MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022
Introduction to probability theory
Hypothesis testing
Linear models
All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Course on statistics and experimental design
Before running any of the code, set the working directory to the folder we unarchived
Navigate to the LMS_RNASeq/course folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
Use getwd() to see where your current directory is
# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_StatisticsInR/course")
# Working with Mac
# setwd("~/Downloads/LMS_Statistics/course")
# Working with Windows
# setwd("~\\Downloads\\LMS_Statistics\\course")
# Ask for current directory
# getwd()
data/life_expectancy_at_birth.csv
and store it as a dataframe# Read .csv file
life_exp <- read.csv("exercises/data/life_expectancy.csv", header = T)
# Explore data
head(life_exp)
time area life.expectancy gender
1 1991-1993 England 73.69 Males
2 1992-1994 England 74.02 Males
3 1993-1995 England 74.18 Males
4 1994-1996 England 74.44 Males
5 1995-1997 England 74.61 Males
6 1996-1998 England 74.84 Males
# Check dimensions
dim(life_exp)
[1] 160 4
# Subset the df for the England cases
life_exp_eng <- life_exp[life_exp$area == "England", ]
dim(life_exp)
[1] 160 4
dim(life_exp_eng)
[1] 40 4
# Get number of cases for each group
ftable(life_exp_eng$gender)
Females Males
20 20
# Option a - use min() and max() functions
min_life_exp_eng <- min(life_exp_eng[, "life.expectancy"])
max_life_exp_eng <- max(life_exp_eng[, "life.expectancy"])
c(min_life_exp_eng, max_life_exp_eng)
[1] 73.69 83.01
cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)
Min: 73.69
Max: 83.01
# Option b - use the range() function
min_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[1]
max_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[2]
cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)
Min: 73.69
Max: 83.01
# Get median value
median(life_exp_eng[life_exp_eng$gender == "Males", "life.expectancy"])
[1] 76.11
median(life_exp_eng[life_exp_eng$gender == "Females", "life.expectancy"])
[1] 80.69
# Get momenta of the distribution
mean(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 78.55425
var(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 7.610123
sd(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 2.758645
# Equivalent - use the subset we generated before
mean(life_exp_eng[, "life.expectancy"])
[1] 78.55425
var(life_exp_eng[, "life.expectancy"])
[1] 7.610123
sd(life_exp_eng[, "life.expectancy"])
[1] 2.758645
Introduction to probability theory
# Introduction to probability theory
print("Introduction to probability theory")
[1] "Introduction to probability theory"
Output of the experiment is a discrete variable
Binomial distribution: probability of getting \( k \) successes in \( n \) events \[ B(x = k) = {n \choose k} p^{k}(1-p)^{n-k} \]
# Simulate 1 flip of a coin
rbinom(n = 1, size = 1, prob = 0.5)
[1] 1
# Change order of arguments
rbinom(size = 1, prob = 0.5, n = 1 )
[1] 0
# Use them as positional arguments
rbinom(1, 1, 0.5)
[1] 0
# Ask R information about the function
?rbinom
# Probability of exactly 3 heads in 10 flips: dbinom()
dbinom(3, 10, prob = 0.5)
[1] 0.1171875
# Probability of less than than 4 heads in 10 flips: pbinom()
pbinom(3, 10, 0.5)
[1] 0.171875
# Probability of at least 4 heads in 10 flips
1 - pbinom(3, 10, 0.5)
[1] 0.828125
Poisson distribution: probability of counting \( k \) events in a given interval \[ P(x = k) = \frac{\lambda^{k} \; e^{-\lambda}}{k!} \]
Example: probability of counting \( 5 \) car accidents in a week with historical average 10 \( (k = 5, \lambda = 10) \) \[ P(x = 5) = \frac{10^{5} \; e^{-10}}{5!} \]
Example: probability of counting \( 10 \) cancer patients per month with historical average 12 \( (k = 10, \lambda = 12) \) \[ P(x = 10) = \frac{12^{10} \; e^{-12}}{10!} \]
# Simulate 10 Poisson distributed events with mean 5
rpois(n = 10, lambda = 5)
[1] 5 5 6 6 7 3 1 9 5 5
# Change order of arguments
rpois(lambda = 5, n = 10)
[1] 6 5 10 4 9 5 6 1 6 5
# Use them as positional arguments
rpois(10, 5)
[1] 4 6 4 9 8 4 5 2 6 3
# Ask R information about the function
?rpois
# Probability of exactly 5 cancer patients if mean value 6: dpois()
dpois(5, 6)
[1] 0.1606231
# Probability of at less than 4 cancer patients with mean value 6: ppois()
ppois(3, 6)
[1] 0.1512039
# Probability of at least 4 cancer patients with mean value 6
1 - ppois(3, 6)
[1] 0.8487961
Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]
Normal distribution: particular case with \( \mu = 0, \sigma = 1 \) \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]
Measurable physical quantities are often Gaussian distributed
Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]
Normal distribution \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]
Substitute the values \( x = 0, \pm 1, \pm 2 \) \[ f(0) = e^{-\frac{1}{2}( 0 )^{2}} = 1 \] \[ f(+1) = e^{-\frac{1}{2}( +1 )^{2}} = e \sim 0.607... \] \[ f(-1) = e^{-\frac{1}{2}( -1 )^{2}} = e \sim 0.607... \] \[ f(+2) = e^{-\frac{1}{2}( +2 )^{2}} = e^{2} \sim 0.368... \] \[ f(-2) = e^{-\frac{1}{2}( -2 )^{2}} = e^{2} \sim 0.368... \]
# Simulate 5 normal distributed events
rnorm(n = 5, mean = 1, sd = 0.1)
[1] 0.9760908 0.9374695 0.9406535 1.0013013 0.8661233
# Change order of arguments
rnorm(mean = 1, sd = 0.1, n = 5)
[1] 0.7574533 0.9509867 0.8813905 1.0582687 1.0509471
# Use them as positional arguments
rnorm(5, 1, 0.1)
[1] 1.0440829 0.9753177 1.0842880 1.0113053 0.9833578
# Ask R information about the function
?rnorm
# Generate sequence between -10 and 10 increment by 0.1
x <- seq(-10, 10, by = 0.1)
head(x)
[1] -10.0 -9.9 -9.8 -9.7 -9.6 -9.5
# Choose the mean as 2.5 and standard deviation as 0.5
y <- dnorm(x, mean = 2.5, sd = 0.5)
head(y)
[1] 1.530786e-136 2.226901e-134 3.112546e-132 4.179831e-130 5.392993e-128
[6] 6.685429e-126
# Plot figure
plot(x,y)
Performing multiple sets of measurements and compare means
Hypothesis testing
# Hypothesis testing
print("Hypothesis testing")
[1] "Hypothesis testing"
The null \( H_{0} \) and alternative \( H_{1} \) hypothesis
Definition of p-value
Definition of p-value
Issues with p-values
“Do not just add samples until getting a good p-value \( \longrightarrow \) increases chance of reporting a false positive”
# Linear models
print("Linear models")
[1] "Linear models"
In real life, non-linearly distributed data
Introduction to probability theory
Hypothesis testing
Linear models