Introduction to Statistics

MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022

Outline

Introduction to probability theory
1. Discrete and continuous probability
2. Discrete probability - Binomial and Poisson distribution
3. Continuous probability - Gaussian distribution
4. Momenta of a distribution (mean, variance, skewness, kurtosis)
Hypothesis testing
1. General approach: null and alternative hypothesis
2. Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)
3. Adjusted p-values
4. Power calculations
Linear models
1. What is a linear model
2. Fitting a linear model
3. Data visualization
4. Generalized linear models

Materials

All prerequisites, links to material and slides for this course can be found on github.

LMS_Statistics

Or can be downloaded as a zip archive from here.

Download zip

Course on statistics and experimental design

Lazic, S. (2016). Experimental Design for Laboratory Biologists

Before we start...

Activate the R script source panel

Set working directory

Before running any of the code, set the working directory to the folder we unarchived
Navigate to the LMS_RNASeq/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

Set working directory - from terminal

Use getwd() to see where your current directory is

# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_StatisticsInR/course")

# Working with Mac
# setwd("~/Downloads/LMS_Statistics/course")

# Working with Windows
# setwd("~\\Downloads\\LMS_Statistics\\course")

# Ask for current directory
# getwd()

Exercise 1

Exercise 1

Exercise 1

Read in the file data/life_expectancy_at_birth.csv and store it as a dataframe

# Read .csv file
life_exp <- read.csv("exercises/data/life_expectancy.csv", header = T)

# Explore data
head(life_exp)

       time    area life.expectancy gender
1 1991-1993 England           73.69  Males
2 1992-1994 England           74.02  Males
3 1993-1995 England           74.18  Males
4 1994-1996 England           74.44  Males
5 1995-1997 England           74.61  Males
6 1996-1998 England           74.84  Males

# Check dimensions
dim(life_exp)

[1] 160   4

Exercise 1

How many males and females are in the area of England?

# Subset the df for the England cases
life_exp_eng <- life_exp[life_exp$area == "England", ]
dim(life_exp)

[1] 160   4

dim(life_exp_eng)

[1] 40  4

# Get number of cases for each group
ftable(life_exp_eng$gender)

 Females Males

      20    20

Exercise 1

Check maximum and minimum life expectancy on England

# Option a - use min() and max() functions
min_life_exp_eng <- min(life_exp_eng[, "life.expectancy"])
max_life_exp_eng <- max(life_exp_eng[, "life.expectancy"])
c(min_life_exp_eng, max_life_exp_eng)

[1] 73.69 83.01

cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)

Min:  73.69 
Max:  83.01

# Option b - use the range() function
min_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[1]
max_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[2]
cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)

Min:  73.69 
Max:  83.01

Exercise 1

Get median value of life expectancy for Males and Females in England

# Get median value
median(life_exp_eng[life_exp_eng$gender == "Males", "life.expectancy"])

[1] 76.11

median(life_exp_eng[life_exp_eng$gender == "Females", "life.expectancy"])

[1] 80.69

Exercise 1

Get the mean, variance and standard deviation of the life expectancy in England

# Get momenta of the distribution
mean(life_exp[life_exp$area == "England", "life.expectancy"])

[1] 78.55425

var(life_exp[life_exp$area == "England", "life.expectancy"])

[1] 7.610123

sd(life_exp[life_exp$area == "England", "life.expectancy"])

[1] 2.758645

Exercise 1

Get the mean, variance and standard deviation of the life expectancy in England

# Equivalent - use the subset we generated before
mean(life_exp_eng[, "life.expectancy"])

[1] 78.55425

var(life_exp_eng[, "life.expectancy"])

[1] 7.610123

sd(life_exp_eng[, "life.expectancy"])

[1] 2.758645

Exercise 1 - solutions

Exercise 1 - solutions

Introduction to probability theory

# Introduction to probability theory
print("Introduction to probability theory")

[1] "Introduction to probability theory"

Discrete vs continuous probability

Discrete output \( \longrightarrow \) probability well defined, compute by counting
Continuous output \( \longrightarrow \) need for probability distributions
Computing probabilities in both scenarios (Binomial, Poisson, Gaussian)

Discrete probability

Output of the experiment is a discrete variable

Flipping a coin
Throwing dice
Counting cars / patients / … in a given interval
Getting a specific genotype, read counts in RNAseq data

Binomial distribution

Binomial distribution: probability of getting \( k \) successes in \( n \) events \[ B(x = k) = {n \choose k} p^{k}(1-p)^{n-k} \]

Getting \( k \) heads in \( n \) coin flips
Getting \( k \) sucesses in \( n \) dice
Getting \( k \) successes randomly filling \( n \) questions of a (a, b, c) test

Binomial distribution

Example: probability of getting 4 heads in 5 flips of a coin (\( n = 5, k = 4, p = \frac{1}{2} \)) \[ B(x = 4) = {5 \choose 4} (1/2)^{4}(1-1/2)^{5-4} \]
Example: probability of getting 3 4s in 5 throws of a dice \( \rightarrow \) (\( n = 5, k = 4, p = \frac{1}{6} \)) \[ B(x = 3) = {5 \choose 3} (1/6)^{3}(1-1/6)^{5-3} \]
Getting \( 10 \) successes randomly filling \( 20 \) questions of a (a, b, c) test (\( n = 20, k = 10, p = \frac{1}{3} \)) \[ B(x = 10) = {20 \choose 10} (1/3)^{10}(1-1/3)^{20-10} \]

Binomial distribution

Binomial distribution in R - rbinom() function

# Simulate 1 flip of a coin
rbinom(n = 1, size = 1, prob = 0.5)

[1] 1

# Change order of arguments
rbinom(size = 1, prob = 0.5, n = 1 )

[1] 0

# Use them as positional arguments
rbinom(1, 1, 0.5)

[1] 0

# Ask R information about the function
?rbinom

Binomial distribution

Binomial distribution in R - dbinom(), pbinom(), qbinom(), function

# Probability of exactly 3 heads in 10 flips: dbinom()
dbinom(3, 10, prob = 0.5)

[1] 0.1171875

# Probability of less than than 4 heads in 10 flips: pbinom()
pbinom(3, 10, 0.5)

[1] 0.171875

# Probability of at least 4 heads in 10 flips
1 - pbinom(3, 10, 0.5)

[1] 0.828125

Poisson distribution

Poisson distribution: probability of counting \( k \) events in a given interval \[ P(x = k) = \frac{\lambda^{k} \; e^{-\lambda}}{k!} \]

\( \lambda \) is the average value
Counting \( k \) new cancer patients per day
Counting \( k \) car accidents per year
Counting \( k \) reads from RNAseq experiment

Poisson distribution

Example: probability of counting \( 5 \) car accidents in a week with historical average 10 \( (k = 5, \lambda = 10) \) \[ P(x = 5) = \frac{10^{5} \; e^{-10}}{5!} \]
Example: probability of counting \( 10 \) cancer patients per month with historical average 12 \( (k = 10, \lambda = 12) \) \[ P(x = 10) = \frac{12^{10} \; e^{-12}}{10!} \]

Poisson distribution

Poisson distribution in R - rpois() function

# Simulate 10 Poisson distributed events with mean 5
rpois(n = 10, lambda = 5)

 [1] 5 5 6 6 7 3 1 9 5 5

# Change order of arguments
rpois(lambda = 5, n = 10)

 [1]  6  5 10  4  9  5  6  1  6  5

# Use them as positional arguments
rpois(10, 5)

 [1] 4 6 4 9 8 4 5 2 6 3

# Ask R information about the function
?rpois

Poisson distribution

Poisson distribution in R - dpois(), ppois(), qpois() function

# Probability of exactly 5 cancer patients if mean value 6: dpois()
dpois(5, 6)

[1] 0.1606231

# Probability of at less than 4 cancer patients with mean value 6: ppois()
ppois(3, 6)

[1] 0.1512039

# Probability of at least 4 cancer patients with mean value 6
1 - ppois(3, 6)

[1] 0.8487961

Continuous probability

Until now we used frequentist approach (probability of individual events was discrete) But what happens when the output of the experiment is a continuous variable?
Compute probability of measuring a particular height, weight, temperature, position, … \[ P(h = 1.75cm) ? \rightarrow \textrm{infinite possible outomes!} \]
Continuous output \( \longrightarrow \) probability not well defined, need for \( \textit{densitiy} \) distributions
Used in statistical analysis (comparing means, errors, p-values, …)

Gaussian distribution

Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]
Normal distribution: particular case with \( \mu = 0, \sigma = 1 \) \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]
Measurable physical quantities are often Gaussian distributed

Gaussian distribution

Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]
Normal distribution \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]
Substitute the values \( x = 0, \pm 1, \pm 2 \) \[ f(0) = e^{-\frac{1}{2}( 0 )^{2}} = 1 \] \[ f(+1) = e^{-\frac{1}{2}( +1 )^{2}} = e \sim 0.607... \] \[ f(-1) = e^{-\frac{1}{2}( -1 )^{2}} = e \sim 0.607... \] \[ f(+2) = e^{-\frac{1}{2}( +2 )^{2}} = e^{2} \sim 0.368... \] \[ f(-2) = e^{-\frac{1}{2}( -2 )^{2}} = e^{2} \sim 0.368... \]

Gaussian distribution

Gaussian distribution in R - rnorm() function

# Simulate 5 normal distributed events
rnorm(n = 5, mean = 1, sd = 0.1)

[1] 0.9760908 0.9374695 0.9406535 1.0013013 0.8661233

# Change order of arguments
rnorm(mean = 1, sd = 0.1, n = 5)

[1] 0.7574533 0.9509867 0.8813905 1.0582687 1.0509471

# Use them as positional arguments
rnorm(5, 1, 0.1)

[1] 1.0440829 0.9753177 1.0842880 1.0113053 0.9833578

# Ask R information about the function
?rnorm

Gaussian distribution

Example: data visualization

# Generate sequence between -10 and 10 increment by 0.1
x <- seq(-10, 10, by = 0.1)
head(x)

[1] -10.0  -9.9  -9.8  -9.7  -9.6  -9.5

# Choose the mean as 2.5 and standard deviation as 0.5
y <- dnorm(x, mean = 2.5, sd = 0.5)
head(y)

[1] 1.530786e-136 2.226901e-134 3.112546e-132 4.179831e-130 5.392993e-128
[6] 6.685429e-126

Gaussian distribution

Example: data visualization

# Plot figure
plot(x,y)

plot of chunk unnamed-chunk-15

Momenta of a distribution

Momenta of a distribution

Mean: where the bulk of the events
Variance: how spread are the events around the mean
Skewness: measure of asymmetry of the probability distribution
Kurtosis: measure of the “tailedness” of the probability distribution

They can be computed for any probability distribution!

Standard deviation vs standard error

Performing multiple sets of measurements and compare means

Standard deviation (\( \sigma \)) quantifies variation over one set of measurements (second momentum of the probability distribution)
Standard error (SE) quantifies variation of \( {\color{red}{means}} \) from multiple sets! \[ \textrm{SE} = \frac{\sigma}{\sqrt{N}} \]
Source of confusion: SE can be estimated from single set of measurements, even though it describes variation of means

Exercise 2

Exercise 2

Exercise 2 - solutions

Exercise 2 - solutions

Hypothesis testing

# Hypothesis testing
print("Hypothesis testing")

[1] "Hypothesis testing"

Compare different models

The null \( H_{0} \) and alternative \( H_{1} \) hypothesis

State some null hypothesis \( H_{0} \)
See if we have enough evidence to reject it and state \( H_{1} \)

Statistic test

Quantify the significance of an observation
Certainty when accepting / rejecting a hypothesis
Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)

p-values

Definition of p-value

“Probability of getting a result as extreme or more than the one obtained for the statistic score, assuming the null hypothesis was true”
Small p-value \( \rightarrow \) It was very unlikely to obtain the specific observation we got
We need a small p-value to reject the null hypothesis

p-values

Definition of p-value

Typical threshold p-value = 0.05 \( \rightarrow \) 5% of experiments will report a false positive
Some experiments/anaysis require more stringent thresholds
If extremely important to state that two drugs are different, p-value = 0.00001 \( \rightarrow \) only once every 100000 experiments will report a false positive

Adjusted p-values

Issues with p-values

Typical 0.05 cut-off is arbitrary and merely a convention
Two common biases further affect the integrity of research findings (selective reporting / p-hacking)
Adjusted p-values (Bonferroni correction & Benjamini-Hochberg method)
Jafari M, Ansari-Pour N. Why, When and How to Adjust Your P Values?. Cell J. 2019;20(4):604-607. doi:10.22074/cellj.2019.5992

Power calculations

“Do not just add samples until getting a good p-value \( \longrightarrow \) increases chance of reporting a false positive”

“Power”: probability that a test will \( {\color{blue}{correctly}} \) give a small p-value
Factors that affect power
1. Variation in data
2. Sample size
3. Test used

Exercise 3

Exercise 3

Exercise 3 - solutions

Exercise 3 - solutions

Linear models

# Linear models
print("Linear models")

[1] "Linear models"

Fitting a linear model

Find a function that describes a set of observations
Assume \( y \) and \( x \) are correlated variables \[ y(x) = b_{0} \cdot x + b_{1} \]

Fitting a linear model

Get the overall mean
Compute sum of residuals (SS) over the mean
Find line with the smallest SS \( \rightarrow \) fit
Evaluate fit - correlation coefficient R\( ^{2} \)

Fitting a linear model

Data points are a discrete set
We have no information about intermediate ranges
A fit provides a continuous function \( \rightarrow \) make predictions!

Data visualization

Importance of representing data
Ascombe's quartet (Francis Ascombe, 1973)
Data with exact same statistics (same linear fit describing behavior)
Very differently distributed: non-linear dependence, outlier effects, etc

Generalized linear models

In real life, non-linearly distributed data

Exponential regression \[ y(x) = e^{b_{0} + b_{1}x} = y_{0}e^{b_{1}x} \]
Negative binomial \[ y(x) = NB(x) \nonumber \]
Poisson regression (assume variance \( \sim \) mean) \[ y(x) = P(x) \]

Exercise 4

Exercise 4

Exercise 4 - solutions

Exercise 4 - solutions

Summary

Introduction to probability theory
1. Discrete and continuous probability
2. Discrete probability - Binomial and Poisson distribution
3. Continuous probability - Gaussian distribution
4. Momenta of a distribution (mean, variance, skewness, kurtosis)
Hypothesis testing
1. General approach: null and alternative hypothesis
2. Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)
3. Adjusted p-values
4. Power calculations
Linear models
1. What is a linear model
2. Fitting a linear model
3. Data visualization
4. Generalized linear models