Introduction to Statistics

MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022

Outline

  • Introduction to probability theory

    1. Discrete and continuous probability
    2. Discrete probability - Binomial and Poisson distribution
    3. Continuous probability - Gaussian distribution
    4. Momenta of a distribution (mean, variance, skewness, kurtosis)
  • Hypothesis testing

    1. General approach: null and alternative hypothesis
    2. Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)
    3. Adjusted p-values
    4. Power calculations
  • Linear models

    1. What is a linear model
    2. Fitting a linear model
    3. Data visualization
    4. Generalized linear models

Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Course on statistics and experimental design

  • Lazic, S. (2016). Experimental Design for Laboratory Biologists

Before we start...

  • Activate the R script source panel
path

Set working directory

  • Before running any of the code, set the working directory to the folder we unarchived

  • Navigate to the LMS_RNASeq/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

path

Set working directory - from terminal

Use getwd() to see where your current directory is

# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_StatisticsInR/course")

# Working with Mac
# setwd("~/Downloads/LMS_Statistics/course")

# Working with Windows
# setwd("~\\Downloads\\LMS_Statistics\\course")

# Ask for current directory
# getwd()

Exercise 1

Exercise 1

  • Read in the file data/life_expectancy_at_birth.csv and store it as a dataframe
# Read .csv file
life_exp <- read.csv("exercises/data/life_expectancy.csv", header = T)

# Explore data
head(life_exp)
       time    area life.expectancy gender
1 1991-1993 England           73.69  Males
2 1992-1994 England           74.02  Males
3 1993-1995 England           74.18  Males
4 1994-1996 England           74.44  Males
5 1995-1997 England           74.61  Males
6 1996-1998 England           74.84  Males
# Check dimensions
dim(life_exp)
[1] 160   4

Exercise 1

  • How many males and females are in the area of England?
# Subset the df for the England cases
life_exp_eng <- life_exp[life_exp$area == "England", ]
dim(life_exp)
[1] 160   4
dim(life_exp_eng)
[1] 40  4
# Get number of cases for each group
ftable(life_exp_eng$gender)
 Females Males

      20    20

Exercise 1

  • Check maximum and minimum life expectancy on England
# Option a - use min() and max() functions
min_life_exp_eng <- min(life_exp_eng[, "life.expectancy"])
max_life_exp_eng <- max(life_exp_eng[, "life.expectancy"])
c(min_life_exp_eng, max_life_exp_eng)
[1] 73.69 83.01
cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)
Min:  73.69 
Max:  83.01
# Option b - use the range() function
min_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[1]
max_life_exp_eng <- range(life_exp_eng[, "life.expectancy"])[2]
cat("Min: ", min_life_exp_eng, "\nMax: ", max_life_exp_eng)
Min:  73.69 
Max:  83.01

Exercise 1

  • Get median value of life expectancy for Males and Females in England
# Get median value
median(life_exp_eng[life_exp_eng$gender == "Males", "life.expectancy"])
[1] 76.11
median(life_exp_eng[life_exp_eng$gender == "Females", "life.expectancy"])
[1] 80.69

Exercise 1

  • Get the mean, variance and standard deviation of the life expectancy in England
# Get momenta of the distribution
mean(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 78.55425
var(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 7.610123
sd(life_exp[life_exp$area == "England", "life.expectancy"])
[1] 2.758645

Exercise 1

  • Get the mean, variance and standard deviation of the life expectancy in England
# Equivalent - use the subset we generated before
mean(life_exp_eng[, "life.expectancy"])
[1] 78.55425
var(life_exp_eng[, "life.expectancy"])
[1] 7.610123
sd(life_exp_eng[, "life.expectancy"])
[1] 2.758645

Exercise 1 - solutions

Introduction to probability theory

Introduction to probability theory

# Introduction to probability theory
print("Introduction to probability theory")
[1] "Introduction to probability theory"

Discrete vs continuous probability

  • Discrete output \( \longrightarrow \) probability well defined, compute by counting
  • Continuous output \( \longrightarrow \) need for probability distributions
  • Computing probabilities in both scenarios (Binomial, Poisson, Gaussian)

Discrete probability

Output of the experiment is a discrete variable

  • Flipping a coin
  • Throwing dice
  • Counting cars / patients / … in a given interval
  • Getting a specific genotype, read counts in RNAseq data

Binomial distribution

path

Binomial distribution: probability of getting \( k \) successes in \( n \) events \[ B(x = k) = {n \choose k} p^{k}(1-p)^{n-k} \]

  • Getting \( k \) heads in \( n \) coin flips
  • Getting \( k \) sucesses in \( n \) dice
  • Getting \( k \) successes randomly filling \( n \) questions of a (a, b, c) test

Binomial distribution

path
  • Example: probability of getting 4 heads in 5 flips of a coin (\( n = 5, k = 4, p = \frac{1}{2} \)) \[ B(x = 4) = {5 \choose 4} (1/2)^{4}(1-1/2)^{5-4} \]
  • Example: probability of getting 3 4s in 5 throws of a dice \( \rightarrow \) (\( n = 5, k = 4, p = \frac{1}{6} \)) \[ B(x = 3) = {5 \choose 3} (1/6)^{3}(1-1/6)^{5-3} \]
  • Getting \( 10 \) successes randomly filling \( 20 \) questions of a (a, b, c) test (\( n = 20, k = 10, p = \frac{1}{3} \)) \[ B(x = 10) = {20 \choose 10} (1/3)^{10}(1-1/3)^{20-10} \]

Binomial distribution

  • Binomial distribution in R - rbinom() function
# Simulate 1 flip of a coin
rbinom(n = 1, size = 1, prob = 0.5)
[1] 1
# Change order of arguments
rbinom(size = 1, prob = 0.5, n = 1 )
[1] 0
# Use them as positional arguments
rbinom(1, 1, 0.5)
[1] 0
# Ask R information about the function
?rbinom

Binomial distribution

  • Binomial distribution in R - dbinom(), pbinom(), qbinom(), function
# Probability of exactly 3 heads in 10 flips: dbinom()
dbinom(3, 10, prob = 0.5)
[1] 0.1171875
# Probability of less than than 4 heads in 10 flips: pbinom()
pbinom(3, 10, 0.5)
[1] 0.171875
# Probability of at least 4 heads in 10 flips
1 - pbinom(3, 10, 0.5)
[1] 0.828125

Poisson distribution

path

Poisson distribution: probability of counting \( k \) events in a given interval \[ P(x = k) = \frac{\lambda^{k} \; e^{-\lambda}}{k!} \]

  • \( \lambda \) is the average value
  • Counting \( k \) new cancer patients per day
  • Counting \( k \) car accidents per year
  • Counting \( k \) reads from RNAseq experiment

Poisson distribution

path
  • Example: probability of counting \( 5 \) car accidents in a week with historical average 10 \( (k = 5, \lambda = 10) \) \[ P(x = 5) = \frac{10^{5} \; e^{-10}}{5!} \]

  • Example: probability of counting \( 10 \) cancer patients per month with historical average 12 \( (k = 10, \lambda = 12) \) \[ P(x = 10) = \frac{12^{10} \; e^{-12}}{10!} \]

Poisson distribution

  • Poisson distribution in R - rpois() function
# Simulate 10 Poisson distributed events with mean 5
rpois(n = 10, lambda = 5)
 [1] 5 5 6 6 7 3 1 9 5 5
# Change order of arguments
rpois(lambda = 5, n = 10)
 [1]  6  5 10  4  9  5  6  1  6  5
# Use them as positional arguments
rpois(10, 5)
 [1] 4 6 4 9 8 4 5 2 6 3
# Ask R information about the function
?rpois

Poisson distribution

  • Poisson distribution in R - dpois(), ppois(), qpois() function
# Probability of exactly 5 cancer patients if mean value 6: dpois()
dpois(5, 6)
[1] 0.1606231
# Probability of at less than 4 cancer patients with mean value 6: ppois()
ppois(3, 6)
[1] 0.1512039
# Probability of at least 4 cancer patients with mean value 6
1 - ppois(3, 6)
[1] 0.8487961

Continuous probability

  • Until now we used frequentist approach (probability of individual events was discrete) But what happens when the output of the experiment is a continuous variable?
  • Compute probability of measuring a particular height, weight, temperature, position, … \[ P(h = 1.75cm) ? \rightarrow \textrm{infinite possible outomes!} \]
  • Continuous output \( \longrightarrow \) probability not well defined, need for \( \textit{densitiy} \) distributions
  • Used in statistical analysis (comparing means, errors, p-values, …)

Gaussian distribution

path
  • Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]

  • Normal distribution: particular case with \( \mu = 0, \sigma = 1 \) \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]

  • Measurable physical quantities are often Gaussian distributed

Gaussian distribution

path
  • Gaussian distribution: continuous distribution for a real-valued random variable \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\big( \frac{x - \mu}{\sigma} \big)^{2}} \]

  • Normal distribution \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^{2}} \]

  • Substitute the values \( x = 0, \pm 1, \pm 2 \) \[ f(0) = e^{-\frac{1}{2}( 0 )^{2}} = 1 \] \[ f(+1) = e^{-\frac{1}{2}( +1 )^{2}} = e \sim 0.607... \] \[ f(-1) = e^{-\frac{1}{2}( -1 )^{2}} = e \sim 0.607... \] \[ f(+2) = e^{-\frac{1}{2}( +2 )^{2}} = e^{2} \sim 0.368... \] \[ f(-2) = e^{-\frac{1}{2}( -2 )^{2}} = e^{2} \sim 0.368... \]

Gaussian distribution

  • Gaussian distribution in R - rnorm() function
# Simulate 5 normal distributed events
rnorm(n = 5, mean = 1, sd = 0.1)
[1] 0.9760908 0.9374695 0.9406535 1.0013013 0.8661233
# Change order of arguments
rnorm(mean = 1, sd = 0.1, n = 5)
[1] 0.7574533 0.9509867 0.8813905 1.0582687 1.0509471
# Use them as positional arguments
rnorm(5, 1, 0.1)
[1] 1.0440829 0.9753177 1.0842880 1.0113053 0.9833578
# Ask R information about the function
?rnorm

Gaussian distribution

  • Example: data visualization
# Generate sequence between -10 and 10 increment by 0.1
x <- seq(-10, 10, by = 0.1)
head(x)
[1] -10.0  -9.9  -9.8  -9.7  -9.6  -9.5
# Choose the mean as 2.5 and standard deviation as 0.5
y <- dnorm(x, mean = 2.5, sd = 0.5)
head(y)
[1] 1.530786e-136 2.226901e-134 3.112546e-132 4.179831e-130 5.392993e-128
[6] 6.685429e-126

Gaussian distribution

  • Example: data visualization
# Plot figure
plot(x,y)

plot of chunk unnamed-chunk-15

Momenta of a distribution

path
  • Momenta of a distribution
  1. Mean: where the bulk of the events
  2. Variance: how spread are the events around the mean
  3. Skewness: measure of asymmetry of the probability distribution
  4. Kurtosis: measure of the “tailedness” of the probability distribution
  • They can be computed for any probability distribution!

Standard deviation vs standard error

Performing multiple sets of measurements and compare means

  • Standard deviation (\( \sigma \)) quantifies variation over one set of measurements (second momentum of the probability distribution)
  • Standard error (SE) quantifies variation of \( {\color{red}{means}} \) from multiple sets! \[ \textrm{SE} = \frac{\sigma}{\sqrt{N}} \]
  • Source of confusion: SE can be estimated from single set of measurements, even though it describes variation of means

Exercise 2

Exercise 2 - solutions

Hypothesis testing

Hypothesis testing

# Hypothesis testing
print("Hypothesis testing")
[1] "Hypothesis testing"

Compare different models

path

The null \( H_{0} \) and alternative \( H_{1} \) hypothesis

  • State some null hypothesis \( H_{0} \)
  • See if we have enough evidence to reject it and state \( H_{1} \)

Statistic test

path
  • Quantify the significance of an observation
  • Certainty when accepting / rejecting a hypothesis
  • Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)

p-values

path

Definition of p-value

  • “Probability of getting a result as extreme or more than the one obtained for the statistic score, assuming the null hypothesis was true”
  • Small p-value \( \rightarrow \) It was very unlikely to obtain the specific observation we got
  • We need a small p-value to reject the null hypothesis

p-values

path

Definition of p-value

  • Typical threshold p-value = 0.05 \( \rightarrow \) 5% of experiments will report a false positive
  • Some experiments/anaysis require more stringent thresholds
  • If extremely important to state that two drugs are different, p-value = 0.00001 \( \rightarrow \) only once every 100000 experiments will report a false positive

Adjusted p-values

path

Issues with p-values

  • Typical 0.05 cut-off is arbitrary and merely a convention
  • Two common biases further affect the integrity of research findings (selective reporting / p-hacking)
  • Adjusted p-values (Bonferroni correction & Benjamini-Hochberg method)
  • Jafari M, Ansari-Pour N. Why, When and How to Adjust Your P Values?. Cell J. 2019;20(4):604-607. doi:10.22074/cellj.2019.5992

Power calculations

path

“Do not just add samples until getting a good p-value \( \longrightarrow \) increases chance of reporting a false positive”

  • “Power”: probability that a test will \( {\color{blue}{correctly}} \) give a small p-value
  • Factors that affect power
    1. Variation in data
    2. Sample size
    3. Test used

Exercise 3

Exercise 3 - solutions

Linear models

# Linear models
print("Linear models")
[1] "Linear models"

Fitting a linear model

path
  • Find a function that describes a set of observations
  • Assume \( y \) and \( x \) are correlated variables \[ y(x) = b_{0} \cdot x + b_{1} \]

Fitting a linear model

path
  • Get the overall mean
  • Compute sum of residuals (SS) over the mean
  • Find line with the smallest SS \( \rightarrow \) fit
  • Evaluate fit - correlation coefficient R\( ^{2} \)

Fitting a linear model

path
  • Data points are a discrete set
  • We have no information about intermediate ranges
  • A fit provides a continuous function \( \rightarrow \) make predictions!

Data visualization

path
  • Importance of representing data
  • Ascombe's quartet (Francis Ascombe, 1973)
  • Data with exact same statistics (same linear fit describing behavior)
  • Very differently distributed: non-linear dependence, outlier effects, etc

Generalized linear models

path

In real life, non-linearly distributed data

  • Exponential regression \[ y(x) = e^{b_{0} + b_{1}x} = y_{0}e^{b_{1}x} \]
  • Negative binomial \[ y(x) = NB(x) \nonumber \]
  • Poisson regression (assume variance \( \sim \) mean) \[ y(x) = P(x) \]

Exercise 4

Exercise 4 - solutions

Summary

  • Introduction to probability theory

    1. Discrete and continuous probability
    2. Discrete probability - Binomial and Poisson distribution
    3. Continuous probability - Gaussian distribution
    4. Momenta of a distribution (mean, variance, skewness, kurtosis)
  • Hypothesis testing

    1. General approach: null and alternative hypothesis
    2. Statistical tests (\( \chi^{2} \)-test, t-test, Wald test)
    3. Adjusted p-values
    4. Power calculations
  • Linear models

    1. What is a linear model
    2. Fitting a linear model
    3. Data visualization
    4. Generalized linear models