MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022
The RNASeq pipeline
Normalization of RNASeq data
Normalization in DESeq2
Differential Expression analysis: DESeq2
Visualization of RNASeq data: SDM and PCA
All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Before running any of the code, set the working directory to the folder we unarchived
Navigate to the LMS_RNASeq/course folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
Use getwd() to see where your current directory is
# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_RNASeq/course")
# getwd()
# Working with Mac
# setwd("~/Downloads/LMS_RNASeq/course")
# Working with Windows
# setwd("~\\Downloads\\LMS_RNASeq\\course")
# Normalization of RNASeq data
print("Normalization of RNASeq data")
[1] "Normalization of RNASeq data"
Factors considered for normalization
Sequencing depth
Gene length
RNA composition
Sequencing depth: sequencing runs with more depth will have more reads mapping to each gene
Needed for comparison of gene expression between samples
In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth
Gene length: longer genes will have more reads mapping to them
Needed for comparing expression between different genes within the same sample
In the example below, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer
We don't want those effects to bias our analysis
# Load data - 3 replicates and 4 genes dataset
data_matrix <- read.table("exercises/data/data_matrix.txt")
# Explore data
head(data_matrix)
V1 V2 V3
1 10 12 30
2 20 25 60
3 5 8 15
4 0 0 1
# Set row and column names
colnames(data_matrix) <- paste("Replicate", 1:3, sep = "")
rownames(data_matrix) <- paste("Gene", 1:4, sep = "")
# Gene lengths: Gene1 = 2kb, Gene2 = 4kb, Gene3 = 1kb, Gene4 = 10kb
gene_lengths <- c(2, 4, 1, 10)
# Explore data
head(data_matrix)
Replicate1 Replicate2 Replicate3
Gene1 10 12 30
Gene2 20 25 60
Gene3 5 8 15
Gene4 0 0 1
# Get sequencing depth for each replica
read_depths <- colSums(data_matrix)
# "Tens" of reads; would be millions of reads in real data
read_depths_tens <- read_depths / 10
head(read_depths_tens)
Replicate1 Replicate2 Replicate3
3.5 4.5 10.6
# Divide each column by its read depth
data_matrix_RPM <- t(data_matrix) / read_depths_tens
data_matrix_RPM <- t(data_matrix_RPM)
# Explore data
head(data_matrix_RPM)
Replicate1 Replicate2 Replicate3
Gene1 2.857143 2.666667 2.83018868
Gene2 5.714286 5.555556 5.66037736
Gene3 1.428571 1.777778 1.41509434
Gene4 0.000000 0.000000 0.09433962
# Divide each row by gene length
data_matrix_RPKM <- data_matrix_RPM / gene_lengths
# Explore data
head(data_matrix_RPKM)
Replicate1 Replicate2 Replicate3
Gene1 1.428571 1.333333 1.415094340
Gene2 1.428571 1.388889 1.415094340
Gene3 1.428571 1.777778 1.415094340
Gene4 0.000000 0.000000 0.009433962
# Store result as data frame
colnames(data_matrix_RPKM) <- paste("RPKM-Rep", 1:3, sep = "")
data_matrix_RPKM <- as.data.frame(data_matrix_RPKM)
class(data_matrix_RPKM)
[1] "data.frame"
# Final result
head(data_matrix_RPKM)
RPKM-Rep1 RPKM-Rep2 RPKM-Rep3
Gene1 1.428571 1.333333 1.415094340
Gene2 1.428571 1.388889 1.415094340
Gene3 1.428571 1.777778 1.415094340
Gene4 0.000000 0.000000 0.009433962
# Divide each column by its read depth
data_matrix_RPK <- data_matrix / gene_lengths
# Explore data
head(data_matrix_RPK)
Replicate1 Replicate2 Replicate3
Gene1 5 6.00 15.0
Gene2 5 6.25 15.0
Gene3 5 8.00 15.0
Gene4 0 0.00 0.1
# Get sequencing depth for each replica
read_depths2 <- colSums(data_matrix_RPK)
# "Tens" of reads; would be millions of reads in real data
read_depths2_tens <- read_depths2 / 10
head(read_depths2_tens)
Replicate1 Replicate2 Replicate3
1.500 2.025 4.510
# Divide each column by its read depth
data_matrix_TPM <- t(data_matrix_RPK) / read_depths2_tens
data_matrix_TPM <- t(data_matrix_TPM)
# Explore data
head(data_matrix_TPM)
Replicate1 Replicate2 Replicate3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295
# Store result as data frame
colnames(data_matrix_TPM) <- paste("TPM-Rep", 1:3, sep = "")
rownames(data_matrix_TPM) <- paste("Gene", 1:4, sep = "")
data_matrix_TPM <- as.data.frame(data_matrix_TPM)
class(data_matrix_TPM)
[1] "data.frame"
# Final result
head(data_matrix_TPM)
TPM-Rep1 TPM-Rep2 TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295
# RPM scaled matrix
head(data_matrix_RPKM)
RPKM-Rep1 RPKM-Rep2 RPKM-Rep3
Gene1 1.428571 1.333333 1.415094340
Gene2 1.428571 1.388889 1.415094340
Gene3 1.428571 1.777778 1.415094340
Gene4 0.000000 0.000000 0.009433962
# Total number per replica
total_RPKM <- colSums(data_matrix_RPKM)
head(total_RPKM)
RPKM-Rep1 RPKM-Rep2 RPKM-Rep3
4.285714 4.500000 4.254717
# RPM scaled matrix
head(data_matrix_TPM)
TPM-Rep1 TPM-Rep2 TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295
# Total number per replica
total_TPM <- colSums(data_matrix_TPM)
head(total_TPM)
TPM-Rep1 TPM-Rep2 TPM-Rep3
10 10 10
# Normalization in DESeq
print("Normalization in DESeq2")
[1] "Normalization in DESeq2"
Normalize RNA composition, or “library” composition
RNA-seq is often used for comparing one condition to another, or one tissue to another
It could be that there are a lot of condition/tissue specific genes, transcribed only in one case
Here Gene 2 will shadow all the rest when performing normalization!
The DE gene takes up most of the counts for Sample A, but not Sample B
In the example below, when dividing each sample by the total number of counts to normalize, the counts would be greatly suppressed by the DE gene
Most genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B
Compare gene expression between two conditions (e.g. control vs treatment, WT vs KO)
Counts divided by sample-specific size factors (*)
Normalizing sequencing depth and RNA composition
# Normalization in DESeq
print("Differential Expression analysis: DESeq2")
[1] "Differential Expression analysis: DESeq2"
Load all_counts.csv.txt by using read.csv() function
# Read input file with count data
all_counts <- read.csv(file = "exercises/data/all_counts.csv", row.names = 1)
Load all_counts.csv.txt by using read.csv() function
# Explore data
head(all_counts)
Viv1 Viv2 Viv3 Hfd1 Hfd2 Hfd3
ENSMUSG00000090025 0 0 0 0 0 0
ENSMUSG00000064842 0 0 0 0 0 0
ENSMUSG00000051951 0 1 1 3 0 0
ENSMUSG00000089699 0 0 0 0 0 0
ENSMUSG00000088390 0 0 0 0 0 0
ENSMUSG00000089420 0 0 0 0 0 0
dim(all_counts)
[1] 37991 6
class(all_counts)
[1] "data.frame"
Read in sample_description.txt file by using read.table function
# Read input file with sample description
sam_des <- read.table("exercises/data/sample_description.txt", sep = "\t", header = TRUE)
Read in sample_description.txt file by using read.table function
# Explore data
head(sam_des)
Sample Group Batch InputFile1 InputFile2 OutputFile
1 Viv1 Viv a SRR2001243.fastq SRR2001244.fastq Viv1.bam
2 Viv2 Viv b SRR2001245.fastq SRR2001246.fastq Viv2.bam
3 Viv3 Viv c SRR2001247.fastq SRR2001248.fastq Viv3.bam
4 Hfd1 Hfd a SRR2001249.fastq SRR2001250.fastq Hfd1.bam
5 Hfd2 Hfd b SRR2001251.fastq SRR2001252.fastq Hfd2.bam
6 Hfd3 Hfd c SRR2001253.fastq SRR2001254.fastq Hfd3.bam
dim(sam_des)
[1] 6 6
class(sam_des)
[1] "data.frame"
Collect sample information
# Prepare data for DESeq
col_data <- data.frame(Sample = sam_des$Sample,
Group = sam_des$Group,
Batch = sam_des$Batch)
# Store data as factors
col_data$Sample <- as.factor(col_data$Sample)
col_data$Group <- as.factor(col_data$Group)
col_data$Batch <- as.factor(col_data$Batch)
# Check dimensions
all(colnames(all_counts) == sam_des$Sample)
[1] TRUE
Construct DESeqDataSet object
# Load DESeq2 library
library(DESeq2)
# Build DESeq dataset
dds <- DESeqDataSetFromMatrix(countData = all_counts,
colData = col_data,
design = ~Group)
# Apply DESeq normalization
dds <- DESeq(dds)
# Ask for information about DESeq
?DESeq
The standard differential expression analysis steps are wrapped into a single function, DESeq, which performs normalization, fitting to the model and statistical testing.
The DESeq function runs the following ones in order
estimateSizeFactors()
estimateDispersions()
nbinomWaldTest()
1 - Estimation of size factors
# Estimate size factors
sizeFactors(dds)
Viv1 Viv2 Viv3 Hfd1 Hfd2 Hfd3
1.2430187 0.7755226 1.0501449 0.9457439 1.0124687 1.0515602
2 - Estimation of dispersion
# Estimate gene-wise dispersions
head(dispersions(dds))
[1] NA NA 2.495857 NA NA NA
# Plot dispersions
plotDispEsts(dds)
3- Hypothesis test for differential expression
For significance testing, DESeq2 by default uses a Wald test, where the function tests whether each model coefficient differs significantly from zero, using previously calculated sizeFactors and dispersion estimates.
The Wald test P values are adjusted for multiple testing using the procedure of Benjamini and Hochberg
# Compute Wald test
nbinomWaldTest()
Results tables are generated using the function results(), which extracts a results table with log2 fold changes, p-values and adjusted p-values
# Obtain results from DESeq
res <- results(dds)
head(res)
log2 fold change (MLE): Group Viv vs Hfd
Wald test p-value: Group Viv vs Hfd
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000090025 0.000000 NA NA NA NA
ENSMUSG00000064842 0.000000 NA NA NA NA
ENSMUSG00000051951 0.902301 -0.545916 2.25912 -0.241649 0.809052
ENSMUSG00000089699 0.000000 NA NA NA NA
ENSMUSG00000088390 0.000000 NA NA NA NA
ENSMUSG00000089420 0.000000 NA NA NA NA
padj
<numeric>
ENSMUSG00000090025 NA
ENSMUSG00000064842 NA
ENSMUSG00000051951 NA
ENSMUSG00000089699 NA
ENSMUSG00000088390 NA
ENSMUSG00000089420 NA
# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]
head(res_ordered)
log2 fold change (MLE): Group Viv vs Hfd
Wald test p-value: Group Viv vs Hfd
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000032080 14291.818 -5.04940 0.251952 -20.0411 2.41261e-89
ENSMUSG00000024526 465.879 -6.89875 0.346598 -19.9042 3.74279e-88
ENSMUSG00000069170 343.928 -4.05808 0.253508 -16.0077 1.12875e-57
ENSMUSG00000042041 651.002 -3.18880 0.203665 -15.6571 2.97319e-55
ENSMUSG00000032231 883.324 -2.58604 0.176730 -14.6327 1.73648e-48
ENSMUSG00000026043 833.868 -2.46575 0.171552 -14.3732 7.62213e-47
padj
<numeric>
ENSMUSG00000032080 4.02472e-85
ENSMUSG00000024526 3.12186e-84
ENSMUSG00000069170 6.27659e-54
ENSMUSG00000042041 1.23997e-51
ENSMUSG00000032231 5.79358e-45
ENSMUSG00000026043 2.11921e-43
Data quality assessment and quality control are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.
We will use following visualization tools to assess the data quality
Heatmap of count matrix
Sample Distance Matrix
Principal Component Analysis
1. Regularized logarithm (rlog)
2. Variance stabilizing transformation or (vst)
Both transforms the original count data to the log2 scale normalized to library size
1. Regularized logarithm (rlog)
When applying a log transformation on a count matrix, careful with zero values
Add a pseudo-count to zero values in the count matrix
rlog already estimates the best pseudo-count
# Regularized log transformation
rld <- rlog(dds)
class(rld)
[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"
# Get rld in count format
rld_counts <- assay(rld)
class(rld_counts)
[1] "matrix" "array"
1. Variance Stabilizing Transformation (rlog)
vst also transforms the original count data to the log2 scale normalized to library size
Faster than rlog
Usefull when working with very long datasets
# Regularized log transformation
vsd <- varianceStabilizingTransformation(dds)
class(vsd)
[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"
# Get rld in count format
vsd_counts <- assay(vsd)
class(vsd_counts)
[1] "matrix" "array"
# Load pheatmap library
library("pheatmap")
# Get dds normalized counts
dds_counts <- counts(dds, normalized = TRUE)
head(dds_counts)
Viv1 Viv2 Viv3 Hfd1 Hfd2 Hfd3
ENSMUSG00000090025 0 0.000000 0.0000000 0.000000 0 0
ENSMUSG00000064842 0 0.000000 0.0000000 0.000000 0 0
ENSMUSG00000051951 0 1.289453 0.9522495 3.172106 0 0
ENSMUSG00000089699 0 0.000000 0.0000000 0.000000 0 0
ENSMUSG00000088390 0 0.000000 0.0000000 0.000000 0 0
ENSMUSG00000089420 0 0.000000 0.0000000 0.000000 0 0
# Get normalized counts - 20 higher values
select <- order(rowMeans(dds_counts), decreasing = TRUE)[1:20]
head(select)
[1] 25353 34980 29008 16168 34999 16721
# Heatmap of the rlog transformed data
pheatmap(assay(rld)[select, ])
# Heatmap of the vst transformed data
pheatmap(assay(vsd)[select, ])
# Sample Distance Matrix
print("Sample Distance Matrix")
[1] "Sample Distance Matrix"
# Compute SDM from rlog transformed data
sample_dist <- dist(t(assay(rld)))
class(sample_dist)
[1] "dist"
# Get SDM in matrix form
sdm <- as.matrix(sample_dist)
class(sdm)
[1] "matrix" "array"
# Load library
library("RColorBrewer")
# Add row names for clear plot
rownames(sdm) <- rld$Group
colnames(sdm) <- NULL
# Add colors
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
# Plot heatmap
pheatmap(sdm,
clustering_distance_rows = sample_dist,
clustering_distance_cols = sample_dist,
col = colors)
# Principal Component Analysis
print("Principal Component Analysis")
[1] "Principal Component Analysis"
# PCA plot on our rld transformed data
plotPCA(rld, intgroup = "Group")
# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_rld.png")
# PCA plot on our rld transformed data
plotPCA(vsd, intgroup = "Group")
# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_vst.png")
# Additional topics
print("Additional topics")
[1] "Additional topics"
# # Add Gene Symbols
# library(biomaRt)
#
# # retrieve the mm9 information from biomart
# mart=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ensembl',
# host="may2012.archive.ensembl.org")
#
# bm<-getBM(attributes=c('ensembl_gene_id','mgi_symbol'),
# filters ='ensembl_gene_id',
# values=rownames(resOrdered), mart=mart)
#
# # see the first few rows of "bm" object
# head(bm)
#
# # ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl',version = 80)
# # merge the Gene_symbol to our DE dataset
# resAnnotated <- merge(as.data.frame(resOrdered),bm,by.x=0,by.y=1)
# head(resAnnotated)
# # change the column name
# colnames(resAnnotated)[1]<-"ensembl_gene_id"
# # Order results by adjusted p value
# resAnnotated<-resAnnotated[order(resAnnotated$pvalue,decreasing=F),]
#
# # show the result with gene symbol annotation
# head(resAnnotated)
The RNASeq pipeline
Normalization of RNASeq data
Normalization in DESeq2
Differential Expression analysis: DESeq2
Visualization of RNASeq data: SDM and PCA
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_3.3.5 RColorBrewer_1.1-3
[3] pheatmap_1.0.12 DESeq2_1.34.0
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] MatrixGenerics_1.6.0 matrixStats_0.61.0
[9] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[11] IRanges_2.28.0 S4Vectors_0.32.4
[13] BiocGenerics_0.40.0 knitr_1.38
loaded via a namespace (and not attached):
[1] httr_1.4.2 bit64_4.0.5 splines_4.1.2
[4] assertthat_0.2.1 highr_0.9 blob_1.2.2
[7] GenomeInfoDbData_1.2.7 pillar_1.7.0 RSQLite_2.2.12
[10] lattice_0.20-45 glue_1.6.2 digest_0.6.29
[13] XVector_0.34.0 colorspace_2.0-3 Matrix_1.4-1
[16] XML_3.99-0.9 pkgconfig_2.0.3 genefilter_1.76.0
[19] zlibbioc_1.40.0 purrr_0.3.4 xtable_1.8-4
[22] scales_1.1.1 BiocParallel_1.28.3 tibble_3.1.6
[25] annotate_1.72.0 KEGGREST_1.34.0 farver_2.1.0
[28] generics_0.1.2 ellipsis_0.3.2 withr_2.5.0
[31] cachem_1.0.6 cli_3.2.0 survival_3.3-1
[34] magrittr_2.0.3 crayon_1.5.1 memoise_2.0.1
[37] evaluate_0.15 fansi_1.0.3 textshaping_0.3.6
[40] tools_4.1.2 lifecycle_1.0.1 stringr_1.4.0
[43] munsell_0.5.0 locfit_1.5-9.5 DelayedArray_0.20.0
[46] AnnotationDbi_1.56.2 Biostrings_2.62.0 compiler_4.1.2
[49] systemfonts_1.0.4 rlang_1.0.2 grid_4.1.2
[52] RCurl_1.98-1.6 rstudioapi_0.13 bitops_1.0-7
[55] labeling_0.4.2 gtable_0.3.0 DBI_1.1.2
[58] R6_2.5.1 dplyr_1.0.8 fastmap_1.1.0
[61] bit_4.0.4 utf8_1.2.2 ragg_1.2.2
[64] stringi_1.7.6 parallel_4.1.2 Rcpp_1.0.8.3
[67] vctrs_0.4.0 geneplotter_1.72.0 png_0.1-7
[70] tidyselect_1.1.2 xfun_0.30