Introduction to RNASeq

MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022

Outline

  • The RNASeq pipeline

  • Normalization of RNASeq data

  • Normalization in DESeq2

  • Differential Expression analysis: DESeq2

  • Visualization of RNASeq data: SDM and PCA

Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Before we start...

  • Activate the R script source panel
path

Set working directory

  • Before running any of the code, set the working directory to the folder we unarchived

  • Navigate to the LMS_RNASeq/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

path

Set working directory - from terminal

Use getwd() to see where your current directory is

# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_RNASeq/course")
# getwd()

# Working with Mac
# setwd("~/Downloads/LMS_RNASeq/course")

# Working with Windows
# setwd("~\\Downloads\\LMS_RNASeq\\course")

Normalization of RNASeq data

# Normalization of RNASeq data
print("Normalization of RNASeq data")
[1] "Normalization of RNASeq data"

Normalization of RNASeq data

Factors considered for normalization

  • Sequencing depth

  • Gene length

  • RNA composition

Normalization of RNASeq data

  • Sequencing depth: sequencing runs with more depth will have more reads mapping to each gene

  • Needed for comparison of gene expression between samples

  • In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth

gene
  • Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization of RNASeq data

  • Gene length: longer genes will have more reads mapping to them

  • Needed for comparing expression between different genes within the same sample

  • In the example below, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer

gene
  • Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization of RNASeq data

We don't want those effects to bias our analysis

  • Counts per million (CPM)
  • Reads per Kilobase and Million reads (RPKM)
  • Fragments per Kilobase and Million (FPKM)
  • Transcripts per million (TPM)

Normalization of RNASeq data

  • Difference between RPKM/FPKM and TPM, go through an example
  • Typical form of a count matrix data, with genes as rows and samples as columns
gene

Normalization of RNASeq data

  • For every gene, rep 3 way more reads than the other 2 \( \rightarrow \) higher sequencing depth
  • For every sample, Gene B twice as long than Gene A \( \rightarrow \) twice as many reads for any replicate
gene

Normalization of RNASeq data

  • RPKM normalization
  • Step1: normalize sequencing depth “per million” \( \rightarrow \) get RPM
  • Step2: normalize gene length in kilobases \( \rightarrow \) get RPKM
gene

Normalization of RNASeq data

  • TPM normalization
  • Step1: normalize gene length in kilobases \( \rightarrow \) get RPK
  • Step2: normalize sequencing depth “per million” \( \rightarrow \) get TPM
gene

Exercise 1

Exercise 1

  • Load data_matrix.txt by using read.table() function
# Load data - 3 replicates and 4 genes dataset
data_matrix <- read.table("exercises/data/data_matrix.txt")

# Explore data
head(data_matrix)
  V1 V2 V3
1 10 12 30
2 20 25 60
3  5  8 15
4  0  0  1

Exercise 1

  • Load data_matrix.txt by using read.table() function
# Set row and column names
colnames(data_matrix) <- paste("Replicate", 1:3, sep = "")
rownames(data_matrix) <- paste("Gene", 1:4, sep = "")

# Gene lengths: Gene1 = 2kb, Gene2 = 4kb, Gene3 = 1kb, Gene4 = 10kb
gene_lengths <- c(2, 4, 1, 10)

# Explore data
head(data_matrix)
      Replicate1 Replicate2 Replicate3
Gene1         10         12         30
Gene2         20         25         60
Gene3          5          8         15
Gene4          0          0          1

Exercise 1

  • 1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
  • i) Normalize by sequencing depth (total number per replica (column))
# Get sequencing depth for each replica
read_depths <- colSums(data_matrix)
# "Tens" of reads; would be millions of reads in real data
read_depths_tens <- read_depths / 10
head(read_depths_tens)
Replicate1 Replicate2 Replicate3 
       3.5        4.5       10.6 
# Divide each column by its read depth
data_matrix_RPM <- t(data_matrix) / read_depths_tens
data_matrix_RPM <- t(data_matrix_RPM)

# Explore data
head(data_matrix_RPM)
      Replicate1 Replicate2 Replicate3
Gene1   2.857143   2.666667 2.83018868
Gene2   5.714286   5.555556 5.66037736
Gene3   1.428571   1.777778 1.41509434
Gene4   0.000000   0.000000 0.09433962

Exercise 1

  • 1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
  • ii) Normalize per gene length (total number per gene (row))
# Divide each row by gene length
data_matrix_RPKM <- data_matrix_RPM / gene_lengths

# Explore data
head(data_matrix_RPKM)
      Replicate1 Replicate2  Replicate3
Gene1   1.428571   1.333333 1.415094340
Gene2   1.428571   1.388889 1.415094340
Gene3   1.428571   1.777778 1.415094340
Gene4   0.000000   0.000000 0.009433962

Exercise 1

  • 1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
  • iii) Store results as dataframe
# Store result as data frame
colnames(data_matrix_RPKM) <- paste("RPKM-Rep", 1:3, sep = "")
data_matrix_RPKM <- as.data.frame(data_matrix_RPKM)
class(data_matrix_RPKM)
[1] "data.frame"
# Final result
head(data_matrix_RPKM)
      RPKM-Rep1 RPKM-Rep2   RPKM-Rep3
Gene1  1.428571  1.333333 1.415094340
Gene2  1.428571  1.388889 1.415094340
Gene3  1.428571  1.777778 1.415094340
Gene4  0.000000  0.000000 0.009433962

Exercise 1

  • 2. Perform TPM normalization (Transcripts per million)
  • i) Normalize by gene length (total number per gene (row))
# Divide each column by its read depth
data_matrix_RPK <- data_matrix / gene_lengths

# Explore data
head(data_matrix_RPK)
      Replicate1 Replicate2 Replicate3
Gene1          5       6.00       15.0
Gene2          5       6.25       15.0
Gene3          5       8.00       15.0
Gene4          0       0.00        0.1

Exercise 1

  • 2. Perform TPM normalization (Transcripts per million)
  • ii) Normalize per sequencing depth (total number per replicate (column))
# Get sequencing depth for each replica
read_depths2 <- colSums(data_matrix_RPK)
 # "Tens" of reads; would be millions of reads in real data
read_depths2_tens <- read_depths2 / 10
head(read_depths2_tens)
Replicate1 Replicate2 Replicate3 
     1.500      2.025      4.510 
# Divide each column by its read depth
data_matrix_TPM <- t(data_matrix_RPK) / read_depths2_tens
data_matrix_TPM <- t(data_matrix_TPM)

# Explore data
head(data_matrix_TPM)
      Replicate1 Replicate2 Replicate3
Gene1   3.333333   2.962963 3.32594235
Gene2   3.333333   3.086420 3.32594235
Gene3   3.333333   3.950617 3.32594235
Gene4   0.000000   0.000000 0.02217295

Exercise 1

  • 2. Perform TPM normalization (Transcripts per million)
  • iii) Store results as dataframe
# Store result as data frame
colnames(data_matrix_TPM) <- paste("TPM-Rep", 1:3, sep = "")
rownames(data_matrix_TPM) <- paste("Gene", 1:4, sep = "")
data_matrix_TPM <- as.data.frame(data_matrix_TPM)
class(data_matrix_TPM)
[1] "data.frame"
# Final result
head(data_matrix_TPM)
      TPM-Rep1 TPM-Rep2   TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295

Exercise 1

  • 3. Compare both methods - get sum of normalized reads for each column (RPKM)
# RPM scaled matrix
head(data_matrix_RPKM)
      RPKM-Rep1 RPKM-Rep2   RPKM-Rep3
Gene1  1.428571  1.333333 1.415094340
Gene2  1.428571  1.388889 1.415094340
Gene3  1.428571  1.777778 1.415094340
Gene4  0.000000  0.000000 0.009433962
# Total number per replica
total_RPKM <- colSums(data_matrix_RPKM)
head(total_RPKM)
RPKM-Rep1 RPKM-Rep2 RPKM-Rep3 
 4.285714  4.500000  4.254717 

Exercise 1

  • 3. Compare both methods - get sum of normalized reads for each column (TPM)
# RPM scaled matrix
head(data_matrix_TPM)
      TPM-Rep1 TPM-Rep2   TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295
# Total number per replica
total_TPM <- colSums(data_matrix_TPM)
head(total_TPM)
TPM-Rep1 TPM-Rep2 TPM-Rep3 
      10       10       10 

Normalization of RNASeq data

  • RPKM vs TPM normalization
  • Both correct for sequencing depth and gene length
  • The sum of total number of reads in each column is different with RPKM and the same with TPM

Exercise 1 - solutions

Normalization in DESeq2

# Normalization in DESeq
print("Normalization in DESeq2")
[1] "Normalization in DESeq2"

Normalization in DESeq2

  • Normalize RNA composition, or “library” composition

  • RNA-seq is often used for comparing one condition to another, or one tissue to another

  • It could be that there are a lot of condition/tissue specific genes, transcribed only in one case

  • Here Gene 2 will shadow all the rest when performing normalization!

gene

Normalization in DESeq2

  • The DE gene takes up most of the counts for Sample A, but not Sample B

  • In the example below, when dividing each sample by the total number of counts to normalize, the counts would be greatly suppressed by the DE gene

  • Most genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B

gene
  • Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization in DESeq2

  • Compare gene expression between two conditions (e.g. control vs treatment, WT vs KO)

  • Counts divided by sample-specific size factors (*)

  • Normalizing sequencing depth and RNA composition

Differential Expression analysis: DESeq2

# Normalization in DESeq
print("Differential Expression analysis: DESeq2")
[1] "Differential Expression analysis: DESeq2"

Read count data

Load all_counts.csv.txt by using read.csv() function

# Read input file with count data
all_counts <- read.csv(file = "exercises/data/all_counts.csv", row.names = 1)

Read count data

Load all_counts.csv.txt by using read.csv() function

# Explore data
head(all_counts)
                   Viv1 Viv2 Viv3 Hfd1 Hfd2 Hfd3
ENSMUSG00000090025    0    0    0    0    0    0
ENSMUSG00000064842    0    0    0    0    0    0
ENSMUSG00000051951    0    1    1    3    0    0
ENSMUSG00000089699    0    0    0    0    0    0
ENSMUSG00000088390    0    0    0    0    0    0
ENSMUSG00000089420    0    0    0    0    0    0
dim(all_counts)
[1] 37991     6
class(all_counts)
[1] "data.frame"

Read sample information

Read in sample_description.txt file by using read.table function

# Read input file with sample description
sam_des <- read.table("exercises/data/sample_description.txt", sep = "\t", header = TRUE)

Read sample information

Read in sample_description.txt file by using read.table function

# Explore data
head(sam_des)
  Sample Group Batch       InputFile1       InputFile2 OutputFile
1   Viv1   Viv     a SRR2001243.fastq SRR2001244.fastq   Viv1.bam
2   Viv2   Viv     b SRR2001245.fastq SRR2001246.fastq   Viv2.bam
3   Viv3   Viv     c SRR2001247.fastq SRR2001248.fastq   Viv3.bam
4   Hfd1   Hfd     a SRR2001249.fastq SRR2001250.fastq   Hfd1.bam
5   Hfd2   Hfd     b SRR2001251.fastq SRR2001252.fastq   Hfd2.bam
6   Hfd3   Hfd     c SRR2001253.fastq SRR2001254.fastq   Hfd3.bam
dim(sam_des)
[1] 6 6
class(sam_des)
[1] "data.frame"

Prepare DESeq2 dataset object

Collect sample information

# Prepare data for DESeq
col_data <- data.frame(Sample = sam_des$Sample,
                  Group = sam_des$Group,
                  Batch = sam_des$Batch)

# Store data as factors
col_data$Sample <- as.factor(col_data$Sample)
col_data$Group <- as.factor(col_data$Group)
col_data$Batch <- as.factor(col_data$Batch)

# Check dimensions
all(colnames(all_counts) == sam_des$Sample)
[1] TRUE

Prepare DESeq2 dataset object

Construct DESeqDataSet object

# Load DESeq2 library
library(DESeq2)

# Build DESeq dataset
dds <- DESeqDataSetFromMatrix(countData = all_counts, 
                              colData = col_data, 
                              design = ~Group)

Normalization in DESeq (II)

  • The standard differential expression analysis steps are wrapped into a single function, DESeq, which performs normalization, fitting to the model and statistical testing
  • In following sections we will explain the different parts
# Apply DESeq normalization
dds <- DESeq(dds)

# Ask for information about DESeq
?DESeq

Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq, which performs normalization, fitting to the model and statistical testing.

The DESeq function runs the following ones in order

estimateSizeFactors()

estimateDispersions()

nbinomWaldTest()

DESeq function - estimateSizeFactors()

1 - Estimation of size factors

  • The sizeFactors vector assigns to each column of the count matrix a value, the size factor
  • Then count values in the different columns can be brought to a common scale by dividing by the corresponding size factor
# Estimate size factors
sizeFactors(dds)
     Viv1      Viv2      Viv3      Hfd1      Hfd2      Hfd3 
1.2430187 0.7755226 1.0501449 0.9457439 1.0124687 1.0515602 

DESeq function - estimateSizeFactors()

gene

DESeq function - estimateDispersions()

2 - Estimation of dispersion

  • This function obtains gene-wide dispersion estimates *Then, a curve is fit to the estimates to capture the overall trend of dispersion-mean dependence
# Estimate gene-wise dispersions
head(dispersions(dds))
[1]       NA       NA 2.495857       NA       NA       NA
# Plot dispersions
plotDispEsts(dds)

plot of chunk unnamed-chunk-23

DESeq function - nbinomWaldTest()

3- Hypothesis test for differential expression

For significance testing, DESeq2 by default uses a Wald test, where the function tests whether each model coefficient differs significantly from zero, using previously calculated sizeFactors and dispersion estimates.

The Wald test P values are adjusted for multiple testing using the procedure of Benjamini and Hochberg

# Compute Wald test
nbinomWaldTest()

Getting results

Results tables are generated using the function results(), which extracts a results table with log2 fold changes, p-values and adjusted p-values

# Obtain results from DESeq
res <- results(dds)
head(res)
log2 fold change (MLE): Group Viv vs Hfd 
Wald test p-value: Group Viv vs Hfd 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat    pvalue
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000090025  0.000000             NA        NA        NA        NA
ENSMUSG00000064842  0.000000             NA        NA        NA        NA
ENSMUSG00000051951  0.902301      -0.545916   2.25912 -0.241649  0.809052
ENSMUSG00000089699  0.000000             NA        NA        NA        NA
ENSMUSG00000088390  0.000000             NA        NA        NA        NA
ENSMUSG00000089420  0.000000             NA        NA        NA        NA
                        padj
                   <numeric>
ENSMUSG00000090025        NA
ENSMUSG00000064842        NA
ENSMUSG00000051951        NA
ENSMUSG00000089699        NA
ENSMUSG00000088390        NA
ENSMUSG00000089420        NA

Getting results

  • The p-values are computing the so-called independent filtering \( \rightarrow \) the adjusted p-values for the genes which do not pass the filter threshold are set to NA.
  • Rresults by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance
# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]
head(res_ordered)
log2 fold change (MLE): Group Viv vs Hfd 
Wald test p-value: Group Viv vs Hfd 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSMUSG00000032080 14291.818       -5.04940  0.251952  -20.0411 2.41261e-89
ENSMUSG00000024526   465.879       -6.89875  0.346598  -19.9042 3.74279e-88
ENSMUSG00000069170   343.928       -4.05808  0.253508  -16.0077 1.12875e-57
ENSMUSG00000042041   651.002       -3.18880  0.203665  -15.6571 2.97319e-55
ENSMUSG00000032231   883.324       -2.58604  0.176730  -14.6327 1.73648e-48
ENSMUSG00000026043   833.868       -2.46575  0.171552  -14.3732 7.62213e-47
                          padj
                     <numeric>
ENSMUSG00000032080 4.02472e-85
ENSMUSG00000024526 3.12186e-84
ENSMUSG00000069170 6.27659e-54
ENSMUSG00000042041 1.23997e-51
ENSMUSG00000032231 5.79358e-45
ENSMUSG00000026043 2.11921e-43

Exercise 2

Exercise 2 - solutions

Data quality assessment

Data quality assessment and quality control are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.

We will use following visualization tools to assess the data quality

  • Heatmap of count matrix

  • Sample Distance Matrix

  • Principal Component Analysis

Transformation of count data

  • In order to test for differential expression we operate on raw counts
  • However for other downstream analyses (e.g. visualization or clustering) it is useful to work with transformed versions of the count data
  • Remove dependence of the variance on the mean
  • There are two alternative approaches

1. Regularized logarithm (rlog)

2. Variance stabilizing transformation or (vst)

Both transforms the original count data to the log2 scale normalized to library size

Transformation of count data

1. Regularized logarithm (rlog)

  • When applying a log transformation on a count matrix, careful with zero values

  • Add a pseudo-count to zero values in the count matrix

  • rlog already estimates the best pseudo-count

# Regularized log transformation
rld <- rlog(dds)
class(rld)
[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"
# Get rld in count format
rld_counts <- assay(rld)
class(rld_counts)
[1] "matrix" "array" 

Transformation of count data

1. Variance Stabilizing Transformation (rlog)

  • vst also transforms the original count data to the log2 scale normalized to library size

  • Faster than rlog

  • Usefull when working with very long datasets

# Regularized log transformation
vsd <- varianceStabilizingTransformation(dds)
class(vsd)
[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"
# Get rld in count format
vsd_counts <- assay(vsd)
class(vsd_counts)
[1] "matrix" "array" 

Visualizing heatmaps

  • Visualization of count matrix \( \rightarrow \) heatmap
  • Get the normalized counts from the DESeq2 dataset
# Load pheatmap library
library("pheatmap")

# Get dds normalized counts
dds_counts <- counts(dds, normalized = TRUE)
head(dds_counts)
                   Viv1     Viv2      Viv3     Hfd1 Hfd2 Hfd3
ENSMUSG00000090025    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000064842    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000051951    0 1.289453 0.9522495 3.172106    0    0
ENSMUSG00000089699    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000088390    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000089420    0 0.000000 0.0000000 0.000000    0    0
# Get normalized counts - 20 higher values
select <- order(rowMeans(dds_counts), decreasing = TRUE)[1:20]
head(select)
[1] 25353 34980 29008 16168 34999 16721

Visualizing heatmaps

  • Plot heatmap for both rlog transformed data
# Heatmap of the rlog transformed data
pheatmap(assay(rld)[select, ])

plot of chunk unnamed-chunk-30

Visualizing heatmaps

  • Plot heatmap for both rld and vsd transformed data
# Heatmap of the vst transformed data
pheatmap(assay(vsd)[select, ])

plot of chunk unnamed-chunk-31

Visualizing heatmaps

  • Interpreting a heatmap
  • Clustering affects the data visualization
  • Organize data based on similarity

Visualizing heatmaps

  • Example: 3 samples and 4 genes
  • Hierarchically cluster (reorder) the rows (genes)
  • Step1: identify which gene is most similar to Gene 1

Visualizing heatmaps

  • Gene 1 and Gene 3 are the most similar
  • Merge them together as a single cluster
  • Step2: identify which gene is most similar to Gene 2 \( \rightarrow \) Gene 4

Visualizing heatmaps

  • Gene 2 and Gene 4 are the most similar
  • Merge them together as a single cluster
  • Step3: repeat considering the new clusters as single genes

Visualizing heatmaps

  • Gene 2 and Gene 4 are the most similar
  • Merge them together as a single cluster
  • Step3: repeat considering the new clusters as single genes

Visualizing heatmaps

  • Hierarchical clustering usually accompained by a dendogram
  • It indicates both the similarity and the order in which the clusters were formed
  • First cluster 1, then cluster 2 and finally cluster 3

Sample Distance Matrix

# Sample Distance Matrix
print("Sample Distance Matrix")
[1] "Sample Distance Matrix"

Sample Distance Matrix

  • The Sample Distance Matrix (SDM) is a useful way to explore this clustering behavior, by computing the “similarity” across the different samples
  • Ideally, one would expect that the different biological replicates cluster together in separate groups, meaning that our experimental desing is able to correctly discriminate between them
  • For the purpose of defining what “similarity” means, different mathematical approaches exist and depend on choice, based of the kind of experiment and data under study \( \rightarrow \) Euclidean distance

Sample Distance Matrix

  • Another use of the transformed data is sample clustering
  • Apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances
# Compute SDM from rlog transformed data
sample_dist <- dist(t(assay(rld)))
class(sample_dist)
[1] "dist"

Sample Distance Matrix

# Get SDM in matrix form
sdm <- as.matrix(sample_dist)
class(sdm)
[1] "matrix" "array" 
# Load library
library("RColorBrewer")

# Add row names for clear plot
rownames(sdm) <- rld$Group
colnames(sdm) <- NULL

# Add colors
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)

Sample Distance Matrix

# Plot heatmap
pheatmap(sdm,
         clustering_distance_rows = sample_dist,
         clustering_distance_cols = sample_dist,
         col = colors)

plot of chunk unnamed-chunk-35

Principal Component Analysis

# Principal Component Analysis
print("Principal Component Analysis")
[1] "Principal Component Analysis"

Principal Component Analysis

  • Typical form of a RNASeq count matrix
  • Genes as rows and samples as columns

Principal Component Analysis

  • Focus on the first two samples
  • Study correlation between sample 1 and sample 2

Principal Component Analysis

  • Represent correlation between two samples plot the measurements for each gene
  • From left to right, gene 1 highly transcribed in sample 1 and lowly in sample 2
  • From left to right, gene 9 highly transcribed in sample 2 and lowly in sample 1
  • Infer that we have two groups of samples, using different genes

Principal Component Analysis

  • Add information about sample 3 \( \rightarrow \) 3D plot
  • Inference in this way becomes more complicated \( \rightarrow \) multidimensional data

Principal Component Analysis

  • Consider the rest of columns on our dataset
  • Plot correlations of sample 1 vs sample 2, sample 1 vs sample 3, …
  • Inference in this way becomes unduable \( \rightarrow \) multidimensional data

Principal Component Analysis

  • Instead, perform dimensional reduction methods
  • Principal component analysis (PCA) plot \( \rightarrow \) convert these correlations to a 2D space
  • Samples that are highly correlated cluster together \( \rightarrow \) assign different color code

Principal Component Analysis

  • Ideally expect our replicates to be clearly aligned along PC1 or PC2
  • That means good quality of our data and good experimental design
  • Typically for bulk-RNASeq data, less than 20% correlation is considered negligible

Principal Component Analysis

  • Perform PCA plot of the rld transformed data
# PCA plot on our rld transformed data
plotPCA(rld, intgroup = "Group")

plot of chunk unnamed-chunk-37

# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_rld.png")

Principal Component Analysis

  • Perform PCA plot of the vsd transformed data
# PCA plot on our rld transformed data
plotPCA(vsd, intgroup = "Group")

plot of chunk unnamed-chunk-38

# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_vst.png")

Exercise 3

Exercise 3 - solutions

Additional topics

# Additional topics
print("Additional topics")
[1] "Additional topics"

Add Gene symbol - annotation files and biomart

# # Add Gene Symbols
# library(biomaRt)
# 
# # retrieve the mm9 information from biomart
# mart=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ensembl',
#              host="may2012.archive.ensembl.org")
# 
# bm<-getBM(attributes=c('ensembl_gene_id','mgi_symbol'),
#           filters ='ensembl_gene_id',
#           values=rownames(resOrdered), mart=mart)
# 
# # see the first few rows of "bm" object
# head(bm)      
# 
# # ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl',version = 80)

Add Gene symbol (Continued)

# # merge the Gene_symbol to our DE dataset
# resAnnotated <- merge(as.data.frame(resOrdered),bm,by.x=0,by.y=1)
# head(resAnnotated)
# # change the column name
# colnames(resAnnotated)[1]<-"ensembl_gene_id"

Add Gene symbol (Continued)

# # Order results by adjusted p value
# resAnnotated<-resAnnotated[order(resAnnotated$pvalue,decreasing=F),]
# 
# # show the result with gene symbol annotation
# head(resAnnotated)

Outline

  • The RNASeq pipeline

  • Normalization of RNASeq data

  • Normalization in DESeq2

  • Differential Expression analysis: DESeq2

  • Visualization of RNASeq data: SDM and PCA

Session Information

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggplot2_3.3.5               RColorBrewer_1.1-3         
 [3] pheatmap_1.0.12             DESeq2_1.34.0              
 [5] SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [7] MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
[11] IRanges_2.28.0              S4Vectors_0.32.4           
[13] BiocGenerics_0.40.0         knitr_1.38                 

loaded via a namespace (and not attached):
 [1] httr_1.4.2             bit64_4.0.5            splines_4.1.2         
 [4] assertthat_0.2.1       highr_0.9              blob_1.2.2            
 [7] GenomeInfoDbData_1.2.7 pillar_1.7.0           RSQLite_2.2.12        
[10] lattice_0.20-45        glue_1.6.2             digest_0.6.29         
[13] XVector_0.34.0         colorspace_2.0-3       Matrix_1.4-1          
[16] XML_3.99-0.9           pkgconfig_2.0.3        genefilter_1.76.0     
[19] zlibbioc_1.40.0        purrr_0.3.4            xtable_1.8-4          
[22] scales_1.1.1           BiocParallel_1.28.3    tibble_3.1.6          
[25] annotate_1.72.0        KEGGREST_1.34.0        farver_2.1.0          
[28] generics_0.1.2         ellipsis_0.3.2         withr_2.5.0           
[31] cachem_1.0.6           cli_3.2.0              survival_3.3-1        
[34] magrittr_2.0.3         crayon_1.5.1           memoise_2.0.1         
[37] evaluate_0.15          fansi_1.0.3            textshaping_0.3.6     
[40] tools_4.1.2            lifecycle_1.0.1        stringr_1.4.0         
[43] munsell_0.5.0          locfit_1.5-9.5         DelayedArray_0.20.0   
[46] AnnotationDbi_1.56.2   Biostrings_2.62.0      compiler_4.1.2        
[49] systemfonts_1.0.4      rlang_1.0.2            grid_4.1.2            
[52] RCurl_1.98-1.6         rstudioapi_0.13        bitops_1.0-7          
[55] labeling_0.4.2         gtable_0.3.0           DBI_1.1.2             
[58] R6_2.5.1               dplyr_1.0.8            fastmap_1.1.0         
[61] bit_4.0.4              utf8_1.2.2             ragg_1.2.2            
[64] stringi_1.7.6          parallel_4.1.2         Rcpp_1.0.8.3          
[67] vctrs_0.4.0            geneplotter_1.72.0     png_0.1-7             
[70] tidyselect_1.1.2       xfun_0.30