Introduction to RNASeq

MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
April 2022

Outline

The RNASeq pipeline
Normalization of RNASeq data
Normalization in DESeq2
Differential Expression analysis: DESeq2
Visualization of RNASeq data: SDM and PCA

Materials

All prerequisites, links to material and slides for this course can be found on github.

LMS_RNASeq

Or can be downloaded as a zip archive from here.

Download zip

Before we start...

Activate the R script source panel

Set working directory

Before running any of the code, set the working directory to the folder we unarchived
Navigate to the LMS_RNASeq/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

Set working directory - from terminal

Use getwd() to see where your current directory is

# Set working directory
# setwd("/Volumes/bioinfomatics$/jurtasun/Courses/CBW2022/LMS_RNASeq/course")
# getwd()

# Working with Mac
# setwd("~/Downloads/LMS_RNASeq/course")

# Working with Windows
# setwd("~\\Downloads\\LMS_RNASeq\\course")

Normalization of RNASeq data

# Normalization of RNASeq data
print("Normalization of RNASeq data")

[1] "Normalization of RNASeq data"

Normalization of RNASeq data

Factors considered for normalization

Sequencing depth
Gene length
RNA composition

Normalization of RNASeq data

Sequencing depth: sequencing runs with more depth will have more reads mapping to each gene
Needed for comparison of gene expression between samples
In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth

Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization of RNASeq data

Gene length: longer genes will have more reads mapping to them
Needed for comparing expression between different genes within the same sample
In the example below, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer

Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization of RNASeq data

We don't want those effects to bias our analysis

Counts per million (CPM)
Reads per Kilobase and Million reads (RPKM)
Fragments per Kilobase and Million (FPKM)
Transcripts per million (TPM)

Normalization of RNASeq data

Difference between RPKM/FPKM and TPM, go through an example
Typical form of a count matrix data, with genes as rows and samples as columns

Normalization of RNASeq data

For every gene, rep 3 way more reads than the other 2 \( \rightarrow \) higher sequencing depth
For every sample, Gene B twice as long than Gene A \( \rightarrow \) twice as many reads for any replicate

Normalization of RNASeq data

RPKM normalization
Step1: normalize sequencing depth “per million” \( \rightarrow \) get RPM
Step2: normalize gene length in kilobases \( \rightarrow \) get RPKM

Normalization of RNASeq data

TPM normalization
Step1: normalize gene length in kilobases \( \rightarrow \) get RPK
Step2: normalize sequencing depth “per million” \( \rightarrow \) get TPM

Exercise 1

Exercise 1

Exercise 1

Load data_matrix.txt by using read.table() function

# Load data - 3 replicates and 4 genes dataset
data_matrix <- read.table("exercises/data/data_matrix.txt")

# Explore data
head(data_matrix)

Exercise 1

Load data_matrix.txt by using read.table() function

# Set row and column names
colnames(data_matrix) <- paste("Replicate", 1:3, sep = "")
rownames(data_matrix) <- paste("Gene", 1:4, sep = "")

# Gene lengths: Gene1 = 2kb, Gene2 = 4kb, Gene3 = 1kb, Gene4 = 10kb
gene_lengths <- c(2, 4, 1, 10)

# Explore data
head(data_matrix)

      Replicate1 Replicate2 Replicate3
Gene1         10         12         30
Gene2         20         25         60
Gene3          5          8         15
Gene4          0          0          1

Exercise 1

1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
i) Normalize by sequencing depth (total number per replica (column))

# Get sequencing depth for each replica
read_depths <- colSums(data_matrix)
# "Tens" of reads; would be millions of reads in real data
read_depths_tens <- read_depths / 10
head(read_depths_tens)

Replicate1 Replicate2 Replicate3 
       3.5        4.5       10.6

# Divide each column by its read depth
data_matrix_RPM <- t(data_matrix) / read_depths_tens
data_matrix_RPM <- t(data_matrix_RPM)

# Explore data
head(data_matrix_RPM)

      Replicate1 Replicate2 Replicate3
Gene1   2.857143   2.666667 2.83018868
Gene2   5.714286   5.555556 5.66037736
Gene3   1.428571   1.777778 1.41509434
Gene4   0.000000   0.000000 0.09433962

Exercise 1

1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
ii) Normalize per gene length (total number per gene (row))

# Divide each row by gene length
data_matrix_RPKM <- data_matrix_RPM / gene_lengths

# Explore data
head(data_matrix_RPKM)

      Replicate1 Replicate2  Replicate3
Gene1   1.428571   1.333333 1.415094340
Gene2   1.428571   1.388889 1.415094340
Gene3   1.428571   1.777778 1.415094340
Gene4   0.000000   0.000000 0.009433962

Exercise 1

1. Perform RPKM normalization (Reads Per Kilobase of exon and per Million reads)
iii) Store results as dataframe

# Store result as data frame
colnames(data_matrix_RPKM) <- paste("RPKM-Rep", 1:3, sep = "")
data_matrix_RPKM <- as.data.frame(data_matrix_RPKM)
class(data_matrix_RPKM)

[1] "data.frame"

# Final result
head(data_matrix_RPKM)

      RPKM-Rep1 RPKM-Rep2   RPKM-Rep3
Gene1  1.428571  1.333333 1.415094340
Gene2  1.428571  1.388889 1.415094340
Gene3  1.428571  1.777778 1.415094340
Gene4  0.000000  0.000000 0.009433962

Exercise 1

2. Perform TPM normalization (Transcripts per million)
i) Normalize by gene length (total number per gene (row))

# Divide each column by its read depth
data_matrix_RPK <- data_matrix / gene_lengths

# Explore data
head(data_matrix_RPK)

      Replicate1 Replicate2 Replicate3
Gene1          5       6.00       15.0
Gene2          5       6.25       15.0
Gene3          5       8.00       15.0
Gene4          0       0.00        0.1

Exercise 1

2. Perform TPM normalization (Transcripts per million)
ii) Normalize per sequencing depth (total number per replicate (column))

# Get sequencing depth for each replica
read_depths2 <- colSums(data_matrix_RPK)
 # "Tens" of reads; would be millions of reads in real data
read_depths2_tens <- read_depths2 / 10
head(read_depths2_tens)

Replicate1 Replicate2 Replicate3 
     1.500      2.025      4.510

# Divide each column by its read depth
data_matrix_TPM <- t(data_matrix_RPK) / read_depths2_tens
data_matrix_TPM <- t(data_matrix_TPM)

# Explore data
head(data_matrix_TPM)

      Replicate1 Replicate2 Replicate3
Gene1   3.333333   2.962963 3.32594235
Gene2   3.333333   3.086420 3.32594235
Gene3   3.333333   3.950617 3.32594235
Gene4   0.000000   0.000000 0.02217295

Exercise 1

2. Perform TPM normalization (Transcripts per million)
iii) Store results as dataframe

# Store result as data frame
colnames(data_matrix_TPM) <- paste("TPM-Rep", 1:3, sep = "")
rownames(data_matrix_TPM) <- paste("Gene", 1:4, sep = "")
data_matrix_TPM <- as.data.frame(data_matrix_TPM)
class(data_matrix_TPM)

[1] "data.frame"

# Final result
head(data_matrix_TPM)

      TPM-Rep1 TPM-Rep2   TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295

Exercise 1

3. Compare both methods - get sum of normalized reads for each column (RPKM)

# RPM scaled matrix
head(data_matrix_RPKM)

      RPKM-Rep1 RPKM-Rep2   RPKM-Rep3
Gene1  1.428571  1.333333 1.415094340
Gene2  1.428571  1.388889 1.415094340
Gene3  1.428571  1.777778 1.415094340
Gene4  0.000000  0.000000 0.009433962

# Total number per replica
total_RPKM <- colSums(data_matrix_RPKM)
head(total_RPKM)

RPKM-Rep1 RPKM-Rep2 RPKM-Rep3 
 4.285714  4.500000  4.254717

Exercise 1

3. Compare both methods - get sum of normalized reads for each column (TPM)

# RPM scaled matrix
head(data_matrix_TPM)

      TPM-Rep1 TPM-Rep2   TPM-Rep3
Gene1 3.333333 2.962963 3.32594235
Gene2 3.333333 3.086420 3.32594235
Gene3 3.333333 3.950617 3.32594235
Gene4 0.000000 0.000000 0.02217295

# Total number per replica
total_TPM <- colSums(data_matrix_TPM)
head(total_TPM)

TPM-Rep1 TPM-Rep2 TPM-Rep3 
      10       10       10

Normalization of RNASeq data

RPKM vs TPM normalization
Both correct for sequencing depth and gene length
The sum of total number of reads in each column is different with RPKM and the same with TPM

Exercise 1 - solutions

Exercise 1 - solutions

Normalization in DESeq2

# Normalization in DESeq
print("Normalization in DESeq2")

[1] "Normalization in DESeq2"

Normalization in DESeq2

Normalize RNA composition, or “library” composition
RNA-seq is often used for comparing one condition to another, or one tissue to another
It could be that there are a lot of condition/tissue specific genes, transcribed only in one case
Here Gene 2 will shadow all the rest when performing normalization!

Normalization in DESeq2

The DE gene takes up most of the counts for Sample A, but not Sample B
In the example below, when dividing each sample by the total number of counts to normalize, the counts would be greatly suppressed by the DE gene
Most genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B

Image credits: Bioinformatics Training at Harvard Chan Bioinformatics Core

Normalization in DESeq2

Compare gene expression between two conditions (e.g. control vs treatment, WT vs KO)
Counts divided by sample-specific size factors (*)
Normalizing sequencing depth and RNA composition

Differential Expression analysis: DESeq2

# Normalization in DESeq
print("Differential Expression analysis: DESeq2")

[1] "Differential Expression analysis: DESeq2"

Read count data

Load all_counts.csv.txt by using read.csv() function

# Read input file with count data
all_counts <- read.csv(file = "exercises/data/all_counts.csv", row.names = 1)

Read count data

Load all_counts.csv.txt by using read.csv() function

# Explore data
head(all_counts)

                   Viv1 Viv2 Viv3 Hfd1 Hfd2 Hfd3
ENSMUSG00000090025    0    0    0    0    0    0
ENSMUSG00000064842    0    0    0    0    0    0
ENSMUSG00000051951    0    1    1    3    0    0
ENSMUSG00000089699    0    0    0    0    0    0
ENSMUSG00000088390    0    0    0    0    0    0
ENSMUSG00000089420    0    0    0    0    0    0

dim(all_counts)

[1] 37991     6

class(all_counts)

[1] "data.frame"

Read sample information

Read in sample_description.txt file by using read.table function

# Read input file with sample description
sam_des <- read.table("exercises/data/sample_description.txt", sep = "\t", header = TRUE)

Read sample information

Read in sample_description.txt file by using read.table function

# Explore data
head(sam_des)

  Sample Group Batch       InputFile1       InputFile2 OutputFile
1   Viv1   Viv     a SRR2001243.fastq SRR2001244.fastq   Viv1.bam
2   Viv2   Viv     b SRR2001245.fastq SRR2001246.fastq   Viv2.bam
3   Viv3   Viv     c SRR2001247.fastq SRR2001248.fastq   Viv3.bam
4   Hfd1   Hfd     a SRR2001249.fastq SRR2001250.fastq   Hfd1.bam
5   Hfd2   Hfd     b SRR2001251.fastq SRR2001252.fastq   Hfd2.bam
6   Hfd3   Hfd     c SRR2001253.fastq SRR2001254.fastq   Hfd3.bam

dim(sam_des)

[1] 6 6

class(sam_des)

[1] "data.frame"

Prepare DESeq2 dataset object

Collect sample information

# Prepare data for DESeq
col_data <- data.frame(Sample = sam_des$Sample,
                  Group = sam_des$Group,
                  Batch = sam_des$Batch)

# Store data as factors
col_data$Sample <- as.factor(col_data$Sample)
col_data$Group <- as.factor(col_data$Group)
col_data$Batch <- as.factor(col_data$Batch)

# Check dimensions
all(colnames(all_counts) == sam_des$Sample)

[1] TRUE

Prepare DESeq2 dataset object

Construct DESeqDataSet object

# Load DESeq2 library
library(DESeq2)

# Build DESeq dataset
dds <- DESeqDataSetFromMatrix(countData = all_counts, 
                              colData = col_data, 
                              design = ~Group)

Normalization in DESeq (II)

The standard differential expression analysis steps are wrapped into a single function, DESeq, which performs normalization, fitting to the model and statistical testing
In following sections we will explain the different parts

# Apply DESeq normalization
dds <- DESeq(dds)

# Ask for information about DESeq
?DESeq

Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq, which performs normalization, fitting to the model and statistical testing.

The DESeq function runs the following ones in order

estimateSizeFactors()

estimateDispersions()

nbinomWaldTest()

DESeq function - estimateSizeFactors()

1 - Estimation of size factors

The sizeFactors vector assigns to each column of the count matrix a value, the size factor
Then count values in the different columns can be brought to a common scale by dividing by the corresponding size factor

# Estimate size factors
sizeFactors(dds)

     Viv1      Viv2      Viv3      Hfd1      Hfd2      Hfd3 
1.2430187 0.7755226 1.0501449 0.9457439 1.0124687 1.0515602

DESeq function - estimateSizeFactors()

DESeq function - estimateDispersions()

2 - Estimation of dispersion

This function obtains gene-wide dispersion estimates *Then, a curve is fit to the estimates to capture the overall trend of dispersion-mean dependence

# Estimate gene-wise dispersions
head(dispersions(dds))

[1]       NA       NA 2.495857       NA       NA       NA

# Plot dispersions
plotDispEsts(dds)

plot of chunk unnamed-chunk-23

DESeq function - nbinomWaldTest()

3- Hypothesis test for differential expression

For significance testing, DESeq2 by default uses a Wald test, where the function tests whether each model coefficient differs significantly from zero, using previously calculated sizeFactors and dispersion estimates.

The Wald test P values are adjusted for multiple testing using the procedure of Benjamini and Hochberg

# Compute Wald test
nbinomWaldTest()

Getting results

Results tables are generated using the function results(), which extracts a results table with log2 fold changes, p-values and adjusted p-values

# Obtain results from DESeq
res <- results(dds)
head(res)

log2 fold change (MLE): Group Viv vs Hfd 
Wald test p-value: Group Viv vs Hfd 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat    pvalue
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000090025  0.000000             NA        NA        NA        NA
ENSMUSG00000064842  0.000000             NA        NA        NA        NA
ENSMUSG00000051951  0.902301      -0.545916   2.25912 -0.241649  0.809052
ENSMUSG00000089699  0.000000             NA        NA        NA        NA
ENSMUSG00000088390  0.000000             NA        NA        NA        NA
ENSMUSG00000089420  0.000000             NA        NA        NA        NA
                        padj
                   <numeric>
ENSMUSG00000090025        NA
ENSMUSG00000064842        NA
ENSMUSG00000051951        NA
ENSMUSG00000089699        NA
ENSMUSG00000088390        NA
ENSMUSG00000089420        NA

Getting results

The p-values are computing the so-called independent filtering \( \rightarrow \) the adjusted p-values for the genes which do not pass the filter threshold are set to NA.
Rresults by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance

# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]
head(res_ordered)

log2 fold change (MLE): Group Viv vs Hfd 
Wald test p-value: Group Viv vs Hfd 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSMUSG00000032080 14291.818       -5.04940  0.251952  -20.0411 2.41261e-89
ENSMUSG00000024526   465.879       -6.89875  0.346598  -19.9042 3.74279e-88
ENSMUSG00000069170   343.928       -4.05808  0.253508  -16.0077 1.12875e-57
ENSMUSG00000042041   651.002       -3.18880  0.203665  -15.6571 2.97319e-55
ENSMUSG00000032231   883.324       -2.58604  0.176730  -14.6327 1.73648e-48
ENSMUSG00000026043   833.868       -2.46575  0.171552  -14.3732 7.62213e-47
                          padj
                     <numeric>
ENSMUSG00000032080 4.02472e-85
ENSMUSG00000024526 3.12186e-84
ENSMUSG00000069170 6.27659e-54
ENSMUSG00000042041 1.23997e-51
ENSMUSG00000032231 5.79358e-45
ENSMUSG00000026043 2.11921e-43

Exercise 2

Exercise 2

Exercise 2 - solutions

Exercise 2 - solutions

Data quality assessment

Data quality assessment and quality control are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.

We will use following visualization tools to assess the data quality

Heatmap of count matrix
Sample Distance Matrix
Principal Component Analysis

Transformation of count data

In order to test for differential expression we operate on raw counts
However for other downstream analyses (e.g. visualization or clustering) it is useful to work with transformed versions of the count data
Remove dependence of the variance on the mean
There are two alternative approaches

1. Regularized logarithm (rlog)

2. Variance stabilizing transformation or (vst)

Both transforms the original count data to the log2 scale normalized to library size

Transformation of count data

1. Regularized logarithm (rlog)

When applying a log transformation on a count matrix, careful with zero values
Add a pseudo-count to zero values in the count matrix
rlog already estimates the best pseudo-count

# Regularized log transformation
rld <- rlog(dds)
class(rld)

[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"

# Get rld in count format
rld_counts <- assay(rld)
class(rld_counts)

[1] "matrix" "array"

Transformation of count data

1. Variance Stabilizing Transformation (rlog)

vst also transforms the original count data to the log2 scale normalized to library size
Faster than rlog
Usefull when working with very long datasets

# Regularized log transformation
vsd <- varianceStabilizingTransformation(dds)
class(vsd)

[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"

# Get rld in count format
vsd_counts <- assay(vsd)
class(vsd_counts)

[1] "matrix" "array"

Visualizing heatmaps

Visualization of count matrix \( \rightarrow \) heatmap
Get the normalized counts from the DESeq2 dataset

# Load pheatmap library
library("pheatmap")

# Get dds normalized counts
dds_counts <- counts(dds, normalized = TRUE)
head(dds_counts)

                   Viv1     Viv2      Viv3     Hfd1 Hfd2 Hfd3
ENSMUSG00000090025    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000064842    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000051951    0 1.289453 0.9522495 3.172106    0    0
ENSMUSG00000089699    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000088390    0 0.000000 0.0000000 0.000000    0    0
ENSMUSG00000089420    0 0.000000 0.0000000 0.000000    0    0

# Get normalized counts - 20 higher values
select <- order(rowMeans(dds_counts), decreasing = TRUE)[1:20]
head(select)

[1] 25353 34980 29008 16168 34999 16721

Visualizing heatmaps

Plot heatmap for both rlog transformed data

# Heatmap of the rlog transformed data
pheatmap(assay(rld)[select, ])

plot of chunk unnamed-chunk-30

Visualizing heatmaps

Plot heatmap for both rld and vsd transformed data

# Heatmap of the vst transformed data
pheatmap(assay(vsd)[select, ])

plot of chunk unnamed-chunk-31

Visualizing heatmaps

Interpreting a heatmap
Clustering affects the data visualization
Organize data based on similarity

Visualizing heatmaps

Example: 3 samples and 4 genes
Hierarchically cluster (reorder) the rows (genes)
Step1: identify which gene is most similar to Gene 1

Visualizing heatmaps

Gene 1 and Gene 3 are the most similar
Merge them together as a single cluster
Step2: identify which gene is most similar to Gene 2 \( \rightarrow \) Gene 4

Visualizing heatmaps

Gene 2 and Gene 4 are the most similar
Merge them together as a single cluster
Step3: repeat considering the new clusters as single genes

Visualizing heatmaps

Gene 2 and Gene 4 are the most similar
Merge them together as a single cluster
Step3: repeat considering the new clusters as single genes

Visualizing heatmaps

Hierarchical clustering usually accompained by a dendogram
It indicates both the similarity and the order in which the clusters were formed
First cluster 1, then cluster 2 and finally cluster 3

Sample Distance Matrix

# Sample Distance Matrix
print("Sample Distance Matrix")

[1] "Sample Distance Matrix"

Sample Distance Matrix

The Sample Distance Matrix (SDM) is a useful way to explore this clustering behavior, by computing the “similarity” across the different samples
Ideally, one would expect that the different biological replicates cluster together in separate groups, meaning that our experimental desing is able to correctly discriminate between them
For the purpose of defining what “similarity” means, different mathematical approaches exist and depend on choice, based of the kind of experiment and data under study \( \rightarrow \) Euclidean distance

Sample Distance Matrix

Another use of the transformed data is sample clustering
Apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances

# Compute SDM from rlog transformed data
sample_dist <- dist(t(assay(rld)))
class(sample_dist)

[1] "dist"

Sample Distance Matrix

# Get SDM in matrix form
sdm <- as.matrix(sample_dist)
class(sdm)

[1] "matrix" "array"

# Load library
library("RColorBrewer")

# Add row names for clear plot
rownames(sdm) <- rld$Group
colnames(sdm) <- NULL

# Add colors
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)

Sample Distance Matrix

# Plot heatmap
pheatmap(sdm,
         clustering_distance_rows = sample_dist,
         clustering_distance_cols = sample_dist,
         col = colors)

plot of chunk unnamed-chunk-35

Principal Component Analysis

# Principal Component Analysis
print("Principal Component Analysis")

[1] "Principal Component Analysis"

Principal Component Analysis

Typical form of a RNASeq count matrix
Genes as rows and samples as columns

Principal Component Analysis

Focus on the first two samples
Study correlation between sample 1 and sample 2

Principal Component Analysis

Represent correlation between two samples plot the measurements for each gene
From left to right, gene 1 highly transcribed in sample 1 and lowly in sample 2
From left to right, gene 9 highly transcribed in sample 2 and lowly in sample 1
Infer that we have two groups of samples, using different genes

Principal Component Analysis

Add information about sample 3 \( \rightarrow \) 3D plot
Inference in this way becomes more complicated \( \rightarrow \) multidimensional data

Principal Component Analysis

Consider the rest of columns on our dataset
Plot correlations of sample 1 vs sample 2, sample 1 vs sample 3, …
Inference in this way becomes unduable \( \rightarrow \) multidimensional data

Principal Component Analysis

Instead, perform dimensional reduction methods
Principal component analysis (PCA) plot \( \rightarrow \) convert these correlations to a 2D space
Samples that are highly correlated cluster together \( \rightarrow \) assign different color code

Principal Component Analysis

Ideally expect our replicates to be clearly aligned along PC1 or PC2
That means good quality of our data and good experimental design
Typically for bulk-RNASeq data, less than 20% correlation is considered negligible

Principal Component Analysis

Perform PCA plot of the rld transformed data

# PCA plot on our rld transformed data
plotPCA(rld, intgroup = "Group")

plot of chunk unnamed-chunk-37

# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_rld.png")

Principal Component Analysis

Perform PCA plot of the vsd transformed data

# PCA plot on our rld transformed data
plotPCA(vsd, intgroup = "Group")

plot of chunk unnamed-chunk-38

# Save figure
library(ggplot2)
ggsave(file = "figures/PCA_plot_vst.png")

Exercise 3

Exercise 3

Exercise 3 - solutions

Exercise 3 - solutions

Additional topics

# Additional topics
print("Additional topics")

[1] "Additional topics"

Add Gene symbol - annotation files and biomart

# # Add Gene Symbols
# library(biomaRt)
# 
# # retrieve the mm9 information from biomart
# mart=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ensembl',
#              host="may2012.archive.ensembl.org")
# 
# bm<-getBM(attributes=c('ensembl_gene_id','mgi_symbol'),
#           filters ='ensembl_gene_id',
#           values=rownames(resOrdered), mart=mart)
# 
# # see the first few rows of "bm" object
# head(bm)      
# 
# # ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl',version = 80)

Add Gene symbol (Continued)

# # merge the Gene_symbol to our DE dataset
# resAnnotated <- merge(as.data.frame(resOrdered),bm,by.x=0,by.y=1)
# head(resAnnotated)
# # change the column name
# colnames(resAnnotated)[1]<-"ensembl_gene_id"

Add Gene symbol (Continued)

# # Order results by adjusted p value
# resAnnotated<-resAnnotated[order(resAnnotated$pvalue,decreasing=F),]
# 
# # show the result with gene symbol annotation
# head(resAnnotated)

Outline

The RNASeq pipeline
Normalization of RNASeq data
Normalization in DESeq2
Differential Expression analysis: DESeq2
Visualization of RNASeq data: SDM and PCA

Session Information

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggplot2_3.3.5               RColorBrewer_1.1-3         
 [3] pheatmap_1.0.12             DESeq2_1.34.0              
 [5] SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [7] MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
[11] IRanges_2.28.0              S4Vectors_0.32.4           
[13] BiocGenerics_0.40.0         knitr_1.38                 

loaded via a namespace (and not attached):
 [1] httr_1.4.2             bit64_4.0.5            splines_4.1.2         
 [4] assertthat_0.2.1       highr_0.9              blob_1.2.2            
 [7] GenomeInfoDbData_1.2.7 pillar_1.7.0           RSQLite_2.2.12        
[10] lattice_0.20-45        glue_1.6.2             digest_0.6.29         
[13] XVector_0.34.0         colorspace_2.0-3       Matrix_1.4-1          
[16] XML_3.99-0.9           pkgconfig_2.0.3        genefilter_1.76.0     
[19] zlibbioc_1.40.0        purrr_0.3.4            xtable_1.8-4          
[22] scales_1.1.1           BiocParallel_1.28.3    tibble_3.1.6          
[25] annotate_1.72.0        KEGGREST_1.34.0        farver_2.1.0          
[28] generics_0.1.2         ellipsis_0.3.2         withr_2.5.0           
[31] cachem_1.0.6           cli_3.2.0              survival_3.3-1        
[34] magrittr_2.0.3         crayon_1.5.1           memoise_2.0.1         
[37] evaluate_0.15          fansi_1.0.3            textshaping_0.3.6     
[40] tools_4.1.2            lifecycle_1.0.1        stringr_1.4.0         
[43] munsell_0.5.0          locfit_1.5-9.5         DelayedArray_0.20.0   
[46] AnnotationDbi_1.56.2   Biostrings_2.62.0      compiler_4.1.2        
[49] systemfonts_1.0.4      rlang_1.0.2            grid_4.1.2            
[52] RCurl_1.98-1.6         rstudioapi_0.13        bitops_1.0-7          
[55] labeling_0.4.2         gtable_0.3.0           DBI_1.1.2             
[58] R6_2.5.1               dplyr_1.0.8            fastmap_1.1.0         
[61] bit_4.0.4              utf8_1.2.2             ragg_1.2.2            
[64] stringi_1.7.6          parallel_4.1.2         Rcpp_1.0.8.3          
[67] vctrs_0.4.0            geneplotter_1.72.0     png_0.1-7             
[70] tidyselect_1.1.2       xfun_0.30