Plotting in R

1 Introduction

Visualisation of large data sets has become an essential task in molecular biology and medicine. While there are many software solutions which can be used for visualisation of data starting, for instance, with Excel or LibraOffice Calc, the R platform has become a de facto standard for visualisation in genomics. Rstudio is an interface to R which facilitates the writing and execution of R code. We will use it in our tutorial.

The course is intended to be self-contained, but if you are new to R, we recommend that you go through our previous introduction to R:

[[https://lmsbioinformatics.github.io/LMS_Reproducible-R/]]

2 Materials

  • All prerequisites, links to material and slides for this course can be found on github - PlottingInR

  • Or can be downloaded as a zip archive from here - Download zip

3 Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

3.1 Set working directory - in Rstudio

You can navigate to the unarchived LMS_PlottingInR/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

path

3.2 Set working directory - in the console

Use getwd() to see where your current directory is

setwd("/PathToMyDownload/LMS_PlottingInR/course")
# e.g. setwd("~/Downloads/LMS_PlottingInR/course")

4 Motivation

  • Visualisation plays a key part in the analysis of genomic data and presentation of the results. Why is this so?

  • The reason for the importance of visualisation is simply that we humans are not good in reading large tables, which generally are produced in genomics. For instance, can you easily spot the significant genes (padj < 0.01 and abs(log2 fold change) > 1) in the table below?

4.1 Typical results

a typical table

4.2 Data visualization

This task will be much easier when we visualise the data and highlight the significant genes by the colour red:

5 Data visualization: A view of every Points of View column

DataVisualCol

http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

5.1 Design of data figures

  • accuracy

5.2 Salience

  • speed
Salience

Salience

Visal conjunctions

Visal conjunctions

5.3 Color blindness

CB

  • Reference: Crameri F, Shephard GE, Heron PJ. The misuse of colour in science communication. Nat Commun. 2020 Oct 28;11(1):5444. doi: 10.1038/s41467-020-19160-7. PMID: 33116149; PMCID: PMC7595127.

https://pubmed.ncbi.nlm.nih.gov/33116149/

  • Colour vision tests

  • Colour map classes and types

  • Guideline for choosing the right scientific colour map

6 A short intro to the most important data structures in R

For our purpose, we need to know two data structures in R:

6.1 Data frame

A data frame is a table in which each column contains values of one variable and each row contains one set of values from each column. Both row and column names should be unique.

  • A data frame is produced when we upload a comma-delimited table into R using the read.csv() command.

  • A useful command to check the content of a data frame is head() which displays the top of the data.frame.

# reading in table 
df <- read.csv("data/DiffEx.csv")
# showing first 6 rows of the data frame
head(df)
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2
  • subset the data.frame based on columns

    • Single columns can either be accessed via the $ operator, writing the explicit name [,name] (if the column has a name) or more generally by [,j] for the jth column.
# different options to access the data 
head(df$baseMean)
## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342
head(df[,"baseMean"])
## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342
head(df[,2])
## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342
  • subset the data frame based on rows

    • Another common way to subset the data frame is to supply a logical (Boolean) vector. For example, if we use the following thresholds to define significant genes (padj < 0.01 and abs(log2 fold change) > 1) from the DESeq2 analysis results.
# define the DE genes using logical vector 
diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)
# diffExpressed is a vector with TRUE or FALSE values
head(diffExpressed)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
class(diffExpressed)
## [1] "logical"
# make sure the length of the logical vector is consistent with the original dataset
## check the dimensions of the original dataset
dim(df)
## [1] 18357    10
## check the length of the logical vector
length(diffExpressed)
## [1] 18357
# use table function to see TRUE & FALSE
table(diffExpressed)
## diffExpressed
## FALSE  TRUE 
## 17120  1237
# subset the original df using diffExpressed vector
DE_res <- df[diffExpressed,]

# check the dimension of the new data frame
dim(DE_res)
## [1] 1237   10
# showing first 6 rows of the data frame
head(DE_res)
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2
  • Another useful command to check the type of each columns from a data frame is str() which compactly displays the structure of a R object.
# different options to access the data 
str(df)
## 'data.frame':    18357 obs. of  10 variables:
##  $ ensembl_id     : chr  "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770" "ENSMUSG00000074480" ...
##  $ baseMean       : num  10894 296 1967 1360 415 ...
##  $ log2FoldChange : num  2.86 4.42 1.74 -2.17 3.52 ...
##  $ lfcSE          : num  0.109 0.187 0.077 0.102 0.166 ...
##  $ stat           : num  26.1 23.7 22.6 -21.3 21.2 ...
##  $ pvalue         : num  6.09e-150 4.39e-124 3.18e-113 3.17e-100 1.23e-99 ...
##  $ padj           : num  1.12e-145 4.03e-120 1.94e-109 1.46e-96 4.51e-96 ...
##  $ baseMean_DOX7D : num  21317 627 3080 473 820 ...
##  $ baseMean_DOX24H: num  2945.5 29.3 921.2 2126.6 71.6 ...
##  $ mgi_symbol     : chr  "Scg2" "Gfra1" "Dlg1" "Mex3a" ...

6.2 Vector

A vector is a sequence of data elements of the same type. We use it often in the definiton of colors or genes to be displayed. A vector can be constructed using the c() function.

colo <- c("blue","red","green")

colo
## [1] "blue"  "red"   "green"

7 Two philosophies of plotting in R

In our tutorial, we use two sources of plotting capabilities. The first one (base graphics) might resemble a printer, where you put in your data, make some choices, press the button and voila you have the picture. The second one (ggplot2) is more like an oil painting, it is layered and needs some setting up, but it is more easily modifiable (although it might be more laborious to get to the full painting in the first place).

Two philosophies for plotting in R

Disclaimer: The distinction between the two plotting philosophies is not that strict, we can still do some limited modification after initial plotting also with base graphics using auxiliary functions.

8 base graphics

This functionality for plotting is easy to use, but becomes difficult or impossible to draw complicated plots.

For using base graphics, we start with the plot() function and then could add features using auxilary functions such as

  • lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); the function connects the points

  • points: add points to a plot

  • text: add text labels to a plot using specified x, y coordinates

  • title: add annotations to x, y axis labels, title, subtitle, outer margin

  • mtext: add arbitrary text to the margins (inner or outer) of the plot

  • axis: adding axis ticks/labels

8.1 usage

This scatterplot is an example of the use of the plot() function. Basically, it takes the x and y arguments for the x and y coordinates of the plot.

# plotting 
plot(x=log2(df$baseMean_DOX24H), y=log2(df$baseMean_DOX7D))

Additional arguments provide options for adjusting the plot such as xlab, ylab, pch and main). These are explained using help(plot) or ?plot.

# help(plot) or ?plot
?plot
plot.default {graphics} R Documentation
The Default Scatterplot Function
Description
Draw a scatter plot with decorations such as axes and titles in the active graphics window.

Usage
## Default S3 method:
plot(x, y = NULL, type = "p",  xlim = NULL, ylim = NULL,
     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
     ann = par("ann"), axes = TRUE, frame.plot = axes,
     panel.first = NULL, panel.last = NULL, asp = NA,
     xgap.axis = NA, ygap.axis = NA,
     ...)
  • type: what type of plot should be drawn. Possible types are
        "p" for points,
        "l" for lines,
        "b" for both,
        "c" for the lines part alone of "b",
        "o" for both ‘overplotted’,
        "h" for ‘histogram’ like (or ‘high-density’) vertical lines,
        "s" for stair steps,
        "S" for other steps, see ‘Details’ below,
        "n" for no plotting.

In the following, we also define the differentially expressed genes and highlight them by red color using points() to add them to our origin plots.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", 
     ylab="log2 expression DOX7D", pch=20, 
     main="Comparison of expression at 24h and 7 days")

diffExpressed <- (!is.na(df$padj) & df$padj < 0.01 & abs(df$log2FoldChange) > 1)
head(diffExpressed)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
length(diffExpressed)
## [1] 18357
table(diffExpressed)
## diffExpressed
## FALSE  TRUE 
## 17120  1237
points(log2(df$baseMean_DOX24H[diffExpressed]),log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

  • Plotting ‘character’ (pch) - symbol to use

9 ggplot2 package

A very versatile package which has become a de facto standard for producing high quality figures is an R package based on the “grammar of graphics”. This enables developing plots by layering graphical elements on top of each other.

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).” - Wickham, 2016

Every ggplot2 plot has three key components:

  1. data provided as data frame,

  2. aesthetic mappings between variables in the data and visual properties, and

  3. layer: usually created with a geom function.

9.1 usage

Start with the ggplot function which initializes a ggplot object. Here you need to declare the input data frame and to specify the set of plot aesthetics. The actual plotting will be done by applying another layer (with the “+” sign) using various specific geom functions. For a scatter plot like the one above, we will use the geom_point() function.

# install.packages("ggplot2")
library("ggplot2")

# (1) ggplot function with initializes a ggplot object
ggplot(df)

# (2) specify the set of plot aesthetics
ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D)))

# (3) specify plot type using geom_point()
ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D))) +
  geom_point()

Note that we do not need to indicate that the x and y are specific columns of the dataframe df e.g. by using the $ operator as we already defined in the ggplot function that df should be used.

Now we add more layers to get a similar plot as we got using base graphics.

ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D))) +
  geom_point() + 
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D")

This looks promising but we would like to highlight the differentially expressed genes. Since ggplot works with the input data frame, we need to include this information into the data frame df. Then, we add color to the aesthetic mappings and set it the added column. geom_point() will select its color for the different values of the columns, but if we do not like it, we can overwrite or more exactly put another layer on it (like on an oil painting.)

# check the dimension for object "df"
dim(df)
## [1] 18357    10
diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

# diffExpressed is a vector with TRUE or FALSE values
head(diffExpressed)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
length(diffExpressed)
## [1] 18357
df2 <- data.frame(df, diffEx=diffExpressed)

head(df2)
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol diffEx
## 1 1.12e-145     21316.8751      2945.51899       Scg2   TRUE
## 2 4.03e-120       626.6633        29.26926      Gfra1   TRUE
## 3 1.94e-109      3080.2873       921.18292       Dlg1   TRUE
## 4  1.46e-96       472.8374      2126.62645      Mex3a   TRUE
## 5  4.51e-96       820.3147        71.59931       Rin1   TRUE
## 6  5.58e-96      4687.6749      1193.30291      Ildr2   TRUE
table(df2$diffEx)
## 
## FALSE  TRUE 
## 17120  1237
g <- ggplot(df2, aes(x=log2(baseMean_DOX24H), 
                     y=log2(baseMean_DOX7D),color=diffEx)) +
  geom_point() + 
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D")
g

# less fancy colors please  
g + scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.")

# more classic look please  
g + scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.") + 
  theme_classic()

Note that we assigned first the ggplot object to the variable g and then plot by entering it again. The nice thing here is that we can put further layers on the plot without calling the ggplot() and other functions again. This is convenient if we need to generate various different version of one figure.

10 Exercise I

  • Plotting in R exercise 1

  • note: In this exercise, you will need to know the difference between “numeric” and “factor” R data types

11 Solutions I

12 Our case study

In 1933, the Dutch pediatrician Cornelia De Lange described a disorder in two children, which became known as Cornelia de Lange syndrome (CdLS). Its features can include short stature, intellectual disability, developmental delay and signs of autism. Over the years, research showed that Cornelia de Lange syndrome is a genetic disorder and that it is frequently linked to mutations in components of cohesin complex. This complex is fundamental for obtaining higher order structures of DNA within the nucleus.

  • Classical facial dysmorphisms of CdLS and related overlapping syndromes
Classical facial dysmorphisms of CdLS and related overlapping syndromes
  • Five major CdLS facial features and their different severity
Classical facial dysmorphisms of CdLS and related overlapping syndromes

Clinical Genetics, Volume: 97, Issue: 1, Pages: 3-11, First published: 13 November 2019, DOI: (10.1111/cge.13674)

How cohesin mutations cause Cornelia de Lange syndrome is not well understood. To elucidate the potential role of cohesin for Cornelia de Lange syndrom, a group at the LMS applied several genomic approaches including RNA-seq technology.

Abstract of our case study https://doi.org/10.1038/s41467-021-23141-9
Weiss et al., Nat Commun. 2021

After comparing expression in cortical neurons from post-mortem brains of Cornelia de Lange syndrome patients with controls, the researchers showed in a mouse model that doxycycline (dox) inducible depletion of one cohesin component (RAD21) leads to similar changes of expression including the downregulation of genes with neuronal and synaptic functions. The researchers asked then the crucial question if the normal expression could be restored when the depletion of RAD21 is lifted. A positive answer might give us hope that correction of cohesin functions could provide a potential therapeutic approach for Cornelia de Lange patients.

The figure below shows one of the experiments in this study. Rad21-TEV neurons were grown in culture, for 10 days. On day 10, cells were treated for 6 hours with either 100ng/ml of doxycycline, or vehicle. After 6 hours, the samples were washed and fresh media was added. Samples for cohesin depletion (24H) were collected 18 hours later, and samples for rescue (7D) were collected 7 days after addition of doxycyline/vehicle.

Experimental design

The file "data/DiffEx.csv" contains the expression after RAD21 depletion at 24 hours (baseMean_DOX24H) and RAD21 restoration at day 7 (baseMean_DOX7D). It also contains log2 fold changes and adjusted p-values (padj). "data/GO_UpDE.csv" contains gene ontology analysis results. "data/DoxExpression.txt" contains normalised gene counts information. To gain insights into these data and to present them to the research community, we need to turn the table into some informative plots. In the following, we would like to complete following tasks:

12.1 Our tasks:

  1. Display the distribution of log2 fold changes

  2. Visualise the expression and changes in expression

  3. Plot the functional categories enriched in significantly differentially expressed genes

Let’s get started.

13 Displaying distributions - Base graphics

13.1 Histograms - Base graphics

hist(df$log2FoldChange,breaks=100,xlab = "Log2 fold changes",
     main="Histogram of log2 fold changes",col="red")

13.2 Density plots - Base graphics

  • log2 FC density plot
d <- density(df$log2FoldChange)
plot(d,xlab="Log2 fold changes",
     main="Distribution of log2 fold changes")
polygon(d,col="red")

  • log2 normalised expression levels in two conditions
log2_DOX24H <- log2(df$baseMean_DOX24H)
log2_DOX7D <- log2(df$baseMean_DOX7D)

d1 <- density(log2_DOX7D)
d2 <- density(log2_DOX24H)

plot(d1, lwd=2, xlab="Log2 baseMean",
     main="Distribution of log2 fold changes")
lines(d2,col=rgb(0,0,1),lwd=2)
polygon(d1,col=rgb(1,0,0,0.2))
polygon(d2,col=rgb(0,0,1,0.2))
legend(x=18,y=0.08,legend=c("DOX24H","DOX7D"),
       fill=c(rgb(1,0,0,0.2),rgb(0,0,1,0.2)))

14 Displaying distributions - ggplot2

14.1 Histograms - ggplot2

# install.package(ggplot2)
library(ggplot2)

# basic histogram
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram()

# change the number of bins and add x and y lab
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100) + 
  xlab("Log2 fold changes") + ylab("Frequency")

# change colours 
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100, col="black",fill="red") + 
  xlab("Log2 fold changes") + ylab("Frequency")

# use "classic background" and add title
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100, col="black",fill="red") + 
  xlab("Log2 fold changes") + ylab("Frequency") +
  theme_classic() +
  ggtitle("Histogram of log2 fold changes")

14.2 Density plots - ggplot2

  • log2 FC density plot
# basic density plot
ggplot(df, aes(x=log2FoldChange)) + 
  geom_density()

# change colour and use "theme_bw"
ggplot(df, aes(x=log2FoldChange)) + 
  geom_density(fill="red") + 
  theme_bw()

  • log2 normalised expression levels in two conditions

    • we have to change the data format in to the format that ggplot2 likes

    • we are going to convert the “wide” data.frame into “long” data.frame

# subset the data.frame and only keep the information of interest
df_subset <- df[,colnames(df) %in% 
                  c("ensembl_id","baseMean_DOX7D","baseMean_DOX24H")]
dim(df_subset)
## [1] 18357     3
head(df_subset)
##           ensembl_id baseMean_DOX7D baseMean_DOX24H
## 1 ENSMUSG00000050711     21316.8751      2945.51899
## 2 ENSMUSG00000025089       626.6633        29.26926
## 3 ENSMUSG00000022770      3080.2873       921.18292
## 4 ENSMUSG00000074480       472.8374      2126.62645
## 5 ENSMUSG00000024883       820.3147        71.59931
## 6 ENSMUSG00000040612      4687.6749      1193.30291
# convert df_subset into a "long" format
# install.packages("reshape2")
library("reshape2")
df_subset_long <- melt(df_subset,id="ensembl_id")
dim(df_subset_long)
## [1] 36714     3
str(df_subset_long)
## 'data.frame':    36714 obs. of  3 variables:
##  $ ensembl_id: chr  "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770" "ENSMUSG00000074480" ...
##  $ variable  : Factor w/ 2 levels "baseMean_DOX7D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value     : num  21317 627 3080 473 820 ...
head(df_subset_long)
##           ensembl_id       variable      value
## 1 ENSMUSG00000050711 baseMean_DOX7D 21316.8751
## 2 ENSMUSG00000025089 baseMean_DOX7D   626.6633
## 3 ENSMUSG00000022770 baseMean_DOX7D  3080.2873
## 4 ENSMUSG00000074480 baseMean_DOX7D   472.8374
## 5 ENSMUSG00000024883 baseMean_DOX7D   820.3147
## 6 ENSMUSG00000040612 baseMean_DOX7D  4687.6749
# now it is ready for ggplot2
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density() 

# setup the alpha value to make the colours transparent
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density(alpha=0.5) 

# use the "theme_classic" and add xlab
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density(alpha=0.5) + theme_classic() + xlab("Log2 baseMean")

  • violin plot
ggplot(df_subset_long, aes(x=variable,y=log2(value), fill=variable)) + 
  geom_violin(alpha=0.5) + theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

  • violin plot + box plot
# add geom_boxplot() layer directly
ggplot(df_subset_long, aes(x=variable,y=log2(value),fill=variable)) + 
  geom_violin(alpha=0.5) + 
  geom_boxplot() + 
  theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

# what to do if we want violin plot filled with 2 different colours and 
# box plot filled with "white" colour
ggplot(df_subset_long, aes(x=variable,y=log2(value))) + 
  geom_violin(aes(fill=variable),alpha=0.5) + 
  geom_boxplot(width=0.2) + 
  theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

15 Exercise II

Regular Expression

Regular Expression

16 Solutions II

17 Visualising expression and changes in expression

17.1 Scatterplots - base graphics

We already did this before. Now we want to highlight some genes and put their gene symbol into the plot. For the start, we will highlight the 5 most significant genes.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", 
     ylab="log2 expression DOX7D",
     pch=20, main="Comparison of expression at 24h and 7 days",
     col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(log2(df$baseMean_DOX24H[diffExpressed]),
       log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

o <- order(df$padj)[1:5]
df[o,]
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
points(log2(df$baseMean_DOX24H[o]),
       log2(df$baseMean_DOX7D[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(log2(df$baseMean_DOX24H[o]),
     log2(df$baseMean_DOX7D[o]),
     labels=df$mgi_symbol[o], 
     pos = 3,offset=0.5,font=2)

This looks good. Now let’s label the top 20 genes.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", ylab="log2 expression DOX7D",pch=20, 
     main="Comparison of expression at 24h and 7 days",col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(log2(df$baseMean_DOX24H[diffExpressed]),
       log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

o <- order(df$padj)[1:20]

points(log2(df$baseMean_DOX24H[o]),
       log2(df$baseMean_DOX7D[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(log2(df$baseMean_DOX24H[o]),
     log2(df$baseMean_DOX7D[o]),
     labels=df$mgi_symbol[o], pos = 3,offset=0.5,font=2)

Oh, no.

17.2 ggplot2 and ggrepel

# install.packages("ggrepel")
library(ggrepel)

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)
# diffExpressed is a vector with TRUE or FALSE values

df4ggplot <- data.frame(df, diffEx=diffExpressed)

head(df4ggplot)
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol diffEx
## 1 1.12e-145     21316.8751      2945.51899       Scg2   TRUE
## 2 4.03e-120       626.6633        29.26926      Gfra1   TRUE
## 3 1.94e-109      3080.2873       921.18292       Dlg1   TRUE
## 4  1.46e-96       472.8374      2126.62645      Mex3a   TRUE
## 5  4.51e-96       820.3147        71.59931       Rin1   TRUE
## 6  5.58e-96      4687.6749      1193.30291      Ildr2   TRUE
table(df4ggplot$diffEx)
## 
## FALSE  TRUE 
## 17120  1237
df4ggplot <- df4ggplot[order(df4ggplot$padj),]

g_top20 <- ggplot(df4ggplot, aes(x=log2(baseMean_DOX24H), 
                     y=log2(baseMean_DOX7D),color=diffEx)) +
  geom_point() + 
  scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.") + 
  theme_classic() +
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D") +
  geom_label_repel(data=df4ggplot[c(1:20),],
                   mapping=aes(label = mgi_symbol),
                   box.padding = 0.5,max.overlaps = 20)

g_top20

18 Volcano plots - base graphics

# plotting 
plot(x=df$log2FoldChange,y=-log10(df$pvalue),
     xlab="log2 fold change", ylab="log10 p-value",
     pch=20, main="Volcano plot",col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(df$log2FoldChange[diffExpressed],
       -log10(df$pvalue[diffExpressed]),col="red",pch=20)

o <- order(df$padj)[1:10]

points(df$log2FoldChange[o],-log10(df$pvalue[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(df$log2FoldChange[o],-log10(df$pvalue[o]),
     labels=df$mgi_symbol[o], pos = 1,offset=0.5,font=2)

19 Volcano plots - ggplot2

# install.packages("ggrepel")
library(ggrepel)

df4ggplot <- df
df4ggplot$DE.cat <- ifelse(
  df4ggplot$padj < 0.01 & abs(df4ggplot$log2FoldChange) > 1,
  "TRUE", "FALSE"
)

df4ggplot$DE.cat <- factor(df4ggplot$DE.cat,
                           levels=c("TRUE","FALSE"))
df4ggplot <- df4ggplot[order(df4ggplot$padj),]

vol_plot <- ggplot(data=df4ggplot, 
                   aes(x=log2FoldChange, y=-log10(pvalue),col=DE.cat)) +
  geom_point() +  
  xlab("log2FoldChange") + ylab("-log10 PValue")+ 
  scale_color_manual(values=c("red","grey"))+
  geom_label_repel(data=df4ggplot[c(1:10),],
                   mapping=aes(label = mgi_symbol),
                   box.padding = 0.5,max.overlaps = 20)+
  theme_classic()

vol_plot

20 Exercise III

21 Solutions III

22 Heatmaps - base graphics

# install.packages("RColorBrewer")
library(RColorBrewer)
DoxEx <-  read.csv("data/DoxExpression.txt",row.names=1)
head(DoxEx)
##                      DOX24H_1  DOX24H_2   DOX24H_3    DOX7D_3    DOX7D_1
## ENSMUSG00000090025  2.4928423  2.430152  2.5169965  2.5058382  2.6820156
## ENSMUSG00000051951 12.4809504 12.339864 12.3060097 11.9827350 11.7679441
## ENSMUSG00000025900 -0.8083506 -0.736612 -0.8085774 -0.8090216 -0.8091383
## ENSMUSG00000025902 -2.2108951 -2.210912 -2.2110380 -2.2113214 -2.2113968
## ENSMUSG00000064376 -1.9723416 -1.972370 -1.9725807 -1.9730551 -1.9569306
## ENSMUSG00000088000 -2.2107546 -2.210769 -2.2108726 -2.2111068 -2.2111690
##                       DOX7D_2
## ENSMUSG00000090025  2.2964760
## ENSMUSG00000051951 12.1394396
## ENSMUSG00000025900 -0.7711231
## ENSMUSG00000025902 -2.2107869
## ENSMUSG00000064376 -1.9721606
## ENSMUSG00000088000 -2.2106652
class(DoxEx)
## [1] "data.frame"
# order df based on padj
o <- order(df$padj)
head(df[o,])
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2
# select top 30 genes based on padj values
selectedGenes <- df[o,]$ensembl_id[1:30]

head(selectedGenes)
## [1] "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770"
## [4] "ENSMUSG00000074480" "ENSMUSG00000024883" "ENSMUSG00000040612"
# subset data of interest and convert it to matrix
DoxExSelected_mx <- as.matrix(DoxEx[selectedGenes,])

heatmap(DoxExSelected_mx,cexCol=1,scale="row", 
        col=colorRampPalette(brewer.pal(8, "Blues"))(25))

23 Heatmaps and RColorBrewer

# install.packages("RColorBrewer")
library(RColorBrewer)

# display all color palettes
display.brewer.all()

# display colorblind-Friendly color palettes
display.brewer.all(colorblindFriendly = TRUE)

# return 8 colours from the color palette "RdBu"
brewer.pal(8, "RdBu")
## [1] "#B2182B" "#D6604D" "#F4A582" "#FDDBC7" "#D1E5F0" "#92C5DE" "#4393C3"
## [8] "#2166AC"

# use `colorRampPalette()` to create a colour ramp from the 8 colours from the color palette "RdBu"
colorRampPalette(brewer.pal(8, "RdBu"))(25)
##  [1] "#B2182B" "#BC2C34" "#C7413E" "#D15748" "#DB6B55" "#E37F65" "#EC9374"
##  [8] "#F4A784" "#F7B799" "#F9C6AD" "#FCD6C1" "#F3DDCF" "#E7E0DB" "#DAE2E7"
## [15] "#CBE2EE" "#B9D9E9" "#A6CFE3" "#94C6DE" "#7EB8D7" "#67A9CF" "#509BC7"
## [22] "#3E8DC0" "#3480B9" "#2A73B2" "#2166AC"

24 Heatmaps - ComplexHeatmap I

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("ComplexHeatmap")
library("ComplexHeatmap")

https://jokergoo.github.io/ComplexHeatmap-reference/book/

?Heatmap

Heatmap {ComplexHeatmap}    R Documentation
  Constructor method for Heatmap class

## Description
  Constructor method for Heatmap class

## Usage
Heatmap(matrix, col, name,
    na_col = "grey",
    color_space = "LAB",
    rect_gp = gpar(col = NA),
    border = NA,
    border_gp = gpar(col = "black"),
    cell_fun = NULL,
    layer_fun = NULL,
    jitter = FALSE,
    
    row_title = character(0),
    row_title_side = c("left", "right"),
    row_title_gp = gpar(fontsize = 13.2),
    row_title_rot = switch(row_title_side[1], "left" = 90, "right" = 270),
    column_title = character(0),
    column_title_side = c("top", "bottom"),
    column_title_gp = gpar(fontsize = 13.2),
    column_title_rot = 0,
    
    cluster_rows = TRUE,
    cluster_row_slices = TRUE,
    clustering_distance_rows = "euclidean",
    clustering_method_rows = "complete",
    row_dend_side = c("left", "right"),
    row_dend_width = unit(10, "mm"),
    show_row_dend = TRUE,
    row_dend_reorder = is.logical(cluster_rows) || is.function(cluster_rows),
    row_dend_gp = gpar(),
    cluster_columns = TRUE,
    cluster_column_slices = TRUE,
    clustering_distance_columns = "euclidean",
    clustering_method_columns = "complete",
    column_dend_side = c("top", "bottom"),
    column_dend_height = unit(10, "mm"),
    show_column_dend = TRUE,
    column_dend_gp = gpar(),
    column_dend_reorder = is.logical(cluster_columns) || is.function(cluster_columns),
    
    row_order = NULL,
    column_order = NULL,
    
    row_labels = rownames(matrix),
    row_names_side = c("right", "left"),
    show_row_names = TRUE,
    row_names_max_width = unit(6, "cm"),
    row_names_gp = gpar(fontsize = 12),
    row_names_rot = 0,
    row_names_centered = FALSE,
    column_labels = colnames(matrix),
    column_names_side = c("bottom", "top"),
    show_column_names = TRUE,
    column_names_max_height = unit(6, "cm"),
    column_names_gp = gpar(fontsize = 12),
    column_names_rot = 90,
    column_names_centered = FALSE,
    
    top_annotation = NULL,
    bottom_annotation = NULL,
    left_annotation = NULL,
    right_annotation = NULL,
    
    km = 1,
    split = NULL,
    row_km = km,
    row_km_repeats = 1,
    row_split = split,
    column_km = 1,
    column_km_repeats = 1,
    column_split = NULL,
    gap = unit(1, "mm"),
    row_gap = unit(1, "mm"),
    column_gap = unit(1, "mm"),
    show_parent_dend_line = ht_opt$show_parent_dend_line,
    
    heatmap_width = unit(1, "npc"),
    width = NULL,
    heatmap_height = unit(1, "npc"),
    height = NULL,
    
    show_heatmap_legend = TRUE,
    heatmap_legend_param = list(title = name),
    
    use_raster = NULL,
    raster_device = c("png", "jpeg", "tiff", "CairoPNG", "CairoJPEG", "CairoTIFF", "agg_png"),
    raster_quality = 1,
    raster_device_param = list(),
    raster_resize_mat = FALSE,
    raster_by_magick = requireNamespace("magick", quietly = TRUE),
    raster_magick_filter = NULL,
    
    post_fun = NULL)

25 Heatmaps - ComplexHeatmap II

DoxEx <-  read.csv("data/DoxExpression.txt",row.names=1)

# order df based on padj
o <- order(df$padj)
head(df[o,])
##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2
# select top 30 genes based on padj values
selectedGenes <- df[o,]$ensembl_id[1:30]

head(selectedGenes)
## [1] "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770"
## [4] "ENSMUSG00000074480" "ENSMUSG00000024883" "ENSMUSG00000040612"
# subset data of interest and convert it to matrix
DoxExSelected <- as.matrix(DoxEx[selectedGenes,])
DEmatok <- t(scale(t(DoxExSelected)))

# print default heatmap
ht <- Heatmap(DEmatok)
draw(ht)

# change font size for the row names and row names
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6))
draw(ht)

# change colours
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = colorRampPalette(brewer.pal(8, "RdBu"))(25))
draw(ht)

# change colours - upregulated as red and downregulated as blue
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)))
draw(ht)

# change figure legend
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",
               title_position = "topcenter"))
draw(ht)

# make it "slimmer" 
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"))

draw(ht,heatmap_legend_side = "bottom")

# turn off "cluster by columns"

ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F)

draw(ht,heatmap_legend_side = "bottom")

# create top annotation
anno <- data.frame(SampleNames=colnames(DEmatok))
anno$TimePoint <- sub("(.+)(_)(.+)","\\1",anno$SampleNames)
anno$TimePoint <- factor(anno$TimePoint)
time_col <- brewer.pal(10, "Paired")[c(9:10)]
names(time_col) <- levels(anno$TimePoint)
col_ha = HeatmapAnnotation(TimePoint = anno$TimePoint,
                           col = list(TimePoint = time_col))
ht_top <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F,top_annotation = col_ha)

draw(ht_top,heatmap_legend_side = "bottom")

# remove column names and remove row names

ht_top_NoL <- Heatmap(DEmatok, row_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F,top_annotation = col_ha,
        show_column_names = F,show_row_names = F)

draw(ht_top_NoL,heatmap_legend_side = "bottom")

25.1 Saving your plots

  1. useful functions: pdf(), png(), jpeg(), postscript(), svg()

  2. plot object that you would like to save

  3. close the file with dev.off()

?svg()
## Usage
svg(filename = if(onefile) "Rplots.svg" else "Rplot%03d.svg",
    width = 7, height = 7, pointsize = 12,
    onefile = FALSE, family = "sans", bg = "white",
    antialias = c("default", "none", "gray", "subpixel"),
    symbolfamily)
filename4ht <- "heatmapCourseOutput.svg"
svg(file=filename4ht, width = 4, height=5.5)
  draw(ht_top_NoL,heatmap_legend_side = "bottom")
dev.off()

26 Exercise IV

27 Solutions IV

28 Plots of functional enrichment

Gene Ontology analysis was performed based on Up-regulated genes in Dox7D vs Dox24H using thresholds padj < 0.01 and log2(FC) > log2(2). Please see “data/GO_UpDE.csv”

Gene Ontology results

28.1 Bar plot

  • We use bar plot to show top 15 GO terms
GOres <-  read.csv("data/GO_UpDE.csv")
# rank "GOres" based on "FDR.over"
GOres <- GOres[order(GOres$FDR.over),]

# subset top 15 GO terms
GOres_top15 <- GOres[c(1:15),]

# basic bar plot
ori_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity") +
  xlab("-log10(FDR)")
ori_bar

# add colour
blue_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

blue_bar

  • present the GO terms based on their significance
# the "GOres_top15" already ordered by the "FDR.over" but it is not shown in the bar plot
head(GOres_top15)
##     category over_represented_pvalue under_represented_pvalue numDEInCat
## 1 GO:0023052                4.98e-22                        1        318
## 2 GO:0007154                5.35e-21                        1        317
## 3 GO:0032501                5.47e-21                        1        369
## 4 GO:0050896                2.89e-19                        1        386
## 5 GO:0007165                1.00e-17                        1        282
## 6 GO:0051716                3.41e-17                        1        339
##   numInCat                             term ontology FDR.over FDR.under
## 1     4390                        signaling       BP 9.90e-18         1
## 2     4436               cell communication       BP 3.63e-17         1
## 3     5461 multicellular organismal process       BP 3.63e-17         1
## 4     6054             response to stimulus       BP 1.44e-15         1
## 5     3932              signal transduction       BP 3.99e-14         1
## 6     5189    cellular response to stimulus       BP 1.13e-13         1
# We use 'factor'
class(GOres_top15$term)
## [1] "character"
## version 1
ordered_term <- GOres_top15$term
head(GOres_top15$term)
## [1] "signaling"                        "cell communication"              
## [3] "multicellular organismal process" "response to stimulus"            
## [5] "signal transduction"              "cellular response to stimulus"
GOres_top15$term <- factor(GOres_top15$term,levels=ordered_term)

ordered_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

ordered_bar

## version 2
ordered_term_rev <- rev(ordered_term)
head(ordered_term_rev)
## [1] "tissue development"                         
## [2] "system development"                         
## [3] "multicellular organism development"         
## [4] "regulation of cell population proliferation"
## [5] "cell-cell signaling"                        
## [6] "synaptic signaling"
GOres_top15$term <- factor(GOres_top15$term,levels=ordered_term_rev)

ordered_barRev <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

ordered_barRev

28.2 Bubble plot

  • We can further define geneRatio as number of DE genes associated with GO process / total number of DE genes.
# calculate number of up.DE
df <- read.csv("data/DiffEx.csv")
upDE_df <- df[!is.na(df$padj) & df$padj<0.01 & df$log2FoldChange> 1,]
num.UpDE <- nrow(upDE_df)

GOres_top15$geneRatio <- GOres_top15$numDEInCat/num.UpDE
head(GOres_top15)
##     category over_represented_pvalue under_represented_pvalue numDEInCat
## 1 GO:0023052                4.98e-22                        1        318
## 2 GO:0007154                5.35e-21                        1        317
## 3 GO:0032501                5.47e-21                        1        369
## 4 GO:0050896                2.89e-19                        1        386
## 5 GO:0007165                1.00e-17                        1        282
## 6 GO:0051716                3.41e-17                        1        339
##   numInCat                             term ontology FDR.over FDR.under
## 1     4390                        signaling       BP 9.90e-18         1
## 2     4436               cell communication       BP 3.63e-17         1
## 3     5461 multicellular organismal process       BP 3.63e-17         1
## 4     6054             response to stimulus       BP 1.44e-15         1
## 5     3932              signal transduction       BP 3.99e-14         1
## 6     5189    cellular response to stimulus       BP 1.13e-13         1
##   geneRatio
## 1 0.4555874
## 2 0.4541547
## 3 0.5286533
## 4 0.5530086
## 5 0.4040115
## 6 0.4856734
GOres_top15_bubble <- GOres_top15[order(GOres_top15$geneRatio),]
GOres_top15_bubble$term <- factor(GOres_top15_bubble$term,
                                  levels = GOres_top15_bubble$term)

bubblePlot <- ggplot(GOres_top15_bubble,aes(x=geneRatio,y=term)) + 
  geom_point(aes(size=numDEInCat,color=FDR.over)) +
  xlab("geneRatio") + theme_classic()+
  scale_colour_gradient(name = "FDR",
                         low = "#D55E00", high = "#0072B2")
bubblePlot

29 Saving your plots - ggsave

29.1 ggsave()

  • It defaults to saving the last plot that you displayed, using the size of the current graphics device

  • It also guesses the type of graphics device from the extension

  • device: “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only)

  • Plot to save, defaults to last plot displayed.

?ggsave()
## Usage
ggsave(filename, plot = last_plot(), device = NULL, path = NULL,
  scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"),
  dpi = 300, limitsize = TRUE, ...)
  
data(mtcars)

plot1 <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
plot2 <- ggplot(mtcars, aes(mpg, wt,col=as.factor(vs))) + geom_point()

ggsave("mtcars_default.png", plot1)
ggsave("mtcars_col.png", plot2)

30 Exercise V

31 Solutions V

32 Resources

33 sessionInfo

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.10.0 RColorBrewer_1.1-2    ggrepel_0.9.1        
## [4] reshape2_1.4.4        ggplot2_3.3.5        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8          circlize_0.4.13     png_0.1-7          
##  [4] assertthat_0.2.1    digest_0.6.29       foreach_1.5.1      
##  [7] utf8_1.2.2          R6_2.5.1            plyr_1.8.6         
## [10] stats4_4.1.2        evaluate_0.14       highr_0.9          
## [13] pillar_1.6.5        GlobalOptions_0.1.2 rlang_0.4.12       
## [16] jquerylib_0.1.4     magick_2.7.3        S4Vectors_0.32.2   
## [19] GetoptLong_1.0.5    rmarkdown_2.11      labeling_0.4.2     
## [22] stringr_1.4.0       munsell_0.5.0       compiler_4.1.2     
## [25] xfun_0.28           pkgconfig_2.0.3     BiocGenerics_0.40.0
## [28] shape_1.4.6         htmltools_0.5.2     tidyselect_1.1.1   
## [31] tibble_3.1.6        bookdown_0.24       IRanges_2.28.0     
## [34] codetools_0.2-18    matrixStats_0.61.0  fansi_1.0.2        
## [37] crayon_1.4.2        dplyr_1.0.7         withr_2.4.3        
## [40] jsonlite_1.7.3      gtable_0.3.0        lifecycle_1.0.1    
## [43] DBI_1.1.2           magrittr_2.0.1      scales_1.1.1       
## [46] rmdformats_1.0.3    stringi_1.7.6       farver_2.1.0       
## [49] doParallel_1.0.16   bslib_0.3.1         ellipsis_0.3.2     
## [52] generics_0.1.1      vctrs_0.3.8         rjson_0.2.21       
## [55] iterators_1.0.13    tools_4.1.2         glue_1.6.1         
## [58] purrr_0.3.4         parallel_4.1.2      fastmap_1.1.0      
## [61] yaml_2.2.2          clue_0.3-60         colorspace_2.0-2   
## [64] cluster_2.1.2       knitr_1.36          sass_0.4.0