A short intro to the most important data structures in R

For our purpose, we need to know two data structures in R:

Data frame

A data frame is a table in which each column contains values of one variable and each row contains one set of values from each column. Both row and column names should be unique.

A data frame is produced when we upload a comma-delimited table into R using the read.csv() command.
A useful command to check the content of a data frame is head() which displays the top of the data.frame.

# reading in table 
df <- read.csv("data/DiffEx.csv")
# showing first 6 rows of the data frame
head(df)

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2

subset the data.frame based on columns
- Single columns can either be accessed via the $ operator, writing the explicit name [,name] (if the column has a name) or more generally by [,j] for the jth column.

# different options to access the data 
head(df$baseMean)

## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342

head(df[,"baseMean"])

## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342

head(df[,2])

## [1] 10893.6671   296.2968  1967.0850  1359.6740   414.7163  2986.8342

subset the data frame based on rows
- Another common way to subset the data frame is to supply a logical (Boolean) vector. For example, if we use the following thresholds to define significant genes (padj < 0.01 and abs(log2 fold change) > 1) from the DESeq2 analysis results.

# define the DE genes using logical vector 
diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)
# diffExpressed is a vector with TRUE or FALSE values
head(diffExpressed)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

class(diffExpressed)

## [1] "logical"

# make sure the length of the logical vector is consistent with the original dataset
## check the dimensions of the original dataset
dim(df)

## [1] 18357    10

## check the length of the logical vector
length(diffExpressed)

## [1] 18357

# use table function to see TRUE & FALSE
table(diffExpressed)

## diffExpressed
## FALSE  TRUE 
## 17120  1237

# subset the original df using diffExpressed vector
DE_res <- df[diffExpressed,]

# check the dimension of the new data frame
dim(DE_res)

## [1] 1237   10

# showing first 6 rows of the data frame
head(DE_res)

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2

Another useful command to check the type of each columns from a data frame is str() which compactly displays the structure of a R object.

# different options to access the data 
str(df)

## 'data.frame':    18357 obs. of  10 variables:
##  $ ensembl_id     : chr  "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770" "ENSMUSG00000074480" ...
##  $ baseMean       : num  10894 296 1967 1360 415 ...
##  $ log2FoldChange : num  2.86 4.42 1.74 -2.17 3.52 ...
##  $ lfcSE          : num  0.109 0.187 0.077 0.102 0.166 ...
##  $ stat           : num  26.1 23.7 22.6 -21.3 21.2 ...
##  $ pvalue         : num  6.09e-150 4.39e-124 3.18e-113 3.17e-100 1.23e-99 ...
##  $ padj           : num  1.12e-145 4.03e-120 1.94e-109 1.46e-96 4.51e-96 ...
##  $ baseMean_DOX7D : num  21317 627 3080 473 820 ...
##  $ baseMean_DOX24H: num  2945.5 29.3 921.2 2126.6 71.6 ...
##  $ mgi_symbol     : chr  "Scg2" "Gfra1" "Dlg1" "Mex3a" ...

Vector

A vector is a sequence of data elements of the same type. We use it often in the definiton of colors or genes to be displayed. A vector can be constructed using the c() function.

colo <- c("blue","red","green")

colo

## [1] "blue"  "red"   "green"

Two philosophies of plotting in R

In our tutorial, we use two sources of plotting capabilities. The first one (base graphics) might resemble a printer, where you put in your data, make some choices, press the button and voila you have the picture. The second one (ggplot2) is more like an oil painting, it is layered and needs some setting up, but it is more easily modifiable (although it might be more laborious to get to the full painting in the first place).

Disclaimer: The distinction between the two plotting philosophies is not that strict, we can still do some limited modification after initial plotting also with base graphics using auxiliary functions.

base graphics

This functionality for plotting is easy to use, but becomes difficult or impossible to draw complicated plots.

For using base graphics, we start with the plot() function and then could add features using auxilary functions such as

lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); the function connects the points
points: add points to a plot
text: add text labels to a plot using specified x, y coordinates
title: add annotations to x, y axis labels, title, subtitle, outer margin
mtext: add arbitrary text to the margins (inner or outer) of the plot
axis: adding axis ticks/labels

usage

This scatterplot is an example of the use of the plot() function. Basically, it takes the x and y arguments for the x and y coordinates of the plot.

# plotting 
plot(x=log2(df$baseMean_DOX24H), y=log2(df$baseMean_DOX7D))

Additional arguments provide options for adjusting the plot such as xlab, ylab, pch and main). These are explained using help(plot) or ?plot.

# help(plot) or ?plot
?plot

plot.default {graphics} R Documentation
The Default Scatterplot Function
Description
Draw a scatter plot with decorations such as axes and titles in the active graphics window.

Usage
## Default S3 method:
plot(x, y = NULL, type = "p",  xlim = NULL, ylim = NULL,
     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
     ann = par("ann"), axes = TRUE, frame.plot = axes,
     panel.first = NULL, panel.last = NULL, asp = NA,
     xgap.axis = NA, ygap.axis = NA,
     ...)

type: what type of plot should be drawn. Possible types are

        "p" for points,
        "l" for lines,
        "b" for both,
        "c" for the lines part alone of "b",
        "o" for both ‘overplotted’,
        "h" for ‘histogram’ like (or ‘high-density’) vertical lines,
        "s" for stair steps,
        "S" for other steps, see ‘Details’ below,
        "n" for no plotting.

In the following, we also define the differentially expressed genes and highlight them by red color using points() to add them to our origin plots.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", 
     ylab="log2 expression DOX7D", pch=20, 
     main="Comparison of expression at 24h and 7 days")

diffExpressed <- (!is.na(df$padj) & df$padj < 0.01 & abs(df$log2FoldChange) > 1)
head(diffExpressed)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

length(diffExpressed)

## [1] 18357

table(diffExpressed)

## diffExpressed
## FALSE  TRUE 
## 17120  1237

points(log2(df$baseMean_DOX24H[diffExpressed]),log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

Plotting ‘character’ (pch) - symbol to use

ggplot2 package

A very versatile package which has become a de facto standard for producing high quality figures is an R package based on the “grammar of graphics”. This enables developing plots by layering graphical elements on top of each other.

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).” - Wickham, 2016

Every ggplot2 plot has three key components:

data provided as data frame,
aesthetic mappings between variables in the data and visual properties, and
layer: usually created with a geom function.

usage

Start with the ggplot function which initializes a ggplot object. Here you need to declare the input data frame and to specify the set of plot aesthetics. The actual plotting will be done by applying another layer (with the “+” sign) using various specific geom functions. For a scatter plot like the one above, we will use the geom_point() function.

# install.packages("ggplot2")
library("ggplot2")

# (1) ggplot function with initializes a ggplot object
ggplot(df)

# (2) specify the set of plot aesthetics
ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D)))

# (3) specify plot type using geom_point()
ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D))) +
  geom_point()

Note that we do not need to indicate that the x and y are specific columns of the dataframe df e.g. by using the $ operator as we already defined in the ggplot function that df should be used.

Now we add more layers to get a similar plot as we got using base graphics.

ggplot(df, aes(x=log2(baseMean_DOX24H), y=log2(baseMean_DOX7D))) +
  geom_point() + 
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D")

This looks promising but we would like to highlight the differentially expressed genes. Since ggplot works with the input data frame, we need to include this information into the data frame df. Then, we add color to the aesthetic mappings and set it the added column. geom_point() will select its color for the different values of the columns, but if we do not like it, we can overwrite or more exactly put another layer on it (like on an oil painting.)

# check the dimension for object "df"
dim(df)

## [1] 18357    10

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

# diffExpressed is a vector with TRUE or FALSE values
head(diffExpressed)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

length(diffExpressed)

## [1] 18357

df2 <- data.frame(df, diffEx=diffExpressed)

head(df2)

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol diffEx
## 1 1.12e-145     21316.8751      2945.51899       Scg2   TRUE
## 2 4.03e-120       626.6633        29.26926      Gfra1   TRUE
## 3 1.94e-109      3080.2873       921.18292       Dlg1   TRUE
## 4  1.46e-96       472.8374      2126.62645      Mex3a   TRUE
## 5  4.51e-96       820.3147        71.59931       Rin1   TRUE
## 6  5.58e-96      4687.6749      1193.30291      Ildr2   TRUE

table(df2$diffEx)

## 
## FALSE  TRUE 
## 17120  1237

g <- ggplot(df2, aes(x=log2(baseMean_DOX24H), 
                     y=log2(baseMean_DOX7D),color=diffEx)) +
  geom_point() + 
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D")
g

# less fancy colors please  
g + scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.")

# more classic look please  
g + scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.") + 
  theme_classic()

Note that we assigned first the ggplot object to the variable g and then plot by entering it again. The nice thing here is that we can put further layers on the plot without calling the ggplot() and other functions again. This is convenient if we need to generate various different version of one figure.

Exercise I

Plotting in R exercise 1
note: In this exercise, you will need to know the difference between “numeric” and “factor” R data types

Solutions I

Plotting in R solution 1

Our case study

In 1933, the Dutch pediatrician Cornelia De Lange described a disorder in two children, which became known as Cornelia de Lange syndrome (CdLS). Its features can include short stature, intellectual disability, developmental delay and signs of autism. Over the years, research showed that Cornelia de Lange syndrome is a genetic disorder and that it is frequently linked to mutations in components of cohesin complex. This complex is fundamental for obtaining higher order structures of DNA within the nucleus.

Classical facial dysmorphisms of CdLS and related overlapping syndromes

Classical facial dysmorphisms of CdLS and related overlapping syndromes

Five major CdLS facial features and their different severity

Clinical Genetics, Volume: 97, Issue: 1, Pages: 3-11, First published: 13 November 2019, DOI: (10.1111/cge.13674)

How cohesin mutations cause Cornelia de Lange syndrome is not well understood. To elucidate the potential role of cohesin for Cornelia de Lange syndrom, a group at the LMS applied several genomic approaches including RNA-seq technology.

Abstract of our case study https://doi.org/10.1038/s41467-021-23141-9

Weiss et al., Nat Commun. 2021

After comparing expression in cortical neurons from post-mortem brains of Cornelia de Lange syndrome patients with controls, the researchers showed in a mouse model that doxycycline (dox) inducible depletion of one cohesin component (RAD21) leads to similar changes of expression including the downregulation of genes with neuronal and synaptic functions. The researchers asked then the crucial question if the normal expression could be restored when the depletion of RAD21 is lifted. A positive answer might give us hope that correction of cohesin functions could provide a potential therapeutic approach for Cornelia de Lange patients.

The figure below shows one of the experiments in this study. Rad21-TEV neurons were grown in culture, for 10 days. On day 10, cells were treated for 6 hours with either 100ng/ml of doxycycline, or vehicle. After 6 hours, the samples were washed and fresh media was added. Samples for cohesin depletion (24H) were collected 18 hours later, and samples for rescue (7D) were collected 7 days after addition of doxycyline/vehicle.

The file "data/DiffEx.csv" contains the expression after RAD21 depletion at 24 hours (baseMean_DOX24H) and RAD21 restoration at day 7 (baseMean_DOX7D). It also contains log2 fold changes and adjusted p-values (padj). "data/GO_UpDE.csv" contains gene ontology analysis results. "data/DoxExpression.txt" contains normalised gene counts information. To gain insights into these data and to present them to the research community, we need to turn the table into some informative plots. In the following, we would like to complete following tasks:

Our tasks:

Display the distribution of log2 fold changes
Visualise the expression and changes in expression
Plot the functional categories enriched in significantly differentially expressed genes

Let’s get started.

Displaying distributions - Base graphics

Histograms - Base graphics

hist(df$log2FoldChange,breaks=100,xlab = "Log2 fold changes",
     main="Histogram of log2 fold changes",col="red")

Density plots - Base graphics

log2 FC density plot

d <- density(df$log2FoldChange)
plot(d,xlab="Log2 fold changes",
     main="Distribution of log2 fold changes")
polygon(d,col="red")

log2 normalised expression levels in two conditions

log2_DOX24H <- log2(df$baseMean_DOX24H)
log2_DOX7D <- log2(df$baseMean_DOX7D)

d1 <- density(log2_DOX7D)
d2 <- density(log2_DOX24H)

plot(d1, lwd=2, xlab="Log2 baseMean",
     main="Distribution of log2 fold changes")
lines(d2,col=rgb(0,0,1),lwd=2)
polygon(d1,col=rgb(1,0,0,0.2))
polygon(d2,col=rgb(0,0,1,0.2))
legend(x=18,y=0.08,legend=c("DOX24H","DOX7D"),
       fill=c(rgb(1,0,0,0.2),rgb(0,0,1,0.2)))

Displaying distributions - ggplot2

Histograms - ggplot2

# install.package(ggplot2)
library(ggplot2)

# basic histogram
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# change the number of bins and add x and y lab
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100) + 
  xlab("Log2 fold changes") + ylab("Frequency")

# change colours 
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100, col="black",fill="red") + 
  xlab("Log2 fold changes") + ylab("Frequency")

# use "classic background" and add title
ggplot(df, aes(x=log2FoldChange)) + 
  geom_histogram(bins=100, col="black",fill="red") + 
  xlab("Log2 fold changes") + ylab("Frequency") +
  theme_classic() +
  ggtitle("Histogram of log2 fold changes")

Density plots - ggplot2

log2 FC density plot

# basic density plot
ggplot(df, aes(x=log2FoldChange)) + 
  geom_density()

# change colour and use "theme_bw"
ggplot(df, aes(x=log2FoldChange)) + 
  geom_density(fill="red") + 
  theme_bw()

log2 normalised expression levels in two conditions
- we have to change the data format in to the format that ggplot2 likes
- we are going to convert the “wide” data.frame into “long” data.frame

# subset the data.frame and only keep the information of interest
df_subset <- df[,colnames(df) %in% 
                  c("ensembl_id","baseMean_DOX7D","baseMean_DOX24H")]
dim(df_subset)

## [1] 18357     3

head(df_subset)

##           ensembl_id baseMean_DOX7D baseMean_DOX24H
## 1 ENSMUSG00000050711     21316.8751      2945.51899
## 2 ENSMUSG00000025089       626.6633        29.26926
## 3 ENSMUSG00000022770      3080.2873       921.18292
## 4 ENSMUSG00000074480       472.8374      2126.62645
## 5 ENSMUSG00000024883       820.3147        71.59931
## 6 ENSMUSG00000040612      4687.6749      1193.30291

# convert df_subset into a "long" format
# install.packages("reshape2")
library("reshape2")
df_subset_long <- melt(df_subset,id="ensembl_id")
dim(df_subset_long)

## [1] 36714     3

str(df_subset_long)

## 'data.frame':    36714 obs. of  3 variables:
##  $ ensembl_id: chr  "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770" "ENSMUSG00000074480" ...
##  $ variable  : Factor w/ 2 levels "baseMean_DOX7D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value     : num  21317 627 3080 473 820 ...

head(df_subset_long)

##           ensembl_id       variable      value
## 1 ENSMUSG00000050711 baseMean_DOX7D 21316.8751
## 2 ENSMUSG00000025089 baseMean_DOX7D   626.6633
## 3 ENSMUSG00000022770 baseMean_DOX7D  3080.2873
## 4 ENSMUSG00000074480 baseMean_DOX7D   472.8374
## 5 ENSMUSG00000024883 baseMean_DOX7D   820.3147
## 6 ENSMUSG00000040612 baseMean_DOX7D  4687.6749

# now it is ready for ggplot2
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density()

# setup the alpha value to make the colours transparent
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density(alpha=0.5)

# use the "theme_classic" and add xlab
ggplot(df_subset_long, aes(x=log2(value), fill=variable)) + 
  geom_density(alpha=0.5) + theme_classic() + xlab("Log2 baseMean")

violin plot

ggplot(df_subset_long, aes(x=variable,y=log2(value), fill=variable)) + 
  geom_violin(alpha=0.5) + theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

violin plot + box plot

# add geom_boxplot() layer directly
ggplot(df_subset_long, aes(x=variable,y=log2(value),fill=variable)) + 
  geom_violin(alpha=0.5) + 
  geom_boxplot() + 
  theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

# what to do if we want violin plot filled with 2 different colours and 
# box plot filled with "white" colour
ggplot(df_subset_long, aes(x=variable,y=log2(value))) + 
  geom_violin(aes(fill=variable),alpha=0.5) + 
  geom_boxplot(width=0.2) + 
  theme_classic() + 
  ylab("Log2 baseMean") + xlab("Samples")

Exercise II

Plotting in R exercise 2
note, In this exercise, apart from ggplot2, you will also be working with some commonly used tools in R: (1) the melt() function from the reshape2 package, (2) use of regular expression to manipulate data. More details about regular expression [[https://en.wikipedia.org/wiki/Regular_expression]]

Regular Expression

Solutions II

Plotting in R solution 2

Visualising expression and changes in expression

Scatterplots - base graphics

We already did this before. Now we want to highlight some genes and put their gene symbol into the plot. For the start, we will highlight the 5 most significant genes.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", 
     ylab="log2 expression DOX7D",
     pch=20, main="Comparison of expression at 24h and 7 days",
     col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(log2(df$baseMean_DOX24H[diffExpressed]),
       log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

o <- order(df$padj)[1:5]
df[o,]

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1

points(log2(df$baseMean_DOX24H[o]),
       log2(df$baseMean_DOX7D[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(log2(df$baseMean_DOX24H[o]),
     log2(df$baseMean_DOX7D[o]),
     labels=df$mgi_symbol[o], 
     pos = 3,offset=0.5,font=2)

This looks good. Now let’s label the top 20 genes.

# plotting 
plot(x=log2(df$baseMean_DOX24H),y=log2(df$baseMean_DOX7D),
     xlab="log2 expression DOX24H", ylab="log2 expression DOX7D",pch=20, 
     main="Comparison of expression at 24h and 7 days",col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(log2(df$baseMean_DOX24H[diffExpressed]),
       log2(df$baseMean_DOX7D[diffExpressed]),
       col="red",pch=20)

o <- order(df$padj)[1:20]

points(log2(df$baseMean_DOX24H[o]),
       log2(df$baseMean_DOX7D[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(log2(df$baseMean_DOX24H[o]),
     log2(df$baseMean_DOX7D[o]),
     labels=df$mgi_symbol[o], pos = 3,offset=0.5,font=2)

Oh, no.

ggplot2 and ggrepel

# install.packages("ggrepel")
library(ggrepel)

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)
# diffExpressed is a vector with TRUE or FALSE values

df4ggplot <- data.frame(df, diffEx=diffExpressed)

head(df4ggplot)

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol diffEx
## 1 1.12e-145     21316.8751      2945.51899       Scg2   TRUE
## 2 4.03e-120       626.6633        29.26926      Gfra1   TRUE
## 3 1.94e-109      3080.2873       921.18292       Dlg1   TRUE
## 4  1.46e-96       472.8374      2126.62645      Mex3a   TRUE
## 5  4.51e-96       820.3147        71.59931       Rin1   TRUE
## 6  5.58e-96      4687.6749      1193.30291      Ildr2   TRUE

table(df4ggplot$diffEx)

## 
## FALSE  TRUE 
## 17120  1237

df4ggplot <- df4ggplot[order(df4ggplot$padj),]

g_top20 <- ggplot(df4ggplot, aes(x=log2(baseMean_DOX24H), 
                     y=log2(baseMean_DOX7D),color=diffEx)) +
  geom_point() + 
  scale_color_manual(values=c('black','red')) + 
  labs(color = "Diff. exp.") + 
  theme_classic() +
  ggtitle("Comparison of expression at 24h and 7 days")+
  xlab("log2 expression DOX24H") +
  ylab("log2 expression DOX7D") +
  geom_label_repel(data=df4ggplot[c(1:20),],
                   mapping=aes(label = mgi_symbol),
                   box.padding = 0.5,max.overlaps = 20)

g_top20

Volcano plots - base graphics

# plotting 
plot(x=df$log2FoldChange,y=-log10(df$pvalue),
     xlab="log2 fold change", ylab="log10 p-value",
     pch=20, main="Volcano plot",col="gray")

diffExpressed <- (df$padj < 0.01 & abs(df$log2FoldChange) > 1)

points(df$log2FoldChange[diffExpressed],
       -log10(df$pvalue[diffExpressed]),col="red",pch=20)

o <- order(df$padj)[1:10]

points(df$log2FoldChange[o],-log10(df$pvalue[o]),
       col="black",pch=21,cex=1.5,lwd=2)

text(df$log2FoldChange[o],-log10(df$pvalue[o]),
     labels=df$mgi_symbol[o], pos = 1,offset=0.5,font=2)

Volcano plots - ggplot2

# install.packages("ggrepel")
library(ggrepel)

df4ggplot <- df
df4ggplot$DE.cat <- ifelse(
  df4ggplot$padj < 0.01 & abs(df4ggplot$log2FoldChange) > 1,
  "TRUE", "FALSE"
)

df4ggplot$DE.cat <- factor(df4ggplot$DE.cat,
                           levels=c("TRUE","FALSE"))
df4ggplot <- df4ggplot[order(df4ggplot$padj),]

vol_plot <- ggplot(data=df4ggplot, 
                   aes(x=log2FoldChange, y=-log10(pvalue),col=DE.cat)) +
  geom_point() +  
  xlab("log2FoldChange") + ylab("-log10 PValue")+ 
  scale_color_manual(values=c("red","grey"))+
  geom_label_repel(data=df4ggplot[c(1:10),],
                   mapping=aes(label = mgi_symbol),
                   box.padding = 0.5,max.overlaps = 20)+
  theme_classic()

vol_plot

Exercise III

Plotting in R exercise 3

Solutions III

Plotting in R solution 3

Heatmaps - base graphics

# install.packages("RColorBrewer")
library(RColorBrewer)
DoxEx <-  read.csv("data/DoxExpression.txt",row.names=1)
head(DoxEx)

##                      DOX24H_1  DOX24H_2   DOX24H_3    DOX7D_3    DOX7D_1
## ENSMUSG00000090025  2.4928423  2.430152  2.5169965  2.5058382  2.6820156
## ENSMUSG00000051951 12.4809504 12.339864 12.3060097 11.9827350 11.7679441
## ENSMUSG00000025900 -0.8083506 -0.736612 -0.8085774 -0.8090216 -0.8091383
## ENSMUSG00000025902 -2.2108951 -2.210912 -2.2110380 -2.2113214 -2.2113968
## ENSMUSG00000064376 -1.9723416 -1.972370 -1.9725807 -1.9730551 -1.9569306
## ENSMUSG00000088000 -2.2107546 -2.210769 -2.2108726 -2.2111068 -2.2111690
##                       DOX7D_2
## ENSMUSG00000090025  2.2964760
## ENSMUSG00000051951 12.1394396
## ENSMUSG00000025900 -0.7711231
## ENSMUSG00000025902 -2.2107869
## ENSMUSG00000064376 -1.9721606
## ENSMUSG00000088000 -2.2106652

class(DoxEx)

## [1] "data.frame"

# order df based on padj
o <- order(df$padj)
head(df[o,])

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2

# select top 30 genes based on padj values
selectedGenes <- df[o,]$ensembl_id[1:30]

head(selectedGenes)

## [1] "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770"
## [4] "ENSMUSG00000074480" "ENSMUSG00000024883" "ENSMUSG00000040612"

# subset data of interest and convert it to matrix
DoxExSelected_mx <- as.matrix(DoxEx[selectedGenes,])

heatmap(DoxExSelected_mx,cexCol=1,scale="row", 
        col=colorRampPalette(brewer.pal(8, "Blues"))(25))

Heatmaps and RColorBrewer

# install.packages("RColorBrewer")
library(RColorBrewer)

# display all color palettes
display.brewer.all()

# display colorblind-Friendly color palettes
display.brewer.all(colorblindFriendly = TRUE)

# return 8 colours from the color palette "RdBu"
brewer.pal(8, "RdBu")

## [1] "#B2182B" "#D6604D" "#F4A582" "#FDDBC7" "#D1E5F0" "#92C5DE" "#4393C3"
## [8] "#2166AC"

# use `colorRampPalette()` to create a colour ramp from the 8 colours from the color palette "RdBu"
colorRampPalette(brewer.pal(8, "RdBu"))(25)

##  [1] "#B2182B" "#BC2C34" "#C7413E" "#D15748" "#DB6B55" "#E37F65" "#EC9374"
##  [8] "#F4A784" "#F7B799" "#F9C6AD" "#FCD6C1" "#F3DDCF" "#E7E0DB" "#DAE2E7"
## [15] "#CBE2EE" "#B9D9E9" "#A6CFE3" "#94C6DE" "#7EB8D7" "#67A9CF" "#509BC7"
## [22] "#3E8DC0" "#3480B9" "#2A73B2" "#2166AC"

Heatmaps - ComplexHeatmap I

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("ComplexHeatmap")
library("ComplexHeatmap")

## Loading required package: grid

## ========================================
## ComplexHeatmap version 2.10.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================

https://jokergoo.github.io/ComplexHeatmap-reference/book/

?Heatmap


Heatmap {ComplexHeatmap}    R Documentation
  Constructor method for Heatmap class

## Description
  Constructor method for Heatmap class

## Usage
Heatmap(matrix, col, name,
    na_col = "grey",
    color_space = "LAB",
    rect_gp = gpar(col = NA),
    border = NA,
    border_gp = gpar(col = "black"),
    cell_fun = NULL,
    layer_fun = NULL,
    jitter = FALSE,
    
    row_title = character(0),
    row_title_side = c("left", "right"),
    row_title_gp = gpar(fontsize = 13.2),
    row_title_rot = switch(row_title_side[1], "left" = 90, "right" = 270),
    column_title = character(0),
    column_title_side = c("top", "bottom"),
    column_title_gp = gpar(fontsize = 13.2),
    column_title_rot = 0,
    
    cluster_rows = TRUE,
    cluster_row_slices = TRUE,
    clustering_distance_rows = "euclidean",
    clustering_method_rows = "complete",
    row_dend_side = c("left", "right"),
    row_dend_width = unit(10, "mm"),
    show_row_dend = TRUE,
    row_dend_reorder = is.logical(cluster_rows) || is.function(cluster_rows),
    row_dend_gp = gpar(),
    cluster_columns = TRUE,
    cluster_column_slices = TRUE,
    clustering_distance_columns = "euclidean",
    clustering_method_columns = "complete",
    column_dend_side = c("top", "bottom"),
    column_dend_height = unit(10, "mm"),
    show_column_dend = TRUE,
    column_dend_gp = gpar(),
    column_dend_reorder = is.logical(cluster_columns) || is.function(cluster_columns),
    
    row_order = NULL,
    column_order = NULL,
    
    row_labels = rownames(matrix),
    row_names_side = c("right", "left"),
    show_row_names = TRUE,
    row_names_max_width = unit(6, "cm"),
    row_names_gp = gpar(fontsize = 12),
    row_names_rot = 0,
    row_names_centered = FALSE,
    column_labels = colnames(matrix),
    column_names_side = c("bottom", "top"),
    show_column_names = TRUE,
    column_names_max_height = unit(6, "cm"),
    column_names_gp = gpar(fontsize = 12),
    column_names_rot = 90,
    column_names_centered = FALSE,
    
    top_annotation = NULL,
    bottom_annotation = NULL,
    left_annotation = NULL,
    right_annotation = NULL,
    
    km = 1,
    split = NULL,
    row_km = km,
    row_km_repeats = 1,
    row_split = split,
    column_km = 1,
    column_km_repeats = 1,
    column_split = NULL,
    gap = unit(1, "mm"),
    row_gap = unit(1, "mm"),
    column_gap = unit(1, "mm"),
    show_parent_dend_line = ht_opt$show_parent_dend_line,
    
    heatmap_width = unit(1, "npc"),
    width = NULL,
    heatmap_height = unit(1, "npc"),
    height = NULL,
    
    show_heatmap_legend = TRUE,
    heatmap_legend_param = list(title = name),
    
    use_raster = NULL,
    raster_device = c("png", "jpeg", "tiff", "CairoPNG", "CairoJPEG", "CairoTIFF", "agg_png"),
    raster_quality = 1,
    raster_device_param = list(),
    raster_resize_mat = FALSE,
    raster_by_magick = requireNamespace("magick", quietly = TRUE),
    raster_magick_filter = NULL,
    
    post_fun = NULL)

Heatmaps - ComplexHeatmap II

DoxEx <-  read.csv("data/DoxExpression.txt",row.names=1)

# order df based on padj
o <- order(df$padj)
head(df[o,])

##           ensembl_id   baseMean log2FoldChange      lfcSE      stat    pvalue
## 1 ENSMUSG00000050711 10893.6671       2.855441 0.10948634  26.08034 6.09e-150
## 2 ENSMUSG00000025089   296.2968       4.421013 0.18660593  23.69170 4.39e-124
## 3 ENSMUSG00000022770  1967.0850       1.741321 0.07700282  22.61373 3.18e-113
## 4 ENSMUSG00000074480  1359.6740      -2.167605 0.10199629 -21.25180 3.17e-100
## 5 ENSMUSG00000024883   414.7163       3.518405 0.16605551  21.18812  1.23e-99
## 6 ENSMUSG00000040612  2986.8342       1.974114 0.09325275  21.16950  1.82e-99
##        padj baseMean_DOX7D baseMean_DOX24H mgi_symbol
## 1 1.12e-145     21316.8751      2945.51899       Scg2
## 2 4.03e-120       626.6633        29.26926      Gfra1
## 3 1.94e-109      3080.2873       921.18292       Dlg1
## 4  1.46e-96       472.8374      2126.62645      Mex3a
## 5  4.51e-96       820.3147        71.59931       Rin1
## 6  5.58e-96      4687.6749      1193.30291      Ildr2

# select top 30 genes based on padj values
selectedGenes <- df[o,]$ensembl_id[1:30]

head(selectedGenes)

## [1] "ENSMUSG00000050711" "ENSMUSG00000025089" "ENSMUSG00000022770"
## [4] "ENSMUSG00000074480" "ENSMUSG00000024883" "ENSMUSG00000040612"

# subset data of interest and convert it to matrix
DoxExSelected <- as.matrix(DoxEx[selectedGenes,])
DEmatok <- t(scale(t(DoxExSelected)))

# print default heatmap
ht <- Heatmap(DEmatok)
draw(ht)

# change font size for the row names and row names
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6))
draw(ht)

# change colours
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = colorRampPalette(brewer.pal(8, "RdBu"))(25))
draw(ht)

# change colours - upregulated as red and downregulated as blue
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)))
draw(ht)

# change figure legend
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",
               title_position = "topcenter"))
draw(ht)

# make it "slimmer" 
ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"))

draw(ht,heatmap_legend_side = "bottom")

# turn off "cluster by columns"

ht <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F)

draw(ht,heatmap_legend_side = "bottom")

# create top annotation
anno <- data.frame(SampleNames=colnames(DEmatok))
anno$TimePoint <- sub("(.+)(_)(.+)","\\1",anno$SampleNames)
anno$TimePoint <- factor(anno$TimePoint)
time_col <- brewer.pal(10, "Paired")[c(9:10)]
names(time_col) <- levels(anno$TimePoint)
col_ha = HeatmapAnnotation(TimePoint = anno$TimePoint,
                           col = list(TimePoint = time_col))
ht_top <- Heatmap(DEmatok, 
              row_names_gp = gpar(fontsize = 6),
              column_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F,top_annotation = col_ha)

draw(ht_top,heatmap_legend_side = "bottom")

# remove column names and remove row names

ht_top_NoL <- Heatmap(DEmatok, row_names_gp = gpar(fontsize = 6),
        col = rev(colorRampPalette(brewer.pal(8, "RdBu"))(25)),
        heatmap_legend_param =
          list(title = "Row z-score",width = unit(6, "mm"),
               title_position = "topcenter",direction = "horizontal"),
        cluster_columns=F,top_annotation = col_ha,
        show_column_names = F,show_row_names = F)

draw(ht_top_NoL,heatmap_legend_side = "bottom")

More details please see: https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html

Saving your plots

useful functions: pdf(), png(), jpeg(), postscript(), svg()
plot object that you would like to save
close the file with dev.off()

?svg()

## Usage
svg(filename = if(onefile) "Rplots.svg" else "Rplot%03d.svg",
    width = 7, height = 7, pointsize = 12,
    onefile = FALSE, family = "sans", bg = "white",
    antialias = c("default", "none", "gray", "subpixel"),
    symbolfamily)

filename4ht <- "heatmapCourseOutput.svg"
svg(file=filename4ht, width = 4, height=5.5)
  draw(ht_top_NoL,heatmap_legend_side = "bottom")
dev.off()

Exercise IV

Plotting in R exercise 4

Solutions IV

Plotting in R solution 4

Plots of functional enrichment

Gene Ontology analysis was performed based on Up-regulated genes in Dox7D vs Dox24H using thresholds padj < 0.01 and log2(FC) > log2(2). Please see “data/GO_UpDE.csv”

Bar plot

We use bar plot to show top 15 GO terms

GOres <-  read.csv("data/GO_UpDE.csv")
# rank "GOres" based on "FDR.over"
GOres <- GOres[order(GOres$FDR.over),]

# subset top 15 GO terms
GOres_top15 <- GOres[c(1:15),]

# basic bar plot
ori_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity") +
  xlab("-log10(FDR)")
ori_bar

# add colour
blue_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

blue_bar

present the GO terms based on their significance

# the "GOres_top15" already ordered by the "FDR.over" but it is not shown in the bar plot
head(GOres_top15)

##     category over_represented_pvalue under_represented_pvalue numDEInCat
## 1 GO:0023052                4.98e-22                        1        318
## 2 GO:0007154                5.35e-21                        1        317
## 3 GO:0032501                5.47e-21                        1        369
## 4 GO:0050896                2.89e-19                        1        386
## 5 GO:0007165                1.00e-17                        1        282
## 6 GO:0051716                3.41e-17                        1        339
##   numInCat                             term ontology FDR.over FDR.under
## 1     4390                        signaling       BP 9.90e-18         1
## 2     4436               cell communication       BP 3.63e-17         1
## 3     5461 multicellular organismal process       BP 3.63e-17         1
## 4     6054             response to stimulus       BP 1.44e-15         1
## 5     3932              signal transduction       BP 3.99e-14         1
## 6     5189    cellular response to stimulus       BP 1.13e-13         1

# We use 'factor'
class(GOres_top15$term)

## [1] "character"

## version 1
ordered_term <- GOres_top15$term
head(GOres_top15$term)

## [1] "signaling"                        "cell communication"              
## [3] "multicellular organismal process" "response to stimulus"            
## [5] "signal transduction"              "cellular response to stimulus"

GOres_top15$term <- factor(GOres_top15$term,levels=ordered_term)

ordered_bar <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

ordered_bar

## version 2
ordered_term_rev <- rev(ordered_term)
head(ordered_term_rev)

## [1] "tissue development"                         
## [2] "system development"                         
## [3] "multicellular organism development"         
## [4] "regulation of cell population proliferation"
## [5] "cell-cell signaling"                        
## [6] "synaptic signaling"

GOres_top15$term <- factor(GOres_top15$term,levels=ordered_term_rev)

ordered_barRev <- ggplot(GOres_top15,aes(x=-log10(FDR.over),y=term)) + 
  geom_bar(stat="identity",fill="blue") +
  xlab("-log10(FDR)") + theme_classic()

ordered_barRev

Bubble plot

We can further define geneRatio as number of DE genes associated with GO process / total number of DE genes.

# calculate number of up.DE
df <- read.csv("data/DiffEx.csv")
upDE_df <- df[!is.na(df$padj) & df$padj<0.01 & df$log2FoldChange> 1,]
num.UpDE <- nrow(upDE_df)

GOres_top15$geneRatio <- GOres_top15$numDEInCat/num.UpDE
head(GOres_top15)

##     category over_represented_pvalue under_represented_pvalue numDEInCat
## 1 GO:0023052                4.98e-22                        1        318
## 2 GO:0007154                5.35e-21                        1        317
## 3 GO:0032501                5.47e-21                        1        369
## 4 GO:0050896                2.89e-19                        1        386
## 5 GO:0007165                1.00e-17                        1        282
## 6 GO:0051716                3.41e-17                        1        339
##   numInCat                             term ontology FDR.over FDR.under
## 1     4390                        signaling       BP 9.90e-18         1
## 2     4436               cell communication       BP 3.63e-17         1
## 3     5461 multicellular organismal process       BP 3.63e-17         1
## 4     6054             response to stimulus       BP 1.44e-15         1
## 5     3932              signal transduction       BP 3.99e-14         1
## 6     5189    cellular response to stimulus       BP 1.13e-13         1
##   geneRatio
## 1 0.4555874
## 2 0.4541547
## 3 0.5286533
## 4 0.5530086
## 5 0.4040115
## 6 0.4856734

GOres_top15_bubble <- GOres_top15[order(GOres_top15$geneRatio),]
GOres_top15_bubble$term <- factor(GOres_top15_bubble$term,
                                  levels = GOres_top15_bubble$term)

bubblePlot <- ggplot(GOres_top15_bubble,aes(x=geneRatio,y=term)) + 
  geom_point(aes(size=numDEInCat,color=FDR.over)) +
  xlab("geneRatio") + theme_classic()+
  scale_colour_gradient(name = "FDR",
                         low = "#D55E00", high = "#0072B2")
bubblePlot

Saving your plots - ggsave

ggsave()

It defaults to saving the last plot that you displayed, using the size of the current graphics device
It also guesses the type of graphics device from the extension
device: “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only)
Plot to save, defaults to last plot displayed.

?ggsave()

## Usage
ggsave(filename, plot = last_plot(), device = NULL, path = NULL,
  scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"),
  dpi = 300, limitsize = TRUE, ...)

data(mtcars)

plot1 <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
plot2 <- ggplot(mtcars, aes(mpg, wt,col=as.factor(vs))) + geom_point()

ggsave("mtcars_default.png", plot1)
ggsave("mtcars_col.png", plot2)

Exercise V

Plotting in R exercise 5

Solutions V

Plotting in R solution 5

Resources

cheatsheet

https://www.rstudio.com/resources/cheatsheets/

+ Data visualization with ggplot2 cheatsheet

+ String manipulation with stringr cheatsheet

Useful information for colour palettes

https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3

https://github.com/karthik/wesanderson

sessionInfo

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.10.0 RColorBrewer_1.1-2    ggrepel_0.9.1        
## [4] reshape2_1.4.4        ggplot2_3.3.5        
## 
## loaded via a namespace (and not attached):
##  [1] shape_1.4.6         circlize_0.4.13     GetoptLong_1.0.5   
##  [4] tidyselect_1.1.1    xfun_0.28           bslib_0.3.1        
##  [7] purrr_0.3.4         colorspace_2.0-2    vctrs_0.3.8        
## [10] generics_0.1.1      stats4_4.1.2        htmltools_0.5.2    
## [13] yaml_2.2.2          utf8_1.2.2          rlang_0.4.12       
## [16] jquerylib_0.1.4     pillar_1.6.5        glue_1.6.1         
## [19] withr_2.4.3         DBI_1.1.2           BiocGenerics_0.40.0
## [22] matrixStats_0.61.0  foreach_1.5.1       lifecycle_1.0.1    
## [25] plyr_1.8.6          stringr_1.4.0       munsell_0.5.0      
## [28] gtable_0.3.0        GlobalOptions_0.1.2 codetools_0.2-18   
## [31] evaluate_0.14       labeling_0.4.2      knitr_1.36         
## [34] IRanges_2.28.0      fastmap_1.1.0       doParallel_1.0.16  
## [37] parallel_4.1.2      fansi_1.0.2         highr_0.9          
## [40] Rcpp_1.0.8          scales_1.1.1        S4Vectors_0.32.2   
## [43] magick_2.7.3        jsonlite_1.7.3      farver_2.1.0       
## [46] rjson_0.2.21        png_0.1-7           digest_0.6.29      
## [49] stringi_1.7.6       dplyr_1.0.7         clue_0.3-60        
## [52] tools_4.1.2         magrittr_2.0.1      sass_0.4.0         
## [55] tibble_3.1.6        cluster_2.1.2       crayon_1.4.2       
## [58] pkgconfig_2.0.3     ellipsis_0.3.2      assertthat_0.2.1   
## [61] rmarkdown_2.11      iterators_1.0.13    R6_2.5.1           
## [64] compiler_4.1.2

Plotting in R

Introduction

Materials

Set the Working directory

Set working directory - in Rstudio

Set working directory - in the console

Motivation

Typical results

Data visualization

Data visualization: A view of every Points of View column

Design of data figures

Salience

Color blindness

A short intro to the most important data structures in R

Data frame

Vector

Two philosophies of plotting in R

base graphics

usage

ggplot2 package

usage

Exercise I

Solutions I

Our case study

Our tasks:

Displaying distributions - Base graphics

Histograms - Base graphics

Density plots - Base graphics

Displaying distributions - ggplot2

Histograms - ggplot2

Density plots - ggplot2

Exercise II

Solutions II

Visualising expression and changes in expression

Scatterplots - base graphics

ggplot2 and ggrepel

Volcano plots - base graphics

Volcano plots - ggplot2

Exercise III

Solutions III

Heatmaps - base graphics

Heatmaps and RColorBrewer

Heatmaps - ComplexHeatmap I

Heatmaps - ComplexHeatmap II

Saving your plots

Exercise IV

Solutions IV

Plots of functional enrichment

Bar plot

Bubble plot

Saving your plots - ggsave

ggsave()

Exercise V

Solutions V

Resources

sessionInfo