Plotting in R

css: Rpress.css author: MRC London Institute of Medical Sciences (LMS) date: https://lmsbioinformatics.github.io/LMS_PlottingInR/ width: 1440 height: 1100 autosize: true font-import: font-family: ‘Slabo 27px’, serif;

Materials.

id: materials

All prerequisites, links to material and slides for this course can be found on github. * PlottingInR

Or can be downloaded as a zip archive from here. * Download zip

Before we start…

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived LMS_PlottingInR/course folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

path

Set working directory - in the console

Use getwd() to see where your current directory is

getwd()

Use setwd() to see where your current directory is

setwd("/PathToMyDownload/LMS_PlottingInR/course")
# if you are working with Mac
# e.g. setwd("~/Downloads/LMS_PlottingInR/course")

# if you are working with Windows
# e.g. setwd("~\\Downloads\\LMS_PlottingInR\\course")

Plotting in R

type:section id:plotting

Introduction

R has excellent graphics and plotting capabilities. They are mostly found in following three sources. + base graphics

easy to use, conceptually motivated by drawing on a canvas

would become difficult or impossible to draw complicated plots

Lattice and ggplot2 packages are built on grid graphics package while the base graphics routines adopt a pen and paper model for plotting.

We will start from the base graphics then focus on ggplot2

R Base Graphics

type:section id: baseGraph

Base Graphics

First we’ll produce a very simple graph using the values in the data.frame that we created:

base_graph_df<- data.frame(sample_num=c(1:6),
                           treatment=c(0.02,1.8, 17.5, 55,75.7, 80),
                           control= c(0, 20, 40, 60, 80,100))

base_graph_df
  sample_num treatment control
1          1      0.02       0
2          2      1.80      20
3          3     17.50      40
4          4     55.00      60
5          5     75.70      80
6          6     80.00     100

Base Graphics

Plot the treatment with default parameters

?plot
* Usage

plot(x, y, ...)
plot(x=base_graph_df$sample_num, y=base_graph_df$treatment)

# or just 
plot(base_graph_df$sample_num, base_graph_df$treatment)

Line Plot

plot of chunk unnamed-chunk-6

Question

What will happen if we change the order of the arguments?

from

plot(base_graph_df$sample_num, base_graph_df$treatment)

to

plot(base_graph_df$treatment, base_graph_df$sample_num)

plot(base_graph_df$treatment,base_graph_df$sample_num)

Question

plot(base_graph_df$sample_num, base_graph_df$treatment)

to

plot(base_graph_df$treatment, base_graph_df$sample_num)

plot(base_graph_df$treatment, base_graph_df$sample_num)
plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-8

========================================================

plot(y= base_graph_df$treatment, x= base_graph_df$sample_num)
plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-9

======================================================= Now, let’s add a title, a line to connect the points, and some colour:

Plot treatment using blue points overlayed by a line

hint: look into the “type” argument

?plot

Arguments

  • type: what type of plot should be drawn. Possible types are

    "p" for points,
    "l" for lines,
    "b" for both,
    "c" for the lines part alone of "b",
    "o" for both ‘overplotted’,
    "h" for ‘histogram’ like (or ‘high-density’) vertical lines,
    "s" for stair steps,
    "S" for other steps, see ‘Details’ below,
    "n" for no plotting.

=======================================================

plot(base_graph_df$sample_num,base_graph_df$treatment, type="o", col="blue")

Create a title with a red, bold/italic font

hint: 1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol

title(main="Treatment", col.main="red", font.main=4)

Line Plot

plot of chunk unnamed-chunk-13

======================================================== Now let’s add a red line for control column from the data.frame base_graph_df and specify the y-axis range directly so it will be large enough to fit the data:

base_graph_df$control
[1]   0  20  40  60  80 100
plot(base_graph_df$sample_num,base_graph_df$treatment, type="o", col="blue", ylim=c(0,100))
lines(base_graph_df$control, type="o", pch=0, lty="dashed", col="red")
title(main="Expression Data", col.main="red", font.main=4)

==========================================================

plot of chunk unnamed-chunk-18

Plotting ‘character’ (pch) - symbol to use

plot of chunk unnamed-chunk-19

plot of chunk unnamed-chunk-19

The line type - lty

lty can be c(“blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”) or number c(0, 1, 2, 3, 4, 5, 6)

plot of chunk unnamed-chunk-20

plot of chunk unnamed-chunk-20

change axes labels, colour and add legend

Next let’s change the axes labels to match our data and add a legend.

We’ll also compute the y-axis values using the max function so any changes to our data will be automatically reflected in our graph.

g_range <- range(0, base_graph_df$treatment, base_graph_df$control)
g_range
[1]   0 100

range returns a vector containing the minimum and maximum of all the given arguments.

plot(base_graph_df$sample_num ,base_graph_df$treatment, 
     type="o", col="blue", 
     ylim=g_range,axes=FALSE, ann=FALSE)

========================================================

Make x axis using labels

axis(1, at=1:6, lab=base_graph_df$sample_num)

Make y axis with horizontal labels that display ticks at every 20 marks.

axis(2, las=1, at=seq(g_range[1],g_range[2],20))

Create box around plot

box()

========================================================

Plot control vector with red dashed line and square points

lines(base_graph_df$control, type="o", pch=0, lty=2, col="red")

Create a title with a red, bold/italic font

title(main="Expression Data", col.main="red", font.main=4)

Label the x and y axes with dark green text

title(xlab="Samples", col.lab="purple")
title(ylab="Values", col.lab="purple")

========================================================

Create a legend at (1, g_range[2]) that is slightly smaller (cex) and uses the same line colors and points used by the actual plots

legend(1, g_range[2], c("treatment","control"), cex=0.8, col=c("blue","red"), pch=1:0, lty=1:2) 

plot of chunk unnamed-chunk-30

ggplot2 R package

type:section id:ggplot2

ggplot2 is a powerful R package based on the grammar of graphics (Wilkinson, 2005).

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).” - Wickham, 2016

GoT_dataset

We’ll use the GoT_dataset that kindly provided by Dr. Reidar P. Lystad. Inj Epidemiol. 2018. 5(1):44. doi: 10.1186/s40621-018-0174-7.

path

GoT_dataset

We will use "Got_dataset/episode_data.csv" and "Got_dataset/subset_GoT.csv" today.

We can use the head function to look at the first few rows of file "episode_data.csv"

[csv: comma-separated values file]

library(ggplot2)
episode_data<-read.csv("GoT_dataset/episode_data.csv")

# show first 6 rows for this dataset
head(episode_data)
  season episode_number                            episode_name
1      1              1                      "Winter Is Coming"
2      1              2                         "The Kingsroad"
3      1              3                             "Lord Snow"
4      1              4 "Cripples, Bastards, and Broken Things"
5      1              5                 "The Wolf and the Lion"
6      1              6                        "A Golden Crown"
  gross_running_time opening_credits_time closing_credits_time
1               3546                  110                   33
2               3182                  111                   34
3               3294                   96                   27
4               3201                   96                   26
5               3123                  101                   24
6               3027                  103                   26
  net_running_time cumulative_net_running_time
1             3403                        3403
2             3037                        6440
3             3171                        9611
4             3079                       12690
5             2998                       15688
6             2898                       18586

GoT_dataset

str() is another useful function to show the Structure of the episode_data object

str(episode_data)
'data.frame':   73 obs. of  8 variables:
 $ season                     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ episode_number             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ episode_name               : Factor w/ 73 levels " \"What Is Dead May Never Die\"",..: 71 47 25 13 65 2 73 56 6 17 ...
 $ gross_running_time         : int  3546 3182 3294 3201 3123 3027 3325 3345 3238 3028 ...
 $ opening_credits_time       : int  110 111 96 96 101 103 105 105 116 116 ...
 $ closing_credits_time       : int  33 34 27 26 24 26 26 28 32 32 ...
 $ net_running_time           : int  3403 3037 3171 3079 2998 2898 3194 3212 3090 2880 ...
 $ cumulative_net_running_time: int  3403 6440 9611 12690 15688 18586 21780 24992 28082 30962 ...

========================================================

Every ggplot2 plot has three key components:

1. data,

2. aesthetic mappings between variables in the data and visual
properties, and

3. layer: usually created with a geom function.

Usage

?ggplot
ggplot(data = NULL, mapping = aes(), ...,
  environment = parent.frame())

========================================================

use ggplot2’s ggplot() function to setup data and aesthetic mappings

g<-ggplot(data=episode_data, 
          aes(x=gross_running_time,y=net_running_time))

print(g)
plot of chunk unnamed-chunk-34

plot of chunk unnamed-chunk-34

add geom_point layer - Scatter plot

g<-ggplot(data=episode_data, 
          aes(x=gross_running_time,y=net_running_time))
g + geom_point()
plot of chunk unnamed-chunk-35

plot of chunk unnamed-chunk-35

add geom_histogram layer - Histogram plot (1/2)

ghis<-ggplot(data=episode_data, aes(x=net_running_time))
ghis + geom_histogram()
plot of chunk unnamed-chunk-36

plot of chunk unnamed-chunk-36

add geom_histogram layer - Histogram plot (2/2)

change binwidth

#ghis<-ggplot(data=episode_data, aes(x=net_running_time))
ghis + geom_histogram(binwidth=200)
plot of chunk unnamed-chunk-37

plot of chunk unnamed-chunk-37

add geom_density layer - Density plot (1/3)

add geom_density layer

ghis<-ggplot(data=episode_data, aes(x=net_running_time))
ghis + geom_density()
plot of chunk unnamed-chunk-38

plot of chunk unnamed-chunk-38

add geom_density layer - Density plot (2/3)

ghis<- ggplot(data=episode_data, 
              aes(x=net_running_time,fill=as.factor(season)))
ghis + geom_density() 
plot of chunk unnamed-chunk-39

plot of chunk unnamed-chunk-39

add geom_density layer - Density plot (3/3)

ghis<- ggplot(data=episode_data, 
              aes(x=net_running_time,fill=as.factor(season)))
ghis + geom_density(alpha=0.25) 
plot of chunk unnamed-chunk-40

plot of chunk unnamed-chunk-40

Bar plot (1/4) - geom_bar()

add geom_bar layer

episode_data$season<-as.factor(episode_data$season)
gbar<- ggplot(data=episode_data, aes(x=season))
gbar + geom_bar() 
plot of chunk unnamed-chunk-41

plot of chunk unnamed-chunk-41

Bar plot (2/4) - change labels for x,y axes and add title

use xlab(), ylab(), and ggtitle()

gbar<- ggplot(data=episode_data, aes(x=season))
gbar + geom_bar() + 
  xlab("Season")+ ylab("Number of episodes")+ ggtitle("Bar plot")
plot of chunk unnamed-chunk-42

plot of chunk unnamed-chunk-42

Bar plot (3/4) - change labels for x,y axes and add title

or just use labs()

gbar<- ggplot(data=episode_data, aes(x=season))
gbar + geom_bar() + 
  labs(x="Season",y="Number of episodes",title="Bar plot")
plot of chunk unnamed-chunk-43

plot of chunk unnamed-chunk-43

Bar plot (4/4) - coord_flip ()

use different colours for different seasons and also change the labels

gbar + geom_bar() + coord_flip()
plot of chunk unnamed-chunk-44

plot of chunk unnamed-chunk-44

Exercise

Time for exercise -

Solutions

Box plot (1/8) - use geom_boxplot()

use "Got_dataset/short_data.csv" dataset

subset_GoT<-read.csv(file="Got_dataset/subset_GoT.csv")

# use head function to see first few rows (default = 6)
# we use the argument n=4 to limit the number of rows to be shown
head(subset_GoT, n=4)
   id         name sex        religion            occupation social_status
1 100 Waymar Royce   M Unknown/Unclear Boiled leather collar       Lowborn
2 101 Gared Tuttle   M Unknown/Unclear Boiled leather collar       Lowborn
3 102         Will   M Unknown/Unclear Boiled leather collar       Lowborn
4 103         Irri   F  Great Stallion Boiled leather collar       Lowborn
  allegiance_last allegiance_switched dth_flag exp_time_sec exp_time_hrs
1   Night's Watch                   N        1          342         0.10
2   Night's Watch                   N        1          405         0.11
3   Night's Watch                   N        1          692         0.19
4       Targaryen                   Y        1        48489        13.47

Box plot (2/8) - use geom_boxplot()

str(subset_GoT)
'data.frame':   359 obs. of  11 variables:
 $ id                 : int  100 101 102 103 104 105 106 107 108 109 ...
 $ name               : Factor w/ 357 levels "Adrack Humble",..: 341 89 344 125 136 33 262 266 260 76 ...
 $ sex                : Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 2 2 2 ...
 $ religion           : Factor w/ 8 levels "Drowned God",..: 8 8 8 3 6 6 2 8 6 6 ...
 $ occupation         : Factor w/ 3 levels "Boiled leather collar",..: 1 1 1 1 1 2 2 1 3 2 ...
 $ social_status      : Factor w/ 2 levels "Highborn","Lowborn": 2 2 2 2 1 1 1 2 1 1 ...
 $ allegiance_last    : Factor w/ 9 levels "Bolton","Frey",..: 5 5 5 8 5 7 7 7 7 6 ...
 $ allegiance_switched: Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
 $ dth_flag           : int  1 1 1 1 0 0 1 1 1 1 ...
 $ exp_time_sec       : int  342 405 692 48489 230347 230347 87621 45722 176937 27606 ...
 $ exp_time_hrs       : num  0.1 0.11 0.19 13.47 63.99 ...

Box plot (3/8) - use geom_boxplot()

add geom_boxplot layer

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs))+
  geom_boxplot()
plot of chunk unnamed-chunk-47

plot of chunk unnamed-chunk-47

Box plot (4/8) - geom_point()

add another layer

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs))+
  geom_boxplot() + geom_point()
plot of chunk unnamed-chunk-48

plot of chunk unnamed-chunk-48

Box plot (5/8) - position_jitter()

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs))+
  geom_boxplot() + geom_point(position = position_jitter())
plot of chunk unnamed-chunk-49

plot of chunk unnamed-chunk-49

Box plot (6/8) - add colour

add geom_bar layer

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs,fill=occupation))+
  geom_boxplot()
plot of chunk unnamed-chunk-50

plot of chunk unnamed-chunk-50

Box plot (7/8) - facet_wrap()

add facet_wrap() layer

facet_wrap(~variable)

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs,fill=occupation))+
  geom_boxplot()+facet_wrap(~sex)
plot of chunk unnamed-chunk-51

plot of chunk unnamed-chunk-51

Box plot (8/8) - facet_grid()

add facet_grid() layer

facet_grid(Rows.for.var1~Columns.for.var2)

ggplot(data=subset_GoT,aes(x=social_status,y=exp_time_hrs,fill=occupation))+
  geom_boxplot()+facet_grid(dth_flag~sex)
plot of chunk unnamed-chunk-52

plot of chunk unnamed-chunk-52

Saving your plots - ggsave

Saving your plots - ggsave (1/2)

data(mtcars)

ggplot(mtcars, aes(mpg, wt)) + geom_point()
ggsave("mtcars_default.pdf")
ggsave("mtcars.pdf", width = 4, height = 4)

Saving your plots - ggsave (2/2)

Plot to save, defaults to last plot displayed.

ggsave(filename, plot = last_plot(), device = NULL, path = NULL, scale = 1, width = NA, height = NA, units = c(“in”, “cm”, “mm”), dpi = 300, limitsize = TRUE, …)

data(mtcars)

plot1<-ggplot(mtcars, aes(mpg, wt)) + geom_point()
plot2<-ggplot(mtcars, aes(mpg, wt,col=as.factor(vs))) + geom_point()

ggsave("mtcars_default.png",plot1)
ggsave("mtcars_col.png",plot2)

Useful Resources

type:section id:useful

Cheatsheet

path

    https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Data visualization: A view of every Points of View column

DataVisualCol

    http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

Data visualization - Design of data figures

DataVisual

Data visualization - Salience

Salience

Data visualization - Color blindness

CB

Summary

Base Graphics VS ggplot2 (1/5)

Base Graphic 1

base_graph_df
  sample_num treatment control
1          1      0.02       0
2          2      1.80      20
3          3     17.50      40
4          4     55.00      60
5          5     75.70      80
6          6     80.00     100
plot of chunk unnamed-chunk-56

plot of chunk unnamed-chunk-56

Base Graphics VS ggplot2 (2/5)

Base Graphic 2

plot(base_graph_df$sample_num ,base_graph_df$treatment, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1, at=1:6, lab=base_graph_df$days)
axis(2, las=1, at=seq(g_range[1],g_range[2],20))
box()

lines(base_graph_df$control, type="o", pch=0, lty=2, col="red")
title(main="Expression Data", col.main="red", font.main=4)
title(xlab="Samples", col.lab="purple")
title(ylab="Values", col.lab="purple")
legend(1, g_range[2], c("treatment","control"), cex=0.8, col=c("blue","red"), pch=1:0, lty=1:2);  

Base Graphics VS ggplot2 (3/5)

ggplot2 - prepare the data.frame

# covert data.frame into the format that ggplot likes
# install.packages("reshape2")
library("reshape2")

base_graph_4gg<-melt(base_graph_df, id.vars="sample_num")
base_graph_4gg$variable<-relevel(base_graph_4gg$variable,ref="control")
head(base_graph_4gg,n=10)
   sample_num  variable value
1           1 treatment  0.02
2           2 treatment  1.80
3           3 treatment 17.50
4           4 treatment 55.00
5           5 treatment 75.70
6           6 treatment 80.00
7           1   control  0.00
8           2   control 20.00
9           3   control 40.00
10          4   control 60.00

Base Graphics VS ggplot2 (4/5)

ggplot2 - plot the figure with default settings

library("ggplot2")

ggplot(base_graph_4gg,aes(x=sample_num,y=value,col=variable,group=variable)) +
  geom_point(aes(shape=variable))+
  geom_line(aes(linetype=variable))+
  labs(title="Expression Data",x ="Sample", y = "Values")
plot of chunk unnamed-chunk-59

plot of chunk unnamed-chunk-59

Base Graphics VS ggplot2 (5/5)

ggplot2 - plot the figure that matches base grahpics

ggplot(base_graph_4gg,aes(x=sample_num,y=value,col=variable,group=variable)) +
  geom_point(aes(shape=variable))+
  geom_line(aes(linetype=variable))+
  scale_color_manual(values=c("red", "blue"))+
  scale_shape_manual(values=c(0,1))+
  scale_linetype_manual(values=c("dashed","solid"))+
  labs(title="Expression Data",x ="Sample", y = "Values")+
  theme_classic()+
  theme(plot.title = element_text(colour = "red",face="bold.italic",hjust = 0.5),
        axis.title = element_text(colour = "purple"))
plot of chunk unnamed-chunk-60

plot of chunk unnamed-chunk-60

Exercise

Time for exercise -

Solutions