Introduction to R, Session 1
1 Overview
- Background to R
- Data types in R
- Reading and writing data in R
- Loading Libraries in R
2 Materials
Links to material and slides for this course can be found on github - Reproducible-R
Or can be downloaded as a zip archive from here - Download zip
3 Materials - Presentations, source code and practicals.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
- presentations/slides/ Presentations as an HTML slide show.
- presentations/singlepage/ Presentations as an HTML single page.
- presentations/rcode/ R code in presentations.
- exercises/ Practicals as HTML pages.
- answers/ Practicals with answers as HTML pages and R code solutions.
4 Set the Working directory
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived Reproducible-R folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/Reproducible-R")
# e.g. setwd("~/Downloads/Reproducible-R")
5 Background to R
What is R?
R is a scripting language and environment for statistical computing.
Developed by Robert Gentleman and Ross Ihaka.
Inheriting much from S (Bell labs).
- Suited to high level data analysis
- Open source & cross platform
- Extensive graphics capabilities
- Diverse range of add-on packages
- Active community of developers
- Thorough documentation
6 Why should you use R ?
R comes with excellent “out-of-the-box” statistical and plotting capabilities.
R provides access to 1000s of packages (CRAN/MRAN/R-forge) which extend the basic functionality of R while maintaining high quality documentation.
In particular, Robert Gentleman developed the Bioconductor project where 100’s of packages are directly related to computational biology and analysis of associated high-throughput experiments.
7 How to get R
Freely available from R-project website.
RStudio provides an integrated development environment (IDE) which is freely available from RStudio site
We will be using RStudio and R already installed on your machines.
8 A quick tour of RStudio
- Four main panels
- Scripting panel
- R interface
- Environment and history
- Files, directories and help
9 Data types in R
- Simple calculations
- Variables
- Vectors
- Lists
- Matrices
- Data frames
10 Simple Calculations
> 3+1
## [1] 4
> 2*2
## [1] 4
> sqrt(25)-1
## [1] 4
11 Functions
The sqrt(25) demonstrates the use of functions in R. A function performs a complex operation on it’s arguments and returns the result.
In R, arguments are provided to a function within the parenthesis – ( ) – that follows the function name. So sqrt(ARGUMENT) will provide the square root of the value of ARGUMENT.
Other examples of functions include min(), sum(), max().
Note multiple arguments are separated by a comma.
min(2,4,6)
## [1] 2
sum(2,4,6)
## [1] 12
max(2,4,6)
## [1] 6
11.1 Using functions
R has many useful functions “built in” and ready to use as soon as R is loaded.
An incomplete, illustrative list can be seen here
In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialised tools for areas such as sequence alignment, read counting etc.
If you need to see how a function works try ? in front of the function name.
?sqrt
11.2 Using functions (Arguments have names and order)
With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter.
min(5,4,6)
## [1] 4
min(6,4,5)
## [1] 4
Many functions however have an order to their arguments. Try and look at the arguments for the dir() function using ?dir.
?dir
11.3 Using functions (Setting names for arguments)
Often we know the names of arguments but not necessarily their order. In cases where we want to be sure we specify the right argument, we provide names for the arguments used.
dir()
dir(full.names=T)
This also means we don’t have to copy out all the defaults for arguments preceeding it.
dir(full.names=T)
# Is equivalent to...
dir(".",NULL,FALSE,T)
12 Variable
As with other programming languages and even graphical calculators, R makes use of variables.
A variable stores a value as a letter or word.
In R, we make use of the assignment operator <-
<- 10 x
Now x holds the value of 10
x
## [1] 10
x
## [1] 10
Variables can be altered in place
<- 20
x x
## [1] 20
Variables can be used just as the values they contain.
+ sqrt(25) x
## [1] 25
Variables can be used to create new variables
<- x + sqrt(25)
y y
## [1] 25
13 Vectors
In R the most basic variable or data type is a vector. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1.
x
## [1] 20
length(x)
## [1] 1
13.1 Vectors - Creating vectors
To create a multiple value vector we use the function c() to combine the supplied arguments into one vector.
<- c(1,2,3,4,5,6,7,8,9,10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
length(x)
## [1] 10
Vectors of continuous stretches of values can be created by the shortcut - :
<- 6:10
y y
## [1] 6 7 8 9 10
Other useful function to create stretchs of numeric vectors are seq() and rep(). The seq() function creates a sequence of numeric values from a specified start and end value, incrementing by a user defined amount. The rep() function repeats a variable a user-defined number of times.
seq(from=1,to=5,by=2)
## [1] 1 3 5
rep(c(1,5,10),3)
## [1] 1 5 10 1 5 10 1 5 10
13.2 Vectors - Indexing
Square brackets [] identify the position within a vector (the index). These indices can be used to extract relevant values from vectors.
x
## [1] 1 2 3 4 5 6 7 8 9 10
1] x[
## [1] 1
8] x[
## [1] 8
Indices can be used to extract values from multiple positions within a vector.
c(1,6)] x[
## [1] 1 6
Negative indices can be used to extract all positions except that specified
-5] x[
## [1] 1 2 3 4 6 7 8 9 10
13.3 Vectors - Indexing and replacement
We can use indices to modify a specific position in a vector
x
## [1] 1 2 3 4 5 6 7 8 9 10
5] <- -5
x[ x
## [1] 1 2 3 4 -5 6 7 8 9 10
Indices can be specified using other vectors.
y
## [1] 6 7 8 9 10
<- 0
x[y] x
## [1] 1 2 3 4 -5 0 0 0 0 0
Remember!
Square brackets [] for indexing
1] x[
## [1] 1
Parentheses () for function argments.
sqrt(4)
## [1] 2
13.4 Vectors - Arithmetic operations
Vectors in R can be used in arithmetic operations as seen with variables earlier. When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector.
<- c(1,2,3,4,5,6,7,8,9,10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
<- x*2
y y
## [1] 2 4 6 8 10 12 14 16 18 20
Multiple vectors can be used within arithmetic operations.
+y x
## [1] 3 6 9 12 15 18 21 24 27 30
13.5 Vectors - Character vectors
So far we have only looked at numeric vectors or variables.
In R we can also create character vectors again using c() function. These vectors can be indexed just the same.
<- c("ICTEM","CommonWealth","Wolfson")
y 2] y[
## [1] "CommonWealth"
Character vectors can be used to assign names to other vectors.
<- c(1:3)
x names(x) <- y
x
## ICTEM CommonWealth Wolfson
## 1 2 3
These named vectors maybe indexed by a position's "name".
c("ICTEM","Wolfson")] x[
## ICTEM Wolfson
## 1 3
Index names missing from vectors will return special value “NA”
c("Strand")] x[
## <NA>
## NA
A note on NA values
In R, like many languages, when a value in a variable is missing, the value is assigned a NA value.
Similarly, when a calculation can not be perfomed, R will input a NaN value.
- NA - Not Available.
- NaN - Not A Number.
NA values allow for R to handle missing data correctly but requires different handling than standard numeric or character values. We will illustrate an example handling NA values later.
13.6 Vectors - The unique() function
The unique() function can be used to retrieve all unique values from a vector.
<- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
geneList unique(geneList)
## [1] "Gene1" "Gene2" "Gene3" "Gene4" "Gene5"
13.7 Vectors. Logical vectors
Logical vectors are a class of vector made up of TRUE/T or FALSE/F boolean values.
<- c(T,F,T,F,T,F,T,F,T,F)
z # z <- c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
z
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Logical vectors can be used like an index to specify postions in a vector. TRUE values will return the corresponding position in the vector being indexed.
<- 1:10
x x[z]
## [1] 1 3 5 7 9
13.8 Vectors - The %in% operator
A common task in R is to subset one vector by the values in another vector.
The %in% operator in the context A %in% B creates a logical vector of whether values in A matches any values in of B.
This can be then used to subset the values within one character vector by a those in a second.
<- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
geneList <- c("Gene5","Gene3")
secondGeneList <- geneList %in% secondGeneList
logical_index logical_index
## [1] FALSE FALSE TRUE FALSE TRUE FALSE TRUE
geneList[logical_index]
## [1] "Gene3" "Gene5" "Gene3"
13.9 Vectors - Logical vectors from operators
Vectors may be evaluated to produce logical vectors. This can be very useful when using a logical to index.
Common examples are:
- == evaluates as equal.
- > and < evaluates as greater or less than respectively.
- >= and <= evaluates as greater than or equal or less than or equal respectively.
<- 1:10
x > 5 x
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> 5] x[x
## [1] 6 7 8 9 10
13.10 Vectors - Combining logical vectors
Logical vectors can be used in combination in order to index vectors. To combine logical vectors we can use some common R operators.
- & - Requires both logical operators to be TRUE
- | - Requires either logical operator to be TRUE.
- ! - Reverses the logical operator, so TRUE is FALSE and FALSE is TRUE.
<- 1:10
x !x > 4
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> 4 & x < 7 x
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> 4 | x < 7 x
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Such combinations can allow for complex selection of a vector’s values.
<- 1:10
x x
## [1] 1 2 3 4 5 6 7 8 9 10
> 4 & x < 7] x[x
## [1] 5 6
> 4 & !x < 7] x[x
## [1] 7 8 9 10
13.11 Exercise!
Exercise on vectors can be found here
14 Matrices
14.1 Creating matrices
In programs such as Excel, we are used to tables.
All progamming languages have a concept of a table. In R, the most basic table type is a matrix.
A matrix can be created using the matrix() function with the arguments of nrow and ncol specifying the number of rows and columns respectively.
<- matrix(1:10, nrow=5, ncol=2)
narrowMatrix narrowMatrix
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
<- matrix(1:10, nrow=2, ncol=5)
wideMatrix wideMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
By default when creating a matrix using the matrix function, the values fill the matrix by columns. To fill a matrix by rows the byrow argument must be set to TRUE.
<- matrix(1:10, nrow=2, ncol=5)
wideMatrix wideMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
<- matrix(1:10, nrow=2, ncol=5, byrow=TRUE)
wideMatrixByRow wideMatrixByRow
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
14.2 Finding dimensions
To find dimensions of a matrix, the dim() function will provide dimensions as the row then column number while nrow() and ncol() will return just row number and column number respectively.
dim(narrowMatrix)
## [1] 5 2
nrow(narrowMatrix)
## [1] 5
ncol(narrowMatrix)
## [1] 2
14.3 Joining vectors and matrices
A matrix can be created from multiple vectors or other matrices.
cbind() can be used to attach data to a matrix as columns.
<- 1:10
x <- 11:20
y <- 21:22
z <- cbind(x,y)
newMatrix newMatrix
## x y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
## [4,] 4 14
## [5,] 5 15
## [6,] 6 16
## [7,] 7 17
## [8,] 8 18
## [9,] 9 19
## [10,] 10 20
rbind() functions to bind to a matrix as rows.
<- rbind(newMatrix,z)
newerMatrix newerMatrix
## x y
## 1 11
## 2 12
## 3 13
## 4 14
## 5 15
## 6 16
## 7 17
## 8 18
## 9 19
## 10 20
## z 21 22
14.4 Joining incompatable vectors and matrices
When creating a matrix using cbind() or matrix() from incompatable vectors then the shorter vector is recycled.
<- matrix(1:5,ncol=2,nrow=3) recycledMatrix2
## Warning in matrix(1:5, ncol = 2, nrow = 3): data length [5] is not a sub-
## multiple or multiple of the number of rows [3]
recycledMatrix2
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 1
For rbind() function, the longer vector is clipped.
<- rbind(recycledMatrix2,c(1:5)) recycledMatrix3
## Warning in rbind(recycledMatrix2, c(1:5)): number of columns of result is not a
## multiple of vector length (arg 2)
recycledMatrix3
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 1
## [4,] 1 2
14.5 Column and row names
As we have seen with vectors, matrices can be named. For matrices the naming is done by columns and rows using colnames() and rownames() functions.
<- matrix(1:10,ncol=5,nrow=2)
namedMatrix colnames(namedMatrix) <- paste("Column",1:5,sep="_")
rownames(namedMatrix) <- paste("Row",1:2,sep="_")
namedMatrix
## Column_1 Column_2 Column_3 Column_4 Column_5
## Row_1 1 3 5 7 9
## Row_2 2 4 6 8 10
Information on matrix names can also be retreived using the same functions.
colnames(namedMatrix)
## [1] "Column_1" "Column_2" "Column_3" "Column_4" "Column_5"
rownames(namedMatrix)
## [1] "Row_1" "Row_2"
14.6 Indexing
Selecting and replacing portions of a matrix can be done by indexing using square brackets [] much like for vectors.
When indexing matrices, two values may be provided within the square brackets separated by a comma to retrieve information on a matrix position.
The first value(s) corresponds to row(s) and the second to column(s).
- myMatrix[rowOfInterest,columnOfInterest]
narrowMatrix
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
Value of first column, second row
2,1] narrowMatrix[
## [1] 2
Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector. When multiple columns or row indices are specified, a matrix is returned.
Values of second column (row index is empty!)
2] narrowMatrix[,
## [1] 6 7 8 9 10
Values of third row (column index is empty!)
3,] narrowMatrix[
## [1] 3 8
Values of second and third row (column index is empty!)
c(2,3),] narrowMatrix[
## [,1] [,2]
## [1,] 2 7
## [2,] 3 8
14.6.1 Indexing by name
As with vectors, names can be used for indexing when present
colnames(narrowMatrix) <- paste("Column",1:2,sep="_")
rownames(narrowMatrix) <- paste("Row",1:5,sep="_")
"Column_1"] narrowMatrix[,
## Row_1 Row_2 Row_3 Row_4 Row_5
## 1 2 3 4 5
"Row_1",] narrowMatrix[
## Column_1 Column_2
## 1 6
"Row_1","Column_1"] narrowMatrix[
## [1] 1
14.6.2 Advanced indexing
As with vectors, matrices can be subset by logical vectors
narrowMatrix
## Column_1 Column_2
## Row_1 1 6
## Row_2 2 7
## Row_3 3 8
## Row_4 4 9
## Row_5 5 10
1] narrowMatrix[,
## Row_1 Row_2 Row_3 Row_4 Row_5
## 1 2 3 4 5
1] < 5 narrowMatrix[,
## Row_1 Row_2 Row_3 Row_4 Row_5
## TRUE TRUE TRUE TRUE FALSE
1] < 5,] narrowMatrix[narrowMatrix[,
## Column_1 Column_2
## Row_1 1 6
## Row_2 2 7
## Row_3 3 8
## Row_4 4 9
14.7 Arithmetic operations
As with vectors, matrices can have arithmetic operations applied to cells,rows, columns or the whole matrix
narrowMatrix
## Column_1 Column_2
## Row_1 1 6
## Row_2 2 7
## Row_3 3 8
## Row_4 4 9
## Row_5 5 10
1,1]+2 narrowMatrix[
## [1] 3
1,]+2 narrowMatrix[
## Column_1 Column_2
## 3 8
mean(narrowMatrix)
## [1] 5.5
14.8 Replacement
As with vectors, matrices can have their elements replaced
narrowMatrix
## Column_1 Column_2
## Row_1 1 6
## Row_2 2 7
## Row_3 3 8
## Row_4 4 9
## Row_5 5 10
1,1] <- 10
narrowMatrix[2] <- 1
narrowMatrix[, narrowMatrix
## Column_1 Column_2
## Row_1 10 1
## Row_2 2 1
## Row_3 3 1
## Row_4 4 1
## Row_5 5 1
14.9 Matrices can contain only one data type
Matrices must be all one type (i.e. numeric or character).
Here replacing one value with character will turn numeric matrix to character matrix.
2] *2 narrowMatrix[,
## Row_1 Row_2 Row_3 Row_4 Row_5
## 2 2 2 2 2
1,1] <- "Not_A_Number"
narrowMatrix[ narrowMatrix
## Column_1 Column_2
## Row_1 "Not_A_Number" "1"
## Row_2 "2" "1"
## Row_3 "3" "1"
## Row_4 "4" "1"
## Row_5 "5" "1"
2] *2 narrowMatrix[,
## Error in narrowMatrix[, 2] * 2: non-numeric argument to binary operator
14.10 Exercise
Exercise on matrices can be found here
15 Factors
15.1 Creating factors
A special case of a vector is a factor.
Factors are used to store data which may be grouped in categories (categorical data). Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data.
To create a factor from a vector we use the factor() function. Note that the factor now has an additional component called “levels” which identifies all categories within the vector.
<- c("male","female","female","female")
vectorExample <- factor(vectorExample)
factorExample factorExample
## [1] male female female female
## Levels: female male
levels(factorExample)
## [1] "female" "male"
15.2 Summary() function
An example of the use of levels can be seen from applying the summary() function to the vector and factor examples
summary(vectorExample)
## Length Class Mode
## 4 character character
summary(factorExample)
## female male
## 3 1
15.3 Display order of levels
In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to levels argument in the factor() function call.
<- factor(vectorExample,levels=c("male","female"))
factorExample factorExample
## [1] male female female female
## Levels: male female
summary(factorExample)
## male female
## 1 3
15.4 Nominal factors
In some cases there is no natural order to the categories such that one category is greater than the other (nominal data). In this case we can see that R is gender neutral.
<- factor(vectorExample,levels=c("male","female"))
factorExample 1] < factorExample[2] factorExample[
## Warning in Ops.factor(factorExample[1], factorExample[2]): '<' not meaningful
## for factors
## [1] NA
15.5 Ordinal factors
In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the ordered argument in combination with specified levels argument.
<- factor(c("small","big","big","small"),ordered=TRUE,levels=c("small","big"))
factorExample factorExample
## [1] small big big small
## Levels: small < big
1] < factorExample[2] factorExample[
## [1] TRUE
15.6 Replacement
Unlike vectors, replacing elements within a factor isn’t so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning.
<- factor(c("small","big","big","small"))
factorExample 1] <- c("big")
factorExample[ factorExample
## [1] big big big small
## Levels: big small
1] <- c("huge") factorExample[
## Warning in `[<-.factor`(`*tmp*`, 1, value = "huge"): invalid factor level, NA
## generated
factorExample
## [1] <NA> big big small
## Levels: big small
To add a new level we can use the levels argument.
levels(factorExample) <- c("big","small","huge")
1] <- c("huge")
factorExample[ factorExample
## [1] huge big big small
## Levels: big small huge
16 Data frames
We saw that with matrices you can only have one type of data. We tried to create a matrix with a character element and the entire matrix became a character.
In practice, we would want to have a table which is a mixture of types (e.g a table with sample names (character), sample type (factor) and survival time (numeric))
16.1 Creating Data Frames
In R, we make use of the data frame object which allows us to store tables with columns of different data types. To create a data frame we can simply use the data.frame() function.
<- c("patient1","patient2","patient3","patient4")
patientName <- factor(rep(c("male","female"),2))
patientType <- c(1,30,2,20)
survivalTime <- data.frame(Name=patientName, Type=patientType,Survival_Time=survivalTime)
dfExample dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
16.2 Indexing and replacement
16.2.1 Using []
Data frames may be indexed just as matrices.
dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
"Survival_Time"] > 10,] dfExample[dfExample[,
## Name Type Survival_Time
## 2 patient2 female 30
## 4 patient4 female 20
16.2.2 Using $ to specify columns
Unlike matrices, it is possible to index a column by using the $ symbol.
<- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample $Survival_Time dfExample
## [1] 1 30 2 20
$Survival_Time < 10,] dfExample[dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 3 patient3 male 2
Using the $ allows for R to autocomplete your selection and so can speed up coding.
$Surv dfExample
## [1] 1 30 2 20
But this will not work..
"Surv"] dfExample[,
16.3 Creating new columns with $
The $ operator also allows for the creation of new columns for a data frame on the fly.
dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
$newColumn <- rep("newData",nrow(dfExample))
dfExample dfExample
## Name Type Survival_Time newColumn
## 1 patient1 male 1 newData
## 2 patient2 female 30 newData
## 3 patient3 male 2 newData
## 4 patient4 female 20 newData
16.4 Indexing and replacement
Certain columns can not be replaced in data frames. Numeric columns may have their values replaced but columns with character values may not by default. This occurs because character vectors are treated as factors by default.
"Survival_Time"] < 10,"Survival_Time"] <- 0
dfExample[dfExample[, dfExample
## Name Type Survival_Time newColumn
## 1 patient1 male 0 newData
## 2 patient2 female 30 newData
## 3 patient3 male 0 newData
## 4 patient4 female 20 newData
"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample[dfExample[, dfExample
## Name Type Survival_Time newColumn
## 1 patientX male 0 newData
## 2 patient2 female 30 newData
## 3 patientX male 0 newData
## 4 patient4 female 20 newData
16.5 Factors in data frames
It is possible to update factors in data frames just as with standard factors.
<- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample
levels(dfExample[,"Name"]) <- c(levels(dfExample[,"Name"]) , "patientX")
"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample[dfExample[, dfExample
## Name Type Survival_Time
## 1 patientX male 1
## 2 patient2 female 30
## 3 patientX male 2
## 4 patient4 female 20
16.6 Creating data frames without factors
If you wish to avoid using factors in data frames then the stringsAsFactors argument to data.frame() function should be set to FALSE
<- data.frame(Name=patientName,
dfExample Type=patientType,
Survival_Time=survivalTime,
stringsAsFactors = F)
"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample[dfExample[, dfExample
## Name Type Survival_Time
## 1 patientX male 1
## 2 patient2 female 30
## 3 patientX male 2
## 4 patient4 female 20
16.7 Ordering with order() function
A useful function in R is order()
For numeric vectors, order() by default returns the indices of a vector in that vector’s increasing order. This behaviour can be altered by using the “decreasing” argument passed to order.
<- c(2,1,3)
testOrder testOrder
## [1] 2 1 3
order(testOrder)] testOrder[
## [1] 1 2 3
order(testOrder,decreasing=T)] testOrder[
## [1] 3 2 1
16.8 Ordering with NA values
When a vector contains NA values, these NA values will, by default, be placed last in ordering indices. This can be controlled by na.last argument.
<- c(2,1,NA,3)
testOrder order(testOrder,decreasing=T,na.last=T)] testOrder[
## [1] 3 2 1 NA
order(testOrder,decreasing=T,na.last=F)] testOrder[
## [1] NA 3 2 1
16.9 Ordering data frames
Since the order argument returns an index of intended order for a vector, we can use the order() function to order data frames by certain columns
dfExample
## Name Type Survival_Time
## 1 patientX male 1
## 2 patient2 female 30
## 3 patientX male 2
## 4 patient4 female 20
order(dfExample$Surv, decreasing=T),] dfExample[
## Name Type Survival_Time
## 2 patient2 female 30
## 4 patient4 female 20
## 3 patientX male 2
## 1 patientX male 1
We can also use order to arrange multiple columns in a data frame by providing multiple vectors to order() function. Ordering will be performed in order of arguments.
order(dfExample$Type,
dfExample[$Survival,
dfExampledecreasing=T),]
## Name Type Survival_Time
## 3 patientX male 2
## 1 patientX male 1
## 2 patient2 female 30
## 4 patient4 female 20
16.10 Merging data frames
A common operation is to join two data frames by a column of common values.
<- data.frame(Name=patientName,
dfExample Type=patientType,
Survival_Time=survivalTime)
dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
<- data.frame(Name=patientName[1:3],
dfExample2 height=c(6.1,5.1,5.5))
dfExample2
## Name height
## 1 patient1 6.1
## 2 patient2 5.1
## 3 patient3 5.5
16.11 Merging data frames with merge()
To do this we can use the merge() function with the data frames as the first two arguments. We can then specify the columns to merge by with the by argument. To keep only data pertaining to values common to both data frames the all argument is set to TRUE.
<- merge(dfExample,dfExample2,by=1,all=F)
mergedDF mergedDF
## Name Type Survival_Time height
## 1 patient1 male 1 6.1
## 2 patient2 female 30 5.1
## 3 patient3 male 2 5.5
16.12 Time for an exercise
Exercise on data frames can be found here
17 Lists
17.1 Creating lists
Lists are the final data-type we will look at.
In R, lists provide a general container which may hold any data types of unequal lengths as part of its elements. To create a list we can simply use the list() function with arguments specifying the data we wish to include in the list.
<- c(1,2,3,4)
firstElement <- matrix(1:10,nrow=2,ncol=5)
secondElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four"))
thirdElement <- list(firstElement,secondElement,thirdElement)
myList myList
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
##
## [[3]]
## colOne colTwo
## 1 1 One
## 2 2 Two
## 3 4 Three
## 4 5 Four
17.2 Named lists
Just as with vectors, list elements can be assigned names.
<- list(First=firstElement,Second=secondElement,Third=thirdElement)
myNamedList myNamedList
## $First
## [1] 1 2 3 4
##
## $Second
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
##
## $Third
## colOne colTwo
## 1 1 One
## 2 2 Two
## 3 4 Three
## 4 5 Four
17.3 Indexing
List, as with other data types in R can be indexed. In contrast to other types, using [] on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets [[]] must be used.
<- list(firstElement,secondElement,thirdElement)
myList 1] myList[
## [[1]]
## [1] 1 2 3 4
1]] myList[[
## [1] 1 2 3 4
As with data.frames, the $ sign may be used to extract named elements from a list
$First myNamedList
## [1] 1 2 3 4
17.4 Joining lists
Again, similar to vectors, lists can be joined together in R using the c() function
<- list(First=firstElement,Second=secondElement,Third=thirdElement)
myNamedList <- c(myNamedList,list(fourth=c(4,4)))
myNamedList c(1,4)] myNamedList[
## $First
## [1] 1 2 3 4
##
## $fourth
## [1] 4 4
17.5 Joining vectors to lists
Note that on last slide we are joining two lists. If we joined a vector to a list, all elements of the vector would become list elements.
<- c(myList,c(4,4))
myList myList
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
##
## [[3]]
## colOne colTwo
## 1 1 One
## 2 2 Two
## 3 4 Three
## 4 5 Four
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 4
17.6 Flattening lists
Sometimes you will wish to “flatten” out a list. When a list contains compatable objects, i.e. list of all one type, the unlist() function can be used. Note the maintenance of names with their additional sufficies.
<- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList myNamedList
## $First
## [1] 1 2 3
##
## $Second
## [1] 2 6 7
##
## $Third
## [1] 1 4 7
<- unlist(myNamedList)
flatList 1:7] flatList[
## First1 First2 First3 Second1 Second2 Second3 Third1
## 1 2 3 2 6 7 1
17.7 Flattening lists to matrices
A common step is to turn a list of standard results into matrix. This can be done in a few steps in R.
<- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList <- unlist(myNamedList)
flatList <- matrix(flatList,
listAsMat nrow=length(myNamedList),
ncol=3,
byrow=T,
dimnames=list(names(myNamedList)))
listAsMat
## [,1] [,2] [,3]
## First 1 2 3
## Second 2 6 7
## Third 1 4 7
18 Reading and writing data in R
18.1 Readding data into R
18.1.1 Data from External sources
Most of the time, you will not be generating data in R but will be importing data from external files.
A standard format for this data is a table
Gene_Name | Sample_1.hi | Sample_2.hi | Sample_3.hi | Sample_4.low | Sample_5.low | Sample_1.low |
---|---|---|---|---|---|---|
Gene_a | 4.309631 | 3.162187 | 2.345090 | 3.546890 | 2.428225 | 3.364047 |
Gene_b | 3.822922 | 3.510406 | 4.151327 | 2.630857 | 3.661457 | 5.160221 |
Gene_c | 3.491185 | 3.715750 | 2.761014 | 2.564055 | 3.972775 | 3.241509 |
Gene_d | 3.939807 | 3.953388 | 2.553703 | 7.188691 | 8.318255 | 9.024141 |
Gene_e | 10.252551 | 9.199695 | 10.661764 | 2.489717 | 3.162538 | 2.489696 |
Gene_f | 8.719102 | 9.969657 | 10.715328 | 3.483089 | 3.967558 | 3.377990 |
Gene_g | 10.027842 | 10.257793 | 9.347402 | 9.822065 | 9.472598 | 7.850575 |
Gene_h | 10.600430 | 9.435366 | 8.806229 | 9.494210 | 8.736870 | 9.971543 |
18.1.2 Data from text file with read.table()
Tables from text files can be read with read.table() function
<- read.table("data/readThisTable.csv",sep=",",header=T)
Table 1:4,1:3] Table[
## Gene_Name Sample_1.hi Sample_2.hi
## 1 Gene_a 4.111851 3.837018
## 2 Gene_b 6.047822 5.683518
## 3 Gene_c 2.597068 3.316300
## 4 Gene_d 6.009197 5.927419
Here we have provided two arguments. - sep argument specifies how columns are separated in our text file. (“,” for .csv, " for .tsv) - header argument specifies whether columns have headers.
18.1.3 Row names in read.table()
read.table() allows for significant control over reading files through its many arguments. Have a look at options by using ?read.table
The row.names argument can be used to specify a column to use as row names for the resulting data frame. Here we use the first column as row names.
<- read.table("data/readThisTable.csv",sep=",",header=T,row.names=1)
Table 1:4,1:3] Table[
## Sample_1.hi Sample_2.hi Sample_3.hi
## Gene_a 4.111851 3.837018 4.360628
## Gene_b 6.047822 5.683518 4.315889
## Gene_c 2.597068 3.316300 3.681509
## Gene_d 6.009197 5.927419 2.244701
18.1.4 Setting factors from read.table()
As mentioned, data which is read into R through read.table() will be of data frame class.
To avoid character columns being converted into factors, we can specify the stringsAsFactors argument here.
<- read.table("data/readThisTable.csv",sep=",",header=T,stringsAsFactors=F) Table
Other very useful functions for read table include: - skip - To set number of lines to skip when reading. - comment.char - To set the start identifier for lines not to be read.
18.1.5 Data from other sources
The read.table function can also read data from http.
<- "http://mrccsc.github.io/readThisTable.csv"
URL <- read.table(URL,sep=",",header=T)
Table 1:2,1:3] Table[
## Gene_Name Sample_1.hi Sample_2.hi
## 1 Gene_a 4.111851 3.837018
## 2 Gene_b 6.047822 5.683518
And the clipboard.(This is Windows version)
<- read.table(file="clipboard",sep=",",header=T) Table
18.1.6 Data from file columns
read.table() function will by default read every row and column of a file.
The scan() function allows for the selection of particular columns to be read into R and so can save memory when files are large.
<- scan("data/readThisTable.csv",sep=",",
x what = c(list(""),rep(list(NULL), 6)),skip=1)
1:3] x[
## [[1]]
## [1] "Gene_a" "Gene_b" "Gene_c" "Gene_d" "Gene_e" "Gene_f" "Gene_g" "Gene_h"
##
## [[2]]
## NULL
##
## [[3]]
## NULL
18.2 writing data in R
Once we have our data analysed in R, we will want to export it to a file.
The most common method is to use the write.table() function
write.table(Table,file="data/readThisTable.csv",sep=",")
Since our data has column names but no row names, I will provide the arguments col.names and row.names to write.table()
write.table(Table,file="data/readThisTable.csv", sep=",", row.names =F,col.names=T)
18.3 Time for an exercise
Exercise on reading and writing data can be found here
19 Loading Libraries
19.1 What are libraries in R?
R libraries or packages are collections of functions developed by the community. R packages include - reusable R functions - the documentation for these functions and - sample data to test their functionality.
19.2 What are repositories?
A repository is a place where packages are deposited by the community and you can install them from it. The two most popular repositories for R packages are: - CRAN - Bioconductor
19.3 How to install R packages from CRAN?
You can install the packages from CRAN using the function install.packages()
For example,
install.packages("babynames")
To load the installed packages and use them, you can use the library() function
For example,
library(babynames)
You can also install packages through the R studio menu
-> Tools -> Install packages ..
19.4 How to install R packages from Bioconductor?
To install packages from Bioconductor, you have to first have to install the Bioconduction package manager “BiocManager”
To install it,
install.packages("BiocManager")
Then, you can install any packages from Bioconductor using the BiocManager::install() function
For example,
::install("DESeq2") BiocManager
To load the installed packages and use them, you can use the library() function
For example,
library(DESeq2)
20 How to install R packages from Bioconductor in older versions of R?
If you are using an older version of R (R < 3.5.0), you have to use biocLite to install Bioconductor packages.
To install packages from Bioconductor, you have to first source the “biocLite” package.
source("https://bioconductor.org/biocLite.R")
Then, you can install any packages from Bioconductor using the biocLite() function
For example,
biocLite("DESeq2")
To load the installed packages and use them, you can use the library() function
For example,
library(DESeq2)
20.1 How to get help about the packages?
You can get the documentation of the package by using the function help() or “??”
For example
help(babynames)
# Or you can use "??"
?? babynames
For a more detailed information on each and every parameter of all the functions in a package, you can refer to the reference manual or the in the package webpage or by using the browseVignettes() function.
For example,
browseVignettes("DESeq2")
You can see what libraries are available in the Packages panel or by the library() function with no arguments supplied
library()
End of Session 1