Bioconductor Tutorial
1 Overview
- Introduction to Bioconductor
- GenomicRanges
- Reading Sequence alignments
- Annotations
- Exercises
2 Bioconductor
Bioconductor (BioC) is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomics data
- Started in 2001
- Gained popularity through microarray analysis packages
- Colloborative effort by developers across the world
- Distributed as R packages
- Two releases every year
3 Objectives
- Provide access to powerful statistical and graphical methods for analysing genomics data.
- Facilitate retrieval and integration of annotation data (GenBank, GO, Entrez Gene, PubMed).
- Allow the rapid development of extensible, interoperable, and scalable software.
- Promote high-quality documentation and reproducible research.
- Provide training in computational and statistical methods.
4 www.bioconductor.org
5 Bioconductor Release 3.14 - it works with R version 4.1.0
Software (2083)
- Provides implementation of analysis methods
AnnotationData (904)
- mapping between microarray probe, gene, pathway, gene ontology, homology and other annotations
- Representations of GO, KEGG and other annotations, and can easily access NCBI, Biomart, UCSC and other sources
ExperimentData (408)
- code, data and documentation for specific experiments or projects
Workflows (29)
- Common analysis work flows for genomics data
6 Installating BioC Packages
Installing & Using a package for R version < 3.5.0:
source("https://bioconductor.org/biocLite.R")
::biocLite(c("GenomicFeatures", "AnnotationDbi")) BiocInstaller
Installing & Using a package for R version after 3.5.0:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(c("GenomicFeatures", "AnnotationDbi")) BiocManager
7 Finding Help
- ?function
library("limma")
?lmFit
- Browse Vignettes browseVignettes(package=“limma”)
- Bioconductor Course Materials: http://www.bioconductor.org/help/course-materials/
- Bioconductor Support Forum - https://support.bioconductor.org/
Prints version information about R and all loaded packages
sessionInfo()
8 BioC packages for Sequencing Data Analysis
- Data structures: IRanges, GenomicRanges, Biostrings, BSgenome
- Input/Ouput: ShortRead, Rsamtools, GenomicAlignments and rtracklayer (GTF,GFF,BED)
- Annotation: GenomicFeatures, BSgenome, biomaRt, TxDb.*, org.*
- Alignment: Rsubread, Biostrings
- Accessing Database: SRAdb & GEOquery
- ChIP-seq peak identification, motif discovery and annotation: ChIPQC, chipseq, ChIPseqR, ChIPpeakAnno, DiffBind, rGADEM, BayesPeak, MotifDb, SeqLogo.
- RNA-seq and Differential expression analysis: Rsubread, GenomicAlignments, edgeR, DESeq2, DEXseq, goseq
- SNP: snpStats, SeqVarTools, GGtools
- Work-flows: ReportingTools, easyRNASeq, ArrayExpressHTS, oneChannelGUI
9 GenomicRanges
- Genomic Ranges provides data structure for efficiently storing genomic coordinates
- Collection of genes coordinates
- Transcription factor binding sites (ChIP-Seq peaks)
- Collection of aligned sequencing reads
- Builds on top of Interval Ranges (IRanges) package and lays foundation for sequencing analysis.
- IRanges are collection of integer interval and GenomicRanges extends IRanges by including chromosome and strand.
- Provides collection of functions for accessing and manipulating Genomic coordinates
- Use cases: Identifying TF binding overlap, counting sequencing reads overlap with a gene
- Main classes: GRanges and GRangesList
10 Run Length Encoding (Rle)
- Run length encoding is a data compression technique
- Efficiently encoding the redundant information
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("GenomicRanges") BiocManager
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
## re-install: 'GenomicRanges'
# Orginial vector
# chr1, chr2, chr2, chr2, chr1, chr1, chr3, chr3
library(GenomicRanges)
<- Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 2))
chr chr
## character-Rle of length 8 with 4 runs
## Lengths: 1 3 2 2
## Values : "chr1" "chr2" "chr1" "chr3"
The above Rle can be interpreted as a run of length 1 of chr1, followed by run length of 3 of chr2, followed by run length of 2 of chr1 and followed by run length of 2 of chr3
11 Constructing GRanges object
GRanges class represents a collection of genomic features with single start and end location on the genome. GRanges object can be cretated using GRanges function.
library("GenomicRanges")
<- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
gr1 ranges = IRanges(start=11:20, end = 50:59, names = head(letters,10)),
strand = Rle(c("-", "+", "-", "+", "-"), c(1,2, 2, 3, 2)),
score = 1:10, GC = runif(10,0,1))
gr1
## GRanges object with 10 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## a chr1 11-50 - | 1 0.459069
## b chr2 12-51 + | 2 0.629740
## c chr2 13-52 + | 3 0.194531
## d chr2 14-53 - | 4 0.511634
## e chr1 15-54 - | 5 0.920725
## f chr1 16-55 + | 6 0.453292
## g chr3 17-56 + | 7 0.628677
## h chr3 18-57 + | 8 0.309197
## i chr3 19-58 - | 9 0.945319
## j chr3 20-59 - | 10 0.909125
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
12 Constructing GRanges object
- The coordinates in GRanges are 1-based and left-most (start of a read will always be left-most coordinate of the read regardless of which strand the read aligned to).
- Additional data stored beyond genomic coordinates (separated by “|”) are called metadata. In our case metadata column contains score and GC content.
- Metadata columns are optional and can be extracted from GRanges object using mcols function.
mcols(gr1)
## DataFrame with 10 rows and 2 columns
## score GC
## <integer> <numeric>
## a 1 0.459069
## b 2 0.629740
## c 3 0.194531
## d 4 0.511634
## e 5 0.920725
## f 6 0.453292
## g 7 0.628677
## h 8 0.309197
## i 9 0.945319
## j 10 0.909125
13 Constructing GRanges object from data frame
<- read.table("./data/mm9Genes.txt",sep="\t",header=T)
mm9genes head(mm9genes)
## chr start end strand ens Symbol
## 1 chr9 105729415 105731415 - ENSMUSG00000043719 Col6a6
## 2 chr18 43636450 43638450 - ENSMUSG00000043424 Gm9781
## 3 chr7 70356455 70358455 - ENSMUSG00000030525 Chrna7
## 4 chrX 100818093 100820093 - ENSMUSG00000086370 B230206F22Rik
## 5 chr12 82881157 82883157 - ENSMUSG00000042724 Map3k9
## 6 chr7 152124774 152126774 - ENSMUSG00000070348 Ccnd1
<- GRanges(seqnames=mm9genes$chr,
mm9genes.GR ranges=IRanges(start=mm9genes$start,end=mm9genes$end),
strand=mm9genes$strand,
ENSID=mm9genes$ens,
Symbol=mm9genes$Symbol)
Another option: makeGRangesFromDataFrame()
Converting GRanges object to data frame: as.data.frame(GRanges)
14 Constructing GRanges object from data frame
head(mm9genes.GR)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | ENSID Symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr9 105729415-105731415 - | ENSMUSG00000043719 Col6a6
## [2] chr18 43636450-43638450 - | ENSMUSG00000043424 Gm9781
## [3] chr7 70356455-70358455 - | ENSMUSG00000030525 Chrna7
## [4] chrX 100818093-100820093 - | ENSMUSG00000086370 B230206F22Rik
## [5] chr12 82881157-82883157 - | ENSMUSG00000042724 Map3k9
## [6] chr7 152124774-152126774 - | ENSMUSG00000070348 Ccnd1
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
15 GenomicRangesList
- To represent hierarchical structured data, ex: Exons in a transcript
- List-like data structure
- Each element of the list is GRanges instance
<- GRanges(seqnames = Rle(c("chr1", "chr3","chr2", "chr1", "chr3"), c(1, 2,1, 2, 4)),
gr2 ranges = IRanges(start=55:64, end = 94:103, names = letters[11:20]),
strand = Rle(c("+", "-", "+", "-"), c(1, 4, 3, 2)),
score = 1:10, GC = runif(10,0,1))
<- GRangesList("Peak1" = gr1, "Peak2" = gr2) GRL
16 Operations on GenomicRanges
Operations | Functions/methods |
---|---|
Accessors | seqnames, start, end, ranges, strand, width, names, mcols, length |
Extraction | GR[i], GRL[[i]], head, tail |
Set operations | reduce, disjoin |
Overlaps | findOverlaps, subsetByOverlaps, countOverlaps, nearest, precede, follow |
Arithmetic | shift, resize, distance, distanceToNearest |
17 Operations on GenomicRanges
18 Operations on GenomicRanges
head(ranges(gr1))
## IRanges object with 6 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## a 11 50 40
## b 12 51 40
## c 13 52 40
## d 14 53 40
## e 15 54 40
## f 16 55 40
start(gr1)
## [1] 11 12 13 14 15 16 17 18 19 20
width(gr1)
## [1] 40 40 40 40 40 40 40 40 40 40
19 Operations on GenomicRanges
Subsetting GRanges
seqnames(gr1)=="chr1"] gr1[
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## a chr1 11-50 - | 1 0.459069
## e chr1 15-54 - | 5 0.920725
## f chr1 16-55 + | 6 0.453292
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Merge overlapping genomic ranges within the same GRanges object
reduce(gr1)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 16-55 +
## [2] chr1 11-54 -
## [3] chr2 12-52 +
## [4] chr2 14-53 -
## [5] chr3 17-57 +
## [6] chr3 19-59 -
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
20 Finding overlapping regions
- One of the common tasks in Sequencing data analysis
- Ex: Identifying transcription factor binding sites overlap with promoters
- findOverlaps function finds intervals overlap between two GRanges object.
- Usage: function(query,subject)
<- findOverlaps(gr1,gr2,ignore.strand=F)
gr1_overlaps gr1_overlaps
## Hits object with 5 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 6 1
## [2] 9 2
## [3] 9 3
## [4] 10 2
## [5] 10 3
## -------
## queryLength: 10 / subjectLength: 10
Output of findOverlaps is a ‘Hits’ object indicating which of the query and subject intervals overlap.
20.1 Finding overlapping regions
Convert the ‘Hits’ object to 2 column matrix using as.matrix(). Values in the first column are indices of the query and values in second column are indices of the subject.
<- as.matrix(gr1_overlaps)
gr1_overlaps.m "queryHits"], ] gr1[gr1_overlaps.m[,
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## f chr1 16-55 + | 6 0.453292
## i chr3 19-58 - | 9 0.945319
## i chr3 19-58 - | 9 0.945319
## j chr3 20-59 - | 10 0.909125
## j chr3 20-59 - | 10 0.909125
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
21 Finding overlapping regions
subsetByOverlaps extracts the query intervals overlap with subject intervals
subsetByOverlaps(gr1,gr2,ignore.strand=F)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## f chr1 16-55 + | 6 0.453292
## i chr3 19-58 - | 9 0.945319
## j chr3 20-59 - | 10 0.909125
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
21.1 Finding overlapping regions - %in% and %over%
Alternate ways of find overlapping regions
%in% gr2] gr1[gr1
## GRanges object with 0 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
%over% gr2] gr1[gr1
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## f chr1 16-55 + | 6 0.453292
## i chr3 19-58 - | 9 0.945319
## j chr3 20-59 - | 10 0.909125
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Note the difference between %in% and %over%
22 Finding overlapping regions
Other interesting functions: nearest() and distanceToNearest()
distanceToNearest(gr1,gr2,ignore.strand=F)
## Hits object with 8 hits and 1 metadata column:
## queryHits subjectHits | distance
## <integer> <integer> | <integer>
## [1] 1 5 | 8
## [2] 4 4 | 4
## [3] 5 5 | 4
## [4] 6 1 | 0
## [5] 7 7 | 4
## [6] 8 7 | 3
## [7] 9 3 | 0
## [8] 10 3 | 0
## -------
## queryLength: 10 / subjectLength: 10
22.1 precede, follow
Find nearest range in gr2 that precede or follow each range in gr1
precede(gr1,gr2)
## [1] NA NA NA NA NA 6 7 7 NA NA
follow(gr1,gr2)
## [1] 5 NA NA 4 5 NA NA NA 9 9
precede, follow returns the index of previous/next ranges. Overlapping ranges are excluded
23 Counting overlapping regions
countOverlaps() tabulates number of subject intervals overlap with each interval in query, ex: counting number of sequencing reads overlap with genes in RNA-Seq
countOverlaps(gr1,gr2,ignore.strand=T) # note the strand!
## a b c d e f g h i j
## 0 0 0 0 0 1 1 2 2 2
24 Computing Coverage
coverage calculates how many ranges overlap with individual positions in the genome. coverage function returns the coverage as Rle instance.
coverage(gr1)
## RleList of length 3
## $chr1
## integer-Rle of length 55 with 6 runs
## Lengths: 10 4 1 35 4 1
## Values : 0 1 2 3 2 1
##
## $chr2
## integer-Rle of length 53 with 6 runs
## Lengths: 11 1 1 38 1 1
## Values : 0 1 2 3 2 1
##
## $chr3
## integer-Rle of length 59 with 8 runs
## Lengths: 16 1 1 1 37 1 1 1
## Values : 0 1 2 3 4 3 2 1
25 Time for Exercises!
26 File formats in NGS
- FASTQ (Fasta with quality)
- SAM/BAM - Sequence Alignment/Map format
- BED
- Wiggle/bedgraph
FASTQ
@ERR590398.1 HWI-ST1146:148:C2FVTACXX:8:1101:1172:2059
NAAAATGCATATTCCTAGCATACTTCCCAAACATACTGAATTATAATCTC+ERR590398.1 HWI-ST1146:148:C2FVTACXX:8:1101:1172:2059
A1BDFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJ
27 BAM/SAM
BAM/SAM consist two sections: Header and Alignment
Header section:
Meta data (reference genome, aligner), starts with “@”
28 BAM/SAM
Alignment section:
Fields | Values |
---|---|
QNAME | ID of the read (“query”) |
FLAG | alignment flags |
RNAME | ID of the reference (typically: chromosome name) |
POS | Position in reference (1-based, left side) |
MAPQ | Mapping quality (as Phred score) |
CIGAR | Alignment description (mismatch, gaps etc.) |
RNEXT | Mate/next read reference sequence name |
MPOS | Mate/next read position |
TLEN | observed Template Length |
SEQ | sequence of the read |
QUAL | quality string of the read |
29 Reading Sequence alignments (BAM/SAM)
Methods for reading BAM/SAM
- readAligned from ShortRead package – Accept multiple formats – BAM, export
- Reads all files in a directory
- Reads base call qualities, chromosome, position, and strand
- scanBam from Rsamtools package
- scanBam reads BAM files into list structure
- Options to select what fields and which records to import using ScanBamParam
- readGAlignments from GenomicAlignments package
30 Reading Sequence alignments (BAM/SAM)
library("Rsamtools")
<- system.file("extdata", "ex1.bam", package="Rsamtools")
BamFile <- IRangesList(seq1=IRanges(1000, 2000),seq2=IRanges(c(100, 1000), c(1000, 2000)))
which <- c("rname", "strand", "pos", "qwidth", "seq")
what <- ScanBamParam(which=which, what=what)
param <- scanBam(BamFile, param=param)
bamReads
length(bamReads) # Each element of the list corresponds to a range specified by the which argument in ScanBamParam
## [1] 3
names(bamReads[[1]]) # elements specified by the what and tag arguments to ScanBamParam
## [1] "rname" "strand" "pos" "qwidth" "seq"
31 Reading Sequence alignments (BAM/SAM)
ScanBamParam:
# Constructor
ScanBamParam(flag = scanBamFlag(), what = character(0), which,simpleCigar = FALSE,reverseComplement = FALSE,
tag = character(0), tagFilter = list(), mapqFilter=NA_integer_)
# Constructor helpers
scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = NA,
hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
isFirstMateRead = NA, isSecondMateRead = NA, isSecondaryAlignment = NA,
isNotPassingQualityControls = NA, isDuplicate = NA)
32 Reading Sequence alignments (BAM/SAM)
- GenomicRanges package defines the GAlignments class – a specialised class for storing set of genomic alignments (ex: sequencing data)
- Only BAM support now – future version may include other formats
- The readGAlignments function takes an additional argument, param allowing the user to customise which genomic regions and which fields to read from BAM -param can be constructed using ScanBamParam function
library(GenomicAlignments)
<- readGAlignments(BamFile) SampleAlign
32.1 Reading Sequence alignments (BAM/SAM)
We can also customise which regions to read
<- IRangesList(seq1=IRanges(1000, 2000),seq2=IRanges(1000, 2000))
region <- ScanBamParam(what=c("rname", "pos", "cigar","qwidth"),which=region)
param1 <- readGAlignments(BamFile,param=param1) SampleAlign1
33 Time for Exercises!
34 BioC Annotation Packages
Annotation packages can be broadly classified in to gene-centric and genome-centric.
Gene-centric annotation packages (AnnotationDbi):
- Organism level packages: contains gene annotation for entire organism. Follows “org.XX.YY.db” pattern (Ex: org.Hs.eg.db)
- General System biology data: KEGG.db (association between pathways and genes), GO.db (Gene ontology term and genes) and ReactomeDb
- Platform level packages: Annotation for a specific platform (ex: hgu133a.db for Affymetrix HGU133A microarray).
35 BioC Annotation Packages
Genomic-centric GenomicFeatures packages:
- TranscriptDB (TxDB) packages contains genomic coordiantes for transcripts specific to a genome build, ex: TxDb.Hsapiens.UCSC.hg19.knownGene. These packages allow access to various features on transcriptome, including exons, genes and transcripts coordinates
Web-based annotation services:
- biomaRt provides interface to query web-based `biomart’ resource for genes, sequence, SNPs, and etc.
36 AnnotationDbi Accessor Functions
- columns What kind of annotation available in AnnotationDb object.
- keytypes Displays which type of identifiers can be passed in to select function.
- keys returns keys (index) for the database contained in the AnnotationDb object. Used along with keytypes in select function to retrieve interested annotation
- select will retrieve the annotation data as a data.frame based on the supplied keys, keytypes and columns.
Note: Package name = Annotation object
We will explore how to retrieve annotation from gene-centric organism level annotation package (org.Hs.eg.db)
37 Accessing annotation from org.Hs.eg.db
Load the package and list the contents
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
To know more about the above identifier types
help(SYMBOL)
38 Accessing annotation from org.Hs.eg.db
Which keytypes can be used to query this database? keytypes (What is the difference between columns and keytypes?)
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
38.1 Accessing annotation from org.Hs.eg.db
If we want to extract few identifiers of a particular keytype, we can use keys function
head(keys(org.Hs.eg.db, keytype="SYMBOL"))
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
We can extract other annotations for a particular identifier using select function
select(org.Hs.eg.db, keys = "A1BG", keytype = "SYMBOL", columns = c("SYMBOL", "GENENAME", "CHR") )
## Warning in .deprecatedColsMessage(): Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND' is
## deprecated. Please use a range based accessor like genes(), or select()
## with columns values like TXCHROM and TXSTART on a TxDb or OrganismDb
## object instead.
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL GENENAME CHR
## 1 A1BG alpha-1-B glycoprotein 19
39 Annotating results - Example
How can we annotate our results (ex: RNA-Seq differential expression analysis results)?
First, we will load an example results:
load(system.file("extdata", "resultTable.Rda", package="AnnotationDbi"))
head(resultTable)
## logConc logFC LR.statistic PValue FDR
## 100418920 -9.639471 -4.679498 378.0732 3.269307e-84 2.613484e-80
## 100419779 -10.638865 -4.264830 291.1028 2.859424e-65 1.142912e-61
## 100271867 -11.448981 -4.009603 222.3653 2.757135e-50 7.346846e-47
## 100287169 -11.026699 -3.486593 206.7771 6.934967e-47 1.385953e-43
## 100287735 -11.036862 3.064980 204.1235 2.630432e-46 4.205535e-43
## 100421986 -12.276297 -4.695736 190.5368 2.427556e-43 3.234314e-40
Rownames of the above dataframe are “Entrez gene identifiers” (human). We will extract gene symbol for these Entrez gene identifiers from org.Hs.eg.db package using select fucntion.
40 Annotating results - Example
<- select(org.Hs.eg.db, keys = rownames(resultTable), keytype = "ENTREZID", columns = "SYMBOL") SYM
## 'select()' returned 1:1 mapping between keys and columns
<- merge(resultTable,SYM,by.x=0,by.y=1)
NewResult head(NewResult)
## Row.names logConc logFC LR.statistic PValue FDR
## 1 100127888 -10.57050 2.758937 182.8937 1.131473e-41 1.130624e-38
## 2 100131223 -12.37808 -4.654318 179.2331 7.126423e-41 6.329847e-38
## 3 100271381 -12.06340 3.511937 188.4824 6.817155e-43 7.785191e-40
## 4 100271867 -11.44898 -4.009603 222.3653 2.757135e-50 7.346846e-47
## 5 100287169 -11.02670 -3.486593 206.7771 6.934967e-47 1.385953e-43
## 6 100287735 -11.03686 3.064980 204.1235 2.630432e-46 4.205535e-43
## SYMBOL
## 1 SLCO4A1-AS1
## 2 LOC100131223
## 3 RPS28P8
## 4 MPVQTL1
## 5 <NA>
## 6 TTTY13B
41 TranscriptDB (TxDb) packages
- TxDb packages provide access genomic coordinates to various transcript-related features from UCSC and Biomart data sources.
- TxDb objects contains relationship between mRNA transcripts, exons, CDS and their associated identifiers
- TxDb packages follows specific naming scheme, ex: TxDb.Mmusculus.UCSC.mm9.knownGene
We will explore TxDb package for Mouse mm9 genome from UCSC. We will first install the package and load in to our current working space.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("TxDb.Mmusculus.UCSC.mm9.knownGene") BiocManager
42 TxDb.Mmusculus.UCSC.mm9.knownGene
By default, the annotation object will have same name as package name. Create an alias for convenience.
library("TxDb.Mmusculus.UCSC.mm9.knownGene")
<- TxDb.Mmusculus.UCSC.mm9.knownGene txdb
Since TxDb are inherited from AnnotationDb object, we can use columns, keys, select, and keytypes functions.
<- c("100009600", "100009609", "100009614")
keys select(txdb,keys=keys,columns=c("GENEID","TXNAME"),keytype="GENEID")
## 'select()' returned 1:1 mapping between keys and columns
## GENEID TXNAME
## 1 100009600 uc009veu.1
## 2 100009609 uc012fog.1
## 3 100009614 uc011xhj.1
43 TxDb.Mmusculus.UCSC.mm9.knownGene
Most common operations performed on TxDb objects are retrieving exons, transcripts and CDS genomic coordinates. The functions genes, transcripts, exons, and cds return the coordinates for the group as GRanges objects.
<- transcripts(txdb)
TranscriptRanges 1:3] TranscriptRanges[
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 4797974-4832908 + | 1 uc007afg.1
## [2] chr1 4797974-4836816 + | 2 uc007afh.1
## [3] chr1 4847775-4887990 + | 3 uc007afi.2
## -------
## seqinfo: 35 sequences (1 circular) from mm9 genome
<- exons(txdb)
ExonRanges 1:2] ExonRanges[
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 4797974-4798063 + | 1
## [2] chr1 4798536-4798567 + | 2
## -------
## seqinfo: 35 sequences (1 circular) from mm9 genome
44 TxDb.Mmusculus.UCSC.mm9.knownGene
TxDb package also provides interface to discover how genomic features are related to each other. Ex: Access all transcripts or exons associated to a gene. Such grouping can be achieved by transcriptsBy, exonsBy, and cdsBy functions. The results are returned as GRangesList objects.
<- transcriptsBy(txdb, by = "gene")
Transcripts 1:2] Transcripts[
## GRangesList object of length 2:
## $`100009600`
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr9 20866837-20872369 - | 28943 uc009veu.1
## -------
## seqinfo: 35 sequences (1 circular) from mm9 genome
##
## $`100009609`
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr7 92088679-92112519 - | 23717 uc012fog.1
## -------
## seqinfo: 35 sequences (1 circular) from mm9 genome
45 TxDb.Mmusculus.UCSC.mm9.knownGene
Other interesting functions: intronsByTranscript, fiveUTRsByTranscript and threeUTRsByTranscript
GenomicFeatures also provides functions to create TxDb objects directly from UCSC and Biomart databases: makeTranscriptDbFromBiomart and makeTranscriptDbFromUCSC
install.packages("RMariaDB")
<- makeTxDbFromUCSC(genome = "mm9", tablename = "knownGene") USCmm9KnownGene
Save the annotation object and label them appropriately to facilitate reproducible research:
saveDb(txdb,file="Mouse_ucsc_mm9_20150204.sqlite")
<- loadDb("Mouse_ucsc_mm9_20150204.sqlite") txdb
46 Annotations from the web – biomaRt
biomaRt package offers access to biomart based online annotation resources (marts). Each mart has several datasets. getBM function can be used to retrieve annotation from the biomarts. Use the following functions to find values for the arguments in getBM
Function | Description |
---|---|
listMarts() | List the available biomart resources |
useMart() | select the mart |
listDatasets() | available dataset in the select biomart resource |
useDataset() | select a dataset in the select mart |
listAttributes() | available annotation attributes for the selected dataset |
listFiltersList() | available filters for the selected dataset |
47 biomaRt - Example
library("biomaRt")
<- listMarts() # List available marts
marts 1,] marts[
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 106
<- useMart("ensembl") # select ensembl
ensembl <- listDatasets(ensembl) # list datasets
ens_datasets <- useDataset("hsapiens_gene_ensembl",mart=ensembl) # select human dataset
ens_human <- listAttributes(ens_human) # list available annotation
ens_human_Attr <- listFilters(ens_human) # list availabel filters ens_human_filters
Extract genomic coordinates, ensembl gene identifier and gene symbol for genes in chromsome X
<- getBM(attributes = c("chromosome_name","start_position","end_position","ensembl_gene_id","strand","external_gene_name"), filter="chromosome_name",values="X", mart=ens_human) chrXGenes
48 Using rtracklayer for creating bigwig format files
library(rtracklayer)
<- RleList(chr1=Rle(c(0,0,0,0,0,1,1,1,0,0,0,0,0)),chr2=Rle(c(0,0,0,0,0,1,1,1,2,2,2,2,2)))
myRleList
export.bw(myRleList,con="chr1_Myc.bw")