ChIPseeker for ChIP peak annotation
ChIPpeakAnno WAS the only R package for ChIP peak annotation. I used it for annotating peak in my recent study.
I found it does not consider the strand information of genes. I reported the bug to the authors, but they are reluctant to change.
So I decided to develop my own package, ChIPseeker, and it’s now available in Bioconductor.
> require(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> require(ChIPseeker)
> peakfile = getSampleFiles()[[4]]
> peakfile
[1] "/usr/local/Cellar/r/3.1.0/R.framework/Versions/3.1/Resources/library/ChIPseeker/extdata//GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
> peakAnno <- annotatePeak(peakfile, tssRegion=c(-3000, 1000), TxDb=txdb, annoDb="org.Hs.eg.db")
>> loading peak file... 2014-04-13 12:03:45 PM
>> preparing features information... 2014-04-13 12:03:45 PM
>> identifying nearest features... 2014-04-13 12:03:45 PM
>> calculating distance from peak to TSS... 2014-04-13 12:03:46 PM
>> assigning genomic annotation... 2014-04-13 12:03:46 PM
>> adding gene annotation... 2014-04-13 12:03:50 PM
>> assigning chromosome lengths 2014-04-13 12:03:50 PM
>> done... 2014-04-13 12:03:50 PM
> head(as.GRanges(peakAnno))
GRanges object with 6 ranges and 14 metadata columns:
seqnames ranges strand | V4 V5
|
[1] chr1 [ 815092, 817883] * | MACS_peak_1 295.76
[2] chr1 [1243287, 1244338] * | MACS_peak_2 63.19
[3] chr1 [2979976, 2981228] * | MACS_peak_3 100.16
[4] chr1 [3566181, 3567876] * | MACS_peak_4 558.89
[5] chr1 [3816545, 3818111] * | MACS_peak_5 57.57
[6] chr1 [6304864, 6305704] * | MACS_peak_6 54.57
annotation geneChr geneStart geneEnd geneLength geneStrand
[1] Distal Intergenic chr1 803451 812182 8732 -
[2] Promoter (<=1kb) chr1 1243994 1247057 3064 +
[3] Promoter (<=1kb) chr1 2976181 2980350 4170 -
[4] Promoter (<=1kb) chr1 3547331 3566671 19341 -
[5] Promoter (<=1kb) chr1 3816968 3832011 15044 +
[6] 3 UTR chr1 6304252 6305638 1387 +
geneId transcriptId distanceToTSS ENSEMBL SYMBOL
[1] 284593 uc001abt.4 -5701 ENSG00000230368 FAM41C
[2] 126789 uc001aed.3 0 ENSG00000169972 PUSL1
[3] 440556 uc001aka.3 0 ENSG00000177133 LINC00982
[4] 49856 uc001ako.3 0 ENSG00000116213 WRAP73
[5] 100133612 uc001alg.3 0 ENSG00000236423 LINC01134
[6] 390992 uc009vly.2 1452 ENSG00000173673 HES3
GENENAME
[1] family with sequence similarity 41, member C
[2] pseudouridylate synthase-like 1
[3] long intergenic non-protein coding RNA 982
[4] WD repeat containing, antisense to TP73
[5] long intergenic non-protein coding RNA 1134
[6] hes family bHLH transcription factor 3
---
seqlengths:
chr1 chr10 chr11 chr12 ... chr8 chr9 chrX
249250621 135534747 135006516 133851895 ... 146364022 141213431 155270560
annotatePeak function can accept peak file or GRanges object that contained peak information. The sample peakfile shown above is a sample output from MACS. All the information contained in peakfile will be kept in the output of annotatePeak.
The annotation column annotates genomic features of the peak, that is whether peak is located in Promoter, Exon, UTR, Intron or Intergenic. If the peak is annotated by Exon or Intron, more detail information will be provided. For instance, Exon (38885 exon 3 of 11) indicates that the peak is located at the 3th exon of gene 38885 (entrezgene ID) which contain 11 exons in total.
All the names of column that contains information of nearest gene annotated will start with gene in camelCaps style.
If parameter annoDb=“org.Hs.eg.db” is provided, extra columns of ENSEMBL, SYMBOL and GENENAME will be provided in the final output.
For more detail, please refer to the package vignette. Comments or suggestions are welcome.
——- below was added at April 30, 2014 ———–
After two weeks of ChIPseeker release in Bioconductor, I am very glad to receive an email from Cambridge Systems Biology Centre that they are using ChIPseeker in annotating Drosophila ChIP data.
Although I only illustrate examples in human, ChIPseeker do supports all species only if genomic information was available. :mrgreen: :mrgreen:
> require(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> require(ChIPseeker)
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
>
> chr <- c("chrX", "chrX", "chrX")
> start <- c(71349, 77124, 86364)
> end <- c(72835, 77836, 88378)
> peak <- GRanges(seqnames = chr, ranges = IRanges(start, end))
> peakAnno <- annotatePeak(peak,
+ tssRegion=c(-3000,100),
+ TxDb = txdb,
+ annoDb="org.Dm.eg.db")
>> preparing features information... 2014-04-30 21:28:27
>> identifying nearest features... 2014-04-30 21:28:30
>> calculating distance from peak to TSS... 2014-04-30 21:28:31
>> assigning genomic annotation... 2014-04-30 21:28:31
>> adding gene annotation... 2014-04-30 21:28:42
Loading required package: org.Dm.eg.db
Loading required package: DBI
>> assigning chromosome lengths 2014-04-30 21:28:43
>> done... 2014-04-30 21:28:43
> as.GRanges(peakAnno)
GRanges object with 3 ranges and 12 metadata columns:
seqnames ranges strand |
|
[1] chrX [71349, 72835] * |
[2] chrX [77124, 77836] * |
[3] chrX [86364, 88378] * |
annotation geneChr geneStart
[1] Promoter (<=1kb) chrX 72458
[2] Intron (FBtr0100135/FBgn0029128, intron 1 of 7) chrX 72458
[3] Intron (FBtr0100135/FBgn0029128, intron 2 of 7) chrX 72458
geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
[1] 96056 23599 + FBgn0029128 FBtr0100135 0
[2] 96056 23599 + FBgn0029128 FBtr0100135 5378
[3] 96056 23599 + FBgn0029128 FBtr0100135 15920
ENTREZID SYMBOL GENENAME
[1] 3354987 tyn trynity
[2] 3354987 tyn trynity
[3] 3354987 tyn trynity
---
seqlengths:
chrX
22422827
>
Citation
Yu G, Wang LG and He QY*. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383.