Nearest gene annotation

Almost all annotation software calculate the distance of a peak to the nearest TSS and assign the peak to that gene. This can be misleading, as binding sites might be located between two start sites of different genes or hit different genes which have the same TSS location in the genome.

The function annotatePeak provides option to assign genes with a max distance cutoff and all genes within this distance were reported for each peak.

> anno = annotatePeak(getSampleFiles()[[4]], addFlankGeneInfo=TRUE, flankDistance=5000)
>> loading peak file...              2014-09-30 01:12:09 PM 
>> preparing features information...         2014-09-30 01:12:09 PM 
>> identifying nearest features...       2014-09-30 01:12:10 PM 
>> calculating distance from peak to TSS...  2014-09-30 01:12:11 PM 
>> assigning genomic annotation...       2014-09-30 01:12:11 PM 
>> adding flank feature information from peaks...2014-09-30 01:12:37 PM 
>> assigning chromosome lengths          2014-09-30 01:12:39 PM 
>> done...                   2014-09-30 01:12:39 PM 
> head(as.GRanges(anno))
GRanges object with 6 ranges and 14 metadata columns:
      seqnames             ranges strand |          V4        V5
                      |     
  [1]     chr1 [ 815092,  817883]      * | MACS_peak_1    295.76
  [2]     chr1 [1243287, 1244338]      * | MACS_peak_2     63.19
  [3]     chr1 [2979976, 2981228]      * | MACS_peak_3    100.16
  [4]     chr1 [3566181, 3567876]      * | MACS_peak_4    558.89
  [5]     chr1 [3816545, 3818111]      * | MACS_peak_5     57.57
  [6]     chr1 [6304864, 6305704]      * | MACS_peak_6     54.57
             annotation  geneChr geneStart   geneEnd geneLength geneStrand
                    
  [1] Distal Intergenic     chr1    803451    812182       8732          -
  [2]  Promoter (<=1kb)     chr1   1243994   1247057       3064          +
  [3]  Promoter (<=1kb)     chr1   2976181   2980350       4170          -
  [4]  Promoter (<=1kb)     chr1   3547331   3566671      19341          -
  [5]  Promoter (<=1kb)     chr1   3816968   3832011      15044          +
  [6]  Promoter (1-2kb)     chr1   6304252   6305638       1387          +
           geneId transcriptId distanceToTSS
             
  [1]      284593   uc001abt.4         -5701
  [2]      126789   uc001aed.3             0
  [3]      440556   uc001aka.3             0
  [4]       49856   uc001ako.3             0
  [5]   100133612   uc001alg.3             0
  [6]      390992   uc009vly.2          1452
                                                                                                                                                               flank_txIds
                                                                                                                                                               
  [1]                                                                                                                                                           uc001abt.4
  [2] uc001aea.2;uc001aeb.2;uc001aec.1;uc001aed.3;uc010nyi.2;uc009vjx.3;uc009vjy.1;uc001aee.2;uc001aef.2;uc001aeg.2;uc001aeh.2;uc001aei.2;uc001aek.2;uc009vjz.2;uc010nyj.2
  [3]                                                                                                    uc001aka.3;uc010nzg.1;uc001akc.3;uc001ake.3;uc001akf.3;uc009vlh.3
  [4]                                                                   uc001ako.3;uc001akn.3;uc010nzi.2;uc001akp.3;uc021ofb.1;uc021ofc.1;uc021ofd.1;uc021ofe.1;uc021off.1
  [5]                                                                                                    uc001alg.3;uc010nzo.2;uc001alh.4;uc001alf.3;uc009vls.3;uc031pkz.1
  [6]                                                                                                                                                uc001amp.2;uc009vly.2
                                                                                        flank_geneIds
                                                                                          
  [1]                                                                                          284593
  [2] 116983;116983;116983;126789;126789;126789;54973;54973;54973;54973;54973;54973;54973;54973;54973
  [3]                                                           440556;440556;63976;63976;63976;63976
  [4]                                                 49856;49856;49856;7161;7161;7161;7161;7161;7161
  [5]                                              100133612;100133612;100133612;339448;339448;339448
  [6]                                                                                   387509;390992
                                                           flank_gene_distances
                                                                    
  [1]                                                                     -2910
  [2] -1978;-18;0;0;0;-128;8492;15729;15729;15729;15729;15729;15729;15729;15729
  [3]                                               0;0;-4514;-4514;-4514;-4514
  [4]                                 0;0;0;-1253;-1253;-1253;-1253;-1253;-1253
  [5]                                                               0;0;0;0;0;0
  [6]                                                                   15331;0
  -------
  seqinfo: 24 sequences from hg19 genome

Three extra columns, flank_txIds, flank_geneIds, flank_gene_distances were added if parameter addFlankGeneInfo is set to TRUE.

Genomic annotation

The annotation column annotates the genomic location of a given peak. Since some annotations overlap, they are assigned based on the following priority.

  • Promoter (defined by tssRegion parameter)
  • 5' UTR
  • 3' UTR
  • Exon
  • Intron
  • Downstream (immediate downstream of a gene, within 3kb)
  • Distal Intergenic

I have implemented plotAnnoPie and plotAnnoBar to visualize their distribution. I am wondering how to visualize the annotation overlap and found venn pie-agram can partially solve this issue. It’s hard/impossible to visualize all the overlap information within one vennpie plot.

I employed the vennpie to ChIPseeker, and now with the vennpie function, we can visualize some of the overlap annotations.

vennpie(anno)

In order to manage complex data, annotatePeak now output a S4 object. Users can use as.GRanges to convert it to as.GRanges object or as.data.frame to convert it to data.frame and then the output can be exported by write.table function.

Citation

Yu G, Wang LG and He QY*. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383.