KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:

  • pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)
  • structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)
  • functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)
  • signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)

Continue reading

To my knowledge, BioEdit is the most comprehensive biological sequence alignment editor. Most of my labmates run this software using Parallels Desktop. For some of them, BioEdit is the only reason to install Parallels Desktop.

I need to edit my alignment recently, and install it in my iMac using Wine, which is a compatibility layer for running Windows applications on POSIX-compliant OS. Although it is famous in Linux community for many years, many OSX users never heard of it.

Continue reading

Google Drive @ HKU

寻找一个好的网盘一直是个困扰我的问题,Dropbox非常好,但空间有限,大陆的各种网盘都是渣渣,本来试用了一下百度云,但度娘实在不争气,体验非常差。我后来找到了个比较好的方案,那就是gitlab,可以创建无限量的project,每个project有10G的空间,这比github出手大方多了。唯一不足是.git文件夹也是非常占空间的。

到HKU两年多,才发现HKU的邮箱自带无限量的google drive网盘。

Continue reading

I extended the subview function to support embed image file in a ggplot object.

set.seed(123)
d = data.frame(x=rnorm(10), y=rnorm(10))

imgfile <- tempfile(, fileext=".png")
download.file("https://avatars1.githubusercontent.com/u/626539?v=3&u=e731426406dd3f45a73d96dd604bc45ae2e7c36f&s=140",
	          destfile=imgfile, mode='wb')

p = ggplot(d, aes(x, y))
subview(p, imgfile, x=d$x[1], y=d$y[1]) + geom_point(size=5)

Continue reading

本文受魏太云(@cloud_wei)的邀请,最初在2015年发表于统计之都

进化树看起来和层次聚类很像。有必要解释一下两者的一些区别。

层次聚类的侧重点在于分类,把距离近的聚在一起。而进化树的构建可以说也是一个聚类过程,但侧重点在于推测进化关系和进化距离(evolutionary distance)。

层次聚类的输入是距离,比如euclidean或manhattan距离。把距离近的聚在一起。而进化树推断是从生物序列(DNA或氨基酸)的比对开始。最简单的方法是计算一下序列中不匹配的数目,称之为hamming distance(通常用序列长度做归一化),使用距离当然也可以应用层次聚类的方法。进化树的构建最简单的方法是非加权配对平均法(Unweighted Pair Group Method with Arithmetic Mean, UPGMA),这其实是使用average linkage的层次聚类。这种方法在进化树推断上现在基本没人用。更为常用的是邻接法(neighbor joining),两个节点距离其它节点都比较远,而这两个节点又比较近,它们就是neighbor,可以看出neighbor不一定是距离最近的两个节点。真正做进化的人,这个方法也基本不用。现在主流的方法是最大似然法(Maximum likelihood, ML),通过进化模型(evolutionary model)估计拓朴结构和分支长度,估计的结果具有最高的概率能够产生观测数据(多序列比对)。另外还有最大简约法和贝叶斯推断等方法用于构建进化树。

Continue reading

To answer the issue, I extend the covplot function to support viewing coverage of a list of GRanges objects or bed files.

library(ChIPseeker)
files <- getSampleFiles()
peak=GenomicRanges::GRangesList(CBX6=readPeakFile(files[[4]]),
                                CBX7=readPeakFile(files[[5]]))

p <- covplot(peak)
print(p)

Continue reading

Author's picture

Guangchuang Yu

Bioinformatics Professor @ SMU

Bioinformatics Professor

Guangzhou