Genetic drift is the term used in population genetics to refer to the statistical drift over time of gene frequencies in a population due to random sampling effects in the formation of successive generations. In a narrower sense, genetic drift refers to the expected population dynamics of neutral alleles (those defined as having no positive or negative impact on reproductive fitness), which are predicted to eventually become fixed at zero or 100% frequency in the absence of other mechanisms affecting allele distributions.

The most important keyword in the definition of genetic drift is random sampling effects. The figure belowed illustrates this idea. The surviving individuals do not necessarily have selection advantage. They are randomly selected.

Continue reading

@kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values.

It eventually came out that he passed the input gene as numeric vector, which was supposed to be character and he used an old version of clusterProfiler which didn’t use as.character to coerce the input

But his comment forces me to test it.

Continue reading

enrichment map

In PLOB’s QQ group, someone asked how to change the color of enrichment map in Cytoscape. I am very curious how enrichment map can helps to interpret enrichment results. It took me 2 hours to implement it using R and I am surprised that the enrichment map is better than anticipated.

Continue reading

去年ubuntu下apt-get了R-3.0.2, 用了没多久就发现了system命令有问题,通常情况下调用系统命令是正常的,但是我调用bowtie的时候,就会报错: Warning: Could not open read file "/tmp/8156.inpipe1" for reading; skipping... Error: Encountered internal Bowtie 2 exception (#1) Command: /usr/bin/bowtie2-align --wrapper basic-0 -p 12 -x /ssd/genomes/hg19 -S tmp.sam -1 /tmp/8156.inpipe1 -2 /tmp/8156.inpipe2 这很明显是因为fasta.gz文件,bowtie需要调用zcat来读的,在R里调用bowtie就找不到好基友zcat的输出管道。当时没在意,R不干,那就找shell。 去年用NMF包的时候,报出了人类不友好的错误,联系了包作者Gaujoux,在作者的帮助下,找到了是doParallel包的问题: > library(doParallel) > Loading required package: foreach foreach: simple, scalable parallel programming from Revolution Analytics Use Revolution R for scalability, fault tolerance and more. http://www.revolutionanalytics.com Loading required package: iterators Loading required package: parallel > registerDoParallel(11) > > foreach(i = 1:10) %dopar% { getwd() } > *** caught segfault *** address 0x7fbeb6d399d0, cause 'memory not mapped' 其实parallel包中用mclapply也是同样报错。于是又把维护deb包的Dirk拉进来,Dirk是Rcpp的作者,高人就是不一样,看了错误,立刻就指出是BLAS库的问题。

Continue reading

主成分分析

使用《Bioinformatics and Molecular Evolution》Table 2.2 (page 24)中的数据:

> aa
  VOL  BULK POLARITY    PI HYD1  HYD2 SURFACE_AREA FRACT_AREA
A  67 11.50     0.00  6.00  1.8   1.6          113       0.74
R 148 14.28    52.00 10.76 -4.5 -12.3          241       0.64
N  96 12.28     3.38  5.41 -3.5  -4.8          158       0.63
D  91 11.68    49.70  2.77 -3.5  -9.2          151       0.62
C  86 13.46     1.48  5.05  2.5   2.0          140       0.91
Q 114 14.45     3.53  5.65 -3.5  -4.1          189       0.62
E 109 13.57    49.90  3.22 -3.5  -8.2          183       0.62
G  48  3.40     0.00  5.97 -0.4   1.0           85       0.72
H 118 13.69    51.60  7.59 -3.2  -3.0          194       0.78
I 124 21.40     0.13  6.02  4.5   3.1          182       0.88
L 124 21.40     0.13  5.98  3.8   2.8          180       0.85
K 135 15.71    49.50  9.74 -3.9  -8.8          211       0.52
M 124 16.25     1.43  5.74  1.9   3.4          204       0.85
F 135 19.80     0.35  5.48  2.8   3.7          218       0.88
P  90 17.43     1.58  6.30 -1.6  -0.2          143       0.64
S  73  9.47     1.67  5.68 -0.8   0.6          122       0.66
T  93 15.77     1.66  5.66 -0.7   1.2          146       0.70
W 163 21.67     2.10  5.89 -0.9   1.9          259       0.85
Y 141 18.03     1.61  5.66 -1.3  -0.7          229       0.76
V 105 21.57     0.13  5.96  4.2   2.6          160       0.85
> aa.pc=prcomp(aa, center=TRUE, scale.=TRUE)
> aa.pc
Standard deviations:
[1] 1.88954924 1.67729608 0.88574923 0.65278312 0.53130822 0.27233782 0.21394709
[8] 0.05808934

Rotation:
                     PC1         PC2           PC3         PC4        PC5
VOL           0.05790316 -0.58494270  0.1057571970 -0.09311996  0.1684497
BULK         -0.22089815 -0.48085679  0.1685873871 -0.15647005 -0.6842181
POLARITY      0.44390146 -0.10372198  0.1432388489  0.73088587 -0.1747201
PI            0.18580066 -0.25244261 -0.9417631776  0.02722752 -0.0522075
HYD1         -0.49279466 -0.03284274 -0.1281013177  0.35033569 -0.3534423
HYD2         -0.50821923  0.03312719 -0.1292531953 -0.12776013  0.1837842
SURFACE_AREA  0.10045084 -0.56587879  0.1408585179 -0.10640314  0.3652725
FRACT_AREA   -0.45283231 -0.17244828  0.0009997326  0.53059489  0.4220135
                     PC6         PC7          PC8
VOL          -0.11705506  0.21351836 -0.739571539
BULK          0.25124136 -0.35181704  0.109655303
POLARITY      0.39252691  0.22974777  0.009653888
PI            0.04562347 -0.09136074  0.030622127
HYD1         -0.50083501  0.48686655  0.064292387
HYD2          0.70974650  0.41191312 -0.019940539
SURFACE_AREA -0.11107422  0.24097250  0.659316956
FRACT_AREA   -0.01018472 -0.55201871 -0.027363410
> summary(aa.pc)
Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     1.8895 1.6773 0.88575 0.65278 0.53131 0.27234 0.21395
Proportion of Variance 0.4463 0.3517 0.09807 0.05327 0.03529 0.00927 0.00572
Cumulative Proportion  0.4463 0.7980 0.89603 0.94930 0.98459 0.99386 0.99958
                           PC8
Standard deviation     0.05809
Proportion of Variance 0.00042
Cumulative Proportion  1.00000

PC1和PC2能够解释79.8%的variation。

Continue reading

上个月写了《使用ggplot画图》一文,图片太多,一下子把博客的12G流量给用光了,月中博客就被停了,小伙伴们太给力了,blog不能访问,真是件不爽的事情。

言归正传,从2002年使用redhat 7.1开始,就没中止过使用linux,为什么要说again呢,因为在过去几年的工作时间里,使用的是OS X和win7做为桌面的,linux只在服务器端使用。自从用了OS X之后,就不想再用linux当桌面了,OS X绝对是节省生命的系统,本身是unix-like系统,对各种桌面软件支持又好,特别是中华大地,linux还是小众,用来当桌面实在是各种坑。

不过现在又回到学生时代了,所以还得继续学生屌丝的折腾,本科时从redhat,debian到gentoo,硕士时实验室的机器用ubuntu,而笔记本用NetBSD。这么多年的使用,该有的坑都填了,但是时代在发展,新坑还是不断有的,使用百度云来同步化文献库,就是最近的新坑,装上linux之后,发现度娘没有客户端,首先想到的当然是wine,wine出来的偶尔能行,时常崩溃,跟它死磕我也会崩溃,github上寻找开源实现,各种功能缺失和限制,那就只能虚拟机了,把home目录做为共享盘,挂到虚拟机中,但是度娘一直都是占100% CPU,然后数据无法同步,只有两个原因,不是硬盘就是网络,感觉好像我一个分区800G太大了,但在虚拟中对硬盘读写正常,网络因为使用NAT,虚拟成内网,有可能是这个原因,于是换成bridge,把虚拟机当成现实网络中的另一台主机,但度娘依然只占CPU,不干活。那只能还是硬盘问题,最后问题解决了,通过在linux中提供samba服务,然后在虚拟机中通过网上邻居把共享目录映射成网络驱动器,这时候度娘终于肯干活了。都不知道是virtualbox的驱动有问题,还是度娘太挑,这太TM坑了。

Continue reading

Why use ggplot2

ggplot2是我见过最human friendly的画图软件,这得益于Leland Wilkinson在他的著作《The Grammar of Graphics》中提出了一套图形语法,把图形元素抽象成可以自由组合的成分,Hadley Wickham把这套想法在R中实现。

为什么要学习ggplot2,可以参考ggplot2: 数据分析与图形艺术序言(btw: 在序言的最后,我被致谢了)。

Hadley Wickham也给出一堆理由让我们说服自己,我想再补充一点,Hadley Wickham是学医出身的,做为学生物出身的人有什么理由不支持呢:)

Continue reading

Author's picture

Guangchuang Yu

Bioinformatics Professor @ SMU

Bioinformatics Professor

Guangzhou