R总给我惊喜

Hello! First of all, I would like to thank you for this wonderful and very powerful package!

I have tried to plot a phylogenetic tree with heatmap of associated matrix (with gheatmap). I found that the row names of matrix doesn’t exactly match to the tip names of tree in case if there is missing data in associated matrix. In other words, if we have two species (e.g., Species198 and Species1981), but only one of them is represented in the associated matrix, we will have colored cells for both species in the heatmap.

Here is a reproducible example:

library(ggtree)

set.seed(111)

# Prepare a species list
tipss <- c("Fomes", "Rozella", "Saitoella", "Entorrhiza", "Cryptococcus", "Tremella", "Puccinia", "Amoeboaphelidium")
tipss <- c(tipss, "Species198", "Species1981")

# Generate random tree
trr <- ape::rtree(n = length(tipss), rooted = TRUE, tip.label = tipss)

# Generate associated data matrix for each species
abunds <- matrix(data = sample(1:100, size = 4*length(tipss)), nrow = length(tipss))
rownames(abunds) <- tipss
colnames(abunds) <- paste("Samp", 1:ncol(abunds), sep="")

# Remove data for some species
abunds <- abunds[-which(rownames(abunds) == "Species198"), ]     # !! no information for this species
abunds <- abunds[-which(rownames(abunds) == "Cryptococcus"), ]

tt <- ggtree(trr) + geom_tiplab()
gheatmap(tt, abunds)

Here is the resulting picture:

You may see that heatmap cells are blank for Cryptococcus (as expected). However, cells corresponding to Species198 are colored with Species1981 data (marked with red arrow).

With best regards, Vladimir

有人在GitHub上对我报了这个bug，说我的gheatmap函数不是精确匹配，这不能够吧？！但可重复性的代码已经说明了一切，我翻一下代码，最终发现是R的问题，R再一次给了我们惊喜，比如下面演示的：

> dd
                 Samp1 Samp2 Samp3 Samp4
Fomes               47    58     3    13
Rozella            100    83    27    18
Saitoella           36    51    78    63
Entorrhiza          70    32    43    16
Tremella            75    73    35    20
Puccinia            61    53     8    72
Amoeboaphelidium    95    66    92    19
Species1981         30    48    34    41
> dd['Species198',]
            Samp1 Samp2 Samp3 Samp4
Species1981    30    48    34    41

这就是问题之所在，当然这个bug我已经修正了，现在再使用gheatmap将不会有这个问题，但我们在使用R的时候，还是要小心哦，以后按名字取子集的时候，还是不要用[的好，R总是能给人惊喜，比如《你的数据被化了妆？》，以及集中吐槽的《R的诡异事件》，都值得重新温习一遍.

在《R的诡异事件》中也讲到了部分匹配的问题，但讲的是$与[[的默认行为不一致，事实上[在matrix和data frame上的默认行为也不一致，在matrix上，[是精确匹配的，而data.frame则是部分匹配。

如果我们通过?base::[看[的文档的话，可以发现文档里写的就是精确匹配：

Character indices:

Character indices can in some circumstances be partially matched

 (see ‘pmatch’) to the names or dimnames of the object being
 subsetted (but never for subassignment).  Unlike S (Becker _et al_
 p. 358), R never uses partial matching when extracting by ‘[’, and
 partial matching is not by default used by ‘[[’ (see argument
 ‘exact’).

Thus the default behaviour is to use partial matching only when

 extracting from recursive objects (except environments) by ‘$’.
 Even in that case, warnings can be switched on by
 ‘options(warnPartialMatchDollar = TRUE)’.

Neither empty (‘""’) nor ‘NA’ indices match any names, not even

 empty nor missing names.  If any object has no names or
 appropriate dimnames, they are taken as all ‘""’ and so match
 nothing.

这一句非常明确：

R never uses partial matching when extracting by ‘[’, and partial matching is not by default used by ‘[[’ (see argument ‘exact’).

[和[[从来都是精确匹配的，而且只有[[可以通过参数使用部分匹配，[只有精确一种。

而如果我们看?base::[.data.frame的文档，则会看到相反的文档：

Both ‘[’ and ‘[[’ extraction methods partially match row names.

 By default neither partially match column names, but ‘[[’ will if
 ‘exact = FALSE’ (and with a warning if ‘exact = NA’).  If you want
 to exact matching on row names use ‘match’, as in the examples.

文档中的例子是这样子的：

 sw["C", ] # partially matches
 sw[match("C", row.names(sw)), ] # no exact match

所以以后用[要小心了，各种不一致，matrix全是精确，而data frame取column名的是精确，取rowname的是部分匹配，再者这会导致一些潜在的bug，很多对象会自定义[方法，使用它去取某个slot的子集，这个非常不一致，容易出现惊喜的行为，鬼知道哪些对象的[方法有这个bug，不知道什么时候就坑爹了！就像我的gheatmap函数一样，如果不是细心的用户发现，根本就不知道这个bug的存在。

R总给我惊喜

Guangchuang Yu