10 General Linear Model

10.1 T检验是回归的特例

data(iris)
tt <- iris[, c("Petal.Length", "Sepal.Length")]
t.test(Petal.Length, Sepal.Length, data=tt, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Petal.Length and Sepal.Length
## t = -13.098, df = 298, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.398643 -1.772023
## sample estimates:
## mean of x mean of y 
##  3.758000  5.843333

我们取iris的Petal.Length和Sepal.Length做两样本T检验,因为方差分析和回归分析都假定齐方差,所以这里以齐方差方式进行T检验。

数据可以转换成以下分组形式:

require(reshape2)
## Loading required package: reshape2
tt2 <- melt(tt)
## No id variables; using all as measure variables
head(tt2)
##       variable value
## 1 Petal.Length   1.4
## 2 Petal.Length   1.4
## 3 Petal.Length   1.3
## 4 Petal.Length   1.5
## 5 Petal.Length   1.4
## 6 Petal.Length   1.7
t.test(value ~ variable, data=tt2, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  value by variable
## t = -13.098, df = 298, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.398643 -1.772023
## sample estimates:
## mean in group Petal.Length mean in group Sepal.Length 
##                   3.758000                   5.843333

分成两组,按分组变量进行T检验,结果是一样的。

对分组数据进行回归分析:

lm.fit  <- lm(value ~ variable, data=tt2)
summary(lm.fit)
## 
## Call:
## lm(formula = value ~ variable, data = tt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7580 -0.8433  0.1993  0.9457  3.1420 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.7580     0.1126   33.38   <2e-16 ***
## variableSepal.Length   2.0853     0.1592   13.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.379 on 298 degrees of freedom
## Multiple R-squared:  0.3654, Adjusted R-squared:  0.3632 
## F-statistic: 171.6 on 1 and 298 DF,  p-value: < 2.2e-16

回归分析完全包含了T检验,并且给出更丰富的信息。

Intercept相当于Petal.Length的均值,Slope则是均值差,slope的T检验结果和上面两样本均值差T检验是一样的。 而F值是T检验结果的平方:

13.10^2
## [1] 171.61

10.2 T检验是方差分析的特例

res <- aov(value ~ variable, data=tt2)
summary(res)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## variable      1  326.1   326.1   171.6 <2e-16 ***
## Residuals   298  566.5     1.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

one way ANOVA进行两组分析和T检验是一样一样的,F值是t值的平方。

我们常说T检验和ANOVA是用来做组间比较的,而相关性和回归是用来度量相互关系的,实际上,他们是没有差别的。

10.3 ANOVA是多重回归的特例

这里使用two way ANOVA一节中使用的数据:

data <- read.table("data/gender_dose.tsv", header=TRUE)
summary(with(data, aov(Alertness~Gender+Dosage)))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Gender       1  76.56   76.56   3.197 0.0971 .
## Dosage       1   5.06    5.06   0.211 0.6533  
## Residuals   13 311.31   23.95                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(Alertness~Gender+Dosage, data=data))
## 
## Call:
## lm(formula = Alertness ~ Gender + Dosage, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.438 -3.406  0.000  1.594 10.562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.688      2.119   7.403 5.17e-06 ***
## Genderm       -4.375      2.447  -1.788   0.0971 .  
## Dosageb        1.125      2.447   0.460   0.6533    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.894 on 13 degrees of freedom
## Multiple R-squared:  0.2077, Adjusted R-squared:  0.08584 
## F-statistic: 1.704 on 2 and 13 DF,  p-value: 0.2201

上面的例子中假设Gender和Dosage是完全无关的,如果把两因素的互作考虑在内,则:

summary(with(data, aov(Alertness~Gender*Dosage)))
##               Df Sum Sq Mean Sq F value Pr(>F)
## Gender         1  76.56   76.56   2.952  0.111
## Dosage         1   5.06    5.06   0.195  0.666
## Gender:Dosage  1   0.06    0.06   0.002  0.962
## Residuals     12 311.25   25.94
summary(lm(Alertness~Gender+Dosage+Gender*Dosage, data=data))
## 
## Call:
## lm(formula = Alertness ~ Gender + Dosage + Gender * Dosage, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.500 -3.375  0.000  1.562 10.500 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15.750      2.546   6.185 4.69e-05 ***
## Genderm           -4.500      3.601  -1.250    0.235    
## Dosageb            1.000      3.601   0.278    0.786    
## Genderm:Dosageb    0.250      5.093   0.049    0.962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.093 on 12 degrees of freedom
## Multiple R-squared:  0.2079, Adjusted R-squared:  0.009862 
## F-statistic:  1.05 on 3 and 12 DF,  p-value: 0.4062