Chapter 6 Regression

6.1 lm

This function is used for regression according to a linear model, i.e. linear regression. It returns a model-class object. There are specialized functions for such models, e.g.  to extract residuals (resid), to extract regression coefficients (coef), to modify (update) the model, etc.

In the following example, we construct two regression models. The first step in your regression analysis should always be a thorough visual and numerical inspection, including histograms and scatterplots of the variables under study.

The first model is

\(\texttt{nsyl} = b_0\)

having only a constant intercept. The second model includes the speakers’ age as a predictor, i.e. 

\(\texttt{nsyl} = b_0 + b_1 \texttt{age}\)

(The intercept is included in this model too, by default, unless suppressed explicitly with -1 or ~0 in the regression formula). The key question here is whether inclusion of a predictor yields a better model, with significantly smaller residuals and significantly higher \(R^2\). The intercept-only model and the linear-regression model are compared with the anova function.

require(hqmisc)
data(talkers)
model1.lm <- lm( nsyl~1, data=talkers ) # only intercept
model2.lm <- lm( nsyl~age, data=talkers ) # with intercept
anova( model1.lm, model2.lm ) # compare two models
## Analysis of Variance Table
## 
## Model 1: nsyl ~ 1
## Model 2: nsyl ~ age
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     79 381.85                           
## 2     78 375.06  1    6.7823 1.4105 0.2386

Including the age predictor does improve the model a little bit, as indicated by the somewhat smaller residual sums-of-squares (RSS). The improvement, however, is too small to be of significance. The linear effect of a speaker’s age on his or her average phrase length (in syllables) is not significant.

6.2 glm

For logistic regression we use function glm(family=binomial), again with a regression formula as an obligatory argument. Logistic regression can be imagined as computing the logit of the hit-rate for each cell, and then regressing these logits on the predictor(s). Here is an annotated example (Moore and McCabe 2003) (Exercise 15.25).

The response variable outcome indicates the death (0) or survival (1) of 2900 patients in two hospitals.

ips1525 <- read.table( 
  file=url("http://www.hugoquene.nl/emlar/ipsex1525.txt"),
  header=T, sep=",") 
with( ips1525, table(outcome) ) 
## outcome
##    0    1 
##   79 2821
log(2821/79) # log of odds of survival is 3.575
## [1] 3.575399
# intercept-only logistic-regression model
model1.glm <- glm( outcome~1, data=ips1525, family=binomial )
summary(model1.glm) 
## 
## Call:
## glm(formula = outcome ~ 1, family = binomial, data = ips1525)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.684   0.235   0.235   0.235   0.235  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.5754     0.1141   31.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 725.1  on 2899  degrees of freedom
## Residual deviance: 725.1  on 2899  degrees of freedom
## AIC: 727.1
## 
## Number of Fisher Scoring iterations: 6

The intercept coefficient of the intercept-only logistic regression model equals the overall log odds of survival; this intercept is significantly different from zero 9. Thus the intercept-only logistic regression model captures the overall log odds of survival. Next, let’s try to improve this model, by including two predictors: first, the hospital where the patient was treated, and second, the patient’s condition at intake, classified as bad (0) or good (1).

model2.glm <- glm( outcome~hospital, data=ips1525, family=binomial ) 
model3.glm <- glm( outcome~hospital*condition, 
                   data=ips1525, family=binomial ) 

The deviance among logistic-regression models follows a \(\chi^2\) distribution. Hence we can compare models by computing the \(\chi^2\) probability of their deviances. Both model 2 and model 3 are compared here against model 1.

anova( model1.glm, model2.glm, model3.glm, test="Chisq" ) 
## Analysis of Deviance Table
## 
## Model 1: outcome ~ 1
## Model 2: outcome ~ hospital
## Model 3: outcome ~ hospital * condition
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2899     725.10                          
## 2      2898     722.78  1   2.3253    0.1273    
## 3      2896     703.96  2  18.8222 8.181e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results indicate that there is no significant difference among hospitals in their survival rates (model 2, \(p>.10\)), but there is a significant effect of intake condition on the outcome (model 3, \(p<.001\)). Of course, you should also inspect the models themselves before drawing conclusions.

References

Moore, David S., and George P. McCabe. 2003. Introduction to the Practice of Statistics. 4th ed. Freeman.