Chapter 10 Inference in MLR

\(\newcommand{\E}{\mathrm{E}}\) \(\newcommand{\Var}{\mathrm{Var}}\) \(\newcommand{\bmx}{\mathbf{x}}\) \(\newcommand{\bmH}{\mathbf{H}}\) \(\newcommand{\bmI}{\mathbf{I}}\) \(\newcommand{\bmX}{\mathbf{X}}\) \(\newcommand{\bmy}{\mathbf{y}}\) \(\newcommand{\bmY}{\mathbf{Y}}\) \(\newcommand{\bmbeta}{\bm{\beta}}\) \(\newcommand{\XtX}{\bmX^\mT\bmX}\) \(\newcommand{\mT}{\mathsf{T}}\) \(\newcommand{\XtXinv}{(\bmX^\mT\bmX)^{-1}}\)

10.1 What kind of hypothesis test?

In multiple linear regression, there are three types of questions we could ask about the importance of the predictor variables. They differ by whether they involve one predictor, all predictors, or a subset of predictors.

Single Variable

  • What is the importance of a specific predictor?
  • This question can be addressed with a T-test (Section 10.3).
  • Examples
    • What is the relationship, if any, between average air pollution levels and cardiovascular mortality rates, after controlling for temperature?
    • What is the relationship, if any, between gender and salary, controlling for years of experience and position title?
    • What is the relationship, if any, between marijuana use and opioid overdoses, controlling for socioeconomic status?
    • What is the relationship, if any, between hours spent on homework and final exam score, controlling for class level and department?

All Variables

  • What is the overall importance of the model?
  • This question can be addressed with a global F-test (Section 10.5).
  • Examples:
    • How well do temperature and air pollution explain variation in cardiovascular mortality rates?
    • Do gender, years of experience, and position title explain differences in salary?
    • Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status?
    • Can we predict final exam score using time spent on homework, class level, and department?

Subset of Variables

  • Which subsets of predictors are important?
  • This question can be addressed with a partial F-test (Section 10.6).
  • Examples:
    • How well do temperature and air pollution explain variation in cardiovascular mortality rates, when accounting for underlying health status?
    • Do gender and years of experience explain differences in salary after adjusting for position title?
    • Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status, when accounting for healthcare provider networks?
    • Can we predict final exam score using time spent on homework and year in school, when accounting for subject area?

10.2 Photosynthesis Data

For examples in this chapter, we will use data on photosynthesis output in trees from Reich et al., Nature, 2018.9 They measured photosynthesis output under different conditions, including variations in the amount of water in the soil and temperature in the surrounding air.

We can fit a multiple linear regression model for photosynthesis output (\(Y\)) that adjusts for soil moisture content (\(x_1\)), an indicator of whether the tree was artificially warmed (\(x_2\)), and leaf temperature (\(x_3\)).10

ph_lm <- lm(photosyn~soil_water + warming_treatment  + tleaf,
            data=photo)

We obtain the fitted model:

\[\begin{equation} \hat y = 3.89 + 40.5x_{1} + 1.4x_{2} -0.022x_3 \tag{10.1} \end{equation}\]

10.3 Hypothesis Tests for \(\beta_j\)

10.3.1 Scientific vs. Statistical Question

Using this model (10.1), we could ask the scientific question:

Is there a relationship between soil water content ratio and photosynthesis output, after adjusting for leaf temperature and warming treatment?

To translate this into a statistical question, we need to isolate what represents the relationship between soil water content ratio and photosynthesis output. This is precisely \(\beta_1\), since it represents the slope of the relationship between soil water content (\(x_1)\) and average photosynthesis outtput (\(E[Y]\)), for trees with the same value of leaf temperature and warming treatment. Thus, our corresponding statistical question is:

Is \(\beta_{1}\ne 0\) in this model?

10.3.2 General Form of \(H_0\)

The standard setup for a hypothesis test of a single coefficient parameter in MLR is similar to SLR. We consider null and alternative hypotheses of the form \[\begin{equation} H_0: \beta_j = \beta_{j0} \quad \text{vs.} \quad H_A: \beta_j \ne \beta_{j0} \tag{10.2} \end{equation}\] One-sided hypotheses are also possible, although less common than the two-sided versions. To test the null hypothesis in (10.2), we use the \(t\)-statistic that has the same form as the one from SLR: \[t = \frac{\hat\beta_j - \beta_{j0}}{\widehat{se}(\hat\beta_j)}\] However unlike in SLR, the standard error in the denominator, \(se(\hat\beta_j) = \sqrt{\hat\sigma^2(\XtX)^{-1}_{jj}}\), depends on \(x_j\) and all of the other \(x\)’s. This means that the correlation between predictor variables can impact the results of the hypothesis test (and the width of confidence intervals).

When \(H_0\) is true, \(t\) follows a T-distribution with \(n-p\) degrees of freedom. The corresponding \(p\)-value is computed in the usual way: \(p = P(T> |t|)\). The \(t\) statistic and \(p\)-value provided by R in standard output correspond to \(\beta_{j0}=0\).

Example 10.1 In the photosynthesis data, is there evidence of a relationship between soil water content ratio and average photosynthesis output, after adjusting for leaf temperature and warming treatment?

To answer this, we compute the test statistic and obtain: \[t = \frac{\hat\beta_1 - \beta_{10}}{se(\hat\beta_1)} = \frac{40.5 - 0}{2.84} = 14.24\] The corresponding \(p\)-value is: \[P(|T_{1311}| > |14.24|) < 0.0001\] Thus, we reject the null hypothesis that \(\beta_1 = 0\) and conclude that there is a linear relationship between soil water content and photosynthesis output, when adjusting for warming treatment and leaf temperature.

All of the information necessary to conduct this hypothesis test is availble in the standard summaries of an lm object in R. For example:

tidy(ph_lm)
## # A tibble: 4 × 5
##   term                    estimate std.error statistic  p.value
##   <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)               3.89      0.777      5.01  6.21e- 7
## 2 soil_water               40.5       2.85      14.2   7.02e-43
## 3 warming_treatmentwarmed   1.40      0.317      4.43  1.02e- 5
## 4 tleaf                    -0.0223    0.0255    -0.877 3.81e- 1

10.4 Confidence Intervals for \(\beta_j\)

Confidence intervals for the \(\beta\) parameters have the same form as in SLR: \[\begin{equation} (\hat\beta_j - t_{1-\alpha/2}\widehat{se}(\hat\beta_j), \hat\beta_j + t_{1-\alpha/2}\widehat{se}(\hat\beta_j)) \tag{10.3} \end{equation}\] where \(t_{1-\alpha/2}\) is such that \(P(T_{n-p} < t_{1-\alpha/2}) = 1-\alpha/2\). The interval in (10.3) is a random interval that, assuming the model is correct, includes the true value of the parameter \(\beta_j\) with probability (1-\(\alpha\)).

The same functions that provide confidence intervals in R for SLR (confint() and broom::tidy()) provide them for models with multiple variables. If desired, the intervals can also be constructed from the individual components in (10.3).

Example 10.2 To construct a confidence interval for \(\beta_1\) in (10.1), we first compute the necessary elements:

  • \(\hat\beta_1 = 40.5\)
  • \(t_{1 - \alpha/2} = 1.96\)
  • \(\widehat{se}(\hat\beta_1) = 2.84\)

We then compute the interval:

\[(40.5 - 1.96*2.84, 40.5 + 1.96*2.84) = (34.9, 46.1)\]

In R, we could accomplish this using:

t_alphaOver2 <- qt(p=0.975, df=nobs(ph_lm))
b1hat <- coef(ph_lm)[2]
seb1hat <- sqrt(diag(vcov(ph_lm)))[2]
c(Lower=b1hat - t_alphaOver2*seb1hat,
  Upper=b1hat + t_alphaOver2*seb1hat)

But in practice, it is much faster and simpler to use tidy():

tidy(ph_lm, conf.int=TRUE, conf.level=0.95)
## # A tibble: 4 × 7
##   term                    estimate std.error statistic  p.value conf.low conf.…¹
##   <chr>                      <dbl>     <dbl>     <dbl>    <dbl>    <dbl>   <dbl>
## 1 (Intercept)               3.89      0.777      5.01  6.21e- 7   2.37    5.42  
## 2 soil_water               40.5       2.85      14.2   7.02e-43  34.9    46.1   
## 3 warming_treatmentwarmed   1.40      0.317      4.43  1.02e- 5   0.782   2.02  
## 4 tleaf                    -0.0223    0.0255    -0.877 3.81e- 1  -0.0723  0.0276
## # … with abbreviated variable name ¹​conf.high

10.5 Testing for Significance of Regression (Global F-Test)

10.5.1 Scientific vs. Statistical Question

A different scientific question we could ask about model (10.1) is:

Is there any linear relationship between all of these predictor variables (soil water content ratio, tree warming status, and leaf temperature) and photosynthesis output?

Our corresponding statistical question is:

Is \(\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0\)?

In words, this corresponds to:

\(H_0\): There is no linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.
\(H_A\): There is a linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.

10.5.2 Global F-test

The general form of this set of hypotheses, called a Global F-Test, is: \[\begin{align*} H_0:& \beta_1 = \beta_2 = \cdots = \beta_k = 0 \\ H_A:& \beta_j \ne 0 \text{ for at least one \textit{j}} \end{align*}\]

An important limitation to the Global F-test is that \(H_A\) does not specify which coefficient is non-zero, only that at least 1 is non-zero. In many cases, we might follow-up a Global F-test with an additional t-test for a specific parameter.11

In MLR, we use the same approach to testing \(H_0\) that we used in SLR (Section 6.3). The F statistic is \[f = \frac{SS_{Reg} / df_{Reg}}{SS_{Res}/ df_{Res}}.\] As with SLR, the sum of squares decomposition is the same: \[\begin{align*} SS_{Tot} &= \sum_{i=1}^n (y_i - \overline{y})^2\\ SS_{Reg} &= \sum_{i=1}^n (\hat y_i - \overline{y})^2\\ SS_{Res} &= \sum_{i=1}^n (y_i - \hat y_i)^2\\ SS_{Tot} &= SS_{Reg} + SS_{Res} \end{align*}\] The degrees of freedom scale up and down according to the number of predictor variables. With \(k\) predictor variables, \(df_{Reg} = k\). Note that this is not \(p=k+1\), since we do not account for the intercept here. The denominator degrees of freedom is \(df_{Res} = n - p = n - (k + 1) = n - k - 1\). When \(H_0\) is true (and assuming approximate normality), \(f\) follows an \(F_{df_{Reg}, df_{Res}}\) distribution. We reject \(H_0\) is \(f\) is large enough (using the test level \(\alpha\)).

Example 10.3 In the photosynthesis data, is there evidence of a linear relationship between any of the predictor variables and the average photosynthesis output? To answer this, we compute the \(f\) statistic:

\[f = \frac{5863/3}{36803/(1315 - 4)} = 69.6\] The corresponding \(p\)-value if \(P(F_{3,1311} > 69.6) < 0.0001\). We reject the null hypothesis that there is no linear relationship between average photosynthesis output and the soil water content ratio, tree warming status, and leaf temperature.

R provides the \(F\) statistic and \(p\)-value for a test of significance of regression. This is provided at the bottom of summary() output:

summary(ph_lm)
## 
## Call:
## lm(formula = photosyn ~ soil_water + warming_treatment + tleaf, 
##     data = photo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.782  -4.148  -0.606   3.770  23.560 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.89447    0.77745   5.009 6.21e-07 ***
## soil_water              40.51751    2.84574  14.238  < 2e-16 ***
## warming_treatmentwarmed  1.40314    0.31670   4.431 1.02e-05 ***
## tleaf                   -0.02233    0.02545  -0.877    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.298 on 1311 degrees of freedom
## Multiple R-squared:  0.1374, Adjusted R-squared:  0.1354 
## F-statistic: 69.62 on 3 and 1311 DF,  p-value: < 2.2e-16

10.5.3 ANOVA Table

The information for the global F-test can be summarized in an ANOVA table:

Source of Variation Sum of Squares Degrees of Freedom MS F
Regression \(SS_{reg}\) \(k\) \(MS_{reg}\) \(MS_{reg}/MS_{res}\)
Residual \(SS_{res}\) \(n-(k+1) = n- p\) \(MS_{res}\)
Total \(SS_{tot}\) \(n-1\)

10.6 Testing for Subsets of Coefficients

10.6.1 Scientific vs. Statistical Question

A different scientific question we could ask about model (10.1) is:

Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content?

Our corresponding statistical question is:

Is \(\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0\)?

Compared to the question for the Global F-test, we are now posing a question about a subset of the predictor variables. This is the same as comparing the “full” model: \[Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \epsilon_i\] and the “reduced” model \[Y_i = \beta_0 + \beta_1x_{i1} + \epsilon_i.\]

This corresponds to the following null and alternative hypotheses: \[\begin{align} H_0:& \beta_2 = \beta_3 = 0\\ H_A:& \beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0 \tag{10.4} \end{align}\]

10.6.2 Partial F-test

The hypotheses in (10.4) can be tested using a Partial F-test, which calculates an \(f\) statistic comparing a “full” model with all predictor variables to the “reduced” model with a subset of parameters.

The corresponding test statistic is: \[\begin{equation} f = \dfrac{\left(SS_{reg}^{Full} - SS_{reg}^{Reduced}\right)/ r}{SS_{Res}/(n - p)} \tag{10.5} \end{equation}\] The numerator of \(f\) is the difference in variation explained by the full model (\(SS_{reg}^{Full}\)) and the variation explained by the reduced model (\(SS_{reg}^{Reduced}\)). This difference is scaled by the number of variables (\(r\)) that are being set to zero in reduced model. The denominator is, like with the Global F-test, the average amount of variability unexplained by the full model. The \(f\) statistic in (10.5) is compared against an \(F_{r, n-p}\) distribution to obtain the \(p\)-value.

Sometimes it is convenient to represent the full and reduced models compactly using vector and matrix notation. First, partition the full vector of coefficients \(\boldsymbol\beta\) into two subvectors: \[\boldsymbol\beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{k-r} \\ \beta_{k-r+1} \\ \vdots \\ \beta_k\end{bmatrix} = \begin{bmatrix} \boldsymbol\beta_A \\ \boldsymbol\beta_B \end{bmatrix}\] Similarly, partition \(\bmX\) into two parts: \[\bmX = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1,k-r} & x_{1, k-r + 1} & \cdots & x_{1k} \\ \vdots & \vdots & & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \cdots & x_{n,k-r} & x_{n, k-r + 1} & \cdots & x_{nk} \end{bmatrix} = \begin{bmatrix} \bmX_A & \bmX_B \end{bmatrix}\]

Without loss of generality, we assume that the reduced model corresponds to setting \(\boldsymbol\beta_B = \mathbf{0}\).12 The Full Model is: \[\bmy = \bmX\boldsymbol\beta + \boldsymbol\epsilon = \bmX_A\boldsymbol\beta_A + \bmX_B\boldsymbol\beta_B + \boldsymbol\epsilon\]
The Reduced Model is: \[\bmy = \bmX_A\boldsymbol\beta_A + \boldsymbol\epsilon\] This gives us the general form of the null and alternative hypotheses for a partial F-test: \[H_0: \boldsymbol\beta_B = \mathbf{0} \text{ vs. } H_A: \boldsymbol\beta_B \ne \mathbf{0}\] Let \(SS_{Reg}(\boldsymbol\beta)\) denote the ususal regression sum of squares from the full model and \(SS_{Reg}(\boldsymbol\beta_A)\) denote the regression sum of squares from the reduced model. Then \(SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)\) is the “extra sum of squares” due to \(\boldsymbol\beta_B\) and we can write the \(f\) statistic from (10.5) as \[f = \dfrac{(SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)) / r}{SS_{Res}/(n - p)}\] Again, we complete the one-sided test by comparing to an \(F_{r, n-p}\) distribution to obtain the \(p\)-value.

To perform partial F-test in R:

  • Fit the full model
  • Fit the reduced model
  • Use anova(reduced_mod, full_mod)

Example 10.4 Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content?

We first fit the full model:

ph_lm <- lm(photosyn~soil_water + warming_treatment  + tleaf,
            data=photo)

and then the reduced model:

ph_lm_reduced <- lm(photosyn~soil_water,
            data=photo)

For comparing the models, we use the anova() command:

anova(ph_lm_reduced, ph_lm)
## Analysis of Variance Table
## 
## Model 1: photosyn ~ soil_water
## Model 2: photosyn ~ soil_water + warming_treatment + tleaf
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1   1313 37358                                  
## 2   1311 36803  2    555.45 9.8933 5.439e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The column RSS gives \(SS_{res}^{Red}\) and \(SS_{res}^{Full}\). Since \(SS_{res}^{Red} - SS_{res}^{Full} = SS_{reg}^{Full} - SS_{reg}^{Red}\), we can use these values to get the extra sum of squares, which is provided in the Sum of Sq column. The right two columns provide the value of \(f\) and the \(p\)-value.

We reject \(H_0\) and conclude that there is a linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content (\(p < 0.0001\)).

Example 10.5 Is there any linear relationship between average photosynthesis output and leaf temperature in a model that already contains soil water content and an indicator of warming treatment?

For this example, our null hypothesis is \(H_0: \beta_3 = 0\). If we use the partial F-test for this hypothesis, we obtain:

ph_lm_reduced2 <- lm(photosyn~soil_water + warming_treatment, data=photo)
anova(ph_lm_reduced2, ph_lm)
## Analysis of Variance Table
## 
## Model 1: photosyn ~ soil_water + warming_treatment
## Model 2: photosyn ~ soil_water + warming_treatment + tleaf
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1   1312 36825                           
## 2   1311 36803  1    21.598 0.7694 0.3806

We would thus fail to reject the null hypothesis. This is the exact same result we would have obtained if we had used a \(t\)-test. In fact, we once again have the relationship \(f = t^2\), when the partial \(f\)-test is for setting one (and only one) coefficient equal to zero.

10.7 Adjusted \(R^2\)

In Section 6.2, we introduced the coefficient of determination, \(R^2\), as measure of model fit. It might be nice to use \(R^2\) as a way to compare different models, but it suffers from one important drawback: adding a variable to a model cannot decrease \(R^2\). In fact, \(R^2\) will almost always go up when a variable is added. This is a mathematical fact, and it doesn’t matter if the new variable is unrelated to the outcome or not.

To get around this, we can compare model using adjusted \(R^2\): \[\begin{align*} R^2_{Adj} &= 1 - \frac{SS_{Res}/(n - p)}{SS_{Tot}/(n-1)} \end{align*}\] To identify the differences, it helps to compare \(R^2_{Adj}\) to the regular version of \(R^2\): \[\begin{align*} R^2 &= 1 - \frac{SS_{Res}}{SS_{Tot}}\\ \end{align*}\] In \(R^2_{Adj}\), the value of \(SS_{Res}\) is divided by the residual degrees of freedom and \(SS_{tot}\) is divided by the total degrees of freedom (minus one). If we were to add a variable that did not help explain the outcome variable, then \(SS_{Res}\) would stay the same but \(SS_{Res}/(n- p)\) would increase slightly. This results in a smaller value of \(R^2_{Adj}\). As the sample size gets larger (i.e., as \(n \to \infty\)), then \(R^2_{Adj}\) approaches the value of \(R^2\).

It’s important to note that while we can use \(R_{Adj}^2\) for model comparison, it no longer represents the percentage of variability in the outcome explained by variation in the predictors. That interpretation still belongs to \(R^2\).

10.8 Estimation and Prediction of the Mean

To complete our description of inference for the MLR model, we now turn to estimation and prediction of the mean.

Let \(\bmx_0^\mT = \begin{bmatrix} 1 & x_{01} & x_{02} & \cdots & x_{0k}\end{bmatrix}\) denote a specific set of predictor variables.
The estimated mean corresponding to these values is \[\hat\mu_{0} = \bmx_0^\mT\hat{\boldsymbol\beta} = \hat\beta_0 + x_{01}\hat\beta_1 + \dots + x_{0k}\hat\beta_k\] As with SLR, this is the same value as a predicted observation, \(\hat y_0 = \bmx_0^\mT\boldsymbol\beta\).

10.8.1 Confidence Intervals for the Mean

To represent uncertainty, we use a confidence interval for the mean. This requires calculating the standard error of \(\hat\mu_0\). Using matrix notation, calculating the variance of \(\hat\mu_0\) is straightforward: \[\begin{align*} \Var(\hat \mu_0) & = \Var(\bmx_0^\mT\hat{\boldsymbol\beta}) \\ &= \bmx_0^\mT \Var(\hat{\boldsymbol\beta})\bmx_0 \\ &= \bmx_0^\mT\left(\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0 \\ \end{align*}\] We estimate \(\Var(\hat \mu_0)\) with: \[\widehat{\Var}(\hat \mu_0) = \bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.\] and so the standard error is: \[\hat{se}(\hat\mu_0) = \sqrt{\bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.}\] This gives us the following form for the CI for the mean: \[(\hat\mu_0 - t_{1-\alpha/2}\hat{se}(\hat\mu_0), \hat\mu_0 + t_{1-\alpha/2}\hat{se}(\hat\mu_0))\] As before, the degrees of freedom used to calculate \(t_{1 - \alpha/2}\) is \(n-p\).

In R, we again use the predict() command to calculate the mean (or a prediction) and associated intervals. For the newdata= argument, we must now provide a data frame with the complete set of predictor variable values. Setting interval="confidence" will give the CI as output.

Example 10.6 What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.15, was not warmed (so warming_treatment="ambient") and had a leaf temperature of 22 degrees?

We can find this easily using R:

pred_data <- data.frame(soil_water=0.15,
                        warming_treatment="ambient",
                        tleaf=22)
predict(ph_lm, newdata=pred_data,
        interval="confidence")
##        fit      lwr      upr
## 1 9.480918 9.006312 9.955524

The estimated mean photosynthesis output for trees with a soil water content ratio of 0.15, that are not warmed, and have a leaf temperature of 22 degrees is 9.48 (95% CI 9.00, 9.96).

Example 10.7 What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.20, were warmed (so warming_treatment="warmed") and had a leaf temperature of 19 degrees?

We can follow the same procedure as Example 10.6, and in fact compute both answers at the same time by passing a data frame with multiple rows to newdata:

pred_data2 <- data.frame(soil_water=c(0.15, 0.2),
                        warming_treatment=c("ambient",
                                            "warmed"),
                        tleaf=c(22, 19))
predict(ph_lm, newdata=pred_data2,
        interval="confidence")
##         fit       lwr       upr
## 1  9.480918  9.006312  9.955524
## 2 12.976911 12.212084 13.741737

10.8.2 Prediction Intervals for New Observations

To make a prediction intervals, we extend the CI for the mean in the same way as SLR (Section 5.3). The prediction interval is: \[(\hat y_0 - t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]}, \hat y_0 + t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]})\] The additional uncertainty in predicting a new observation, compared to a mean, can lead to quite wide prediction intervals, depending on the value of \(\hat\sigma^2\).

In R, these can be computed by setting interval="prediction" inside predict().

Example 10.8 What is the predicted photosynthesis output for a tree that had a soil water content ratio of 0.15, was not warmed (so warming_treatment="ambient") and had a leaf temperature of 22 degrees?

predict(ph_lm, newdata=pred_data,
        interval="prediction")
##        fit       lwr     upr
## 1 9.480918 -0.924065 19.8859

The predicted photosynthesis output for a tree with a soil water content ratio of 0.15, that was not warmed, and had a leaf temperature of 22 degrees is 9.48 (95% PI -0.92, 19.89).

10.9 Exercises

Exercise 10.1 In model (10.1), state in words what the null hypothesis \(H_0: \beta_2 = 0\) means. Conduct the test at the \(\alpha=0.05\) level and summarize your conclusiosn in one or two sentences.

Exercise 10.2 Calculate a 95% confidence interval for \(\beta_3\) in (10.1). Use the interval to test \(H_0: \beta_3 = 0\) at the \(\alpha=0.05\) level.


  1. Reich, P.B., A. Stefanski, K.M. Sendall, and R.L. Rich. 2018. Photosynthetic data on experimentally warmed tree species in northern Minnesota, 2009-2011, used in the paper Reich et al Nature 2018. ver 2. Environmental Data Initiative. https://doi.org/10.6073/pasta/258239f68244c959de0f97c922ac313f↩︎

  2. For this example, we are ignoring some features of the design such as correlation by plot.↩︎

  3. However, this leads to issues of multiple comparisons that need to be addressed.↩︎

  4. If we want to set a different subset equal to zero, just rearrange the vector.↩︎