Chapter 8 The Multiple Linear Regression (MLR) Model

\(\newcommand{\E}{\mathrm{E}}\) \(\newcommand{\Var}{\mathrm{Var}}\)

8.1 Multiple Linear Regression

Simple linear regression (SLR) gave us a tool to model the relationship between a predictor (\(x\)) and an outcome (\(Y\)):

\[Y_i = \beta_0 + \beta_1x_{i1} + \epsilon_i\] This has a clear drawback: most real-world outcomes are impacted by more than one variable. Multiple linear regression (MLR) extends SLR to include multiple predictors:

\[\begin{equation} Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_kx_{ik} + \epsilon_i \tag{8.1} \end{equation}\]

As we will see, the mathematical details of SLR extend readily to having more than one predictor variable. However, the graphical representations are often more difficult to create.

The multiple linear regression model (8.1) has analogous assumptions to simple linear regression:

  • \(E[\epsilon_i] = 0\)
  • \(Var(\epsilon_i) = \sigma^2\)
  • \(\epsilon_i\) are uncorrelated

In Chapter 9.4, we will see an alternative form of the MLR model that uses matrix algebra to simplify computations and directly use these properties to show properties of the parameter estimates.

But first, in this chapter we explore how the addition of additional predictor variables impacts the effects and interpretations of the coefficient parameters \(\beta_j\).

8.2 MLR Model 1: One continuous and one binary predictor

Consider once again the penguin data with body mass as the outcome. But now we will use a model that includes multiple predictor variables. From the data plotted in Figure 8.1, we can see two trends:

  • Penguins with longer flippers tend to have greater body mass
  • Male penguins tend to have greater body mass than female penguins
Flipper length and body mass in the Palmer Penguin dataset.

Figure 8.1: Flipper length and body mass in the Palmer Penguin dataset.

In Examples 4.7 and 4.8, respectively, we showed there was significant evidence for these two trends. But what happens we when include both variables in the model at the same time?

Let’s use the following notation for modeling the penguin data:

  • \(Y_i =\) Body mass (in grams) for penguin \(i\)
  • \(x_{i1} =\) Flipper length (in mm) for penguin \(i\)
  • \(x_{i2} =\) Indicator of sex for penguin \(i\). \(0 =\) female, \(1=\) male.

Instead of the SLR Model \(Y_i = \beta_0 + \beta_1x_{i1} + \epsilon_i\), we can use the model

\[\begin{equation} Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i \tag{8.2} \end{equation}\]

From equation (??), we can construct two different regression lines. For female penguins, \(x_{i2} = 0\), so (??) reduces to:

\[\begin{align*} Y_i &= \beta_0 + \beta_1x_{i1} + \beta_2(0) + \epsilon_i\\ &= \beta_0 + \beta_1x_{i1} + \epsilon_i \end{align*}\]

If we take expectations, we can fine the regression line for mean body mass of female penguins is:

\[\begin{equation} \E[Y_i | x_{i1} = x_{i1}, x_{i2} = 0] = \beta_0 + \beta_1x_{i1} (\#eq:mlr2_x20) \end{equation}\]

Notice here how we are using the notation \(\E[Y_i | x_{i1} = x_{i1}, x_{i2} = 0]\) to denote the expected value of \(Y_i\) for observations with \(x_{i1} = x_{i1}\) and \(x_{i2} = 0\). This is an example of general notation for the expectation of \(Y\) conditional on specific values of the predictor variables \(x_{ij}\).

For male penguins, \(x_{i2} = 0\), so (??) reduces to: \[\begin{align*} Y_i &= \beta_0 + \beta_1x_{i1} + \beta_2(1) + \epsilon_i\\ &= (\beta_0 + \beta_2) + \beta_1x_{i1} + \epsilon_i \end{align*}\] Taking the expectation of this equation gives: \[\begin{equation} \E[Y_i| x_{i1} = x_{i1}, x_{i2} = 1] = (\beta_0 + \beta_2) + \beta_1x_{i1} (\#eq:mlr2_x21) \end{equation}\]

The key difference between equations @ref(eq:mlr2_x20) and @ref(eq:mlr2_x21) is the addition of \(\beta_2\), which changes the intercept. For female penguins, the intercept of the line is \(\beta_0\), while for male penguins it is \(\beta_0 + \beta_2\). Both lines still have the same slope. Graphically, we can represent this as:

## `geom_smooth()` using formula 'y ~ x'
Flipper length and body mass in the Palmer Penguin dataset.

Figure 8.2: Flipper length and body mass in the Palmer Penguin dataset.

Consider the following groups of penguins:

  • Group A: Female penguins with 200mm flippers
  • Group B: Female penguins with 190mm flippers
  • Group C: Male penguins with 200mm flippers
  • Group D: Male penguins with 190mm flippers
Example 8.1 According to the MLR model @ref(ex:mlr_x2}, what is the difference in average body mass between penguins in Group A and Group B?

To answer this, let’s first write out the equation of the mean body mass for each group of penguins:

\[\text{Group A:} \quad \E[Y_i | x_{i1} = 200, x_{i2} = 0 ] = \beta_0 + \beta_1*200\] \[\text{Group B:} \quad \E[Y_i | x_{i1} = 190, x_{i2} = 0 ] = \beta_0 + \beta_1*190\] The difference between these is: \[\begin{align*} \E[Y_i | x_{i1} = 200, x_{i2} = 0 ] - \E[Y_i | x_{i1} = 190, x_{i2} = 0 ] &= \left(\beta_0 + \beta_1*200 \right) - \left(\beta_0 + \beta_1*190\right)\\ & = 200\beta_1 - 190\beta_1\\ &= 10\beta_1 \end{align*}\] So for female penguins that differ in flipper length by 10mm, the difference in their average body mass is \(10\beta_1\).

Example 8.2 According to the MLR model @ref(ex:mlr_x2}, what is the difference in average body mass between penguins in Group C and Group D?

We follow the same procedure, first finding the equation for the mean body mass in each group and then computing their difference.

\[\text{Group C:} \quad \E[Y_i | x_{i1} = 200, x_{i2} = 1 ] = \beta_0 + \beta_2 + \beta_1*200\] \[\text{Group D:} \quad \E[Y_i | x_{i1} = 190, x_{i2} = 1 ] = \beta_0 + \beta_2 + \beta_1*190\]

\[\begin{align*} \E[Y_i | x_{i1} = 200, x_{i2} = 1 ] - \E[Y_i | x_{i1} = 190, x_{i2} = 1 ] &= \left(\beta_0 + \beta_2 + \beta_1*200 \right) - \left(\beta_0 + \beta_2 + \beta_1*190\right)\\ & = 200\beta_1 - 190\beta_1\\ &= 10\beta_1 \end{align*}\] So for male penguins that differ in flipper length by 10mm, the difference in their average body mass is \(10\beta_1\).

Example 8.3 According to the MLR model @ref(ex:mlr_x2}, what is the difference in average body mass between penguins in Group C and Group A?

\[\begin{align*} \E[Y_i | x_{i1} = 200, x_{i2} = 1 ] - \E[Y_i | x_{i1} = 200, x_{i2} = 0 ] &= \left(\beta_0 + \beta_2 + \beta_1*200 \right) - \left(\beta_0 + \beta_2 + \beta_1*200\right)\\ & = \beta_2 \end{align*}\] We would obtain the same difference if we compared Group D to Group B. So for penguins with the same flipper length, the difference in body mass between male penguins and female penguins is \(\beta_2\).

8.3 MLR Model 2: Two continuous predictors

Instead of modelling body mass using flipper length and sex, we could instead model body mass using flipper length and bill length. Mathematically, this means considering a model with two continuous predictor variables.

First, we can graphically see that there appears to be a positive correlation between bill depth, flipper length, and body mass.

We can again use equation (8.2) as our model, but now with

  • \(Y_i =\) Body mass (in grams) for penguin \(i\)
  • \(x_{i1} =\) Flipper length (in mm) for penguin \(i\)
  • \(x_{i2} =\) Bill length (in mm) for penguin \(i\)
Example 8.4 What is the difference in average body mass for penguins with the same flipper length and that differ in bill length by 1 mm?

In this example, we don’t know the specific flipper length of the penguins, but we are told that they have the same length. So when computing their mean body mass, we can use a variable (\(x_1\)) to represent this value. We also don’t know what their bill depths are, except that they differ by one unit. We can use \(x_2 + 1\) and \(x_2\) to denote these two quantities. The difference in average body mass between the specified groups of penguins is:

\[\begin{align*} \E[Y_i | x_{i1} = x_1, x_{i2} = x_2 + 1 ] - \E[Y_i | x_{i1} = x_1, x_{i2} = x_2 ] &= \left(\beta_0 + \beta_1*x_1 + \beta_2(x_2 + 1)\right) - \left(\beta_0 + \beta_1*x_1 + \beta_2x_2\right)\\ & = (x_2 + 1)\beta_2 - x_2\beta_2\\ &= \beta_2 \end{align*}\]

By the same procedure, we could find that the difference in average body mass for penguins with the same bill length that differ in flipper length by 1mm is \(\beta_1\).

8.4 Interpreting \(\beta_j\) in the general MLR model

For the MLR model with two predictor variables, the coefficient parameters can be interpreted as:

  • \(\beta_0 =\) Average value of \(Y_i\) for observations with \(x_{i1}=0\) and \(x_{i2}=0\)
  • \(\beta_1 =\) Difference in average value of \(Y_i\) for a 1-unit difference in \(x_{i1}\) among observations with the same value of \(x_{i2}\)
  • \(\beta_2 =\) Difference in average value of \(Y_i\) for a 1-unit difference in \(x_{i2}\) among observations with the same value of \(x_{i1}\)

These can be generalized to an MLR model with \(p-1\) different predictor variables:

  • \(\beta_0 =\) Average value of \(Y_i\) when all the \(x\)’s are zero
  • \(\beta_j =\) Average difference in \(Y_i\) for a 1-unit difference in \(x_{ij}\) among observations with the same value of all other \(x\)’s