Chapter 17 Model Selection for Association

\(\newcommand{\E}{\mathrm{E}}\) \(\newcommand{\Var}{\mathrm{Var}}\) \(\newcommand{\bmx}{\mathbf{x}}\) \(\newcommand{\bmH}{\mathbf{H}}\) \(\newcommand{\bmI}{\mathbf{I}}\) \(\newcommand{\bmX}{\mathbf{X}}\) \(\newcommand{\bmy}{\mathbf{y}}\) \(\newcommand{\bmY}{\mathbf{Y}}\) \(\newcommand{\bmbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bmepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bmmu}{\boldsymbol{\mu}}\) \(\newcommand{\bmSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\XtX}{\bmX^\mT\bmX}\) \(\newcommand{\mT}{\mathsf{T}}\) \(\newcommand{\XtXinv}{(\bmX^\mT\bmX)^{-1}}\)

Acknowledgment: Some of the information in this section was based upon ideas from Scott Emerson and Barbara McKnight.

17.1 Model Misspecification – Mathematical consequences

Suppose we have \(\bmX = \begin{bmatrix} \bmX_q & \bmX_r \end{bmatrix}\) and \(\bmbeta = \begin{bmatrix} \bmbeta_q \\ \bmbeta_r \end{bmatrix}\). Assume the true regression model is: \[\bmy = \bmX\bmbeta + \bmepsilon = \bmX_q\bmbeta_q + \bmX_r\bmbeta_r + \bmepsilon\] with the usual assumptions of \(\E[\bmepsilon] = 0\) and \(\Var(\bmepsilon) = \sigma^2 \bmI\).

17.1.1 Correct model

Suppose we fit the correct model

  • Model we fit is: \(\bmy = \bmX_q\bmbeta_q + \bmX_r\bmbeta_r + \bmepsilon\)
  • OLS estimator is: \[\hat\bmbeta^* = (\bmX^\mT\bmX)^{-1}\bmX^\mT\bmy\]
  • \(\hat\bmbeta\) is unbiased: \[\E[\hat\bmbeta] = (\bmX^\mT\bmX)^{-1}\bmX^\mT\E[\bmy] =(\bmX^\mT\bmX)^{-1}\bmX^\mT(\bmX\bmbeta) = \bmbeta\]

17.1.2 Not including all variables

What happens if we fit model only using \(\bmX_q\) (leaving out the \(\bmX_r\) variables)?

  • Model we fit is \(\bmy = \bmX_q\bmbeta_q + \bmepsilon\)
  • OLS estimator is: \[\hat\bmbeta_q = (\bmX_q^\mT\bmX_q)^{-1}\bmX_q^\mT\bmy\]
  • Expected value of \(\hat\bmbeta_q\) is:
    • This means \(\hat\bmbeta_q\) is biased, unless \(\bmbeta_r = \mathbf{0}\) or \(\bmX_q^\mT\bmX_r = \mathbf{0}\) (predictors are uncorrelated).
    • Size and direction of bias depends on \(\bmbeta_r\) and correlation between variables.
  • It can be shown that \(\hat\sigma^2\) is biased high.

Key Result 1: Not including necessary variables can bias our results.

17.1.3 Including too many variables

What happens if we include \(\bmX_q\), \(\bmX_r\), and \(\bmX_s\)?

  • Model with fit is: \(\bmy = \bmX_q\bmbeta_q + \bmX_r\bmbeta_r + \bmX_s\bmbeta_s + \bmepsilon = \bmX^*\bmbeta^* + \bmepsilon\)
  • OLS estimator is: \[\hat\bmbeta^* = (\begin{bmatrix}\bmX & \bmX_s\end{bmatrix}^\mT\begin{bmatrix}\bmX & \bmX_s\end{bmatrix})^{-1}\begin{bmatrix}\bmX & \bmX_s\end{bmatrix}^\mT\bmy\]
  • Expected value of \(\hat\bmbeta^*\) is:
    • Our estimate of \(\bmbeta\) is unbiased!
  • It can be shown that \(\Var(\hat\bmbeta)\) will increase when extraneous variables are added.

Key Result 2: Adding extra variables does not bias our results, but can increase the variances of \(\hat\bmbeta\) (and thus reduce power).

17.2 Confounding

17.2.1 Confounders

Confounding is the distortion of a predictor-outcome relationship due to an additional variable(s) and its relationship to the predictor and the outcome.

A variable (\(C\)) is a confounder of the relationship between predictor of interest (\(X\)) and outcome (\(Y\)) if:

  • \(C\) is causally related to the outcome \(Y\) in the population
  • \(C\) is causally related to the predictor of interest \(X\)

17.2.2 Directed Acyclic Graphs (DAGs)

DAGs are used to represent causal relationships

  • Outcome, predictor of interest, and other variables are ``nodes’’
  • Arrows between two nodes denote a causal relationship
  • ``Acyclic’’ = no closed loops
  • DAGs generally show just the existence of a relationship, not its strength or magnitude
  • Nodes not connected are presumed to have no causal relationship

17.2.2.1 DAG Example

  • \(C\) is a confounder
  • \(Z\), \(M\), and \(W\) are not confounders

17.2.3 Confounding Example: FEV in Children

Is there a relationship between smoking and lung function in children?

  • Outcome: FEV, Forced Expiratory Volume (fev)
    • Higher value is better
  • Predictor of Interest: Smoking Status (smoke, 1 = Yes, 0 = No)

fev_mod1 <- lm(fev~smoke, data=childfev)
tidy(fev_mod1, conf.int=TRUE)
## # A tibble: 2 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    2.57     0.0347     74.0  1.49e-319    2.50      2.63 
## 2 smoke          0.711    0.110       6.46 1.99e- 10    0.495     0.927

From this model, we would conclude:

Children who smoke have on average a 0.71 L greater FEV (95% CI: 0.50, 0.93) than children who do not smoke (\(p<0.0001\)).

What’s going on here?

  • Older children have higher FEV
  • Older children are more likely to be smokers

\(\Rightarrow\) Age is a confounder of the relationship between smoking and FEV

Let’s fit model that also adjusts for age.

fev_mod2 <- lm(fev~smoke + age, data=childfev)
tidy(fev_mod2, conf.int=TRUE)
## # A tibble: 3 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.367   0.0814       4.51 7.65e-  6    0.207    0.527 
## 2 smoke         -0.209   0.0807      -2.59 9.86e-  3   -0.368   -0.0504
## 3 age            0.231   0.00818     28.2  8.28e-115    0.215    0.247

Children who smoke have on average a 0.21 L lower FEV (95% CI: -0.37, -0.05) than children of the same age who do not smoke (\(p<0.0001\)).

  • By adjusting for the confounder in the MLR model, we can correctly estimate the relationship between smoking and FEV.

17.2.4 Accounting for Confounders

  • Confounding can have a drastic impact on our results
    • If ignored, it can lead to completely wrong conclusions
  • Account for confounding variables by adjusting for them in regression
  • Always interpret your results in the context of what is in the model

17.2.5 Confounding & Randomized Experiments

Why are randomized experiments so popular?

  • General approach:
    • Randomly assign experimental units (rats, trees, people, etc.) to a treatment condition (food, fertilizer, drug, etc.)
    • Compare the difference in outcome between the treatment groups
  • By randomly assigning the treatment conditions, there is no relationship between the treatment condition and any other variables \(\Rightarrow\) No Confounding

17.3 Confirmatory v. Exploratory

17.3.1 Association Study Goals

In studies of association, the goal is to estimate the potential relationship between a predictor and an outcome

  • Inference about a quantity (usually \(\beta\)) that summarizes the relationship between predictor and outcome is target, as opposed to accuracy in predicting \(y\)
  • Want to limit bias from confounding
  • Want high power for detecting an association, if it exists
  • Study may be confirmatory or exploratory

17.3.2 Confirmatory Analyses

A confirmatory analysis attempts to answer the scientific question the study was designed to address

  • The analysis is hypothesis testing
  • Interpretation can be strong
  • Analysis must follow a detailed protocol designed before data were collected to protect the interpretation of the \(p\)-value.
    • Ex: The statistical analysis plan for all U.S. clinical trials must be published before recruitment begins.
  • Variables included in the model and tests performed cannot depend on features of the data that were collected
    • Cannot change your model based on what you see in the data you collect
    • Protects against overfitting and preserves Type I error rate
  • Failure to be strict about the analysis plan is a major factor in the large number of contradictory published results

17.3.3 Exploratory Analyses

An exploratory analysis uses already collected data to explore additional relationships between the outcome and other measured factors

  • The analysis is hypothesis generating
  • Interpretations should be cautious
  • The general analysis plan should be established first, but
    • The form of variables in the model (e.g. splines, cut points for categories) may depend on what is found in the data
    • The variables chosen for the model and the presence of interactions may depend on what is found in the data
  • Exploratory analyses can be basis for future studies, but do not provide definitive evidence

17.3.4 Model Selection Process – Confirmatory Analysis

For a confirmatory analysis:

  1. State the question of interest and hypothesis
  2. Identify relevant causal relationships and confounding variables
  3. Specify the model
    1. Specify the form of the predictor of interest
    2. Specify the forms of the adjustment variables
    3. Specify the hypothesis test that you will conduct
  4. Collect data.
  5. Fit the pre-specified model and conduct the pre-specified hypothesis test.
  6. Present the results

17.3.5 Model Selection Process – Exploratory Analysis

For an exploratory analysis:

  1. State the question of interest and hypothesis
  2. Identify relevant causal relationships and confounding variables
  3. Perform a descriptive analysis of the data
  4. Specify the model
    1. Specify the form of the predictor of interest
    2. Specify the forms of the adjustment variables
  5. Fit the model.
  6. Develop hypotheses about possible changes to the model.
    1. Assess the evidence for changes to the model form, using hypothesis tests
    2. Identify model(s) that fit data and science well
  7. Present the results as exploratory

17.3.6 Identifying a Statistical Model

  • Collect your data and choose your statistical model based on what question you want to answer.
  • Do not choose the question to answer just based on a statistical model or because you have a certain type of data.
  • Better to have an imprecise answer to the right question than a precise answer to the wrong question
    • Do not ignore confounding for the sake of improving power, since this will lead to biased estimates
    • “Validity before precision”
  • Variable selection should be based upon understanding of science and causal relationships (e.g. DAGs), not statistical tests of relationships in the data.

17.4 Variable Selection

17.4.1 What not to do

Many “automated” methods for variable/model selection exist

  • Forwards selection
  • Backwards elimination
  • Minimizing AIC/BIC

While rule-based and simple to implement, these are not appropriate methods to use for selecting a model to test an association.

Statistical issues can impact model estimates

  • Outliers, high-leverage points, and influential observations can have out-sized impact on model fit
  • Including multiple correlated adjustment variables can inflate standard errors and reduce power (multicollinearity)

These are statistical issues that should be considered in overall assessment and interpretation of your model. But they should not drive variable selection.

17.4.2 Predictor of Interest

Confirmatory Analysis

  • Choose form (e.g. linear, log, quadratic, etc.) consistent with the prior hypothesis

Exploratory Analysis

  • Include all relevant exposure variables in the model
  • Explore what form fits best
    • Unordered categorical?
    • Linear, Logarithmic, Quadratic?
    • What is best category size? (e.g. 2-year or 5-year groups)

17.4.3 Adjustment Variables

Confirmatory Analysis

  • Include variables if your prior hypothesis is that they are confounders
  • Form should be as flexible as possible, to reduce residual confounding
    • Unordered indicators for categorical variables
    • Spline representations of continuous variables (often 3 df is enough, but not always)
    • Interactions between confounders, if there is a priori reason to hypothesize such a relationship

Exploratory Analysis

  • Develop list of scientifically plausible confounders (i.e. draw a DAG)
  • Include plausible confounders in the model
  • Explore the impact of different forms of the variables
    • Linear v. spline for continuous
    • Variations in cut point of categories
    • Interactions

17.4.4 Things to Consider

When developing DAGs and identifying confounders for inclusion in the model, some points to keep in mind:

  • Are two plausible confounders meaningfully different?
    • Ex: Employment status and personal income
  • Sample size
    • With large \(n\), can disentangle differences between related variables
    • Small sample sizes make estimating many parameters difficult
      • If \(n=100\), don’t include splines with 50 df or a variable with 30 categories!
    • However, sample size alone should not determine your model!
  • Always include lower-order terms
    • If including interaction, include main effects
    • If including quadratic term, include linear term
  • Interactions
    • Interactions are often hard to estimate unless very strong
    • Confounding can depend greatly on the presence of interactions
    • Interactions often subject of exploratory analyses, but not as common as primary question for a confirmatory analysis
  • Parsimony
    • Traditionally, parsimonious models have been preferred, since they are easier to interpret
    • Modern datasets and questions are typically too complex for simple models to be appropriate
    • A model can be scientifically parsimonious but still statistically complex

17.5 Exercises

Exercise 17.1 Does ice cream sales cause shark bites?

  • Predictor of Interest: Ice Cream Sales
  • Outcome: Number of shark bites
  • We also know:
    • Ice cream sales are higher during warm weather
    • More people are at the beach during warm weather
Draw a DAG representing these relationships.
Example 17.1 Vaccine trials for COVID-19 compare rates of infection in treatment and placebo groups, without adjusting for anything else. Why is there no confounding?