Like Cats and Dogs

Why model selection and inference just can’t get along

Peter Humburg

18 February 2021

Model Selection and Inference

The Situation

  • Research studies often collect data on many variables
  • Can’t include all of them in analysis
    • too many
    • collinearity
  • Need to select subset of variables
    • Backward selection
    • Penalized regression
  • Want to interpret resulting model to answer research questions

The Problem

Standard statistical inferences are often carried out based on a model that is determined by a data-driven selection criterion. Such procedures, however, are both logically unsound and practically misleading.
Zhang et al.
Biometrika (1992)

A Closer Look

What is the effect of model selection on coefficient estimates?

Setup

Simulate some data!

Variable True Coefficient
X1 2
X2 1.5
X3 0.5
X4 0.1
X5 - X15 0
  • Sample size: 20
  • Generate 1000 datasets
  • Fit three models
    • True model
    • Full model
    • Forward selection model

Simulation Results

What to expect?

  • Coefficients for X1 - X4 should be close to true values
  • Remaining coefficients should be close to 0
  • How often a coefficient was significant tells us about Type I and Type II error rates

Estimates

Estimates

Power

X true full forward
X1 1 1 1
X2 1 1 1
X3 0.704 0.248 0.727
X4 0.217 0.07 0.29
  • Power to detect X3 and X4 is low
    • especially for full model

Estimates for 0 coefficients

Type I error

X full forward
X5 0.049 0.162
X6 0.062 0.187
X7 0.054 0.16
X8 0.047 0.171
X9 0.043 0.167
X10 0.059 0.16
X11 0.05 0.168
X12 0.06 0.17
X13 0.071 0.201
X14 0.06 0.17
X15 0.05 0.185
  • Inflated Type I error
  • Forward selection: 3 - 4\(\times\) nominal level


  • Model selection is likely to include variables without real effect
  • Interpreting these coefficients may be misleading!

Can we make it work?

What are our options?

  • Choose model before looking at data
  • Split dataset
  • Select variables without looking at response
    • Can help to deal with collinearity (VIF)
  • Adjust p-values to account for effect of model selection
    • e.g. R package selectiveInference
How does adjusting for model selection affect Type I and Type II error?

Type I Error

X full forward forward adjusted
X5 0.049 0.162 0.017
X6 0.062 0.187 0.026
X7 0.054 0.16 0.016
X8 0.047 0.171 0.018
X9 0.043 0.167 0.023
X10 0.059 0.16 0.031
X11 0.05 0.168 0.016
X12 0.06 0.17 0.026
X13 0.071 0.201 0.018
X14 0.06 0.17 0.018
X15 0.05 0.185 0.016
  • Marked reduction in Type I error
  • Adjustment may be too conservative

Power

X true full forward forward adjusted
X1 1 1 1 0.868
X2 1 1 1 0.922
X3 0.704 0.248 0.727 0.232
X4 0.217 0.07 0.29 0.048
  • Clear reduction in power
  • Larger sample size required
  • Need to consider model selection when planning a study to ensure adequate power