Like Cats and Dogs

Why model selection and inference just can’t get along

Peter Humburg

18 February 2021

Model Selection and Inference

The Situation

Research studies often collect data on many variables
Can’t include all of them in analysis
- too many
- collinearity
Need to select subset of variables
- Backward selection
- Penalized regression
- …
Want to interpret resulting model to answer research questions

The Problem

Standard statistical inferences are often carried out based on a model that is determined by a data-driven selection criterion. Such procedures, however, are both logically unsound and practically misleading.
Zhang et al.
Biometrika (1992)

A Closer Look

What is the effect of model selection on coefficient estimates?

Setup

Simulate some data!

Variable	True Coefficient
X1	2
X2	1.5
X3	0.5
X4	0.1
X5 - X15	0

Sample size: 20
Generate 1000 datasets
Fit three models
- True model
- Full model
- Forward selection model

Simulation Results

What to expect?

Coefficients for X1 - X4 should be close to true values
Remaining coefficients should be close to 0
How often a coefficient was significant tells us about Type I and Type II error rates

Estimates

Power

X	true	full	forward
X1	1	1	1
X2	1	1	1
X3	0.704	0.248	0.727
X4	0.217	0.07	0.29

Power to detect X3 and X4 is low
- especially for full model

Estimates for 0 coefficients

Type I error

X	full	forward
X5	0.049	0.162
X6	0.062	0.187
X7	0.054	0.16
X8	0.047	0.171
X9	0.043	0.167
X10	0.059	0.16
X11	0.05	0.168
X12	0.06	0.17
X13	0.071	0.201
X14	0.06	0.17
X15	0.05	0.185

Inflated Type I error
Forward selection: 3 - 4\(\times\) nominal level

Model selection is likely to include variables without real effect
Interpreting these coefficients may be misleading!

Can we make it work?

What are our options?

Choose model before looking at data
Split dataset
Select variables without looking at response
- Can help to deal with collinearity (VIF)
Adjust p-values to account for effect of model selection
- e.g. R package selectiveInference

How does adjusting for model selection affect Type I and Type II error?

Type I Error

X	full	forward	forward adjusted
X5	0.049	0.162	0.017
X6	0.062	0.187	0.026
X7	0.054	0.16	0.016
X8	0.047	0.171	0.018
X9	0.043	0.167	0.023
X10	0.059	0.16	0.031
X11	0.05	0.168	0.016
X12	0.06	0.17	0.026
X13	0.071	0.201	0.018
X14	0.06	0.17	0.018
X15	0.05	0.185	0.016

Marked reduction in Type I error
Adjustment may be too conservative

Power

X	true	full	forward	forward adjusted
X1	1	1	1	0.868
X2	1	1	1	0.922
X3	0.704	0.248	0.727	0.232
X4	0.217	0.07	0.29	0.048

Clear reduction in power
Larger sample size required
Need to consider model selection when planning a study to ensure adequate power