Standard statistical inferences are often carried out based on a model that is determined by a data-driven selection criterion. Such procedures, however, are both logically unsound and practically misleading.
What is the effect of model selection on coefficient estimates?
Simulate some data!
Variable | True Coefficient |
---|---|
X1 | 2 |
X2 | 1.5 |
X3 | 0.5 |
X4 | 0.1 |
X5 - X15 | 0 |
What to expect?
X | true | full | forward |
---|---|---|---|
X1 | 1 | 1 | 1 |
X2 | 1 | 1 | 1 |
X3 | 0.704 | 0.248 | 0.727 |
X4 | 0.217 | 0.07 | 0.29 |
X | full | forward |
---|---|---|
X5 | 0.049 | 0.162 |
X6 | 0.062 | 0.187 |
X7 | 0.054 | 0.16 |
X8 | 0.047 | 0.171 |
X9 | 0.043 | 0.167 |
X10 | 0.059 | 0.16 |
X11 | 0.05 | 0.168 |
X12 | 0.06 | 0.17 |
X13 | 0.071 | 0.201 |
X14 | 0.06 | 0.17 |
X15 | 0.05 | 0.185 |
selectiveInference
X | full | forward | forward adjusted |
---|---|---|---|
X5 | 0.049 | 0.162 | 0.017 |
X6 | 0.062 | 0.187 | 0.026 |
X7 | 0.054 | 0.16 | 0.016 |
X8 | 0.047 | 0.171 | 0.018 |
X9 | 0.043 | 0.167 | 0.023 |
X10 | 0.059 | 0.16 | 0.031 |
X11 | 0.05 | 0.168 | 0.016 |
X12 | 0.06 | 0.17 | 0.026 |
X13 | 0.071 | 0.201 | 0.018 |
X14 | 0.06 | 0.17 | 0.018 |
X15 | 0.05 | 0.185 | 0.016 |
X | true | full | forward | forward adjusted |
---|---|---|---|---|
X1 | 1 | 1 | 1 | 0.868 |
X2 | 1 | 1 | 1 | 0.922 |
X3 | 0.704 | 0.248 | 0.727 | 0.232 |
X4 | 0.217 | 0.07 | 0.29 | 0.048 |