NBER has a new working paper on the challenges of assessing value-add in the education context:
Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted valueadded measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height.
We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice
Evaluation and properly crediting the appropriate causation is tough. It is especially tough to do when there are strong inhibitions or limitations to accurate randomization.
This leaped out at me as I’m currently working on a couple of papers about health plan quality. The data that we are using is a series of composite measures of consumer surveys and claims analysis to try and figure out which characteristics of insurers are associated with high value care and what insurer characteristics are associated with low value care. We can find the associations easily enough. But as soon as we want to start thinking about hypothesizing about causal pathways of outcomes, I get a splitting headache.