.comment-link {margin-left:.6em;}

Ed Knows Policy

EKP -- a local (Washington, DC) and national blog about education policy, politics, and research.

Who is Ed Researcher?

Wednesday, May 24, 2006

Questions for Sanders et al. on NBPTS study

I'm no NBPTS apologist, but I have to say that I'm underwhelmed by the final report (pdf) written by Sanders, Ashton, and Wright. NBPTS posted a summary of the negative comments offered by peer reviewers of the report, but that set of critiques is pretty unconvincing too. Three major flaws (or potential flaws, it's hard to tell based on their report) are:

  • 1. Attrition bias

  • 2. Underpowered tests

  • 3. Omitted variable bias

  • I'll try to explain those in English on the continuation page.

    1. Attrition bias. A potentially huge source of bias is sample attrition. The researchers threw away about 35% of the records (p. 4) without explaining well enough why or whether it might bias their results. They list 6 reasons for dropping cases, some of which are harmless and others not so much. For example, "the student record lacked two prior year scores." Well, that would mean that teachers who work with highly mobile populations are going to have fewer records and they might even drop out of the sample altogether (if they teach fewer than 10 "stayers"). If your stayers are better than your leavers, then this analysis uses a decidedly nonrandom subsample of your students to judge your performance. I'm not sure I want these guys in charge of estimating my value added.

    To see why this might be the source of sample loss rather than the other 5 reasons, just look at the sample size by grade level. Analysis of grades 7 and 8, which requires data on kids who necessarily changed schools to move from elementary to middle school, relies on the smallest samples.

    2. Underpowered tests. I know it's hard to complain that the sample size is too small when the authors report that they have 260,000 records and 4600 teacher-year-grade-subject combinations. But when people use bullshit units like "teacher-year-grade-subjects" you should start to worry. By my calculations they had 102 National Board Certified teachers (NBCs) in their sample -- still not too shabby, especially considering that the comparison groups were as large or in the case of "never in NBPTS", much larger. Yet, because they did all analysis broken down by grade, that translates into an average of 20 NBCs per grade or a range of 41 teachers in grade 1 down to only 8 teachers in grade 8. Maybe they had more teachers and fewer than 3 years per teacher, but it's not reported, so I assumed the worst here.

    Why am I harping on sample size here when so many good studies use even fewer teachers? Because when your finding is that the group differences are not statistically signficant there are two possible explanations. One is that there really is no difference (i.e. NBPTS is no good) and the other is that you have too few data points to know. The latter explanation is not ruled out here.

    In the end, they may be right that teacher variability is so great relative to certification that certification is essentially meaningless, but I think they need to compare NB cert with other signals of teaching quality, like experience, degrees, or exam scores. IF NB cert is even a tiny bit better than those other signals, then it's probably worth the money.

    3. Omitted variable bias. The paper is very boastful of how sophisticated their model is (and how superior it is to the studies by Goldhaber and Anthony and by Cavalluzo) because they treat teacher effects as random. But at the end of the day, the fixed versus random effect question is one of interpretation, not of right or wrong. A fixed effect interpretation says that we will treat these teachers in the study as a fixed sample and we explore relationships among them, like their group averages by certification status. The random effects interpretation says that we will assume they come from a larger population and instead of estimating their average effectiveness, we will estimate what kind of "effectiveness distribution" they seem to have come from. The latter approach is more ambitious, so it's harder to identify parameters or show that entire distributions differ, and hence you're more likely to generate null findings. Combine that with the problem of few NBCs per grade and you're nearly guaranteed to get the result that they did.

    So what does this have to do with omitted variables? A true value added score is supposed to account for everything outside the control of the teacher. Ideally, you want to compare teachers with the same kids in the same buildings and the same working conditions. No researcher can do all that, but you can come a lot closer than Sanders et al. did. Their model controls for hardly anything besides prior test score: just race and sex of the student and teachers' years of experience.

    But wait, they did use prior test scores, right, so that takes care of everything? Well, no. Not everyone with the same initial score has the same expected growth. Even conditional on prior score you will have slower growth for more economically disadvantaged students or those with disabilities or special needs. I'm not asking the authors to do a randomized trial (although that would be nice), but they could have controlled for even a weak proxy for family income like free/reduced price lunch eligibility. They could have included controls for disability status or special needs. NEver mind all the harder stuff like school conditions that might also contribute to the between-classroom variability they document so carefully.

    Like I said, this study may show that National Board certification is worthless, but I'm not yet convinced.


    Blogger ghost said...

    This is such a great news, it really
    helps,Thanks for convey the message, I really didn't know about that, will avail this,Thanks for share this.
    Order Custom Term Papers

    2:44 AM  

    Post a Comment

    Links to this post:

    Create a Link

    << Home