Shrinkage of Value-Added Estimates and Characteristics of Students with Hard-to-Predict Achievement Levels

ABSTRACT It is common in the implementation of teacher accountability systems to use empirical Bayes shrinkage to adjust teacher value-added estimates by their level of precision. Because value-added estimates based on fewer students and students with “hard-to-predict” achievement will be less precise, the procedure could have differential impacts on the probability that the teachers of fewer students or students with hard-to-predict achievement will be assigned consequences. This article investigates how shrinkage affects the value-added estimates of teachers of hard-to-predict students. We found that teachers of students with low prior achievement and who receive free lunch tend to have less precise value-added estimates. However, in our sample, shrinkage had no statistically significant effect on the relative probability that teachers of hard-to-predict students received consequences.


Introduction
Due to the incentives provided by the federal Race to the Top program and Elementary Secondary Education Act (ESEA) waivers, districts and states rapidly implemented new teacher evaluation systems to make high-stakes decisions about tenure, pay, and retention, based in part on statistical measures of teacher effectiveness. These evaluation systems generally incorporated multiple measures of effective teaching, such as student achievement growth, classroom observations, and student learning objectives (Mihaly et al. 2013). One of the most controversial has been the student achievement growth component, for which districts have used different approaches. One common approach has been to use a value-added model (VAM) to estimate teachers' contributions to student achievement on standardized assessments. Under the 2016 reauthorization of ESEA, the federal government gave states more control over their teacher evaluation systems. As states rethink their evaluation systems, there is an even greater need to understand how statistical models like VAMs contribute to assigning teachers to performance categories.
VAMs predict individual student achievement based on the student's characteristics, including baseline achievement, and compare this prediction with the actual achievement of a teacher's students. The prediction is derived using data on other students in the state or district and represents what we would expect the student to achieve if he or she were taught by the average teacher. The difference between how a teacher's students actually performed and how they were predicted to perform represents the estimate of the teacher's value added to student achievement.
The precision of the teacher's value-added estimate is related to the average differences between the actual and predicted performance of the students in a teacher's class. Precision depends both on the number of students matched to the teacher and the ability of the model to predict the achievement of the specific students in the teacher's class. All else being equal, the estimates of teachers matched to fewer students are more likely to be imprecise because one or two students with unusually high or low achievement growth can have a larger influence on the average differences between the actual and predicted performance of the students in the teacher's class. Differences between students' actual and predicted performance are related to the existence of factors that influence achievement but are not observed in the data. Some students may have achievement that is hard to predict in the sense that their background characteristics are not very informative about their future achievement. The role of this potential heteroscedasticity-variation across student groups in the difference between their actual and predicted achievement-has received relatively little attention in the value-added literature. 1 However, Stacy et al. (2012) found that students with low socioeconomic status and low prior achievement had larger differences between actual and predicted achievement than more-advantaged students. They showed that the value-added estimates of teachers who had many disadvantaged students were less stable over time.
It is common in the implementation of teacher accountability systems-including in the District of Columbia, New York City, and Los Angeles-to use a procedure known as empirical Bayes (EB) shrinkage to adjust the teacher valueadded estimates by their level of precision. 2 The adjusted estimate is a weighted average of the teacher's initial valueadded estimate and the value-added of an average teacher, with more precise estimates receiving greater weight. This procedure is called shrinkage because less precise estimates receive lower weight in the adjustment and are shrunk toward the average. Because value-added estimates based on fewer students and students with hard-to-predict achievement will be less precise, the procedure could have differential impacts on the probability that the teachers of fewer students or students with hard-to-predict achievement will be assigned consequences. For example, in the absence of shrinkage, teachers with very few students could be more likely to have estimates in the extremes of the distribution. 3 Similarly, if teachers of disadvantaged students have imprecise initial value-added estimates, shrinkage could reduce the probability that these teachers' estimates are in the extremes of the distribution.
How shrinkage affects the probability that teachers of particular types of students will receive consequences depends on the designs of accountability systems. Many evaluation systems use thresholds to identify highly effective or ineffective teachers and assign consequences. Because consequences are often discrete-for example, whether to retain the teacher or whether to require the teacher to receive professional development-very small changes in a teacher's estimate near the threshold can have substantial consequences. In designing a high-stakes accountability system, one goal is to avoid classification errors. 4 In these systems, shrinkage is conceptually appealing because it produces estimates of teacher effectiveness that are conservative in the assignment of consequences to teachers. Shrinkage estimates are conservative because an imprecise estimate will be shrunk more to the overall average, and the teacher might be less likely to be assigned consequences as a result. Because precision depends on the number of students and the characteristics of the students in a teacher's class, shrinkage could potentially reduce the probability that teachers of these types of students receive consequences. This might be desirable in evaluation systems because differences in teachers' probabilities of being misclassified could be considered unfair and have deleterious effects on teachers' incentives to teach classes that include certain groups of students.  The shrinkage estimates are called the best linear unbiased predictors (BLUPs) because they minimize the expected mean squared error (MSE) between teachers'estimates and their actual contributions. However, critics of the approach note that it obtains these properties by intentionally introducing statistical bias in the estimates and does not generally provide an accurate ranking of teachers (Tate ).  Kane and Staiger () found that small schools were more likely to have large changes in achievement across years than large schools.  Shrinkage could also differentially reduce the probability of consequences for teachers with greater variation in their true impacts on students. Variation in true impacts cannot be disentangled from variation due to random shocks to student performance, and the shrinkage procedure errs on the side of attributing imprecision to random variation rather than to variation in true impacts.
This article investigates how shrinkage affects the valueadded estimates of teachers of hard-to-predict students, focusing on the context of threshold-based accountability systems. We examine three main questions: 1. What student characteristics are correlated with having hard-to-predict test scores? 2. Are these student characteristics correlated with how much teacher value-added estimates change when shrinkage methods are used? 3. Are these student characteristics correlated with how much shrinkage affects the probability that these teachers' estimates exceed policy-relevant thresholds? Consistent with heteroscedasticity, we find that the achievement of particular students, such as students with low prior achievement and who receive free lunch is harder to predict. Teachers of these types of students tend to have less precise value-added estimates and shrinkage increases their estimates' precision. Shrinkage also reduces the absolute value of the valueadded estimates of teachers of hard-to-predict students. However, in our data, shrinkage has no statistically significant effect on the relative probability that teachers of hard-to-predict students receive consequences.
The main contribution of this article is to examine heteroscedasticity and shrinkage in the context of accountability systems based on thresholds. This differs from other papers on shrinkage, which examine how shrinkage affects teacher ranks (Tate 2004;Guarino et al. 2015a), and other work on heteroscedasticity, which examines its effect on the precision and inter-year stability of value-added estimates (Stacy et al. 2012). Our purpose is to examine how shrinkage affects the probability that teachers of hard-to-predict students are classified at the extremes of the value-added distribution of teacher effectiveness, because many evaluation systems use these thresholds to determine consequences.

Theory
To illustrate how heteroscedasticity affects estimates of teacher value added, we present the following VAM: where Y igt is the test score of student i in grade g who has teacher t and Y i,g−1 is the prior test score for student i in the previous grade g-1. The vector X igt denotes control variables for student demographic characteristics, θ t is a vector of teacher effects, 5 and ε igt is the student-level error term. Under homoscedasticity, all student-level errors ε igt have the same variance-the accuracy of a predicted score does not depend on the characteristics of the student. However, estimates of precision typically allow for the possibility of heteroscedastic errors. One possible cause of heteroscedasticity is measurement error in test scores. Typically, scores near the center of  Teacher effects can be estimated as random variables that are independent of other factors affecting test scores or as fixed effects that might be correlated with the other factors affecting test scores. We use the latter method, which estimates teacher effects as fixed effects, to account for the possibility that teacher assignments are correlated with the covariates included in the model. the distribution of a standardized test are more reliable than scores in the tails of the distribution (Resch and Isenberg 2014). 6 Consequently, students who are likely to be at the top or the bottom of achievement distribution might tend to have large differences between their actual and predicted scores. Another reason for heteroscedasticity is the presence of unobserved factors that affect achievement. For example, we do not observe parental involvement or students' motivation, which could affect achievement and vary across different types of students. Shrinkage places less weight on imprecise initial valueadded estimates when constructing the final value-added estimates. If teachers who are assigned students with large residuals tend to have imprecise estimates of initial value added, then the shrinkage procedure will proportionately shrink the estimates of these teachers more and the final estimates of these teachers will be proportionately more precise than without shrinkage. This relationship means that shrinkage reduces but does not eliminate differences in variances of value-added estimates.
To see this, consider the EB shrinkage procedure outlined in Morris (1983). A teacher's shrinkage estimate is the weighted average of the teacher's preshrinkage value-added estimate and the estimate for the average teacher. Letθ t be the mean-centered preshrinkage point estimate for teacher t from the value-added regression model, and let the associated heteroscedasticity-robust variance estimate be var[θ t ] =σ 2 t . The EB estimate for a teacher is approximately equal to a precision-weighted average ofθ t and the overall mean of all estimated teacher effects. 7 Because the overall mean of the estimated teacher effects is zero by design, the teacher's EB estimatê θ EB t can be written as follows: whereσ is an estimate of the standard deviation of teacher effects (purged of sampling error), which is constant for all teachers. The termσ 2 / σ 2 +σ 2 t must be less than 1, so the EB estimate always has a smaller absolute value than the initial estimate.
Although heteroscedasticity in student background characteristics does not affect the preshrinkage point estimates, it does affect the variance of the teacher estimates and, thus, the EB estimates. The change in the EB estimate in response to a change in the variance of the preshrinkage estimate is given by the following:  Standardized tests are designed to differentiate students who are proficient from those who are not, so they provide more reliable scores for students around the level of proficiency. Koedel, Leatherman, and Parsons () developed a method for accounting for measurement error that varies across the achievement distribution.  In Morris (), the EB estimate does not exactly equal the precision-weighted average of the two values due to a correction for bias. This adjustment increases the weight on the overall mean by (K -)/(K -), where K is the number of teachers. For ease of exposition, we have omitted this correction from the description given here and consider this expression to hold with equality.
whereŵ t ≡σ 2 / σ 2 +σ 2 t , the weight on the preshrinkage estimate. The negative sign indicates that estimates of teachers who teach students with characteristics that produce large residuals are shrunk more than those of teachers who teach students with smaller residuals. The extra shrinkage is largest for teachers with larger unshrunken estimates and for teachers with more of the types of students who tend to produce large residuals. The extra shrinkage will tend to distort the relationship between teacher effectiveness and teachers' average student characteristics related to heteroscedasticity (or other characteristics of teachers or classrooms related to heteroscedasticity). This is an example of the trade-off implicit in shrinkage between introducing bias and the choice to minimize the mean squared error (MSE) of the estimates; 8 the negative relationship in Equation (3) does not indicate that these teachers are more or less effective on average-it is instead an artifact of being conservative in handing out extreme value-added scores to teachers.
One benefit of the trade-off between bias and MSE is the increased efficiency of the EB estimator. This can be seen in the lower variances associated with the EB estimates. Ignoring the sampling error associated withσ 2 andσ 2 t , the variance of a teacher's EB estimate is approximately equal to 9 Thus, the EB estimate has a smaller variance than that of the preshrinkage estimate, becauseŵ t is less than 1.
Heteroscedasticity affects var θ EB t through the variance on the preshrinkage estimates. The change in the variance of a teacher's EB estimate in response to a change in the variance of the preshrinkage estimate is proportional to the square of the weight on the point estimate in the EB calculation: Becauseŵ t is less than one but greater than 0, shrinkage reduces but does not eliminate heteroscedasticity from the EB estimates. This suggests that shrinkage might mitigate but will not eliminate the differential probability that teachers of hardto-predict students are classified in the extremes of the valueadded distribution.

Value-Added Model and Data
We estimate a VAM that corresponds closely (though not exactly) with the model estimated for DCPS for teacher evaluations in the 2011-2012 school year (described in Isenberg and Hock 2012). This model has many features that are common in VAMs estimated in other states or districts. Six of nine  Morris () and others refer to EB estimates as unbiased. For example, the EB random effects estimator is often called the BLUP. The EB estimates are unbiased only in the sense that the expectation of the mean over all groups (teachers) gives an unbiased estimate of the true mean over the population of group effects. As a direct consequence of Equation (), a particular teacher's EB estimate is not generally an unbiased estimate of that teacher's true effect.  The actual formula we employ for the variance of the EB estimates is based on a formula in Morris () that accounts for sampling error in the estimated variances and makes degrees of freedom corrections. For ease of exposition, we have omitted these corrections from the description given here and consider this expression to hold with equality.
teacher-level VAMs for which we could obtain documentation use teacher fixed effects and apply a post-hoc adjustment to apply empirical Bayes' shrinkage to the fixed effect estimates, the method we use for our analysis. The other three VAMs bring about shrinkage as part of the regression model by using teacher random effects. 10 The popularity of the two-step approach may be due in part to the desire to avoid the assumption of a random effects model that teacher effectiveness is uncorrelated with student characteristics. Using simulated data, Guarino, Reckase, and Wooldridge (2015b) suggest that teacher fixed effects models are more robust to circumstances in which there is systematic sorting of students to teachers-that is, to the type of sorting that would violate this assumption of a random effects model. We estimate the VAM separately for teachers of elementary (grades 4 and 5) and middle school (grades 6 through 8) students: where Y tig is the posttest score for student i in grade g linked to teacher t and Y i(g−1) is the same-subject pretest for student i in grade g-1 during the previous year. The variable Z i(g−1) denotes the pretest in the opposite subject. Thus, when estimating teacher effectiveness in math, Y ig and Y i(g−1) represent math tests, with Z i(g−1) representing reading tests and vice versa. The pretest scores capture prior inputs into student achievement, and the associated coefficients λ g and ω g vary by grade. The vector X i denotes control variables for individual student background characteristics-specifically, indicators for free lunch eligibility, reduced-price lunch eligibility, limited English proficiency, and special education status. The coefficients on these characteristics are constrained to be the same across grades within each grade span. The vector δ g includes grade indicators.
The vector T it contains one indicator variable for each teacher. A student contributes one observation to the model for each teacher to whom the student is linked. This contribution is based on a roster confirmation process that enables teachers to indicate whether and for how long they have taught the students on their administrative rosters and to add any students who are not listed on their administrative rosters . Students are weighted in the regression according to their dosage, which indicates the amount of time the teacher taught the student. 11 The vector η includes one coefficient for each teacher. Finally, ε tig is the random error term.  (Rotz, Johnson, and Gill ). All incorporate empirical Bayes shrinkage. Of this group, Baltimore, Florida, and Louisiana use random effects and the rest used fixed effects followed by a post-regression step for shrinkage.  To estimate the effectiveness of teachers who share students, we use a technique called the full roster plus method, which attributes equal credit to teachers of shared students. In this method, each student contributes one observation to the model for each teacher to whom he or she is linked and students are weighted based on the dosage they contribute (Hock and Isenberg ; Isenberg and Walsh ). Then we create additional observations to equalize the dosage that each student contributes to the model. The proportion of dosage each student contributes to a teacher remains unchanged because the additional observations are linked to a distinct set of so-called shadow teacher indicators. Coefficient estimates for these shadow teachers are discarded.
Measurement error in the pretest scores will attenuate the estimated relationship between the pre-and posttest scores. To avoid this bias, the VAM in Equation (6) is estimated in two regression steps to account for both measurement error in the pretests and clustering of standard errors at the student level, which cannot be implemented simultaneously. In the first step, we estimate Equation (6) adjusting for measurement error in the pretests using an errors-in-variables correction (eivreg in Stata) that relies on published information on the test-retest reliability of the District of Columbia Comprehensive Assessment System (DC CAS) (Buonaccorsi 2010). We then used the measurement-error corrected values of the pretest coefficients to calculate adjusted posttest scores by subtracting the predicted effects of the pretest scores from the posttest scores: Notably, because the measurement error correction accounts only for measurement error that is constant across the test score distribution, any measurement error that varies across the distribution will contribute to heteroscedasticity.
In the second step, we then estimate teacher effects by regressing the adjusted posttest scores from the first step on student background characteristics, grade indicators, and teacher indicators, clustering standard errors by student using a clusterrobust sandwich variance estimator (Liang and Zeger 1986;Arellano 1987) The initial teacher value-added estimates are the coefficients on the teacher indicators in this regression, and their variance is given by the squared standard errors of the coefficient estimates. We then remove any teachers with fewer than 15 students, and re-center the value-added estimates to have a mean of 0. Finally, we calculate the EB estimates by applying Equation (2) to the initial estimates and variances from the main value-added regression. We also calculate the variance of the EB estimates using Equation (4). We use data from DCPS and the Office of the State Superintendent of Education of the District of Columbia (OSSE). The data contain information on students' test scores and demographic characteristics and enable students to be linked to their teachers. In DCPS, students in grades 3 through 8 and 10 were tested in math and reading using the DC CAS tests. Our analysis is based on students who were in grades 4 through 8 during the 2011-2012 school year and who had both pre-and posttest scores. To enable us to compare students across grades, we standardize student test scores within subject, year, and grade to have a mean of 0 and a standard deviation of 1. We exclude students who are repeating the grade so that in each grade, we compare only students who completed the same examinations.
Teachers vary in the characteristics of students that they teach. Table 1 displays the summary statistics for the 297 elementary school teachers (top panel) and 182 middle school teachers (bottom panel) in our sample. For simplicity, we restrict our analysis to math teachers and exclude teachers linked to fewer than 15 or more than 150 students. Excluding teachers who are matched to fewer than 15 students follows the districts' implementation of its teacher accountability system. Because groups of students with large residuals will affect the precision of teacher value-added estimates only if they are unevenly distributed across teachers, the table reports the teacher-level averages of student characteristics. For homeroom teachers-typically elementary school teachers-these values are the average characteristics of the students in the teachers' classrooms, whereas for departmentalized teachers, these values are averaged across multiple classes. Table 1 shows that there is wide variation in students' test scores and classroom composition across teachers. For example, in math, the average pretest scores of teachers' students range from −1.34 to 1.36 standard deviations in elementary school and from −1.12 to 1.16 standard deviations in middle school. Teachers also vary substantially in the percentage of students eligible for free and reducedprice lunch, limited English proficiency, and special education students they teach. Table 1 also reports the number of student-equivalents taught by teachers, which weights the number of students taught by teachers by their dosages-that is, the portions of time that individual students were enrolled in their teachers' classes. Elementary school teachers are typically linked to fewer students than middle school teachers, consistent with more departmentalized teachers in middle school. However, even within a grade span there is considerable variation in student-equivalents across teachers, which could affect the precision of value-added estimates.
Postshrinkage value-added estimates and variances are smaller than the preshrinkage estimates. Table 1 presents the preshrinkage and shrinkage value-added estimates and their estimated variances. For elementary school teachers, the preshrinkage estimates range from −0.90 to 0.96 and the estimated preshrinkage variances range from 0 to 0.07. Because the shrinkage procedure assigns the preshrinkage estimates and variances weights that are less than 1-the average weight for elementary teachers is 0.85-the shrinkage estimates have a smaller range than the preshrinkage estimates and the shrinkage variances are smaller than the preshrinkage variances. For elementary school teachers, the postshrinkage estimates range from −0.77 to 0.71 and the estimated postshrinkage variances range from 0.04 to less than 0.005.
In the next section, we describe our methodology to examine how differences in the characteristics of teachers' students are related to precision and how accounting for imprecision using shrinkage affects the probability that teachers are classified in the extremes of the value-added distribution based on their students' characteristics.

Empirical Approach
To describe heteroscedasticity in the student-level errors, we estimate the relationship between the squared residuals from the VAM-a measure of the accuracy of the prediction-and student characteristics using the following student-teacher-level regression: The outcomeε 2 tig is the squared residuals for student i in grade g with teacher t from the student-teacher-level regression of student test scores on student characteristics and grade and teacher indicators. Because the VAM contains teacher indicators, these residuals are the remaining variation in student outcomes after taking into account the average differences in the effectiveness of students' teachers. Thus, unpredictable outcomes for certain types of students do not simply reflect assignment of these students to teachers with a wider distribution of ability levels. Instead, the residuals measure the remaining difference between these students' outcomes and those of other students taught by the teacher. The vector X i denotes student characteristics (that is, average math and reading pretest scores, eligibility for free or reduced-price lunches, limited English proficiency, and special education status), and υ igt is an error term.
The regression in Equation (9) is intended to provide a parsimonious description of heteroscedasticity and not a formal test for its presence. For ease of interpretation, we estimate these regressions separately for each student characteristic and grade span; the sole exceptions to the separate regressions are eligibility for free or reduced-price lunch, which are entered simultaneously into the regression. 12 To address repeated observations of students, we estimate standard errors that account for clustering in the residual at the student level and weight the regressions by the student-teacher dosages. As a sensitivity analysis, we also estimate Equation (9) using the natural log of the squared residual as the dependent variable.
After examining which types of students tend to have higher residuals, we analyze the relationship between these student characteristics and the precision and magnitude of their teachers' pre-and postshrinkage value-added estimates. We estimate the following teacher-level bivariate regression model, again using separate regressions for each student characteristic and grade span: where W t represents a measure of the precision or magnitude of the pre-or postshrinkage value-added estimate for teacher t. Specifically, for both pre-and postshrinkage estimates, we examine as outcomes the natural log of the estimates' variance, the absolute value of the estimates, and indicators for the estimates being above or below a given percentile of the valueadded distribution. We examine the absolute value rather than the original value of the estimate because value-added estimates have mean zero (for the average teacher), so the further from zero, the more extreme the estimate of teacher effectiveness in a good (positive) or bad (negative) direction. When W t represents an indicator for being above or below a cut-point, we present results based on a linear probability model, but as a sensitivity analysis we also estimate a logit model. The vector X t represents the average characteristics of the teacher's students from Table 1, including measures of the student-equivalents taught by the teacher-the natural log of the total or the raw total. As we did in the student-teacher-level regressions, for ease of interpretation, we estimate separate regressions for each student characteristic (with the exception of eligibility for free or reduced-price  We do not examine higher-order polynomials for pretest scores. lunch) and grade span. We calculate robust standard errors for each teacher. Table 2 presents the results from the regressions that examine the relationship between student characteristics and the various outcomes. The first column displays the results from the student-teacher-level regressions of the squared residuals (Equation (9)) on student characteristics and provides evidence that some groups of students have significantly larger residuals than others. This is consistent with heteroscedasticity and shows which types of students have achievement that is hard to predict. In elementary school, the hard-to-predict students are those who are eligible for free lunch, receive special education services, and have lower pretest scores. 13 The same characteristics are also significantly associated with larger residuals in middle school, as is limited English proficiency. When we substituted the natural log of the squared residual as the dependent variable, we obtained results with very similar results: for each variable, the direction (positive or negative) was identical to the results shown in Table 2, and the same variables were statistically significant (not shown). These results are consistent with those of Stacy et al. (2012), who found higher residuals for students eligible for free lunch and students with lower prior test scores, but do not explicitly examine special education or limited English proficiency status.

Results
Consistent with the significant relationships between particular student characteristics and the student-level residuals, elementary school teachers assigned hard-to-predict students also have less precise value-added estimates. In column 2 of Table 2, we provide the results from teacher-level regressions, which separately regress the natural log of teachers' preshrinkage variances on each of the characteristics of teachers' students and the natural log of the teachers' number of studentequivalents (Equation (10)). 14 For elementary school teachers, a one-standard-deviation decrease in students' average pretest scores in math is significantly associated with a 20% increase in the teacher's preshrinkage variance. Lower pretest scores in reading and free lunch eligibility are also significantly associated with higher preshrinkage variances in elementary school. Interestingly, limited English proficiency is significantly correlated with lower preshrinkage variances in elementary school, and there is no significant relationship between the preshrinkage  To determine whether students with very high or very low pretest scores have larger residuals, as would be expected from measurement error in the tests, we also examined the relationship between the residuals and the absolute value of the pretest scores, an indicator for having a below-mean pretest score, and an interaction between the absolute value of the pretest score and having a belowmean pretest score (results not shown). Although having a pretest score at either extreme of the distribution is associated with larger residuals, residuals are significantly higher for students at the bottom of the pretest distribution than for those at the top.  We use the natural log of the variance as the dependent variable in these specifications for two reasons. First, the natural log of the variance more closely meets the normality assumption for efficient ordinary least squares (OLS) estimates. Second, the specification reflects the theoretical relationship between the variance and observable characteristics: whereV t is the estimated variance and N t is the number of student equivalents.

Middle School
Math pretest score Source: Administrative data from OSSE and DCPS. NOTE: Each result is from a separate regression except for those for eligibility for free or reduced-price lunch, which were entered simultaneously into a regression. In column , the results in the first four rows are from a student-level regression of the squared residuals from the value-added model on the student characteristics. The remaining columns are teacher-level regressions. In columns  and , the outcome is the natural log of the value-added estimates' variance. Column  tests the statistical significance of the difference in the relationship between student characteristics and the natural log of the value-added estimates' variance. In columns  and , the outcome is the absolute value of the value-added estimates. Column  tests the statistical significance of the difference in the relationship between student characteristics and the absolute value of the value-added estimates. Robust standard errors are in parentheses. * Significantly different from zero at the . level, two-tailed test. DCPS = District of Columbia Public Schools; OSSE = Office of the State Superintendent of Education of the District of Columbia. n.a. = not applicable.
variances and special education status. Having a larger number of students is associated with lower preshrinkage variances. 15 The same general patterns hold in middle school. Lower pretest scores in math and reading, free lunch eligibility, and special education status are significantly associated with higher preshrinkage variances. The magnitudes of the associations are  The relationship coefficient on the natural log of the number of student equivalents is not −, as would be expected from the equation in the previous footnote. This is likely due to correlations between class size and other student characteristics that are correlated with teachers' variances.
generally larger for middle school teachers than for elementary school teachers. For example, a one standard deviation decrease in students' math scores is significantly associated with a 64% increase in a teacher's preshrinkage variance in the middle school grades. Column 2 of Table 2 also investigates the relationship between the natural log of the teachers' number of studentequivalents and the variance. These same relationships with student characteristics are significantly smaller for postshrinkage variances. Recall that shrinkage increases the precision of estimates at the cost of intentionally introducing bias. This effect of shrinkage is reflected in columns 3 and 4 of Table 2. In column 3, we examine the relationships among teachers' postshrinkage variances, the characteristics of their students, and teachers' number of student-equivalents. Compared with the preshrinkage variances, the postshrinkage variances have the same number of statistically significant relationships with student characteristics, although the size of these correlations is significantly smaller for the postshrinkage variances. For example, for elementary school students, a one standard deviation decrease in students' average pretest scores in math is significantly associated with a 17% increase in the teacher's postshrinkage variance, compared with a 20% increase in the preshrinkage variance, a statistically significant difference. Statistical tests of the differences between the coefficients for preshrinkage and postshrinkage models (columns 2 and 3, respectively) can be seen in column 4. Similarly, shrinkage also reduces the size of the relationships between free lunch eligibility and teachers' number of studentequivalents and the estimates' variance.
Shrinkage also affects the absolute value of the estimates of teachers with hard-to-predict students. Column 5 of Table 2 presents results from teacher-level regressions of the absolute value of teachers' preshrinkage value-added estimates on student characteristics and teachers' number of studentequivalents. In elementary school, the preshrinkage value-added estimates for teachers of students with lower pretest scores and teachers with more students eligible for free lunch are significantly further from (and with limited English proficiency closer to) the overall teacher mean relative to the preshrinkage value-added estimates of other teachers. 16 In middle school, student characteristics do not have a significant relationship with the absolute value of the preshrinkage value-added estimates, possibly because we lack the power to precisely estimate these relationships. In column 6, we examine the relationship between the absolute value of teachers' postshrinkage valueadded estimates, student characteristics, and teachers' number of student-equivalents. As before, student characteristics and teachers' number of student-equivalents still have statistically significant relationships with the absolute values of the shrinkage estimates, but these relationships are weaker than their relationships with the preshrinkage absolute values were.
We next assess how shrinkage affects the probability that teachers are assigned positive or negative consequences. We focus on the extremes of the distribution because teachers with such estimates are the most likely to be targeted for rewards or sanctions. 17 In particular, we examine the relationship between student characteristics and indicator variables for whether the  The absolute value is measured relative to the overall teacher mean, rather than the mean conditional on the characteristics of teachers' students. The preshrinkage estimates could be related to the characteristics of teachers'students if teachers are not equitably distributed. The results in columns  through  of Table  do not adjust for correlated teacher assignments. However, shrinkage is the only difference between the value-added estimates used in columns  and , so the difference in the relationships between student characteristics and the absolute values of the value-added estimates (column ) can be attributed to shrinkage.  The probability that teachers in the top or bottom decile receive consequences depends on the weight that value-added estimates receive in the overall evaluation, the correlation between value-added estimates and other evaluation components, and the placement of the thresholds.
estimates fell into the top or bottom deciles of their distributions. 18 We presume that districts "grade on a curve"-that is, they set thresholds for categories of effectiveness based on teachers' positions in the value-added distribution. 19 By using the same relative thresholds of effectiveness for pre-and postshrinkage estimates, we use different absolute thresholds as measured in standard deviations of teacher value added. Examining valueadded estimates in a relative rather than absolute sense also affects how to think about how estimates change after shrinkage. In an absolute sense, all estimates move toward the center of the distribution, with less precise estimates moving more. In relative terms, however, relatively imprecise initial estimates will move toward the center, and relatively precise estimates will move toward the extremes only if they change places in a rank order with imprecise initial estimates. We present these results in Table 3. In column 1, using a linear probability model, we regress an indicator variable for the preshrinkage estimate falling in either the top or bottom deciles of the preshrinkage value-added distribution on each student characteristic and on teachers' number of student-equivalents (Equation (10)). For elementary school, we find that having students with low pretest scores, students eligible for free lunch, or smaller numbers of students increases the probability that teachers' preshrinkage estimates are in either the top or bottom deciles of the distribution. In middle school, there is no relationship between the characteristics of teachers' students and the probability that the estimates fall in the top or bottom deciles. Column 2 examines the probability that the postshrinkage estimate is either in the top or bottom deciles of the postshrinkage valueadded distribution and column 3 tests whether the difference between the pre-and postshrinkage relationships is statistically significant. Of 57 elementary school teachers in the top or bottom decile before applying shrinkage, there were 2 who moved in out of the extremes of the distribution after applying shrinkage. For middle school teachers, 5 of 32 teachers did likewise. We find that shrinkage has little effect on the probability that teachers of particular student groups are classified in the top or bottom deciles of the value-added distribution.
Columns 4 through 6 repeat the same analysis using an indicator variable for the estimate falling in either the top or bottom quintiles of the distribution. Because the top and bottom quintiles cover a much larger range of teacher effectiveness, there are fewer significant relationships between student characteristics and the probability of the estimate's being classified in either the top or bottom quintile, though the magnitude of some of the relationships appears to have increased. Comparing movement of teachers in and out of the top and bottom quintiles, 7 of 119 elementary school teachers switched places, as did 3 of 73 middle school teachers. However, shrinkage does not significantly affect the probability that teachers of hard-to-predict students  As a sensitivity analysis, we also replace the top and bottom deciles by the top and bottom quintiles.  In a typical teacher evaluation system, like the system that was in place in DCPS, this does not force a certain number of teachers to have positive consequences and others to have negative consequences because value-added estimates are used in conjunction with rigorous teacher observations and other measures to create a final evaluation score for each teacher. Source: Administrative data from OSSE and DCPS. NOTE: Each result is from a separate teacher-level regression except for those for eligibility for free or reduced-price lunch, which were entered simultaneously into a regression. In columns  and , the outcome is the probability the teacher value-added estimate is below the th or above the th percentile of estimates. Column  tests the statistical significance of the difference in the relationship between student characteristics and the probability of being below the th or above the th percentile of estimates. In columns  and , the outcome is the probability the teacher value-added estimate is below the th or above the th percentile of estimates. Column  tests the statistical significance of the difference in the relationship between student characteristics and the probability the teacher value-added estimate is below the th or above the th percentile of estimates. Robust standard errors are in parentheses. * Significantly different from zero at the . level, two-tailed test. DCPS = District of Columbia Public Schools; OSSE = Office of the State Superintendent of Education of the District of Columbia. are classified in either the top or bottom quintiles of the valueadded distribution. 20

Conclusion
Recent work has raised concerns about the effect of heteroscedasticity on estimates of value added because heteroscedasticity can reduce the precision of value-added estimates for teachers of particular types of students (e.g., Stacy et al. 2012). Because shrinkage adjusts value-added estimates by their level of precision, it might be viewed as a method to mitigate heteroscedasticity. This article empirically investigates how shrinkage affects the relationship between student characteristics and the precision of value-added estimates, the absolute value of  As a sensitivity analysis, we estimated Equation () using a logit model in place of a linear probability model. The direction and statistical significance of coefficients reported in columns , , , and  of Table  are nearly identical (not shown). The only difference is for the coefficient for the number of student-equivalents. These coefficients are just under the threshold for statistical significance in the elementary grade span for the top/bottom decile specification using a linear probability model. By contrast, they are not statistically significant when using a logit model. those estimates, and the probability that the estimates fall in the extremes of the value-added distribution. We find three main results. First, consistent with heteroscedasticity, we find that particular student characteristics associated with disadvantagelower pretest scores and eligibility for free lunch-are correlated with the precision of a teacher's estimate. Second, we show that shrinking the estimates improves the precision of the adjusted estimates and reduces the absolute value of estimates for teachers with hard-to-predict students. Third, in these data, shrinking has no statistically significant impact on the relative probability of exceeding a threshold for teachers of hard-to-predict students.
The first two results are predicted by theory, while the third result is puzzling. One piece of this puzzle may be that we assume that the accountability thresholds are rescaled along with the estimates. So, while shrinkage moves all teachers toward the middle of the distribution, the thresholds also adjust inward as shrinkage reduces the variance of teacher estimates. One must use caution in generalizing the result that, essentially, the accountability thresholds move in as quickly as teachers of disadvantaged students are pulled toward the middle. In this investigation, we have focused on data from a single year and district. An analysis using more years and districts would reveal whether this result is an anomaly or part of a broader pattern. In the meantime, given the conceptual appeal of shrinkage, the relative simplicity of implementation as a postestimation step, and the importance of this procedure in protecting teachers with relatively few students from being placed in the extremes of the distribution by chance, we recommend its use as part of any value-added model used for teacher accountability.