U.S. states’ performance on NAEP mathematics and reading exams after the implementation of school letter grade accountability policies

Abstract Researchers explored how 13 states in which policymakers have adopted an A-F school letter grade accountability system performed on the National Assessment of Educational Progress (NAEP) post-policy implementation. Researchers found mixed results, with approximately half of these 13 states increasing achievement post-policy, and the other approximately half not, evidencing that states’ adoptions of similar A-F policies may not realize increased student achievement as intended and, rather, yield more-or-less random results. Policy implications are discussed, also given seven of these 13 states are Republican/Republican-leaning and one is Democratic/Democratic-leaning, which also has implications for others looking to potentially adopt such relatively conservative, educational policy approaches towards reform.

ABOUT THE AUTHOR Audrey Amrein-Beardsley, PhD., is a Professor in the Mary Lou Fulton Teachers College at Arizona State University (ASU). Her research focuses on validation studies of multiple teacher-and school-level test-based evaluation and accountability measures and policies. She also serves as an expert witness in many legal cases surrounding the (mis)uses of such accountabilitybased outputs at multiple educational levels.

PUBLIC INTEREST STATEMENT
Spurred by one state's work with Jeb Bush's ExcelInEd foundation to advance its A-F school letter grade accountability policy, researchers explored how all 13 states that had adopted an A-F school letter grade accountability system (i.e., to hold schools accountable for the A-F letter grades they receive) had performed on the National Assessment of Educational Progress (NAEP) after A-F policy implementation. Researchers investigated 13 states' score variations as compared to the nation post A-F implementation until 2017 in NAEP grade 4 mathematics, grade 8 mathematics, grade 4 reading, and grade 8 reading. Researchers found mixed results, with about half of states increasing student achievement post policy implementation and half not, evidencing that states' adoption of A-F policies did (or does) not apparently increase student achievement as intended by such schoollevel accountability policies. Rather, analyses yielded random results whereby demonstrated were policy effects on student achievement that were really no different than a flip of a coin.

Introduction
In the fall of 2017, one state's board of education convened a technical advisory group to advise the state on its A-F school letter grade policies and procedures, as pertinent to the state's continued use of its A-F school letter grade accountability system. The system was put into place to help the state reform its historically poor-performing schools, and this was to be done via the use of the A-F system to help the state make consequential decisions about its schools, for example, in terms of state funding. Funding tied to school-level performance was to help incentivize school improvement in student learning, with student learning measured by proxy as student achievement.
This state was being pressed to move forward with its A-F reform-based plans by the Foundation for Excellence in Education (FEE). FEE was launched while Jeb Bush was the state of Florida's governor, although it has since been rebranded as ExcelinEd., n.d.a.), perhaps given bad press (see, for example, In the Public Interest, n.d.; see also, Fang, 2013;Layton, 2015;Mathis, 2011). ExcelinEd describes itself as a "501(c)(3) nonprofit organization focused on state education reform" and operates on approximately $12 million per year of donations from the Bill & Melinda Gates Foundation, Michael Bloomberg Philanthropies, the Walton Family Foundation, and the Pearson, McGraw-Hill, Northwest Evaluation Association, ACT, College Board, and Educational Testing Service (ETS) testing corporations, among others (ExcelinEd., n.d.b.).
Notwithstandingin this case, the state board of education was being offered the "customized support" that ExcelinEd provides "state policymakers, reformers, educators, parents and communities" across the nation in the name of "aggressive" educational reform in the name of increasing educational "quality" via increased school-level accountability. Given " [t]hese foundational policies are proven [emphasis added] pathways to improving student achievement," ExcelinEd continues to work with "states [that] are challenging the status quo" and "taking a stand for [states'] futures by launching bold, student-centered reforms to transform education for all children" (ExcelinEd., n.d. a.). ExcelinEd was working with this state to do the same.
More precisely, one of the appointed members of the state's technical advisory board was a member of the ExcelinEd team and was charged with moving the state forward in terms of its A-F policies and procedures (ExcelinEd., n.d.c.). During each public meeting of the technical advisory board, it became increasingly clear that this individual was the only one urging the state to continue the consequential nature of its A-F policies, and consistently pushed up against the other six technical advisory board members who had made it clear that they opposed the state's continued implementation of its A-F system without evidence of the aforementioned proof that it worked. These six members repeatedly referred to the type of evidence needed to warrant consequential use of the state's A-F data following the Standards for Educational and Psychological Testing (American Educational Research Association (AERA), American Psychological Association (APA) & National Council on Measurement in Education (NCME), 2014), for example, expressly calling for evidence that the policy worked in terms of its intended consequences, that bias was not of empirical or practical concern given schools that had larger proportions of students eligible for free and reduced lunches may have been more likely to receive the lowest letter grades (e.g., with correlations evidenced at the time ≅ −0.60), etc.
There was so much conflict around the state's A-F system among the seven technical advisory board members, in fact, that the aforementioned ExcelinEd member eventually resigned, but did sol after inserting into public record the following written statement: And, when comparing outcomes for the same period prior to the implementation of the original A-F school grading system, [state name removed for peer review] students made more progress after implementation than in the years before. (ExcelinEd., 2018) Figure 1. Research design for states that adopted an A-F system with O denoting the observational points before and after year X in which the state implemented its A-F policy (i.e., the point of intervention, after which changes in the dependent variable may be observed 1 or more years or observations later).
Board members, however, did not accept this statement without another round of debate. At the core of board member concerns was a set of "exaggerated visuals" used to support the above statement, also given inconsistencies in subgroup performance across visuals (e.g., English learner (EL) issues), non-representative data (e.g., Native American students were missing, although such students represent a proportion greater than the African American students visualized across state figures). Further, statistics indicating whether noted differences were statistically and practically significant were absent.
Thereafter, another board member called for the study ultimately conducted herein, in which this same board member agreed to be one of three researchers who ultimately conducted this study. Thus, the purpose of this study was to evaluate the relationship between the implementation of states' school letter grade accountability systems-states' A-F letter grades-and states' NAEP exam scores in grades 4 and 8 mathematics and reading, post school letter grade policy implementation. To date, 13 states use a school letter grade accountability system with Florida being the first state to implement a school letter grade policy in 1998-1999 (see , Table 1). It was these 13 states on which researchers focused for the purpose of this study.
It is important to note here that seven of these 13 states were, as of the U.S. presidential election in 2020, Republican or Republican-leaning; one of these states was Democratic or Democratic-leaning, all of which has implications for students, schools, and districts within such states, given these states are more likely to adopt such relatively conservative, policy approaches to incite educational reform. Also important to underscore is that states that are Republican or Republican-leaning, which are those in which policymakers are more likely to adopt such testbased educational reform policies, are also, as compared to national averages, the states (a) in which per-pupil school funding is typically lower, (b) which serve higher relative percentages of high-needs students who are also from poor and racial minority backgrounds, (c) that serve relatively more immigrant students and their families, (d) as well as students for whom English is a second language, (e) in which student to teacher ratios are higher, and the like. It is important to note given these policies, within these states but also possibly beyond should similar states follow suit, will have greater potential impacts on higher needs students. Hence, this research is also important so as to examine whether these "new" test-based policies meant to reform schools, primarily in such states, seem to work.
As such and, again, with these unique populations of students in mind, researchers sought to answer the following research question: To what extent, if any, did these 13 states' NAEP scores vary, in terms of pre-and post school A-F accountability system implementation, in mathematics and reading as compared to the national average over time? Researchers investigated score variations post policy implementation for the following four NAEP exams: grade 4 mathematics,  grade 8 mathematics, grade 4 reading, and grade 8 reading (i.e., the primary grades and subject areas in which the NAEP is administered every other year).
Before researchers explain their methodological approach, it is important to define what school letter grades are, as well as review the literature. For the purposes of this study, researchers define A-F school letter grades as any state's annual achievement profile required via policy or state statute to help the state define and then label school quality every year for every public school in the state. The A-F scale is akin to an A, B, C, D, and F grading scale commonly used within classrooms within schools, whereas an A is the highest grade (e.g., 90% or above for states using 0-100-point scales), C is an average grade (e.g., 70% or above), and an F is the lowest and failing grade (e.g., below 60%). Fundamentally, the grades are meant to help parents and members of the public assess and compare schools' performance, and ultimately help states hold schools, and the educators within them, accountable for meeting higher standards so as to increase student achievement and improve upon other important indicators of school quality (e.g., graduation rates) over time.
To collect the literature that informed this study, researchers used multiple combinations of relevant search terms like "School Accountability," "A-F Letter Grade," and "School Letter Grade," within two education-specific databases, namely, Education Resources Information Center (ERIC) and JSTOR. Researchers limited their searches to only peer-reviewed articles that were published after No Child Left Behind (NCLB) was enacted given NCLB spurred the school accountability movement (No Child Left Behind, 2002). Search results yielded seven relevant articles, noted with asterisks in the reference list, and the contents of which are discussed next. It is also important to note here that no hits from international articles were discovered, perhaps because other nations have not entertained these school-level, accountability-based policies, yet. Given the United States (U.S.) often "leads" other nations in such large scale, accountability-based policies (Sørensen, 2016), other nations might also heed the evidence thus far.

School A-F letter grade accountability systems
The move to hold schools accountable for their students' performances on standardized tests began with the U.S. 's No Child Left Behind (No Child Left Behind, 2002) Act, which required that all states provide evidence that their students were achieving adequate yearly progress (AYP) each year. NCLB also required each state to produce an annual report card that listed whether each school was succeeding (see, for example, U.S. Department of Education, 2004). The A-F letter grade systems under analysis constitute a unique subset of the general report cards required by NCLB in that they are similar in purpose-to bear an overall picture of school, district, and state educational quality. As such, 13 states have implemented an A-F grading system since No Child Left Behind (2002), although Florida implemented its system prior to the passage of NCLB. It was these 13 states on which researchers focused for the purpose of this study. See, again, Table 1 for states and years of system implementation.
Likewise, since A-F state-level policies have been implemented in Florida and beyond in other states, much is unknown about these systems as the use of report cards for school-level accountability purposes is an understudied topic (* Coe & Brunet, 2006). Not much research has been conducted on the intended and unintended effects of states' A-F systems; hence, what is known is more theoretical than empirical.
On the positive, albeit theoretical side, is that the grades and scales used in states' A-F systems are seemingly satisfying multiple public interests within the educational sector by, for example, helping parents and educators better understanding an A-F school rating system than a less simplistic alternative. Using a letter grade system to define schools influences public perception, more specifically, in that the presentation of school performance data through letter grades leads to arguably strong perceptions about very strong or very weak schools as per the data yielded via A-F systems (*Jacobsen et al., 2014). Likewise, using report cards shapes public perception to the extent that people might view more variation between high-and low-performing schools than actually might be real or authentic (* Murray & Howe, 2017). Using report cards also influences parents' perceptions when evaluating potential schools in which to enroll their children. In that the majority of states have open enrollment policies (Education Commission of the States, 2017), this can be a challenge given parents may not have information beyond that which is most often included in these systems, and parents might rely on this limited measure to make informed enrollment decisions.
This brings us to the concerns, albeit still theoretical, about states' A-F systems. Schneider, *Jacobsen et al. (2014) explain, for example, that "the information available to 'outsiders' can shape perceptions about organizational functionality, impacting public support for a public good [emphasis added]" (p. 5). If the public is taking up as "truth" that which is being offered via a state's A-F systems, in other words, this clearly has implications for all, especially given A-F systems' reductionistic and consequently consumable ways of capturing and assessing overall school quality (see also, *Coe & Brunet, 2006). While individual states define school quality differently, each state has at the center of their A-F policy intent substantive increases in student achievement as the most common intended consequence to result post A-F policy implementation. Regardless of whether increases in student achievement is the most important for the public good, the decisions made by state leaders via every A-F system, for example, in terms of what data to include and what values to attribute to each indicator, have vital implications for the public, how the public perceives and understands schools as public goods, and how the public uses their understandings for better or worse, despite the validity behind the understandings or "truths" as taken up and inferred.
Of related concern is that the single letter grades used across these systems, even if comprised of more factors than just increases in students' test scores over time, do not often tell complete or comprehensive stories about schools and their students (* Coe & Brunet, 2006;*Polikoff et al., 2014). This is especially true when school report cards are more analytical versus holistic in nature and, again, when states' large-scale standardized tests scores carry considerably more weight when evaluating a school's quality than other factors, like programs offered to specialized populations, school strengths and concentrations, and community services offered. Accordingly, critics caution that making policy decisions based on A-F grades is potentially dangerous, given that such grades often do not account for student subgroup performances and effectively ignore achievement gaps within schools and districts (Adams et al., 2016b; see also Adams et al., 2016a;*Murray & Howe, 2017). This "may exacerbate segregation patterns by steering well-resourced and quality-conscious parents away from perfectly good schools, and, in doing so, they may enact a self-fulfilling prophecy by concentrating inequality" (* Schneider et al., 2018, p. 27; see also, *Jacobsen et al., 2014). To reveal different aspects of their contexts, schools may need to compensate for such policies to more effectively communicate to parents and others their actual values behind their worth.

Study methods
Researchers sought to answer the following research question: To what extent, if any, did states' NAEP scores vary, pre-and post school A-F accountability system implementation, in mathematics and reading as compared to the national average over time? Researchers investigated score variations post policy implementation for the following four NAEP exams: grade 4 mathematics, grade 8 mathematics, grade 4 reading, and grade 8 reading. Researchers used these exams given these are the primary exams that are administered, every two years, to all states throughout the U.S.

NAEP data
The NAEP was the measure noted as the "gold standard" in the statement made by the technical advisory board member included above (ExcelinEd., 2018) as it is widely considered a very good measure of state-level student achievement that also allows state-by-state comparisons (Desimone et al., 2005). The NAEP was initially created during the 1960s, though it was not used at the statewide level until 1990 when states voluntarily participated in the assessment to benchmark their own students' achievement, as also compared to students throughout the nation (Beaton et al., 2011;see also, U.S. Department of Education, 2003). NCLB requires that states' grade 4 and grade 8 students participate in the NAEP mathematics and reading exams every two years, although there are also other tests, for example, in science, writing, U.S. history, civics, and art, administered every year following different schedules (National Center on Educational Statistics (NCES), n.d.b). If any state opts out of participating in the NAEP, it forfeits its Title I funding (U. S. Department of Education, 2005).
Due to the standardization of the NAEP, many sets of researchers (e.g., Amrein & Berliner (2002); Camilli (2000); Carnoy and Loeb (2002); Dee and Jacob (2011); Hanushek and Raymond (2005); Rosenshine (2003)) have used NAEP scores to compare student test performance across states. For example, many sets of researchers have conducted time series analyses similar to that being done here (see forthcoming) to inform the federal government as to whether states that attached highstakes consequences to their state tests required as per NCLB outperformed those that had not (e.g., Amrein & Berliner (2002); Braun (2004); Dee and Jacob (2011);Fuller et al. (2007); Lee and Reeves (2012); Nichols et al., 2003). Similarly, for purposes of this study, the NAEP allowed for state-by-state comparisons, more specifically in terms of comparisons of states with A-F school letter grade systems versus the nation, on student achievement in grade 4 and grade 8 NAEP mathematics and reading.

Comparative interrupted time series
Researchers employed a comparative interrupted time series (CITS) research design to answer the research question. CITS studies are useful for determining the degree to which large-scale social or governmental policies, such as the A-F school letter grade policies of interest herein, demonstrate an impact (Hallberg et al., 2018;Judd et al., 1991;Smith & Glass, 1987). In CITS designs, strings of observations of the dependent variables-in this study defined as changes in student achievement

Figure 2. Examples of CITS graphs with (A-H) and without (I-K) intervention effects. The magnified ♦ denotes the point at which the intervention was introduced.
on the NAEP exams over time-are made before and after treatments-in this study defined as states' implementations of A-F school letter grade policies-are introduced. This research design enables researchers to yield convincing evidence about the effects that common interventions may have on dependent variables of interest over time, especially if patterns across similar casesin this study defined as states with similar policies in place at similar or different times-are observed.
Researchers made such observations by displaying trend lines of states' NAEP scores using time series graphs and analyzing the patterns of trend lines before and after the implementation of states' A-F systems to demonstrate relationships between A-F implementation (i.e., treatments) and NAEP scores (i.e., effects; see, Fraenkel et al., 2015;Glass, 1988;Hallberg et al., 2018;Smith & Glass, 1987). To facilitate such analyses, and also similar to a difference-in-difference analysis, while controlling for baseline scores one year before implementation, researchers employed two important controls to help evaluate the certainty and strength of any treatment effects that might be observed. First, they used state-level data points before the introduction of the policy treatment to yield baseline information (Glass, 1988;Hallberg et al., 2018;Smith & Glass, 1987). Whether changes in the dependent variable (i.e., NAEP scores) may be authentic or artificial might then be better determined on a case-by-case basis, post baseline. Put differently, if a positive change after the intervention is noted, one might conclude that the treatment had its desired or hypothesized effect, or vice versa (Hallberg et al., 2018;Wong et al., 2012).
Secondly, researchers positioned national trend lines alongside the trend lines of the 13 states that had adopted an A-F system at the time of this study, and at focus in this study, to take into consideration the national context, as well as any normal fluctuations or large-scale extraneous influences on the data (Glass, 1988). Researchers used the non-A-F state group, or states and DC with no A-F policy per each respective year, as a comparison group to help estimate how the dependent variable would have oscillated sans treatment (Campbell & Stanley, 1963;Glass, 1988;Hallberg et al., 2018;Smith & Glass, 1987). In other words, researchers used national trend lines to help them control or account for whether effects at any state level were genuine or mere reflections of national trends. See, Figure 1 for an illustration of this type of research design and Figure 2 for visuals of CITS graphs with and without discernable intervention effects.
Researchers calculated each state's score differential over time and compared it to the average national growth (or lack thereof) over the same period of time. Researchers computed this calculation by taking the most recent score, from 2017, for a given state and subtracting the score from the most recent NAEP test that occurred prior to that state's A-F system implementation. For example, Alabama's A-F system was implemented in 2013, so researchers subtracted the state's 2011 scores from its 2017 scores, as the 2011 administration of the NAEP was the exam given immediately prior to 2013. Researchers then compared each state's relative growth (or lack thereof) to that of the national average via a series of bar charts. See, Figures 3-6 in anonymous google doc here. However, while Hallberg et al. (2018) called for calculating indicators of statistical or practical significance, researchers were unable to do this given they did not have access to the raw NAEP scores that would make such calculations possible.
It should also be noted that under what time frame A-F policies might be expected to change students' NAEP performance over time (e.g., one-year, three-years post policy implementation) is indeterminate. Making any fixed determinations of this type would, accordingly, be arbitrary, also given researchers who have conducted similar studies have not made such fixed determinations either (e.g., Author(s) (); Camilli (2000); Carnoy and Loeb (2002); Dee and Jacob (2011); Hanushek and Raymond (2005)). Rather, and again, researchers took each state's most recent NAEP score and subtracted it from the most recent NAEP test that occurred prior to that state's A-F system implementation. This choice, researchers decided, was the least arbitrary of the options. However, researchers also included all 13 states' data in Appendices A and B (in an anonymous google doc here) so that anyone interested in states' performance, for example, one-, two-, etc. years post policy implementation could also conduct their own examinations of states' individual or collective policy effects.
While CITS studies can be useful for determining the degree to which large-scale social or governmental policies make an impact, especially across similar cases as is the case here (Hallberg et al., 2018;Judd et al., 1991;Smith & Glass, 1987), CITS analyses are not to be used to yield causal claims unless major threats to internal validity can be dismissed. One of the biggest threats to internal validity in any CITS design is history, or alternative events that occur at approximately the same time as the intervention that may contribute to the observed effect and disguise what actual effect the treatment caused (Campbell & Stanley, 1963;Fraenkel et al., 2015;Hallberg et al., 2018;Judd et al., 1991). However, history threats can be alleviated via the use of national trendlines, as researchers included here; although, it is still possible that other state-bystate policies or policy interactions might influence the changes in trendlines observed.
It should be noted here, accordingly, that the 13 states of interest herein also had, and have historically had in place some of the highest stakes attached to their educational accountability policies other than the school-level A-F policies. See, for example, state-level histories and states' Accountability Pressure Rating (APR) ratings calculated in detailed in Nichols et al. (2006). While this makes these states' adoptions of A-F school letter grading policies less unforeseen, these states' high-stakes accountability histories should also be taken into consideration as potentially, if not likely confounded. Put differently, the examination researchers conducted herein, while focused on 13 states' performances on the NAEP post A-F school letter grading policies, might also be at least somewhat indicative of how these same high-stakes states are performing as compared to the nation in general. Likewise, differentiating between whether hypothesized or rival causes created a given effect is difficult as both are confounded (Smith & Glass, 1987); hence, all such trendlines should still be critically consumed and not just assumed as causal.
Three other common threats to validity include selection, instrumentation, and attrition, although all three are also mitigated here. Selection was not a threat to validity as researchers used state-level data from all 50 states and Washington DC, so no states were omitted. Another non-issue was instrumentation as the NAEP has consistently assessed mathematics using the same assessment framework since the early 1990s (National Assessment Governing Board (NAGB), 2017a). While the framework upon which the NAEP reading exam drew was modified for exams starting in 2009, a bridge study found that  Note: Grade 4 and grade 8 reading scores were not available for Florida prior to 1998, so the 1998 score was used in the score differential calculation. Amrein-Beardsley et al., Cogent Education (2022) NAEP reading exams before and after the framework modifications were similar; therefore, results and trendlines from exams from both sets of frameworks could be compared (NCES, 2010; see also, National Assessment Governing Board (NAGB), 2017b). Finally, the threat of attrition should be taken into consideration with CITS studies as compositional shifts, for example, in the groups of students taking the NAEP exam, over time could affect NAEP score outcomes (Hallberg et al., 2018). Although, this too  was mitigated given the sophistication of NAEP's state-level sampling procedures that yield representative and accurate depictions of state performance regardless of, for example, large-scale changes in states' student-level demographics over time (Johnson, 1992; see also NCES, n.d.a.).

Data collection
Researchers used the NAEP Data Explorer (Nation's Report Card, n.d.) to retrieve the composite NAEP reading and mathematics scores for grades 4 and 8 for all available years for all 50 states plus Washington DC. For each A-F state, researchers utilized individual state-level scores for each of the four exams. Researchers then calculated, per year and per exam, an aggregate average score comprised of all A-F states and all non-A-F states. Each non-A-F state average was calculated for each individual state's analysis and used as the national comparison group against which each state's NAEP scores post A-F implementation could be compared which, again, helped to reduce the aforementioned history threat to validity (Cook & Campbell, 1979;Hallberg et al., 2018). Researchers only included a state's scores in the A-F calculation per year if that state had implemented its A-F system by a given year. For example, Alabama instituted its A-F system in 2013, so its NAEP scores were included in the non-A-F average for all years prior to 2013 and in the A-F average for all years 2013 and beyond.
For the A-F states, score data ranged from as few as one score (i.e., for Ohio, Texas, West Virginia) to nine scores after implementation (i.e., Florida), with most states having two scores post-implementation (e.g., Alabama, North Carolina). To address their main research question, researchers included NAEP scores for one test administration prior to the beginning of each state's implementation of its letter grade accountability system and interpreted this as each state's prepolicy baseline NAEP score.
Researchers also included in the state-level trend graphs of scores over time (see Appendices A and B in an anonymous google doc here) three years of NAEP scores prior to A-F implementation for illustrative purposes, as is also recommended by Hallberg et al. (2018;see also, Bloom, 2003). Researchers included these additional years to illustrate ample years of NAEP activity prior to any treatment effect, in that any treatment effect can only be recognized if pretreatment trends (e.g., three years) are illustrated (Hallberg et al., 2018;Wong et al., 2012).
However, researchers calculated the pre-to post-policy changes using only one year of prior NAEP data; that is, for one year immediately preceding each state's A-F system implementation. While some might argue that, for example, the anticipation of such policies could cause changes in scores earlier than one year prior, researchers believed using two or even three years prior was less defensible and more arbitrary than using one year back as the baseline point. If, however, others might disagree, researchers illustrated three years of data prior, again, as recommended for demonstrative purposes (Hallberg et al., 2018; see also Appendices A and B in an anonymous google doc here) but also in case others might want to visually assess any trends using more than one year prior. See the years of NAEP data included for each state in Table 2.

Results
Overall, researchers found mixed results as to whether and to what extent there was a clear relationship between states implementing A-F school letter grade accountability systems and increased student achievement.

Grade 4 mathematics
For grade 4 mathematics, the average score across all A-F states was consistently lower than the average score across all 50 states and DC (see , Table 3). Since 2009, states that had implemented an A-F system ranged between one and three points lower on the grade 4 mathematics exam than did all states and DC, as the A-F states' average scores between 2009 and 2017 ranged from 238 to 240 while the national average scores during the same time period ranged from 240 to 242.
While the average across all A-F states for the NAEP grade 4 mathematics exam was lower than the national average, several states had a higher score on the 2017 exam than on the exam immediately preceding the A-F implementation (see, Table 4). Five of the 13 states (38.5%) had net score increases since their A-F systems were implemented, seven states (53.8%) had net score decreases since A-F implementation, and one state (7.7%) demonstrated no change. Compared to the national average on grade 4 mathematics scores, eight of the 13 states (61.5%) demonstrated growth over time greater than that of the national average (see, Figure 3). Three of the 13 states (23.1%) demonstrated less growth, and two states (15.4%) had comparable growth to that of the national average.

Grade 8 mathematics
For grade 8 mathematics exams, the average score across all A-F states was consistently lower than the average score across all 50 states and DC (see, again, Table 3). Since 2009, states that had implemented an A-F system ranged between three and six points lower on the grade 8 mathematics exam than did all states and DC, as the A-F states' average scores between 2009 and 2017 ranged from 278 to 280 while the national average scores during the same time period ranged from 282 to 285.
While the average across all A-F states for the NAEP grade 8 mathematics exam was lower than the national average, several states had a higher score on the 2017 exam than on the exam immediately preceding A-F implementation (see, again, Table 4). Five of the thirteen states (38.5%) had net score increases since their A-F systems were implemented, yet eight states (61.5%) actually had net score decreases since A-F implementation. Grade 8 mathematics growth compared to that of the national average varied more than that of grade 4 mathematics, as six of the 13 states (46.2%) demonstrated greater growth over time compared to that of the national average (see, Figure 4), six other states (46.2%) demonstrated less growth compared to that of the national average growth, and one state (7.7%) had comparable growth to that of the national average.

Grade 4 reading
For grade 4 reading exams, the average score across all A-F states was consistently lower than the average score across all 50 states and DC (see , Table 5). Since 2009, states that had implemented an A-F system ranged between two and four points lower on the grade 4 mathematics exam than did all states and DC, as the A-F states' average scores between 2009 and 2017 ranged from 217 to 220 while the national average scores during the same time period ranged from 221 to 223.
While the average across all A-F states for the NAEP grade 4 reading exam was lower than the national average, several states had a higher score on the 2017 exam than on the exam Table 6. Growth, or lack thereof, over time compared to the national average, per state

Grade 4 Reading Grade 8 Reading
Note: A plus sign (+) denotes a state with growth greater than that of the national average, a minus sign (-) denotes a state with growth less than that of the national average, and a zero (0) denotes a state with growth on par with that of the national average.
immediately preceding the A-F implementation (see, again, Table 4). Eight of the 13 states (61.5%) had net score increases in grade 4 reading after their A-F systems were implemented, three states (23.1%) demonstrated net score decreases post A-F implementation, and two states (15.4%) showed no change. Grade 4 reading evidenced a similar pattern to grade 4 mathematics in that eight of the 13 states (61.5%) had greater growth over time compared to that of the national average (see, Figure 5), while five of the 13 states (38.5%) had less growth compared to that of the national average growth.

Grade 8 reading
For grade 8 reading, the average score across all A-F states was consistently lower than the average score across all 50 states and DC (see, again, Table 5). Since 2009, states that had implemented an A-F system ranged between two and six points lower on the grade 8 reading exam than did all states and DC, as the A-F states' average scores between 2009 and 2017 ranged from 260 to 263 while national average scores during the same time period ranged from 264 to 268.
While the average across all A-F states for the NAEP grade 8 reading exam was lower than the national average, several states had a higher score on the 2017 exam than on the exam immediately preceding A-F implementation (see, again, Table 4). Eight states (61.5%) had net score increases since their A-F systems were implemented, two states (15.4%) had net score decreases since A-F implementation, and three states (23.1%) showed no change.
In grade 8 reading, states evidenced a similar pattern to grade 8 mathematics in that the majority of states demonstrated less growth compared to that of the nation in terms of the nation's average growth (see, Figure 6). Five of 13 states (38.5%) had greater growth over time compared to that of the national average, while six of the 13 states (46.2%) had less growth compared to that of the national average growth. Two states (15.4%) exhibited comparable growth to that of the nation.

Additional results
As indicated, there did not appear to be any concrete pattern in terms of whether there was a consistent positive or negative relationship between states implementing an A-F accountability system and NAEP scores on grade 4 and grade 8 mathematics and reading. In sum, the NAEP data slightly favored A-F states on grade 4 mathematics and grade 4 reading, half of the states increased and half of the states decreased in achievement post A-F implementation on grade 8 mathematics, and a plurality of states decreased in achievement post A-F implementation on grade 8 reading.
Also of note is that most states did not display the same type of growth, or lack thereof, across all four exams. For example, while Texas demonstrated growth greater than that of the national average for both grade 4 and grade 8 mathematics, it demonstrated less growth than that of the national average for both grade 4 and grade 8 reading (see , Table 6). While some states (i.e., 30.8%; n = 4/13) did have consistent patterns across all four exams, either in favor or against the state's NAEP performance in comparison to the nation, the majority of states did not yield consistent results. Indeed, across all 13 states there was a lack of consistency made clear across these states' NAEP trends over time.

Conclusions
Spurred by one state's work with ExcelInEd on its A-F school letter grade accountability policy, researchers sought to explore to what extent, if any, all 13 states that had adopted an A-F school letter grade accountability system performed on the NAEP post policy implementation. Researchers investigated 13 states' score variations as compared to the nation post implementation to 2017 in NAEP grade 4 mathematics, grade 8 mathematics, grade 4 reading, and grade 8 reading and found, overall, mixed results as to whether and to what extent there was a clear relationship between states implementing A-F school letter grade accountability systems and the increased student achievement that was to result. Put differently, 13 states' adoptions of A-F policies were not systematically related to NAEP gain scores. More specifically, researchers found that while approximately half of these 13 states increased achievement as per the NAEP post-A-F policy, and the other approximately did not, this evidenced that states' adoptions of similar A-F policies may not realize increased student achievement as intended and, rather, yield more-or-less random results.
This finding is certainly significant for other states throughout the U.S. in which leaders have adopted, or might be considering adopting similar A-F school letter grade policies, given that such state-level accountability policies still seem to be trending due to, at least in part, the ongoing efforts of ExcelInEd. Such state leaders should, at minimum, exercise great caution and care when continuing to implement, or potentially adopt such polices, respectively, on, perhaps, promises versus actual proof. The proof, or lack thereof, presented in this study should, at minimum, help state leaders and others critically question and consume the claims like those presented to the leaders of the particular state of interest in this study (i.e., student achievement on the NAEP provided evidence that that state was "moving in the right direction," as compared to the nation, post adoption of its A-F school grading system; ExcelinEd., 2018).
Indeed, evidence in this study conversely suggested that since this state's adoption of its A-F policy, the state surpassed the nation on grade 4 mathematics, grade 4 reading, and grade 8 reading, and the state was on par with the nation in grade 8 mathematics as per the NAEP. While not entirely contradictory to that which was advanced by ExcelinEd's representative (see also the trends observed in Florida, Mississippi, Utah, West Virginia), other states yielded opposite results (e.g., Arkansas, New Mexico, North Carolina, Ohio, Oklahoma). See these and other results, again, in Table 3.
Nonetheless, the larger picture painted via analyses in this study is, perhaps, of more significance and worth, given researchers' inclusion of all A-F states in their sample to examine the extent to which intended results, or increased achievement, may or may have not generalized. Again, researchers found that the NAEP data slightly favored A-F states on grade 4 mathematics and reading, half increased and half decreased on grade 8 mathematics, and a plurality of states decreased on grade 8 reading. This evidence does not likely support the advancement of states' policy efforts in these regards. Rather, again, the evidence suggests that states that have adopted these policies have, relative to the nation and DC, categorically fared no better or worse in terms of increasing student achievement as a causal result of adopting and implementing an A-F school letter grade system. In reality, how these states have performed post policy implementation is not much different than random, or a flip of a coin.
This, in and of itself, has implications for educational policy and practice in that this is the first empirical study of which researchers are aware to offer such evidence to states currently using or contemplating the continued use of such systems. These states should, accordingly, attend to this evidence prior to further policy consideration, perhaps especially when organizations pushing state leaders and policymakers into advancing such policies make claims about these "foundational policies [being] proven [emphasis added] pathways to improving student achievement" (ExcelinEd., n.d.a.) as a result, albeit without really any evidence in support.
Likewise, while the U.S. continues to lead other nations in terms of similar accountability-based and policy-backed initiatives, other countries might heed this evidence as well. Put differently, in the U.S. the use of accountability-based policy instruments to evaluate schools and teachers "has been taken exceptionally far [emphasis added]," while "most other high-income countries remain [relatively more] cautious" towards the use of such accountability-based tactics (Sørensen, 2016, p. 1). However, with the support of global bodies such as the Organisation for Economic Co-operation and Development (OECD), such policies continue to be trending and continue to be adopted worldwide for purposes and uses similar to those in the U.S. (see also, Araujo et al., 2016). Hence, what we have learned from states' uses and misuses throughout the U.S. in this and other regards can and should also be understood by others when considering whether to adopt or implement similar school-level accountability policies, should they cross borders into other countries' educational accountability policy schematics.