National tests and the wellbeing of primary school pupils: new evidence from the UK

ABSTRACT There is growing concern about the mental wellbeing of young people, including how this is related to national tests. This is a particularly important policy issue in England, where it is claimed that the end of primary Key Stage 2 tests cause schools, pupils and teachers stress. I investigate this issue using data from the Millennium Cohort Study, comparing the wellbeing of pupils in England (measured around the point they are sitting their Key Stage 2 tests) to the rest of the UK (where Key Stage 2 tests are not taken). No evidence is found that the Key Stage 2 tests in England is associated with lower levels of happiness, enjoyment of school, self-esteem or children’s mental wellbeing. Likewise, no evidence is found that children who are happier, more self-confident or with higher levels of wellbeing obtain higher Key Stage 2 test scores.


Introduction
High-stakes tests, where children sit assessments covering key academic competencies with results potentially having material consequences for themselves and their schools, are becoming an increasingly common element of school systems across the world (Howard, 2020). In many countries, such tests go hand-in-hand with school and teacher accountability, with the results used to make judgements about pupil, school and teacher performance (Amoako et al., 2019). Some argue that the independent information provided by such tests play a vital role in monitoring and supporting young people's academic achievement, allowing underperforming pupils, teachers and schools to be identified and remedial action to be put in place (Figlio & Loeb, 2011). Results from such assessments also play a key role in providing information to key education stakeholders, with 'league-tables' published by the national government to help parents select the appropriate school for their child (Allen et al., 2014).
Yet high-stakes testing has also come under much criticism within the education community (Minarechová, 2012). It has been argued that high-stakes testing in a small number of key subjects (usually English and mathematics) leads to a narrowing of the school curricula, with teachers spending less time on Arts, Humanities and Social Studies, and instead being motivated to 'teach to the test' (Berliner, 2011). Others have suggested that the results from such tests do not fulfil their intended purpose, are misused in making judgements about school and teacher effectiveness (Bitler et al., 2019) and may hinder rather than support the development of children (Amrein & Berliner, 2003). This, in turn, undermines the important work done by schools.
However, by far the strongest criticism against such tests is that they cause stress and harm the wellbeing of young people (Cho & Chan, 2020;UK Education Select Committee, 2020). This has been a growing concern in recent years, with mental health problems amongst school children in England on the rise (NHS Digital, 2018). Although such mental health problems are likely to have many causes, including family relationships (Amato and Keith 1991), genetics (Gottschalk & Domschke, 2017) and social media (Frith, 2017), it is widely believed that high-stakes testing -putting extra pressure upon young people -is partially to blame. Much discussion of this issue focuses upon teenagers, as they prepare to take end of secondary school examinations, such as the General Certificate of Secondary Education (GCSE) in England. Yet there has been growing concern about the impact that high-stakes testing has upon younger pupils as well (UK Education Select Committee, 2020). For instance, almost all 10/11-year-olds in England take Key Stage 2 Standard Attainment Tests (SATs) when they finish primary school. It has been claimed that these tests cause stress amongst both teachers and pupils (Bradbury, 2019), with newspapers headlines screaming how 'more primary school children [are] suffering stress from SATs' (Weale, 2017) and how the 'stress of SATs gives children nightmares' (Ward, 2018).
But is this really true? The main thrust of the existing academic evidence on this issue would seem to suggest so. For instance, Reay and Wiliam (1999:346) undertook qualitative research with a class of 20 students in a working-class primary school in London. They found that 'SATs have set in motion a new set of tensions with which year 6 students are expected to cope' and 'there are strong currents of fear and anxiety permeating children's relationships to the SATs process'. Similarly, Webb and Vulliamy (2006) argued that the testing system in England led to stress and anxiety amongst pupils. Reviewing the literature, Wyse and Torrance (2009:219) argued that 'the introduction of the National Curriculum Tests in England seemed to produce a correlation between low achievement and low self-esteem'. Connor (2001) found that, although the introduction of Key Stage 2 tests in England was linked to an overall rise in education standards, there were also signs that they had led to increased stress amongst pupils. This finding was then replicated within another small-scale follow-up survey (Connor, 2003). Importantly, Connors et al. (2009:9) described how there was 'a distinct temporal dimension' about pupils' feelings towards the Key Stage 2 tests, with many nervous beforehand (particularly in the period following the Easter holiday) but then experienced 'a sense of relief once the SATs were over and done with'. In a large-scale survey, Hutchings (2015) found that most teachers strongly agreed with the statement that 'many pupils become very anxious/stressed in the time leading up to SATs' (Hutchings, 2015:55), with a teacher stating in a qualitative interview that 'In the lead up to SATs, I have had pupils in tears, feeling sick, feeling stressed because they were so worried about the results from the tests' (Hutchings, 2015:56).
On the other hand, it is also worth noting the counterarguments as to why the Key Stage 2 tests may not be negatively associated with pupil wellbeing. Three possibilities stand out. First, tests are not inherently high or low stakes, with this depending upon (and differing across) stakeholders. Thus, while Key Stage 2 tests may well be high-stakes for teachers and schools (due to the publication of school league tables) this may not be the case for children (and, in turn, their parents), given that there are no major educational consequences for them based upon the results. Children may therefore not feel any added pressure from Key Stage 2 tests in comparison to the likely alternatives, such as schools/teachers using their own end-of-year tests or a greater reliance upon teacher assessment. Second, given that Key Stage 2 tests may well be considered high-stakes for teachers and schools, it has been argued that these groups may in-turn increase the academic pressure they put upon pupils via 'fear appeals' (Putwain & Von Der Embse, 2018). Yet we do not currently have a good estimate of how prevalent the use of such fear appeals are amongst teachers of Key Stage 2 pupils across England as a whole. Indeed, many primary school teachers may be more concerned about the welfare of their pupils, rather than how they perform on the Key Stage 2 tests, and thus may actually try to protect them from such pressures, rather than to try and use them as a performanceenhancing tool. Moreover, the impact such fear appeals have upon children across the population as a whole is not clear, with there likely to be heterogeneous effects. For instance, while some Key Stage 2 pupils may see such fear appeals as threatening, others may respond positively to them as a challenge. This is consistent with the work of Putwain et al. (2012:299), who noted how there were a 'variety of experiences of SATssome as anxiety-provoking and stressful, consistent with threat appraisals, and some as more positive (e.g. excitement), consistent with a challenge appraisal'. At the population level, it is hence possible that negative effects upon the wellbeing of some pupils may be offset by positive effects upon others. Finally, the impact that Key Stage 2 tests have upon children could be domain-specific, having a narrow impact upon their attitudes and feelings about school. This may -or may not -spill over into their general wellbeing overall.
Although the work cited above are insightful, important gaps in the evidence on the link between Key Stage 2 tests and pupil wellbeing remain. Much of the work in this area is of a qualitative nature, or based upon small convenience samples. Moreover, very few studies have attempted to contrast findings against a meaningful control group, who have not had to sit the Key Stage 2 tests at the same time. Similarly, few studies have attempted to evaluate whether children's socio-emotional outcomes, such as their mental wellbeing, happiness, self-esteem or enjoyment of school, improves once the Key Stage 2 tests have passed. This, in turn, leaves several important questions unanswered. For instance, is the wellbeing of primary school children who face taking high-stakes national tests lower than those who do not? Does the wellbeing of primary pupils decline in the approach to these tests, but then return to a more normal state afterwards (as suggested by much of the existing evidence base described above)? And how much of a problem is this really in England, where national tests for primary schools carry particular importance (at least for schools)?
This paper attempts to make this contribution to the existing literature. Using Millennium Cohort Study (MCS) data from the United Kingdom, we draw comparisons of pupil wellbeing across England (where Key Stage 2 tests are sat at the end of primary school) and the rest of the UK (where Key Stage 2 tests are not taken). Importantly, a large part of the age 11 MCS fieldwork was conducted during the Key Stage 2 test period, meaning we can also examine how pupil wellbeing changes as these high-stakes tests approach. Finally, we also investigate whether pupil wellbeing is associated with performance on the Key Stage 2 tests in England. In other words, do pupils who feel unhappy in the build-up to Key Stage 2 tests end up with lower scores on these tests?
To preview key findings, this paper largely presents a story of null effects. No consistent evidence is found of children's mental wellbeing, happiness, self-esteem or enjoyment of school being associated with the timing of Key Stage 2 tests in England. Differences in these socio-emotional outcomes between England and the rest of the UK are also typically small in terms of magnitude, with little sign that they differ before and after the Key Stage 2 tests take place. Similarly, no link is found between children's socio-emotional state (such as their happiness, wellbeing or enjoyment of school) and their scores on the Key Stage 2 tests. Hence, in stark contrast with existing literature and conventional wisdom, this paper concludes that there is no suggestion that primary school children's wellbeing is substantially affected by the presence of national tests, and that there is insufficient evidence to support calls for the Key Stage 2 tests to be scrapped.
The paper now proceeds as follows. The Millennium Cohort Study (MCS) data is described in section 2, along with our empirical methodology. Results are presented in section 3, with discussion and conclusions in section 4.

Data
The Millennium Cohort Study (MCS) is a rich, nationally-representative longitudinal study of UK children. A stratified, clustered survey design was used, with geographic areas (electoral wards) selected as the primary sampling unit, and then households with newly born children randomly selected from within (see Plewis, 2004 for further details). Six waves have been conducted between 2000/01 and 2015, when children were nine months, 3, 5, 7, 11 and 14 years old. Parents, children and their teachers have been interviewed within the various waves. In total, 19,243 cohort members participated in the first survey, when children were nine months old (12,224 in England). Within this paper, the focus is upon data from the fifth survey wave when children were in the final year of primary school. Unless stated otherwise in the table/figure notes, the MCS wave 5 survey weight is applied throughout the analysis, with standard errors adjusted for the clustering in the MCS survey design.
MCS fifth wave respondents with data available) so that estimates can be interpreted in terms of effect sizes.
Key Stage 2 tests take place in one specific week of the academic year. In 2012, this was the week beginning 14 th May. Table 1 illustrates how the timing of the test is related to the fifth wave MCS fieldwork. There is clearly quite substantial overlap, with the MCS fieldwork in England taking place in the build up to the Key Stage 2 test week.

Measures of pupil wellbeing
Within the fifth MCS survey wave, children answered a self-completion questionnaire. Throughout this paper, the following outcomes are of interest, measured using standardised scales: • Mental well-being. Children were asked: 'In the last four weeks, how often did you ( -31st May  221  72  3  1st June -7th June  231  78  4  8th June -14th June  269  88  5  15th June -21st June  156  67  6  22nd June -28th July  170  81  7 29th June -5th July 124 77 8 6th July -12th July 137 74 9 13th July -19th July 140 52 10 20th July -26th July 128 63 Notes: A week here is defined as starting on Friday and ending on Thursday. This is due to the Key Stage 2 tests taking place between Monday and Thursday. Grey shading indicates the analytic sample we restrict the MCS to within parts of our analysis. In the six weeks prior to the Key Stage 2 tests, the 'rest of the UK' sample comprises of 56% of children from Wales, 20% from Scotland and 24% from Northern Ireland (MCS weight applied).
to each statement using a four-point scale (strongly agree to strongly disagree Two-Parameter Item-Response Theory (IRT) models are used to create the four outcome scales, based upon the questionnaire items outlined above. These scales are standardised to mean zero and standard deviation one across all wave 5 MCS respondents, with all results hence presented in terms of effect sizes. (An item-level analysis, where results are reproduced for each question separately, rather than combining them into scales, is presented in Appendix E).

Background data
A wide array of information has been collected from children and their families in previous survey waves. As well as information on demographic background (gender, ethnicity, parental education) this includes attitudes and enjoyment of school at age 7, behavioural issues (as measured by the Strengths and Difficulties questionnaire) and scores on a range of low-stakes cognitive tests (including in reading and mathematics). Such information are used as controls within the analysis.

Differences in socio-emotional outcomes between England and the rest of the UK
To begin, the four socio-emotional outcomes of interest will be compared across England (where Key Stage 2 tests are taken) and the rest of the UK (where they are not). The sample is first restricted to only those children who completed the MCS survey between Friday 30 March 2012 and Thursday 17 May 2012 (the end of the Key Stage 2 test window). Average scores for each of the four scales (and upon each item) are then compared between children who live in England and children who live in other parts of the UK. If Key Stage 2 tests really do have a substantial negative impact upon children's wellbeing, then one would anticipate that school enjoyment, happiness, self-esteem and mental wellbeing to be lower in the former (England) than the latter (rest of the UK). For reference, Table 2 provides a comparison of the background characteristics of children in England and the rest of the UK. This comparison is presented both before and after the sample restriction discussed above has been imposed.
To check that the results are not being driven by differences in the composition of the sample across countries, the following OLS regression model will be estimated to supplement the descriptive analysis outlined above: Where: O ij = The socio-emotional outcome scale of interest. C i = A dummy indicator for whether the child lives in England (1) or not (0). D i = A vector of controls for children's background characteristics (ethnicity, household income, single parent household, parental education, gender).
A i = Academic achievement of the child measured at age 7, based upon scores in the MCS maths, pattern construction and reading tests. P i = A vector of socio-emotional outcomes measured for the child at age 7, including happiness, SDQ scores, school enjoyment and mental wellbeing. ε ij = The error term. i = Child i. j = Primary sampling unit j. The model presented in equation (1) includes controls for gender, family background and a range of socio-emotional and cognitive measures gathered in the prior (age 7) survey wave, with imputation used to account for any missing covariate data. The parameter of interest from this model is β; the extent that wellbeing, self-esteem, happiness and enjoyment of school is lower amongst children in England than the rest of the UK in the build-up to the Key Stage 2 tests. The robustness of these results will be tested by re-producing estimates having either extended the time horizon used to determine the sample window (e.g. to include those interviewed between March 1 st and 17 May 2012) or reducing it (e.g. to between April 17 th and 17 May 2012).

Changes in socio-emotional outcomes around the key stage 2 test date
Next, I investigate how socio-emotional outcomes of children change as the Key Stage 2 test date approaches. The expectation is that, if these tests really do have a negative impact upon children's well-being, then there should be a spike in the percentage of young people reporting poor socio-emotional outcomes (e.g. unhappiness at school and more generally in life) the closer the MCS wave 5 survey was to the Key Stage 2 test date. In particular, one would anticipate that those who completed the MCS survey close to the test week (May 14 th ) would have worse socio-emotional outcomes than those who completed the survey a long time beforehand (e.g. February 2012) or afterwards (e.g. June/July 2012). Importantly, such a pattern should only be observed in England -and not the rest of the UK -as England is the only country which runs Key Stage 2 tests. The analysis will begin by simply plotting the average score on each of the four outcome measures by the date (week) the MCS survey was completed (separately for England and the rest of the UK). However, in order to place a causal interpretation on these results, one must assume that the timing of the MCS interviews is not systematically related to the Key Stage 2 tests. For instance, if children with certain characteristics (e.g. children with particularly low levels of wellbeing) managed to move their survey date to avoid a clash with the Key Stage 2 tests, then such an assumption would be unlikely to hold. Unfortunately, such detailed information about the MCS interview schedule is not available within the data held. It is however possible to control for the background characteristics of cohort members in the analysis (including their wellbeing measured at the prior -age 7 -survey wave) to provide some reassurance that this is not driving the results.
Consequently, our primary analysis will be based upon the following OLS regression model: Where: Y ij = One of the outcomes of interest (e.g. mental well-being scale). Survey weeki = The week when the child/family completed the MCS survey (entered as a set of dummy variables).
X i = A vector of background characteristics measured prior to the age 11 survey wave. This includes gender, socio-economic status, prior academic achievement and socioemotional outcomes (e.g. happiness, SDQ scores) at age 7. i = Child i. j = Electoral ward j (the primary sampling unit in the MCS). ε ij = Error term. The parameter of interest from model (2) is β. This reveals the strength of the association between the MCS survey week and children's socio-emotional outcomes after key potential confounders (such as gender, socio-economic status, prior academic achievement, prior measures of happiness, self-esteem and well-being) have been controlled. To summarise results from this model, predicted scores for each outcome are generated, with the average of these predictions then presented by survey week. It is of particular interest as to whether there are non-linearities in these average predicted values, and whether one observes a 'spike' in poor socio-emotional outcomes (e.g. unhappiness) the closer the MCS survey was to the Key Stage 2 test week.

The link between pupil wellbeing and key stage 2 performance
Finally, I am also interested in the link between children's wellbeing in the build-up to the Key Stage 2 tests and their subsequent test performance. The MCS sample is first restricted to children in England who were surveyed between Friday 30 March 2012 and Thursday 17 May 2012 (the end of the Key Stage 2 test window), where consent was obtained by the survey organisers to link the data to administrative records (the National Pupil Database). This leaves an analytic sample size of 1,977 pupils.
The mental state of this group of children in the lead up to Key Stage 2 tests is known. This includes how happy they were at school, their self-confidence as well as their general well-being. These variables will be used as the key covariates in a regression model, investigating whether they are associated with children's Key Stage 2 scores. The models will again include a wide array of controls for potential confounding factors, including demographics, socio-emotional outcomes and prior achievement measured in previous MCS sweeps (captured by the wide array of cognitive tests conducted as part of the MCS). Formally, this model will be specified: KS2 ij = Key stage 2 scores. Mental = Indicators of children's mental state in the lead up to the Key Stage 2 tests (e.g. low scores on the mental well-being scale). X = A range of potential confounders (e.g. demographics).
Prior Ach = Measures of children's achievement at age 7. Prior Mental = Measures of children's socio-emotional outcomes at age 7. Multiple imputation will be used to adjust for item non-response, with Huber-White adjustments made to the estimated standard errors to take the complex survey design into account. The parameter of interest (βÞ will capture the extent that children's mental state (recorded in the six-week period before their Key Stage 2 tests) is associated with the scores they achieved on this test (controlling for the other factors in the model). Importantly, this will include rich controls for prior achievement and for their socioemotional outcomes in previous MCS waves. These estimates will thus provide new insight into how children's mental state in the run-up to their Key Stage 2 tests is linked to their performance on this test, over and above previous experience of socio-emotional issues (e.g. low self-confidence, behavioural problems).
Four specifications of this model will be estimated. First, the unconditional bivariate association will be presented (i.e. a model with no controls). The second model specification will then add controls for children's demographic background, with measures of prior academic achievement added in model three. The final model -which is the preferred specification -will include controls for demographic characteristics, prior academic achievement (measured at age 7) and prior socio-emotional outcomes (also measured at age 7). The robustness of these results will also be tested to altering the time window used to define the analytic sample (see Appendix C and D). In the main text, results are presented using Key Stage 2 mathematics scores as the outcome variable. Analogous results focusing upon Key Stage 2 English and total average point scores can be found in Appendix F. Table 3 begins by presenting differences in the four socio-emotional outcome measures between children in England and the rest of the UK. Recall that the sample has been restricted to those children who completed the MCS age 11 survey up to six weeks before the Key Stage 2 tests took place. Panel (a) presents unconditional estimates, with results from the regression models presented in panel (b). All estimates refer to effect sizes, with positive (negative) values indicating better (worse) outcomes in England than the rest of the UK.

Differences in socio-emotional outcomes in the build-up to key stage 2 tests
In the build-up to Key Stage 2 tests, children's mental wellbeing is lower in England than in other parts of the UK. This holds true in both panel a (unconditional) and panel b (conditional), with the addition of controls leading to essentially no change in the parameter estimates. In terms of magnitude, the difference is modest, standing at an effect size difference of around 0.17 standard deviations (statistically significant at the five percent level). Table 3 hence provides clear evidence that, at least in the period leading up to Key Stage 2 tests, the mental wellbeing of pupils in England is lower than in other parts of the UK.
On the other hand, the final row of Table 3 suggests that the opposite holds true for children's enjoyment of school. Specifically, school engagement is actually reported to be slightly higher amongst children who live in England in the build-up to the Key Stage 2 tests, compared to those who live elsewhere. Again, the difference is relatively small, standing at 0.19 standard deviations in panel a (unconditional estimates) and 0.11 standard deviations in panel b (conditional estimates), with the latter sitting just on the boundary of statistical significance at the conventional five percent threshold (p = 0.05; t = 1.96). Nevertheless, there is clearly no evidence from Table 3 that the Key Stage 2 tests lead pupils to stop enjoying or becoming disengaged from school, with it quite possible that the opposite (to a small extent) holding true. Results for the other two outcomes (happiness and self-esteem) are slightly more mixed. In the unconditional estimates, differences between children in England and the rest of the UK are essentially zero. These differences increase in the conditional estimates presented in panel (b), up to around 0.1 standard deviations, with children in England displaying slightly lower levels of happiness and self-esteem. For both outcomes, the difference between England and the rest of the UK is on the boundary of statistical significance in the conditional estimates, though the magnitude is small. Hence, overall, there is little evidence of a substantive difference in happiness and self-esteem of children in England and children in the rest of the UK as the Key Stage 2 test week approaches.

Variation in socio-emotional outcomes around the key stage 2 test week
The analysis now turns to variation in children's mental wellbeing in the period before and after the Key Stage 2 test week. If preparation for the Key Stage 2 tests has an influence upon pupils' socio-emotional outcomes, one would expect a decline (in mental wellbeing, for instance) in England as the tests approach (e.g. due to the increased workload and pressure induced by the test). One would then expect there to be reversion towards the mean after the tests have been completed, as any stress, anxiety and pressure from the test fades away. Similarly, differences in socio-emotional outcomes between England and the rest of the UK should increase up to the Key Stage 2 test week and then decline afterwards. Figure 1 considers whether such a pattern can be observed within the MCS data, presenting local polynomial smoothing estimates of the link between the MCS age 11 survey date and the four socio-emotional outcomes of interest. Alternative estimates plotting the mean scale score for each outcome by survey week can be found in Appendix A.
Panel (a) begins with the estimates for mental wellbeing. There are three key points to note. First, the estimates for England are always lower than the estimates for the rest of the UK. This in part reflects the finding already presented from Table 3; children's mental wellbeing tends to be slightly lower in England than elsewhere. Second, the trend line for England is essentially flat; one does not observe any clear, systematic fluctuation in the estimates around the Key Stage 2 test date. Finally, there is some suggestion that the gap in mental wellbeing between children in England and the rest of the UK widens slightly as the Key Stage 2 test week approaches and falls slightly afterwards (this can be seen via the arch in the dotted line for the rest of the UK). This should be caveated, however, by the fact that (i) it is driven by changes in the results for the rest of the UK (and not England) and (ii) the magnitude of any change over this period is small (0.1 standard deviations or less). Overall, Figure 1 panel (a) therefore does little to suggest that the mental wellbeing of children in England is linked to the Key Stage 2 tests. Panels (b) and (c) present analogous estimates for children's self-esteem and their levels of happiness. Regarding the former, the trend lines for England and the rest of the UK always sit close together, with neither showing any clear association with the MCS survey date. In particular there is no evidence that children in England suffer from lower self-esteem at the time that Key Stage 2 tests take place. With respect to the latter, children in England report lower levels of happiness than their peers elsewhere in the UK (effect size difference = 0.1). Yet this holds true throughout the period under investigation, with no evidence that this difference is related to the survey date, or that there is any 'peak' around the Key Stage 2 test week. Indeed, a similar gap in children's happiness between England and the rest of the UK continues to be observed even when the MCS survey was conducted during the school summer holidays (i.e. the period around 70 days after the Key Stage 2 test week). Put another way, the two trend lines plotted in Figure 1(c) are always parallel. There is consequently little to suggest that the self-esteem and happiness of children in England are, on average, related to the fact that they sit Key Stage 2 tests. Finally, results for school enjoyment are plotted in Figure 1 panel (d). Interestingly, the figures for England and the rest of the UK start off reasonably close together (e.g. at around 90 days before the Key Stage 2 tests are taken). Then, from around 60 days before the Key Stage 2 tests, the two lines start to diverge. From this point until the Key Stage 2 test week, school enjoyment amongst pupils in England is, on average, marginally higher than in the rest of the UK (effect size difference of around 0.1). Then, after the Key Stage 2 tests have finished, the two lines then come together. A cautious interpretation of this finding is that it provides little evidence that children's enjoyment of schools is negatively affected by the Key Stage 2 tests, with it possible that the opposite might even hold true. Figure 2 provides similar estimates, but are now conditional results based upon the regression models specified in section 2.2 (which includes controls for demographic background, prior academic achievement and socio-emotional outcomes of children at age 7).
On the whole, the key findings from these conditional estimates follow a similar pattern to those for the unconditional estimates presented in Figure 1 above. The mental wellbeing of children in England is always slightly lower than the mental wellbeing of children elsewhere in the UK (see Figure 2(a)). Yet there is no sign that this varies systematically around the Key Stage 2 test week, with any difference in the mental wellbeing of children in England and other parts of the UK apparently greatest after the Key Stage 2 tests have been taken. This reiterates the substantive conclusion presented above; there is little to suggest that Key Stage 2 tests in England have any clear link to children's mental wellbeing.
A similar pattern emerges with respect to children's self-esteem (Figure 2(b)) and their happiness in general (Figure 2(c)). In the former, the estimated trend line for England is essentially flat, overlaps with the line for the rest of the UK, with no evidence of there being any peak near the Key Stage 2 test week. Likewise, there is no evidence that children in England reported being happier after the Key Stage 2 tests had passed, or that they were any less happy as the tests approached (see Figure 2(c)). There is hence again no evidence that children's level of happiness or self-confidence is linked to the Key Stage 2 tests.
Finally, in Figure 2(d), one once again observes enjoyment of school to be slightly higher in England than the rest of the UK in the six weeks prior to the Key Stage 2 tests taking place (a difference of up to 0.2 standard deviations). However, once the Key Stage 2 tests have finished, school enjoyment amongst children in England is similar to the rest of the UK. Note though that this result is not being driven by changes in school engagement in England, with average scale scores broadly flat in this country over the 20week period considered. Rather, the change seems to be driven by lower levels of school enjoyment elsewhere in the UK during the first summer half-term. Regardless, Figure 2 (d) continues to support the conclusion of there being no evidence of a negative association between Key Stage 2 tests and children's enjoyment of school in England.

The link between pupil wellbeing and key stage 2 performance
To conclude, Table 4 considers the link between each of the four age 11 socio-emotional scales and children's Key Stage 2 mathematics scores (see Appendix F for analogous results for English and total average point scores). Recall that the sample has been restricted to children in England who completed the MCS survey in the six weeks prior to the Key Stage 2 test week. Estimates refer to effect size changes in Key Stage 2 mathematics scores for a one standard deviation increase in the relevant socioemotional scale. Model M3 is the preferred specification (with the richest set of background controls) and is hence the focus of discussion.  Starting with children's mental wellbeing, there is no evidence that this is linked to performance on the Key Stage 2 tests. This holds true even in the most basic model specification without any controls (M0), with a one standard deviation increase in children's mental wellbeing associated with just a 0.02 standard deviation change in Key Stage 2 mathematics test scores. The estimated effect size is hence consistently small and not statistically significant at conventional thresholds. 1 Key Stage 2 scores hence do not seem to be related to children's mental wellbeing.
A similar result emerges for the link between children's happiness, self-esteem and their Key Stage 2 scores. Although there does appear to be a positive association in Model M0 (no controls) and M1 (demographic controls only), the estimated effect sizes fall rapidly once prior achievement measured at age 7 (model M2) and prior socio-emotional outcomes measured at age 7 (model M3) have been controlled. Hence, in the preferred model specification (M3), a one standard deviation increase in children's self-esteem or their happiness is associated with only a very small (and statistically insignificant) 0.02-0.03 standard deviation change in Key Stage 2 mathematics scores. There is hence little evidence that happier or more self-confident children perform better in their Key Stage 2 tests.
The final column of Table 4 concludes with the results for the association between Key Stage 2 mathematics scores and children's enjoyment of school. A small association remains in model specification M3, with a one standard deviation increase in school enjoyment linked to a 0.04 standard deviation increase in Key Stage 2 mathematics test scores (this sits on the boundary of statistical significance -p = 0.025; t = 2.24 -at the conventional five percent level). Yet caution should be exercised when interpreting this result, for at least two reasons. First, it should be emphasised that any association between these two variables is weak, with a very large change in the school enjoyment scale (of around three standard deviations) needed to generate even a 0.1 standard deviation increase in Key Stage 2 scores. Second, it is possible that there are unobserved confounders that have not been controlled, which may further drive the effect towards zero (following the declining effect size observed when additional controls are added between model M0 and M3). The most appropriate interpretation hence seems to be that the link between enjoyment of school in the build-up to the Key Stage 2 tests and performance on these tests is weak (at best). Bringing these results together, overall, there is little evidence to suggest that children's socio-emotional state (such as their happiness, mental wellbeing, self-esteem or enjoyment of school) is related to their scores on the Key Stage 2 tests.

Robustness tests
The appendices to this paper present a series of alternative estimates to illustrate the sensitivity of the results to different analytic choices. In Appendix B, the 'rest of the UK' is restricted to children living in Wales and Scotland only, given that the high-stakes 'transfer test' (which determines grammar school entry) is also taken by Northern Irish pupils in the final year of primary school (albeit much earlier -typically around November, with results released in January -in the academic year). The exclusion of Northern Irish pupils does little to change the substantive results. Appendix C restricts the sample selection used in parts of the analysis (e.g. Tables 3 and 4) to a shorter period of time before the Key Stage 2 test date. Similarly, Appendix D extends the sample selection to a longer period of time. Again, results do not materially differ to those from the main analysis presented above. Appendix E provides an item-level analysis. Specifically, rather than constructing four scales from the various questionnaire items, responses to each question are analysed individually. The motivation is that some questions on some scales may theoretically be more likely to be linked to the Key Stage 2 test than others. Take the happiness scale, for instance. Arguably, responses to the items asking children how happy they are about their school work and the school they go to are more relevant for the analysis than the items asking about children's happiness with their looks or their friendship groups. Overall, little consistent evidence emerges that the results presented above are driven by responses to particular items, including those most clearly tied to children's experiences at school. Finally, Appendix F investigates the links between the wellbeing measures and pupils performance on the Key Stage 2 English tests, as well as their total Key Stage 2 point scores. Consistent with the results focusing upon mathematics scores presented in Table 4, we generally find the association between the wellbeing of primary school pupils and their performance on the Key Stage 2 tests to be weak.

Conclusions
National examinations are becoming an increasingly common feature of education systems across the globe. Proponents of such tests argue that they provide vital information about school, teacher and pupil performance, which can then be used to hold educational institutions and their staff to account (Figlio & Loeb, 2011). Yet a concern held by many is that these tests may be having a severe negative effect upon the wellbeing and socio-emotional development of children (UK Education Select Committee, 2020). Although this is most often associated with those in their mid-to-late teens as they approach their end-of-school exams, it is claimed that the same holds true for national tests held during primary school (Hutchings, 2015). Nowhere is this more apparent than in England, with stories regularly appearing in the media about how the 'stress of SATs [Key Stage 2 tests] gives children nightmares' (Ward, 2018). Such notions have also been supported by some qualitative research or small-scale survey data, with quotes from parents, pupils and teachers used to illustrate the negative impact these tests are having upon children in primary school.
But is this really true? Existing evidence linking national tests (and, in particular, the Key Stage 2 tests) to lower levels of pupil wellbeing have been limited by the fact that most are of a qualitative nature, tend to rely upon small convenience samples and -criticallydo not draw comparisons to a meaningful control group. For instance, do children become much happier once the Key Stage 2 tests are over, and are young people elsewhere in the UK (where the Key Stage 2 tests do not take place) happier and more self-confident than their peers in England? Addressing such issues is important in order to better understand the pros and cons of holding high-stakes national tests at the end of primary school, guiding key education policy issues such as whether the Key Stage 2 tests should be scrapped. This paper has investigated such issues, presenting new evidence on the link between Key Stage 2 tests held annually in England and primary school children's socio-emotional outcomes. Running against conventional wisdom -and the main thrust of previous research -little evidence is found that the Key Stage 2 tests in England are associated with lower levels of mental wellbeing, happiness, self-esteem or school enjoyment amongst its primary pupils. In particular, there is no sign children from other parts of the UK (where Key Stage 2 tests are not hold) have significantly better outcomes on these measures, or that there is any meaningful systematic variation around the time of the Key Stage 2 test date. Similarly, there is no evidence that pupils who report higher levels of wellbeing in the build-up to the Key Stage 2 tests then go on to perform any better in these tests. These null effects are robust to different analytic approaches and various robustness tests. The central conclusion of this paper is hence that the Key Stage 2 tests do not seem to be negatively associated with children's socio-emotional outcomes. Similarly, there is no evidence that pupils who are happier or more self-confident in the build-up to the Key Stage 2 tests achieve higher scores.
These findings should be interpreted in light of the limitations of the research. Four issues particularly stand out. First, estimates refer to associations between different aspects of children's wellbeing around the time of the Key Stage 2 tests. Although a range of possible confounders have been controlled, the estimates may not capture cause and effect. Second, although the MCS contains a wide array of measures, it has not asked children directly about test anxiety or whether they feel under pressure due to the Key Stage 2 tests. Yet, as pointed out by an anonymous referee, this could equally be interpreted as a strength of this paper, with the more 'neutral' measures used in this study being particularly well suited to capturing the potential effects of the Key Stage 2 tests on pupils' general wellbeing. Third, comparisons have been drawn between England and the rest of the UK, noting how in the latter Key Stage 2 tests do not take place. Yet this does not necessarily mean that children are not tested at all; rather, it may just be done somewhat less formally. Fourth, the data we analyse were collected in 2012, with a number of reforms made to national assessments in England over recent years. Although this is unlikely to undermine the key findings, a stronger relationship between Key Stage 2 tests and pupil wellbeing could be found if the importance attached to these tests has increased. Fifth, the findings are clearly only applicable to primary school children, with results not necessarily generalisable to pupils of older ages. Finally, although the empirical findings suggest that a negative association does not exist in the population, it is possible that there are (potentially small) sub-groups of pupils who do experience well-being issues due to the Key Stage 2 tests. Unfortunately, considering heterogeneity between sub-groups has not been possible in the analysis due to the limited sample size.
It is also important to consider why results from this paper are in stark contrast to most of the existing literature. One possibility is that, rather than seeing Key Stage 2 tests as threatening, some children may respond positively to them as a challenge. This is consistent with our finding that children's enjoyment of school is not negatively affected by the approach of the Key Stage 2 tests -at least not on average, at population level. Hence one possible explanation for our null findings is that the Key Stage 2 tests may have heterogeneous effects -positive for some groups, but negatives for others -which end-up cancelling one another out. Indeed, much of the previous work in this area (and subsequent media attention) has tended to focus upon the downsides of the Key Stage 2 tests, while ignoring the potential positives. Alternatively, the wellbeing of the average Year 6 pupil may not be affected by the Key Stage 2 tests because they simply do not feel them to be particularly high-stakes, as there is little consequence for them personally from the results. Relatedly, given the lack of individual consequences, pupils may not feel pressure about these tests from their parents. From a policy point of view, this is particularly important when one considers the likely counterfactual to Key Stage 2 tests if they were to be scrapped (as some policymakers have suggested). Children may not feel any more or less pressurised by Key Stage 2 tests than the likely alternatives, such as tests set by their class teacher, greater use of teacher assessment or more emphasis being placed upon end-of-year school reports. On the other hand, it seems unlikely that the difference between results in this paper and the conventional wisdom is due to our focus upon more general measures of wellbeing. Specifically, even for a more domain-specific measure (school enjoyment) we continue to find largely null effects. It hence seems improbable that the findings reported here are due to a lack of spillover between how Key Stage 2 tests affects children's attitudes and feelings about school and their general wellbeing overall.
In addition, there are also some important methodological differences between this study and previous work. For instance, the data used in the present study are drawn from a large, nationally representative cohort study, with children answering questions about their happiness and wellbeing in the context of a much broader social survey. This is rather different from previous work, which typically either involved qualitative interviews or a survey focused specifically upon experiences of Key Stage 2 tests. Relatedly, this study is one of the first to consider the counterfactual and to try to draw comparisons to a meaningful control group. Hence, although previous research has suggested that children in England may be unhappy around the time of the Key Stage 2 tests, few have presented any evidence that children would be better off had the Key Stage 2 tests not taken place. A third issue is that some of the studies within the literature (e.g. Hutchings, 2015) have had 'industry' involvement (e.g. teaching unions have sponsored the study) with there being potential conflicts of interest and agendas to promote. Indeed, this plays into a broader concern with this literature -that teachers and the wider education community generally dislike the Key Stage 2 tests and how they are now used for school accountability -with many wanting them to be scrapped. Cynically, being able to point towards research evidence illustrating how the Key Stage 2 tests harm children's wellbeing is useful to support this narrative. Finally, there is the issue of publication bias, with it well-known that null results are less likely to be reported and subsequently published (Jerrim & De Vries, 2015). This may, in turn, have led to an overly negative view of the link between Key Stage 2 tests and children's socio-emotional outcomes emerging in the literature, with it possible that many other null findings in this area having simply not been written up.
What do the results then suggest for education policy in England? Given the results of this paper, and considering the findings of the wider literature, evidence on substantial negative effects of Key Stage 2 tests upon pupils simply does not seem strong enough to support calls for these tests to be scrapped. Yet, at the same time, there is also a dearth of hard evidence that Key Stage 2 tests have substantial benefits for pupil's learning and academic development. Thus, given the fundamental role that Key Stage 2 tests currently play in England (feeding into both primary and secondary school accountability metrics), they should continue for the foreseeable future in their current form. However, further high-quality research needs to be conducted to better understand both the positive and negative effects that such tests have upon young people. This includes work which seeks to identify particular sub-groups which may suffer from negative wellbeing effects, as well as those who may respond more positively. This will help teachers, policymakers and the broader education community to better understand the trade-offs such testing involves, and whether such policy advice should change in the future.

Note
1. Results for model M2 is an exception, where the parameter sits on the boundary of statistical significance at the five percent level.

Disclosure Statement
No potential conflict of interest was reported by the author(s).

Funding
Appendixes ( Table 3  Appendix C. Alternative estimates narrowing the window before the Key Stage 2 tests