Analytic or holistic? A study about how to increase the agreement in teachers’ grading

ABSTRACT In Sweden, grades are used for selection to upper-secondary school and higher education, even though agreement in teachers’ grading is low and the selection therefore potentially unfair. Furthermore, measures taken to increase the agreement have not been successful. This study has explored how to increase agreement in teachers’ grading by comparing analytic and holistic grading. Teachers (n = 74) have been randomly assigned to two different conditions (i.e. analytic or holistic grading) in either English as a foreign language (EFL) or mathematics. Findings suggest that analytic grading is preferable to holistic grading in terms of agreement among teachers. The effect was stronger and statistically significant in EFL, but could be observed in mathematics as well. However, teachers in the holistic conditions provided more references to criteria in their justifications, whereas teachers in the analytic conditions to a much larger extent made references to grade levels without specifying criteria.


Introduction
In Sweden, where this study is situated, grades are high stakes for students. Grades are the only criteria used for selecting students as they leave compulsory school and apply for upper-secondary school. When students apply for higher education, selection is also made based on the 'Swedish Scholastic Aptitude Test', but a minimum of one third of the seats (often more) are based on grades. Given that it is the individual teacher who synthesises students' performances into a grade, and that the reliability of teachers' grades has been questioned (e.g. Swedish National Agency of Education, 2019), this is a problematic situation which can potentially have a significant influence on the lives of thousands of students each year.
Measures have been taken to increase the agreement in teachers' grading, most recently by legally requiring that teachers in primary and secondary education take results from national tests into account when grading. This relatively loose strategy of 'taking test results into account' has not yet yielded any observable change in teachers' grades, however, and there is still a large discrepancy between teacher-assigned grades and test results for most subjects (Swedish National Agency of Education, 2019). The current situation may potentially lead to political demands for tying grades even more closely to test results, which in turn may have other unwanted consequences. For example, research suggests that the most widespread consequence of high-stakes testing is a narrowing of the curriculum, where teachers pay less attention to (or even exclude) subject matter that is not tested (Au, 2007;Pedulla et al., 2003). Furthermore, in their review of research on the impact of high-stakes testing on student motivation, Harlen and Deakin Crick (2003) conclude that results from such tests have been found to have a 'particularly strong and devastating impact ' (p. 196) on low-achieving students.
The question addressed by this study is therefore whether it is possible to increase the agreement in teachers' grading by other means than national tests, such as specifying how teachers synthesise data on student performance into a grade.

Grades and grading
The term 'grading', as used here, means making a decision about students' overall performance according to (implicit or explicit) grading criteria. It follows from this definition that grades (as a product) are composite 'measures' (i.e. expressed along an ordinal scale) of student proficiency, based on a more or less heterogeneous collection of data on student performance. It also follows from the definition that grading, as a process, involves human judgement.
In Sweden, it is the responsibility of the individual teacher to synthesise students' performances into a grade and the teacher's decision cannot be appealed. However, in those subjects were there are national tests, 1 the results from these tests have to be taken into account when grading. Exactly how the test results should be taken into account, or their weight in relation to the student's other performances, is not specified. Obviously, a great deal of trust has been vested in teachers' capacity to grade consistently and fairly.

Consistency in teachers' grading
As shown in the review by Brookhart et al. (2016), covering more than 100 years of research about assessment and grading, research on teachers' grading has a long history. In this research, two findings are particularly consistent: (a) Although student achievement is the factor that above all others determines a student's grade, grades commonly include other factors as well, most notably effort and behaviour, and (b) There is great variability among teachers regarding both the process and product of grading (Brookhart, 2013).
Regarding the inclusion of non-achievement factors when grading, this seems primarily to be an effect of teachers wanting the grading to be fair to the students, which means that teachers find it hard to give low grades to students who have invested a lot of effort (Brookhart, 2013;Brookhart et al., 2016). Including non-achievement factors is therefore a way for teachers to balance an ethical dilemma in cases where low grades are anticipated to have a negative influence on students.
The variation in scores, marks, and grades between different teachers, and also for individual teachers on different occasions, has been extensively investigated. Several of the recent reviews of research about the reliability of teachers' assessment and grading make reference to the early studies by Starch and Elliott (e.g. Brookhart et al., 2016;Parkes, 2013), who compared teachers' marking of student performance in English, mathematics, and history (Starch & Elliott, 1912, 1913a, 1913b. These authors used a 100-point scale, and teachers' marks in English (n = 142), for example, covered approximately half of that scale (60-97 and 50-97 points respectively for the two tasks). In history, the variability was even greater than it was for English and mathematics. They therefore concluded that the variation is a result of the examiner and the grading process rather than the subject (for an overview, see Brookhart et al., 2016).
Interestingly, almost a hundred years later, Brimi (2011) used a design similar to that used in the Starch and Elliott study in English, but with a sample of teachers specifically trained in assessing writing. The results, however, were the same (50-93 points on a 100point scale).
In his review on the reliability of classroom assessments, Parkes (2013) also turns his attention to the intra-rater reliability of teachers' assessment. As an example, Eells (1930) compared the marking of 61 teachers in history and geography at two occasions, 11 weeks apart. The share of teachers making the same assessment on both occasions varied from 16 to 90 percent for the different assignments. The 90 percent agreement was an extreme outlier, however, and the others were clustered around 25 percent. None of the teachers made the same assessment for all assignments, and the estimated reliability ranged from .25 to .51. The author concludes that 'It is unnecessary to state that reliability coefficients as low as these are little better than sheer guesses' (p. 52).
A number of objections can be made in relation to the conclusions above, due to limitations of the studies. For example, as pointed out by Brookhart (2013), the tasks used by Starch and Elliott would not be considered high-quality items according to current standards. Rather, they would be anticipated to be difficult to assess and likely to lead to large variations in marking. Another limitation is that most studies are 'one-shot' assessments, where teachers are asked to assess or grade performances from unknown or fictitious students. While it may be argued that such assessments are more objective, they miss the idea that teachers' assessments can become more accurate over time as evidence of student proficiency accumulates. Further, their assessments can potentially become more valid as teachers come to understand what their students mean, even if expressed poorly. Lastly, teachers do not always have access to assessment criteria, which also means that their assessments could be anticipated to vary greatly. Still, teachers' assessments are not always sufficiently reliable even with very detailed scoring protocols, such as rubrics. In a review of research on the use of rubrics, Jonsson and Svingby (2007) report that most assessments were below the threshold for acceptable reliability. Brookhart and Chen (2015), in a more recent review, claim that the use of rubrics can yield reliable results, but then criteria and performance-level descriptions need to be clear and focused, and raters need to be adequately trained. Taken together, even taking into account the limitations of some individual studies, the majority of studies on this topic point in the same direction with regard to the variability of teachers' assessments and grading.

Different models for grading
The underlying reasons for the observed variability in teachers' assessments and grading are complex and involve both idiosyncratic beliefs and external pressure, as well as a number of other factors, such as subject taught, school level, and student characteristics (e.g. Duncan & Noonan, 2007;Isnawati & Saukah, 2017;Kunnath, 2017;Malouff & Thorsteinsson, 2016;McMillan, 2003;Randall & Engelhard, 2008). Due to this complexity, teachers' grading is sometimes portrayed in almost mystical terms, such as when Bloxham et al. (2016) write that assessment decisions, at least at the higher education level, are so 'complex, intuitive and tacit' (p. 466) that any attempts to achieve consistency in grading are likely to be more or less futile.
One possible reason for the observed variability in teachers' grading, which is the focus of this study, is that teachers use different models (or strategies) for grading. Korp (2006) has, in a Swedish context, described three different models, which will be called holistic, arithmetic, and intuitive.
In the arithmetic model, the grade is calculated as a sum or a mean based primarily on test results. The arithmetic model therefore requires that the teachers document student performance quantitatively (cf. Sadler, 2005). According to Korp, the teachers who used this model did not mention either the national curriculum or the grading criteria when talking about their grading practice.
In the intuitive model, students' grades are influenced by a mixture of factors, such as test results, attendance, attitudes, and lesson activity. From these factors, the teacher may have a general impression of the student's proficiency in the subject, and that, rather than the specific performance in relation to the grading criteria, will determine the grade.
In contrast to the arithmetic and intuitive models, in the holistic model the teacher compares all available evidence about student proficiency to the grading criteria and makes a decision based on this holistic evaluation. Since it is only in the holistic model that the teachers base their decision on explicit grading criteria, this is the only model in line with the intentions of the Swedish grading system.
The fact that the holistic model works in line with the intentions of the grading system does not mean that this model is easier for teachers to apply. On the contrary, the teachers in the study by Korp (2006) expressed dissatisfaction with the idea that they were expected to integrate different aspects of student performance. Not surprisingly, it is considerably easier to arrive at a composite measure of student proficiency when using a homogeneous set of data, such as points from written tests, as compared to the heterogeneous material in a portfolio (Nijveldt, 2007). This tension between a unidimensional or multidimensional basis for grading is therefore yet another instance of the reliability versus validity trade-off. While unidimensional data may result in more coherent and reliable grading, such data only represents a fraction of student proficiency in a subject. Multidimensional data, on the other hand, may provide a fuller and more valid picture of student proficiency but is more difficult to interpret and evaluate in a reliable manner.
In addition to the models presented by Korp (2006), Jönsson and Balan (2018) introduced yet another model, which is called 'analytic' and can be described as a combination of the holistic and arithmetic models. In this model, teachers compare student performance on individual tasks with the grading criteria, producing what is referred to as 'assignment grades', which are then used -more or less arithmeticallywhen assigning an overall grade. The analytic model thereby resembles the holistic model by making reference to explicit grading criteria (i.e. not only test scores or a general impression), but reduces the complexity of the final synthesis by already having transformed the heterogeneous data into a common scale.
As suggested by Jönsson and Balan (2018), the reduction of complexity in the analytic model may contribute to increasing the consistency in teachers' grading while preserving the connection to shared criteria. A disadvantage, however, is that each individual decision is based on a much smaller dataset, as compared to a holistic judgement, which takes all available evidence about student proficiency into account. Consequently, there is a risk that the analytic model may influence the validity of the grades negatively in terms of construct underrepresentation.

Analytic versus holistic assessment and grading
Analytic assessment involves assessing different aspects of student performance, such as mechanics, grammar, style, organisation, and voice in student writing. Alternatively, holistic assessment means making an overall assessment, considering all criteria simultaneously.
The most apparent advantage of using analytic assessments is that they provide a nuanced and detailed image of student performance by taking different aspects into consideration. By judging the quality of, for instance, different dimensions of student writing, both strengths and areas in need of improvement may be identified, which, in turn, facilitate formative assessment and feedback.
In contrast, holistic assessments focus on the performance as a whole, which also has advantages. For example, holistic assessments are likely to be less timeconsuming and there are indications that teachers prefer them (e.g. Bloxham et al., 2011). Furthermore, focusing on the whole prevents teachers from giving too much weight to individual parts. This is an important argument, since there is a widespread fear that easy-to-assess surface features get more attention in analytic assessment than more complex features that are more difficult to assess (Panadero & Jonsson, 2020). Sadler (2009), for instance, argues that teachers' holistic and analytic assessments rarely coincide, and implies that analytic assessments may therefore not be valid. Focusing on different aspects of student performance can therefore be seen as both a strength and a weakness with analytic assessments. This is because even if surface features do not receive more attention than other features, the analytic assessments still need to be partitioned into separate criteria. Consequently, there is a risk of assessments becoming fragmented and missing the entirety.
However, in the same way as analytic assessments risk missing the whole when focusing on the parts, holistic assessments risk missing the parts when focusing on the whole. There is also, similar to the intuitive model of grading, the risk of unintentionally including other factors than student performance or attaching weight to other criteria than assumed (Bouwer et al., 2017). For example, Tomas et al. (2019) used analytic rubrics as supplementary to holistic assessments in order to elicit weightings of the criteria used. Their analyses indicated the existence of a ranking of importance and contribution of different criteria to the overall marks, which differed from the expectations and assumptions of the assessors. While the assessors were assumed to give priority to complex skills such as critical thinking, descriptions and explanations of concepts contributed more highly towards the overall mark in their assessments. Since analytic assessments specify the criteria to consider, this situation is probably more common for holistic assessments. The authors therefore argue that analytic and holistic assessments should not be seen as opposites, but rather as complementary.
Still, there are researchers who are quite sceptical of analytic assessments (e.g. Bloxham et al., 2011;Sadler, 2009). In their defence, it should be noted that these sceptics often work within systems that require teachers to numerically score student performance according to analytic criteria, and where the individual scores are arithmetically aggregated into a composite score, mark, or grade. Under such circumstances, where qualitative and quantitative aspects of assessment are mixed, the sum of the parts may very well depart from the teacher's holistic judgement, depending on how the scoring logic or algorithm is designed, which may lead to frustration and dissatisfaction. It could be argued, however, that such specific applications should be kept apart from the basic principles of analytic and holistic assessments.

Materials and methods
This study started from the observation that although grades are high stakes for students in Sweden, agreement in teachers' grading is low, potentially making the selection to upper-secondary school and higher education unfair. And even though measures have been taken to increase the agreement in grading, these measures have not been very efficient. Furthermore, tying teachers' grading even closer to results from national tests may have other negative consequences. This study therefore aims to explore other routes for increased agreement in teachers' grading by comparing the advantages and disadvantages of analytic and holistic grading models in terms of a) agreement between teachers and b) how teachers justify their decisions.
The overall design of this study is experimental, where a number of teachers have been randomly assigned to two different conditions (i.e. analytic or holistic grading) in either English as a foreign language (EFL) or mathematics ( Table 1). The study builds on a previous pilot study (Jönsson & Balan, 2018) and uses a similar methodology.
Information about the project was distributed in two different regions of the country, where teachers volunteered to participate in the study. The teachers chose whether to participate in EFL or mathematics and were then randomly assigned to one of the two conditions. The study was performed entirely online and no personal data has been collected, only grades and written justifications from the teachers.

Procedure
In the analytic condition, the teachers received written responses to the same assignment from four students on four occasions during one semester (i.e. a total of 16 responses). The assignments all addressed writing in EFL or mathematics, but otherwise had different foci (Table 2). All responses were from students aged 12, but with different levels of proficiency in either subject. The responses were authentic student responses that had been anonymised. The teachers were asked to grade the student responses within a week after receiving them and report their decisions through an internet-based form. At the end of the semester, teachers were asked to provide an overall grade for each of the four students, accompanied by a justification. The final grades and written justifications were used as data in the study.
In the holistic condition, participants were given all of the material on a single occasion so that they were not influenced by any prior assessments of the students' responses. Similar to the analytic condition, they were asked to provide a grade for each of the four students and a justification for each grade.

Agreement between and among teachers
A common method for estimating the agreement between different assessors (i.e. interrater agreement) is by using correlation analysis. Depending on whether scores (a continuous variable) or grades (a discrete variable) are given, either Pearson's or Spearman's correlation may be used. Spearman's correlation (ρ) is a nonparametric measure of rank correlation, which is suitable for ordinal scales, such as grades. Naturally, if letter grades are used, they need to be converted to numbers in order to perform a correlation analysis. In this study, the grade A (i.e. the highest grade) has been converted to 6 and F (fail) to 1. Since only rank correlation has been used, no assumptions regarding equal distance between numbers are needed.
A disadvantage of using correlation analysis is that the assessments of two assessors may be highly correlated even if they do not agree on the exact grade, only the internal ranking. Therefore, in this study Spearman's correlation has been combined with measures of absolute agreement. Pair-wise comparison is a common measure of agreement (Jönsson & Balan, 2018), but it has some notable disadvantages. First, it does not take the possibility of the agreement occurring by chance into account. Second, all agreements are given equal weight, which means that a couple of assessors who agree, but whose assessments deviate strongly from the consensus of other assessors, still contribute to the overall agreement. Third, since the observed agreements are divided by the number of all possible comparisons between assessors, the estimation tends to be low and it could be argued to underestimate the agreement between teachers. In this study, we have therefore used Fleiss' κ and what we call 'consensus agreement' as measures of absolute agreement (for an in-depth discussion of different measures, see Stemler, 2004). Fleiss' κ provides an estimate similar to pair-wise agreement but takes the possibility of the agreement occurring by chance into account. We have also used 'consensus agreement', which is based on the idea that teachers as a group, or a 'community of practice', should ideally share a common interpretation of the grading criteria and standards, which the individual teacher either agrees or disagrees with. By using the most frequent grade for each student (i.e. the typical value), agreement among teachers has been estimated by calculating the proportion of grades that agree with the central tendency. Although this estimate of agreement only takes into consideration the proportion of grades that agree with the majority, it is generally higher than estimates based on pair-wise comparisons.
Finally, the median absolute deviation (MAD) has been calculated as a measure of distribution, and the Mann-Whitney U test has been used to test for statistical significance.

Justifications by teachers
The justifications for the grades were subjected to both qualitative and quantitative content analysis. First, all words the teachers used to describe the quality (either positively or negatively) of students' performance were identified. All words were coded as different nodes in the data, even if they referred to the same quality. This was done in order to recognise the full spectrum of teachers' language describing quality.
Second, all nodes were grouped in relation to commonly used criteria for assessing either writing (i.e. mechanics, grammar, organisation, content, style, and voice) or mathematics (i.e. methods, concepts, problem solving, communication, and reasoning). In addition, some teachers in EFL made references to comprehensibility and whether students followed instructions and finished the task. Some (in both EFL and mathematics) also made inferences about students' abilities or referred only to grade levels (i.e. not describing quality). Four additional criteria, called 'Comprehensibility', 'Rigour', 'Student', and 'Only grades', were therefore added to the categorisation framework. For an example of the categorisation procedure, see Figure 1.
In the quantitative phase, the frequency of teachers' references to the different criteria was used to summarise the findings and make possible a comparison between the different conditions.

Findings
The estimations of agreement between and among teachers are summarised in Table 3. In EFL, the teachers in the analytic condition agreed on the exact same grade in two thirds of the cases, while teachers in the holistic condition agreed in three out of five cases. The kappa estimations suggest fair (holistic condition) to moderate (analytic condition) agreement. The mean rank correlation was high in both groups. The distribution estimate (MAD) was clearly greater in the holistic condition as compared to the analytic condition.
In mathematics, the agreement was also higher in the analytic condition as compared to the holistic condition, and the kappa estimations suggest slight (holistic condition) to fair (analytic condition) agreement. The mean rank correlation was high in both groups, and the distribution estimate was greater in the holistic condition as compared to the analytic condition.
Taken together, the agreement for the analytic condition is higher as compared to the holistic condition for both subjects, and for all estimates, although the difference is only statistically significant for EFL (p < .01).

Justifications
All in all, the teachers in EFL made 787 references to quality indicators in their justifications (i.e. on the average, 18.7 references per teacher), using 64 different terms for describing these qualities. As can be seen in Table 4, the teachers in the holistic condition made slightly more references in relation to most qualities. Similarly, in mathematics, the teachers made 358 references to quality indicators in their justifications (i.e. on the average, 11.2 references per teacher), using 29 different terms for describing these qualities ( Table 5).
The student has understood all of the information and is able to express herself in a simple  Median absolute deviation (1 = The mean difference is one grade level, e.g. from E to F). Table 4 summarises teachers' references in relation to the criteria in EFL. Overall, most references were made to style dimensions (appr. 40%). This is also the most nuanced category, with 20 different terms used to describe these dimensions. In comparison, there were 9 terms used in relation to content, which comes second. Table 5 provides the corresponding information for mathematics, where the use of concepts (e.g. 'fractions' or 'circumference') is the largest category (appr. 25%).
There are some differences between the conditions, where teachers in the holistic groups provided comparatively more references in relation to most criteria for both subjects. In EFL, the largest relative differences are in relation to mechanics and content, while in mathematics the largest relative differences are in relation to concepts and communication. The most striking difference, however, is that it is primarily teachers in the analytic condition who make references only to grade levels. Table 6 shows an   example where the groups make similar assessments of the students' merits according to the criteria, but the analytic group provides significantly more references only to grades.

Discussion
This study aimed to compare analytic and holistic grading conditions by investigating the extent to which teachers agree on students' grades, as well as how they justify their decisions, when using an analytic or holistic model of grading.

Comparing the two conditions
As suggested by previous research (Jönsson & Balan, 2018), the analytic grading model may reduce the complexity of grading, thereby increasing the agreement between teachers. The findings lend some support to this interpretation as there is a difference between the two conditions in relation to both measures of agreement (i.e. Fleiss' kappa and consensus agreement), where the agreement is higher in the analytic conditions, although the difference is only statistically significant in EFL. The MAD values also show that the teachers in the holistic groups tend to deviate more from the consensus. The correlations, however, are very similar across the groups, suggesting that the teachers' perceptions of the relative rank of the students are largely unaffected by the model of grading. This finding also suggests that there is indeed a shared conception of quality performance, although teachers may disagree on the exact grade. In relation to teachers' justifications, critics of analytic assessments tend to assume that such assessments result in construct underrepresentation (e.g. Sadler, 2009). This is to some extent verified by the findings, since teachers in the holistic conditions provided slightly more references in relation to most criteria and the teachers in the analytic conditions provided substantially more references to grade levels, without describing any qualities. Interestingly, in EFL, teachers in the holistic condition provided more references to mechanics, which is often seen as a surface feature, but also to content. Neither of the conditions could therefore be accused of focusing primarily on surface features, such as organisation, mechanics, and grammar in EFL.
Regarding the assessment of individual students, there were no clear differences between the groups. Instead, there is quite a strong consensus about the qualities in students' performances. For instance, as can be seen in Table 6, both groups agree that the use of methods is a strength for this student, as are communication and concepts, while reasoning and problem solving are relative weaknesses. This consensus means that there is a potential for the teachers to agree on the formative feedback to give the students (i.e. strengths and suggestions for improvement). Consequently, formative assessments may not necessarily be affected by the inequality of grading if no overall assessment has to be made.
In sum, the findings suggest that the teachers in the sample are in agreement about which criteria to use when assessing and also about the extent to which these criteria are fulfilled in students' texts. The teachers also largely agree on the rank order of student performance. However, when assigning specific grades, the agreement is generally low. This observation supports the idea of assessment as a two-tier process, where the first stage involves the discernment of criteria in relation to the performance, and the second involves making a judgement about the quality of the performance (Sadler, 1987). Teachers may therefore be in agreement during the first stage but not the second (or vice versa), for instance, because they attach different weight to individual criteria when making an overall assessment. The agreement was higher for the analytic model in both subjects and for both measures, but the difference was only statistically significant in EFL.

Findings in relation to research on teachers' grading
It is difficult to draw any firm conclusions about teachers' grading based on this study, since the conditions in the experimental situation differ in several respects from how teachers may work during ordinary conditions. There are, however, some interesting observations that can be made.
First, even if the agreement in the analytic conditions was higher as compared to the holistic conditions, agreement in all groups was relatively low (52-67%), which is in line with previous research on teachers' grading (e.g. Brookhart et al., 2016). This could to some extent be expected, as the study uses a methodology similar to that used in several other studies, with tasks not designed by the teachers themselves and fictitious students. What might differ, however, is that the teachers in this study had access to shared grading criteria (which are very detailed in Sweden) and that the teachers in both subjects have experience with assessing student performance on national testsboth of which could be assumed to contribute to increased agreement. Furthermore, several of the factors that have previously been shown to influence teachers' grading, such as student characteristics and external pressure (e.g. Malouff & Thorsteinsson, 2016;McMillan, 2003), were not part of the experimental conditions, which could also be assumed to lower the variability among teachers and increase the agreement. As can be seen, however, the agreement is still low, which might indicate that factors such as idiosyncratic beliefs (e.g. Kunnath, 2017) and individual weighting of criteria contribute more strongly to the variability.
Second, in the justifications provided by the teachers, there are some references to the students as persons, but otherwise there are no signs of teachers taking other aspects into consideration. This finding is most likely a methodological artefact, partly since, as discussed above, such aspects were not part of the experimental conditions and the teachers therefore had no personal information about the students, and partly because the teachers knew that someone was going to review their justifications. What this suggests, however, is that it may be possible for teachers to confine themselves to using only legitimate criteria if they know that their grading will be reviewed.

Conclusions and implications for practice
The findings from this study suggest that analytic grading, where teachers assign grades to individual assignments and then use these 'assignment grades' when deciding on the final grade, is preferable to holistic grading in terms of agreement among teachers. The effect was stronger and statistically significant in EFL, but a clear tendency could be observed in mathematics as well, for all measures.
Teachers in the holistic conditions provided more references to criteria in their justifications, whereas teachers in the analytic conditions to a much larger extent made references to grade levels without specifying criteria. The gain in agreement for the analytic grading model is consequently countered by teachers making fewer references to criteria. This effect is, however, quite small.
All groups were in agreement on which criteria to use when assessing student work and the qualities identified in students' performance, which means that formative assessments may not be affected to the same extent by the low agreement in teachers' grading. Furthermore, if adopting an analytic model of grading, teachers may consider keeping the 'assignment grades' to themselves, as these are based on limited data and could be assumed to be unreliable, while only communicating qualitative feedback to the students (i.e. strengths and suggestions for improvements) in order to support students' learning and improved performance.

Limitations and future research
There are a number of important limitations in this study that need to be considered. First, although it is an experimental study where the participants were randomly assigned to the conditions, the sample was small and the participants volunteered to participate. Second, the experimental conditions differed in several respects from teachers' ordinary grading; for instance, the teachers did not design the tasks themselves and had no personal relationship with the students. Furthermore, as mentioned above, the current study cannot confirm which individual and/or contextual factors influenced teachers' grading, only that some factors beyond the grading model -such as giving different weight to different criteria -gave rise to variability in the sample. Given the variability of assigned grades, as well as the fact that this influence has been a robust finding in numerous studies over the years, the lack of support in this study is most likely an artefact of the design. Teachers can be assumed to restrict their written judgements to what they believe are legitimate criteria, since they know that someone will review their grading.
It should also be emphasised that no comparisons can be made between the two subjects since it is not known whether student performance is equally difficult to synthesise. For instance, only written performance was included for the teachers in EFL, while there was also oral performance for the teachers in mathematics, making the material more heterogeneous. Such differences may explain why the agreement between teachers in mathematics was lower as compared to the teachers in EFL in this study.
In conclusion, the aim of this study was to explore alternative ways to increase the agreement in teachers' grades, as opposed to increasing the influence of test results. Findings suggest that the agreement between teachers may increase by adopting an analytic grading model, but only slightly. Furthermore, agreement between teachers is still low, even when a large number of factors, which have previously been shown to influence teachers' grading, were removed as part of the experimental design. The results could therefore be used to support the argument that teachers' grading is indeed so 'complex, intuitive and tacit' (Bloxham et al., 2016, p. 466) that any attempts to achieve consistency in grading are likely to be futile. On the other hand, the findings could be seen as a small step towards increasing the agreement. For example, what would happen if the teachers, in addition to adopting the analytic grading model, used task-specific criteria for assessing the individual assignments, and then simpler criteria for aggregating the 'assignment grades', instead of using the (detailed, but abstract) national grading criteria? Or what would happen if the teachers, in addition to adopting the analytic grading model, participated in a moderation procedure, where they reviewed each other's grading?
We propose a continued investigation into possibilities to increase agreement in teachers' grading in ways that can be implemented as part of ordinary practice, such as the ones mentioned above (i.e. improving criteria and participating in moderation procedures), as well as other strategies suggested by previous research, such as aiding teachers in identifying different levels of quality (Brookhart et al., 2016). Such an endeavour can be supported by currently available resources (e.g. Brookhart & Nitko, 2008;Guskey & Brookhart, 2019) and may contribute to steering clear of a too-heavy reliance on external test results and the unfair grading that Swedish students experience today.

Note
1. There are currently tests provided in ten subjects during the last year of compulsory school, but each individual student only performs five of them.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Anders Jönsson is professor of Educational Sciences at Kristianstad University. His research interest is in classroom assessment, for both formative and summative purposes. Anders is the scientific leader of the research platform 'Collaboration for Learning', which aims to contribute to the development of teaching in schools and preschools through practice-based research.
Andreia Balan is an assistant professor at the City of Helsingborg. Her research interest is in formative assessment and mathematics education.
Eva Hartell is a teacher in STEM subjects and holds a PhD in the area of classroom assessment. She is currently working with research and development in the municipality of Haninge and at KTH Royal Institute of Technology in Sweden.