Organisational quality of ESL argumentative essays and its influence on pre-service teachers’ judgments

Abstract Assessing student writing is a complex task and experimental studies have shown that textual determinants, such as spelling or vocabulary, affect teachers’ judgments of other analytic criteria in student essays. Among these elements, organisation has not been extensively explored. This experimental study examines how organisational quality influences teacher judgments of ESL student texts in order to obtain a comprehensive overview of important determinants in the assessment process. Pre-service English teachers (N = 53) in Switzerland and Germany assessed four argumentative essays of two different overall text qualities in which two organisational features had been manipulated in terms of cohesive devices and paragraphing. Findings show that texts with higher organisational quality were assessed more positively concerning all analytic scales than texts with low organisational quality, indicating halo effects. In addition, an interaction effect was found between overall text quality and organisational quality, suggesting that the effect of organisational quality was stronger when overall text quality was low. This study sheds new light on the interaction between individual text features in the assessment of complex learner texts. Practical implications regarding essay assessment in teacher education are discussed.


PUBLIC INTEREST STATEMENT
Many pre-service teachers report that they do not feel adequately prepared for the assessment of student writing after having completed teacher education. Assessing student writing is a complex task and research has shown that many factors influence teachers' judgments, such as spelling mistakes. This means that teachers assigned lower grades to texts containing many spelling mistakes than to the same text with few spelling mistakes, even when assessing organisation, argumentation or content. While the influence of spelling mistakes has been well researched, the authors of this article analysed if a similar influence of organisation can be found on the assessment of student writing. Pre-service teachers assessed the same student texts in different versions, once with clear paragraph breaks and suitable linking words and once without any paragraph breaks and few linking words. The aim of this study is to know more about the influence of text features on teachers' judgments in order to increase the quality of assessment in the writing classroom.
This study sheds new light on the interaction between individual text features in the assessment of complex learner texts. Practical implications regarding essay assessment in teacher education are discussed.

Introduction
A growing body of research has investigated the influence of specific factors on teacher judgments of written student compositions, including students' ethnicity (Kaiser et al., 2016), teachers' knowledge and experience (e.g. Bates & Nettelbeck, 2001;Feinberg & Shapiro, 2003;Ready & Wright, 2011) and textual features, such as spelling and vocabulary (Birkel & Birkel, 2002;Scannell & Marshall, 1966;Vögelin et al., 2019). This present study focuses on pre-service teachers since there is an increasing awareness of the importance of teacher education also in writing assessment studies (Dempsey et al., 2009). It aims at generating valuable insights for research on assessing student writing and its determinants while also providing practical information on how to develop pre-service teachers' diagnostic skills in teacher education programmes. Teaching and assessing student writing is an essential component of teachers' professional competences (Baumert & Kunter, 2006;Edelenbos & Kubanek-German, 2004;Schrader, 2013). Many pre-service teachers in Switzerland and Germany report that they do not feel sufficiently prepared to assess students' classroom performances when they enter the profession (Rauin & Meier, 2007) and that this deficiency is particularly strong where written compositions are concerned (Porsch, 2010). While there is an increased focus on teachers' diagnostic competence in general (Südkamp & Praetorius, 2017), subject-specific components of the process of diagnosis need to be investigated in more detail (Alderson et al., 2014) in order to educate pre-service teachers to assess students' English writing as fairly and objectively as possible.
Although previous analyses investigating the influence of specific linguistic features on teacher judgments have advanced the understanding of the process of writing assessment and its determinants, they have primarily focused on spelling (mistakes), grammar, and vocabulary (Rezaei & Lovorn, 2010;Scannell & Marshall, 1966;Vögelin et al., 2019). For instance, Rezaei and Lovorn (2010) showed that pre-service teachers were affected by 20 structural, mechanical, spelling and grammar errors in well-written essays and assigned lower scores to these texts, even in criteria relating solely to content. Similarly, Vögelin et al. (2019) showed that pre-service teachers assessed student texts with a high level of lexical sophistication and diversity significantly more positively not only with regard to vocabulary but also concerning grammar and frame of essay.
This study focuses on the genre of argumentative essays since it is an essential component of learning and teaching English in many European countries, especially at upper-secondary level (Keller, 2013;Zhu, 2001). Command of argumentative essays requires-among other linguistic and rhetorical skills-skills in structuring arguments in order to persuade the readership with clearly organised and coherent paragraphs. For that reason, organisation plays an important role when writing argumentative essays (Hyland, 2008). Because of its importance, organisation is included in most analytic rating scales as a separate scale, such as the ESL Composition Profile (Jacobs et al., 1981), the Special Test of English Proficiency (STEP; Lumley, 2002), or the 6 + 1 Trait Model (Culham, 2003). Despite its importance in argumentative texts, empirical studies investigating argumentative essays with regard to organisation are few and far between.
In order to address this gap in the research, the present experimental study investigates the influence of organisational quality in argumentative essays on teacher judgments by minimizing the interference of other confounding variables. Its outcomes should provide a more detailed insight into the intricate process of assessing ESL compositions and raise awareness of its complexity for (pre-service) teachers.

Theoretical background
This review of the relevant literature first discusses core theoretical concepts connected to organisation in writing such as coherence, cohesion, and metadiscourse. Second, it discusses the assessment of organisation in student writing and explores possible ways of measuring and manipulating organisation in student texts using specific features, such as paragraphing and cohesive devices. Third, the review examines the role of organisation in student writing in relation to the substantial body of literature exploring other text features that influence teacher judgments.

Cohesion, coherence and metadiscourse
From an educational perspective, an argumentative essay is typically organised in three main "stages" and associated "moves" (Hyland, 1990). The main stages consist of an introduction with a thesis statement, paragraphs with arguments, and a conclusion summarizing the main arguments. Following this structure, organisation is foregrounded in most ESL writing course books as a key aspect of writing essays (Hyland, 2008). For instance, the widely used course book Writers at Work: The Essay by Zemach and Stafford-Yilmaz (2008) provides different tasks and instructions for students to teach them how to write introductions, organise information in paragraphs, and compose an effective conclusion which will persuade readers. The basic structure of argumentative essays is thus well defined, but the conceptualisation of each component varies in the numerous academic contexts and research projects in which the genre is relevant (Kruse, 2013). While some researchers might regard the presence of thesis statements and topic sentences as central for evaluating organisation, others place their emphasis more on physical appearance such as paragraphing and the presence of cohesive devices when assessing organisation (Erdosy, 2004;Ruegg & Sugiyama, 2013). Organisation is a feature of writing quality with a particularly wide array of definitions, making it one of the most difficult aspects of an essay to assess (Erdosy, 2004;Freedman, 1979), particularly because it shares considerable overlap with other text features (Ruegg & Sugiyama, 2013). Three concepts which are closely related to organisation in student writing-coherence, cohesion and metadiscourse-are relevant in this context. For a long time, the terms coherence and cohesion were used interchangeably (Yang & Sun, 2012) and previous literature has often employed ambiguous definitions (Knoch, 2007;Lee, 2002;Ruegg & Sugiyama, 2013). As studies in discourse analysis emerged, a shift in focus on ties between sentences occurred (Lee, 2002). Halliday and Hasan (2013) define cohesion as the presence of textual ties between sentences in order to build connections between ideas and to aid the reader in following the text (see also Crossley & McNamara, 2009;Lee, 2002;Yang & Sun, 2012). While textual cohesion seems to be superficial (Yang & Sun, 2012), it is an essential aspect of successful and coherent language comprehension (Crossley & McNamara, 2009). Thus, a text with a more proficient and frequent use of cohesive cues is usually deemed more coherent and understandable (McNamara & Kintsch, 1996;Yang & Sun, 2012).
By contrast, the concept of coherence describes the semantic relations between ideas in a text (McNamara & Kintsch, 1996) and refers to the mental representations that readers derive from a text (Crossley et al., 2016;McNamara et al., 2010;Witte & Faigley, 1981;Yang & Sun, 2012). As it is not directly represented on the linguistic level, the concept is often regarded as abstract and fuzzy (Connor, 1990;Lee, 2002). Even though coherence and cohesion interact to a considerable extent, a cohesive text is not necessarily coherent (Ruegg & Sugiyama, 2013;Witte & Faigley, 1981). A text might include cohesive devices that link its sentences but lacks coherence or a clear purpose.
Related to cohesion and coherence, the term metadiscourse is a relatively new approach widely employed in discourse analysis and language education (Hyland, 2005). Coined by Harris in 1959, several scholars, such as Vande Kopple (1985), Crismore et al. (1993), andHyland (2005), have further developed this concept and presented taxonomies to classify an array of metadiscoursal features. Metadiscourse refers to "the linguistic material in texts, whether spoken or written, that does not add anything to the propositional content but that is intended to help the listener or reader organize, interpret, and evaluate the information given" (Crismore et al., 1993, p. 40). Typically employed as an umbrella term, metadiscourse encompasses an open-ended set of cohesive and interpersonal features which signal the writer's intent and intends to guide a receiver through a text (Crismore et al., 1993;Hyland, 2005;Hyland & Tse, 2004). Typical examples are transitions, such as in addition, thus, or finally. These words do not contribute to topic development but contribute to achieving the communicative purpose of supporting the reader in understanding a text (Crismore et al., 1993;Hyland, 2005;Vande Kopple, 1997).
Further, metadiscourse is a characteristic of good ESL and native speaker student compositions (Cheng & Steffensen, 1996) and an important component of argumentative writing (Hyland, 1998). It is a crucial element of learning a second or foreign language since the degree of organising a text and guiding readers through it differs between languages (Hyland, 2005). Although the number of studies on metadiscourse has increased, its relation to cohesion and coherence remains uncertain. Jalilifar and Alipour (2007) state that metadiscourse influences cohesion, which is grammar-bound in nature, and does not account for coherence, which is meaning-bound. Contrary, Lee (2002) argues that metadiscourse markers account for a coherent text. This study focuses on textual devices which do not add to the propositional content of a text but help readers organizing and interpreting information, thereby contributing to a coherent text (Cheng & Steffensen, 1996;Crismore et al., 1993;Lee, 2002).
There are numerous denominations of cohesive devices, such as metadiscoursal features, metadiscourse markers, connectives, and cohesive ties (Crossley et al., 2016;Hyland, 2005;McNamara et al., 2014;Witte & Faigley, 1981). According to Halliday and Hasan (2013), conjunctive elements are not cohesive per se but presuppose the presence of other elements in the discourse. They distinguish between cohesive ties achieving between-sentence cohesion and cohesive ties achieving withinsentence cohesion (Halliday & Hasan, 2013). Whereas the latter-intrasentence cohesion-is primarily determined by grammatical structures of sentences, intersentence cohesion describes the use of cohesive ties creating cohesion across sentence boundaries (Bridge & Winograd, 1982). According to Halliday and Hasan (2013), cohesive devices across sentence boundaries are more prominent since they are the sole source of texture in contrast to cohesive devices within a sentence which are also structural relations (Halliday & Hasan, 2013). The effect is thus more striking and the meaning more obvious. Furthermore, these cues are local in nature, meaning that the cohesive devices relate to sentence-level cohesion, in contrast to global cohesion, and include explicit connections between sentences and paragraphs (Crossley et al., 2016;Halliday & Hasan, 2013).
This study defines cohesive devices as elements that create links between sentences which help readers connect and organise materials without contributing to propositional content (Hyland, 2005;Jalilifar & Alipour, 2007;Vande Kopple, 1997). This is in line with Ruegg and Sugiyama (2013) definition of cohesive devices and their adaptation of Hyland and Tse (2004) model of metadiscourse. Previous research investigating the relationship between cohesive devices and writing quality of ESL/EFL student texts showed somewhat conflicting results. Several studies found that essays rated higher contain more cohesive ties than essays rated lower (Chiang, 2003;Tejada et al., 2015;Witte & Faigley, 1981;Yang & Sun, 2012). By contrast, previous research with expert raters shows divergent results with regard to first language (L1) as well as second language (L2) writing (Crossley et al., , 2016Crossley & McNamara, 2012;Guo et al., 2013). According to these studies, local cohesion indices are either negatively or not at all correlated with expert judgments of text coherence and text quality.

Assessing organisation in student writing
Previous research focused predominantly on different text features that influence raters and teachers when assessing organisation (Freedman, 1979;Ruegg & Sugiyama, 2013). Freedman (1979) examined the influence of text features on teachers' holistic assessment of student texts by rewriting four essays to be weaker or stronger in the four categories: content, organisation, sentence structure, and mechanics. Each of the four categories consisted of several features that were rewritten. For example, the manipulation of organisation involved differing paragraphing (appropriate paragraphing vs. three misparagraphings per 250-word page), ordering (logical ordering, keeping main ideas together, respecting "given-new" strategies vs. violating logical order, separating main ideas, violating "given-new" strategies) and transitioning (appropriate transitions vs. deleting transitions) (Freedman, 1979). This resulted in 96 essays (12 text variations on eight different topics), which were rated by 12 teachers using a 4-point holistic scale. Analyses of variance revealed that content and organisation affected the holistic ratings the most, in contrast to sentence structure and mechanics (Freedman, 1979). However, in the experimental variation of her study, the exact influence of one feature was difficult to determine since several aspects of one category were rewritten at once. This should be examined further.
In a recent study, Ruegg and Sugiyama (2013) investigated features that raters are sensitive to when assessing organisation in students' argumentative essays. Their findings revealed that raters seem to attach more importance to the most visible aspects of organisation, such as the number of paragraphs and the number of cohesive devices, than to the deeper level aspects such as coherent flow of ideas. An essay with several clearly distinguished paragraphs appeared to raters as more organized than a text consisting of only one paragraph (Ruegg & Sugiyama, 2013). Similarly, Crossley et al. (2014) conducted an empirical evaluation of linguistic features measured by simple natural language processing (NLP) tools. They found that the number of paragraphs of 11 th grade students' essays correlated positively with human-rated essay scores (r =.388, p < .001). However, these findings have not yet been verified in an empirical study which investigates how organisational quality of student essays affect teacher judgments of other criteria by manipulating organisational features. Based on previous studies suggesting that raters base their judgment of organisation on paragraphing and cohesive devices, this present study investigated the relationship between these features and the assessment of other analytic criteria.

Text features influencing teacher judgments
Reviewing various textual features and their influence on teachers' assessment of student essays, previous literature reported the influence of spelling, grammar and vocabulary on teachers' analytic assessment of other criteria (Birkel & Birkel, 2002;Rezaei & Lovorn, 2010;Scannell & Marshall, 1966;Vögelin et al., 2019). While Birkel and Birkel (2002) conducted their study with experienced elementary school teachers, many subsequent studies examined pre-service and in-service teachers' judgments. For instance, pre-service teachers assigned lower grades to papers with punctuation, spelling, and grammar errors even though they were instructed to grade content alone (Scannell & Marshall, 1966). Furthermore, teachers are strongly influenced by the physical appearance of a text, for example, a student's handwriting (Charney, 1984;Marshall & Powers, 1969). This refers to distortion effects in an analytic assessment situation, distinguishing between construct-relevant and construct-irrelevant factors (Messick, 1994). Teachers should be able to judge the construct-relevant factors of a student's written performance (e.g., vocabulary, organisation, spelling, etc.) separately from construct-irrelevant factors such as the students' ethnicity, or gender (Kaiser et al., 2016). Failing to distinguish between construct-relevant and construct-irrelevant factors indicates a halo effect (Thorndike, 1920). A halo effect describes the positive or negative influence of one feature on other independent criteria in the rating process (Bechger et al., 2010;Thorndike, 1920). In ESL writing assessment, teachers typically decide on a set of relevant analytic scales for a particular writing task or context. Behind this selected set of criteria, there is a much larger set of criteria applicable to describe writing quality which interacts and correlate with each other (Sadler, 1989). For instance, poor vocabulary skills might affect the quality of grammar in a student text and it might be difficult to differentiate between these individual dimensions of a text. In an analytic approach, teachers try to disambiguate these dimensions by selecting a limited number of relevant and distinctive criteria and assigning them to separate scales. Previous studies have shown that the use of analytic scales is particularly relevant for pre-service teachers who are not used to assessing authentic, multifaceted student performances with uneven learning profiles (Weigle, 2002). In particular, second language writers who struggle with formal aspects of writing, such as spelling, benefit from analytic assessment, as different dimensions of writing will be judged separately. When regarding criteria separately, teachers are concerned with individual properties-or qualities-of a text (Sadler, 1989). This approach contrasts with a holistic approach in which teachers react to a piece of writing as a whole and do not assume operational independence among criteria, thus judging its quality in a broader sense. In a holistic approach, teachers make judgments based on the most salient criteria in a text and it might be difficult for students to make systematic progress if they do not know what configuration of criteria teachers are referring to in their appraisal of the text (Sadler, 1989). This is different in the analytic approach where students see which criteria are relevant for task completion, how a teacher rates each individual one, and consequently, in which areas they need to improve. Thus, it becomes an important aspect of teacher professionalism to be able to evaluate selected criteria relevant for a genre separately in an analytic assessment (Parr & Timperley, 2010), even though there might be some overlap between these aspects from a linguistic perspective.
In general, linguistic resources-in this case cohesive devices-have the function of displaying certain moves in a text (Swales, 1990). In the case of argumentative essays, cohesive devices typically display the order of arguments, which helps the reader to quickly grasp the structure of a text and thus follow it more easily. Another text pattern would be paragraphing for each argument. These are requirements of the genre argumentative essays and need to be learned before writing an essay. In this study, learners studied the structure of an argumentative essay and identified the different moves in the preceding teaching unit (cf. "Student texts").

Purpose
Assessing writing in authentic classroom contexts entails an interplay of different influencing factors as well as a larger variety of text qualities. Features such as grammar, vocabulary or argumentation are distributed in varying configurations in different essays. This study, however, does not aim at representing the full diversity of students' written compositions. The aim of this experimental study is to explore how the organisational quality of student texts influences preservice teachers' analytic assessment. Based on Ruegg and Sugiyama (2013) findings, we varied organisational quality in the student essays by manipulating two key features of organisation in ESL argumentative essays: the number of paragraphs and the number of cohesive devices. By manipulating organisational quality in ESL argumentative essays, we investigated whether the experimental variation of organisation affected teacher judgments of other criteria. Consequently, the current study addresses the following research question: • Does the number of paragraphs and selected cohesive devices in ESL argumentative essays affect pre-service teachers' judgments of distinct and independent text features (indicating halo effects)?
Based on the studies summarized above, and our own work on the influence of spelling and lexical quality on other features of text assessment, we hypothesise that organisation has a similar effect and influences teacher perception of text features which should-from a pedagogical perspective -be viewed as separate.

Method
This study examines organisation from an educational perspective within the context of uppersecondary ESL writing. In order to obtain detailed insights into the influence of specific textual determinants in the assessment process, this study employs a similar design and methodology as an earlier study which investigated the influence of lexical diversity and sophistication on teacher judgments (Vögelin et al., 2019). This new, independent study involved the variation of organisational quality and was conducted with a different sample of participants.

Student texts
First, a teaching unit on argumentative writing was developed and implemented at an uppersecondary grammar school ("Gymnasium") in Switzerland. This type of schooling (International Standard Classification of Education [ISCED] level 3) is quite selective in Switzerland: Only about 21% of students completed their upper-secondary education in this type of school in 2015 (National Statistical Office of Switzerland, 2016). Over 4 weeks, their English teacher taught them techniques of argumentative writing with materials from the coursebook Writers at Work-The Essay (Zemach & Stafford-Yilmaz, 2008). The teaching unit encompassed watching a video on "how technology shapes our life" with subsequent discussion, reading various articles on the influence of technology on people's lives, analysing the structure of argumentative essays, finding examples for one's arguments and independent writing of parts of argumentative essays. Thus, students learned how to structure and organize different elements of an argumentative essay, how to sustain their arguments by providing different types of support and by using the appropriate type of language required by the genre. At the end of 4 weeks, students were asked to write an essay in 90 minutes answering the following writing prompt: "Do you agree or disagree with the following statement? As humans are becoming more dependent on technology, they are gradually losing their independence." The students were in 11th grade (2 years before their final examination), had been studying English for 5 years, and were upper-intermediate learners.
From this dataset of 15 ESL argumentative essays, two texts with a high overall text quality and two texts with a low overall text quality were selected using the NAEP rating scale (National Assessment Governing Board, 2010). While the weaker student compositions exhibited insufficient levels of work with regard to the learning goals of the unit, the stronger compositions showed levels of work that were considered to have surpassed the learning goals. After preselecting two strong and two weak texts, two groups of independent expert raters (N = 9 and N = 7), consisting of experienced English lecturers and teachers that were part of the research team, rated one weak and one strong text, respectively, upon their overall text quality using the NAEP rating scale. The intra-class correlation coefficients (ICC) were satisfying with ICC (2,9) = .88 and ICC (2,7) = .75.
As outlined in the literature review, previous studies have reported that raters base their assessment of organisation in student writing on paragraphing and the number of cohesive devices. Since both features are central skills for upper-secondary ESL students to write argumentative texts, this study experimentally varied these features in the four student texts to investigate their role in the assessment process. We are aware that, from a linguistic point of view, our selected features do not cover the entire construct of organisation. We chose this operationalisation because first, paragraphing and cohesive devices are viewed as central for writing wellstructured argumentative essays at upper-secondary level in the pedagogical literature and feature heavily in relevant course-books. Second, empirical studies have shown these two features to be particularly salient in determining raters' view of organizational quality in student texts (Ruegg & Sugiyama, 2013).
The study defined cohesive devices following the model of metadiscourse by Hyland and Tse (2004) and the study by Ruegg and Sugiyama (2013). In addition, we analysed cohesive devices found in our dataset of authentic student essays and complemented the list with further devices of our own. Table 1 displays the selection of cohesive devices employed in this study. In order to ensure that the manipulated number of cohesive devices was within the range of authentic student essays at that level, we examined the number of features per 100 words in the ESL compositions of the data set. We found a minimum of 0.75 tokens and a maximum of 2.80 tokens of cohesive devices per 100 words. Based on the mean length of M = 459.75 words of the four essays, we included three cohesive features in essays with a low organisational quality and 12 features in essays with a high organisational quality. Depending on the cohesive devices present in the original essay, different cohesive devices were inserted or substituted in order to increase coherence. To standardize the number and type of cohesive devices between texts, some cohesive devices already present in a student text had to be deleted or altered at times. All cohesive devices were inserted at the beginning of a sentence and carried little or no propositional meaning, as in the following example: "In addition, you cannot lose time if you use your phone" (own emphasis). In texts with low organisational quality, connectives consisted only of the tokens so, because (of), and and.
There were 12 tokens (11 types) of cohesive devices in each text with high organisational quality: six transitions, four frame markers, one code gloss, and one attitude marker. The level of lexical sophistication and diversity was measured between the text variations using Coh-Metrix (Graesser et al., 2004) and the Tool for the Automatic Analysis of Lexical Sophistication (TAALES) . Using these objective measures, we made certain that the insertion of sophisticated cohesive devices in texts with a mean length of M = 459.75 words did not lead to a significant difference in the overall quality of vocabulary.
Regarding paragraphing, we deleted all paragraph breaks in texts with the low organisational quality so that the essay appeared as one large paragraph. In texts with high organisational quality, essays were structured into five or more paragraphs according to the existing arguments in the text (i.e., each paragraph contained one idea or argument).
We thus manipulated organisational quality by systematically varying both paragraphing and cohesive devices in four student texts. This created a corpus of eight texts in which each textregardless of overall text quality-existed in two versions: one with low organisational quality and one with high organisational quality. These texts then formed the basis of our experimental study.

Rating scales
This study employed both analytic and holistic rating scales (Crusan, 2010;Knoch, 2009;Rakedzon & Baram-Tsabari, 2017;Weigle, 2002) to investigate whether organisational quality influences teacher judgments on both types of scales. To measure the influence of organisation on other features of text quality, and to address the genre-specific characteristics of argumentative essays, we developed specific rating scales aligned with the relevant syllabus of the 11th grade at the upper-secondary grammar school in Switzerland and Germany. Our scales were based on existing analytic scales where appropriate, such as the 6 + 1 trait model (Culham, 2003) and the Test in English for Educational Purposes (TEEP; Weir, 1988). This resulted in the following six analytic scales: organisation, support of arguments, spelling and punctuation, grammar, vocabulary, and overall task completion (see Appendix). Special attention was payed to the organisational scale, since previous studies reported raters' difficulties in understanding the descriptors of this concept (Ruegg & Sugiyama, 2013) or their tendency to follow their own internal manifestations of organisation (Erdosy, 2004). Thus, the descriptors of the organisational scale are based on the KEPT Essay Rating scales (Ruegg & Sugiyama, 2013), the ESL Composition Profile (Jacobs et al., 1981) and the STEP writing performance criteria and descriptors (Lumley, 2002). In a pilot phase, a group of experienced lecturers created an a priori rating scale together with the research team, in which they adapted and evaluated the assessment criteria in two rounds of feedback in order to eliminate imprecise wording and ambiguities. As previous studies have shown no significant differences between 4-point and 6-point rating scales with regard to reliability and sensitivity (McColloy & Remsted, 1965;Nimehchisalem, 2010), we chose rating scales with only four levels in order to make them practical and easy to use.
We also included a holistic scale in this study as a manipulation check. A successful manipulation would include a lower holistic assessment for texts with lower organisational quality, and vice versa, since organisation is a component of the holistic scale employed in this study. We used the well-validated National Assessment of Educational Progress (NAEP) rating scale, which is divided into six levels (National Assessment Governing Board, 2010). This scale was previously used in large-scale empirical writing assessment studies in Germany because it contains age and grade appropriate criteria fitting well with the national curriculum (Knopp et al., 2012). In the Swiss and German educational context, no standardized holistic rating scale for argumentative essays at upper-secondary level exists. Thus, we chose the NAEP scale as it is widely used in international research and proves a good fit with local curricula.
Existing theoretical work on the nature of analytic assessment has shown that the criteria used in appraising the quality of performance interlock to some degree (Bacha, 2001;Sadler, 1989). Analytic criteria may sometimes be fuzzy or be regarded as a continuous gradation from one category to another, as in assessing "originality" (Sadler, 1989). In designing our own scales, we, therefore, payed special attention to the distinctiveness of our genre-specific analytic rating scales by differentiating between the six scales and defining the levels within each scale by using detailed descriptors. Our aim was to test whether participants would be able to discern these criteria as a profile of distinct strengths and weaknesses, and to gauge the size of a possible influence of our manipulated criterion.

Participants
Participants were students attending seminars on English language methodology or English language skills at Swiss and German universities (N = 53). 60.4% of the participants were enrolled in a teacher's education programme for upper-secondary school and 28.3% in a programme for lower-secondary school. The remaining 11.3% were enrolled in an English master's degree with the intention of later becoming upper-secondary teachers. The participants' age ranged from 21 to 48 years with a mean of 27.98 years (SD = 5.78). The majority of participants were female (69.8%). Furthermore, 54.7% of the participants reported that their English proficiency corresponded to level C2 according to the Common European Framework (Council of Europe, 2001). The remaining participants reported that their English was equivalent to a C1 (35.8%) or that English was their first language (7.5%). The study thus dealt with participants who, in their own estimate, were highly competent users of the English language.

Procedure
This study used the computer-based assessment tool Student Inventory ASSET (SIA) which has been adapted from the Student Inventory (SI) (Vögelin et al., 2019). First, participants received background information on the genre of the texts (argumentative essays), the school context, and the preceding teaching unit which had been implemented prior to the writing prompt. After reading this background information, participants were asked to familiarize themselves with the descriptors of the holistic and analytic rating scales employed in the study. Each participant then evaluated four texts in randomized order-two weak texts with differing levels of organisational quality (high/low) and two strong texts with differing levels of organisational quality (high/low). Participants assessed each text on the holistic scale first and then on the analytic rating scales. In order to control for the influence of handwriting (Marshall & Powers, 1969), student essays were typed and displayed in the same font. Last, participants were asked background questions concerning their studies and English language proficiency. After data gathering was complete, participants saw their scores in relation to expert ratings of the four texts. In addition to the expert ratings on both types of scales, experts' descriptive comments on the student texts were presented in order to justify their ratings. In that way, participants could see where they had assessed more strictly or leniently and in what ways their reasons for the assessment aligned with that of the experts. This feedback was not part of the empirical study but was added at the end as a benefit to participants who had taken part in the study.

Analyses
We used an experimental 2 × 2 design with two independent variables: overall text quality and organisational quality. To test the hypothesis, we conducted repeated-measures multivariate analyses of variance, which assume normal distribution of the data, independence, and sphericity. Concerning the first assumption, the analysis of variance is robust against the violation of normal distribution since our experimental design was balanced with more than 20 participants per condition (Schmider et al., 2010). Further, the second and third assumptions were given by our design. Table 2 displays skewness and kurtosis values, which were in acceptable ranges for all dependent variables.
There were no missing values as participants could only proceed in SIA once they had assessed all four texts. We calculated intra-class coefficients (ICC) for the inter-rater agreement on individual texts. Inter-rater agreement ranged from ICC (2,1) = .34 on the scale task completion and ICC (2,1) = .74 on the scale organisation. The relatively low ICC coefficients on scales other than organisation indicate the substantial variation of ratings of each of the eight texts in comparison to the variation of ratings across the texts. Yet, the eight texts constitute twice the same four texts with a slight variation of organisation and only two variations of overall text quality across the four texts. Therefore, we can safely assume that the low ICCs are a result of the low overall variance of text quality and should expect to receive high ICCs only on the ratings of organisation. Table 3 shows descriptive statistics for the variable text quality with regard to the holistic and the six analytic rating scales. Correlations between the holistic and analytic scores ranged between r = .61 (p < .001) for spelling and r = .74 (p < .001) for support of arguments (see Table 4). The multivariate test indicated significant effects for text quality (F(7, 46) = 44.41, p < .001, η2 = .87) and thus, we conducted univariate post-hoc tests.

Organisational quality
Concerning the variable organisational quality, the multivariate test indicated significant effects (F(7, 46) = 46.99, p < .001, η2 = .88); thus, we conducted univariate post-hoc tests (see Table 5).  The second set of univariate tests showed that texts with a high organisational quality were judged more positively than texts with a low organisational quality with regard to the holistic score (F(1, 52) = 52.00, p < .001, η2 = .50), indicating a successful manipulation check.

Interaction between text quality and organisational quality
A multivariate effect was detected for first-level interactions between text quality and organisational quality (F(7, 46) = 2.36, p < .05, η2 = .26). The third set of univariate tests for the interaction effect indicated that the effect of organisational quality was stronger when overall text quality was low. The effects of organisation on the judgment of organisation (F(1, 52) = 5.57, p < .05, η2 = .10) and support of arguments (F(1, 52) = 6.34, p < .05, η2 = .11) was higher in texts of low overall quality than in texts of high overall quality (see Figures 1 and 2).

Discussion
The aim of this experimental study was to explore how organisational quality of ESL argumentative essays influences pre-service teachers' judgments of other independent textual features in an analytic assessment process. By focusing on organisation-operationalised by paragraphing and cohesive devices-as an influencing factor on teacher judgments, this study provides new insights into the assessment process of ESL texts which-we hope-are beneficial both for the research and practice of writing assessment. We focused on pre-service teachers in this study because there is an increasing awareness of the importance of teacher education for developing diagnostic competence in general (Blömeke et al., 2015). Especially in writing education, it has been shown that teaching experience does not automatically transfer to high diagnostic competence or accurate assessment. For example, Jansen et al. (submitted) showed that experienced English teachers at upper-secondary level in Switzerland and Germany consistently underestimated students' writing competences (i.e., judged their texts too harshly)  Vögelin et al., Cogent Education (2020), 7: 1760188 https://doi.org/10.1080/2331186X.2020 in comparison to benchmarks scores. As Dempsey et al. (2009) showed, learning to assess student texts objectively and fairly requires scaffolded practice in assessing multiple student papers, applying explicit assessment criteria under expert guidance and discussing assessments in peer groups. It is this type of specific training that pre-service teachers need to receive in their teacher education, and our study aimed to lay the groundwork for suitable materials and training tools in this field.
Findings suggest that pre-service teachers were able to differentiate between strong and weak ESL argumentative essays on holistic as well as analytic rating scales, replicating results from a previous study on ESL text assessment (Vögelin et al., 2019). Thus, participants judged texts with higher overall text quality more positively than texts with lower overall text quality with regard to all criteria. In order to explore the role of organisation on other analytic scales, this study further investigated whether experimentally varied organisational qualities of student texts influence teachers' judgments of other analytic criteria. Results showed that the manipulation of organisational quality significantly affected participants' assessment on a holistic scale, indicating a successful manipulation check. Furthermore, the variation of organisational quality also influenced assessment on all analytic scales employed in this study: support of arguments, spelling, grammar, vocabulary, and overall task completion. On all these features, texts with a higher organisational quality were rated more positively than texts with a lower organisational quality.
While findings indicate halo effects for criteria which are sufficiently distinct from organisation, other criteria might be more difficult to distinguish from organisation in student texts. In our view, the influence of organisation on formal criteria, such as spelling and grammar, suggests halo effects. Even though the number of spelling mistakes was kept constant between text variations, spelling was assessed more positively in texts with higher organisational quality. The same can be said for grammar, which was defined as usage of a variety of complex grammatical structures and few grammatical mistakes. Again, organisation should also not affect the assessment of grammar since cohesive devices were inserted solely at the beginning of a sentence and did not alter the original grammatical structure of the student essay in any way.
We also argue for halo effects for the criterion overall task completion as it was operationalized in this study. According to our analytic rating scale, task completion was defined as whether the student text fully addresses the essay question and was thus concerned whether the written text is relevant to the essay question (see Appendix). As the variety of transition words and the paragraphing of a text only marginally overlap with the criterion of whether an essay fully addresses the essay questions, we would again argue that this constitutes a halo effect.
The criteria support of arguments and vocabulary are more complex cases since they share some overlap with the analytic scale organisation. First, the criterion support of arguments was defined in this study as to whether authors present a variety of different examples to support their argument and whether the relevance of these examples to the topic is fully explained. By inserting cohesive devices, the relevance of examples might be better explained to the readership. However, the influence of one criterion on the other is still remarkable as students' abilities to present arguments and clarify their relevance to the topic are noticeably different from their use of paragraphing and cohesive devices which do not add to the propositional content of the essay.
Second, quality of vocabulary-defined as whether the author uses sophisticated, varied vocabulary-correlates operationally with organisation since the insertion or deletion of cohesive devices affects the lexical quality of the texts. Further, the two scales share some overlap with regard to the formulation "sophisticated transition words" and "sophisticated, varied vocabulary". This should be avoided in future studies. Nevertheless, the automated tools of text analysis (Coh-Metrix and TAALES) employed in this study indicated that lexical diversity and sophistication stayed within comparable range, and the differences perceived by human raters in our experimental variation far exceeded the minor changes made in our manipulation.
Overall, the findings of this study suggest that the manipulation of organisational quality affected the assessment of other analytic scales to a greater and lesser degree. While the effect of organisation on the analytic scales spelling, grammar, and task completion indicates halo effects in pre-service teachers' assessment, the influence on vocabulary and support of arguments is not that clear-cut due to overlap between the analytic criteria. In sum, these findings underscore how difficult it is for pre-service teachers to keep different categories in analytic rating scales apart and assess each correctly. Findings further support the notion that teachers partly base their assessment-consciously or subconsciously-on the most noticeable aspects of writing (Charney, 1984;Erdosy, 2004;Ruegg & Sugiyama, 2013), since the successful manipulation of organisational quality in this study focused on the most visible aspects of organisation (paragraphing) and on cohesive devices only. Overall, results reinforce those reported in previous studies, which is that analytic assessments are prone to halo effects which can lead to unfair assessments of student writing (Birkel & Birkel, 2002;Freedman, 1979;Rezaei & Lovorn, 2010;Scannell & Marshall, 1966;Vögelin et al., 2019. Our study suggests that organisation should play an important role in ESL teacher education as it influences pre-service teachers' judgments of other criteria in significant ways.

Implications for practice
This study contributes one modest piece to the large puzzle of text assessment involving the multifaceted construct of organisation, which is highly relevant for teachers of ESL argumentative writing at upper-secondary level. Results imply a need for raising awareness among preservice teachers of possible influences in the assessment process and preparing them to assess student texts as fairly and objectively as possible by using analytic rating scales. Pre-service teachers should be aware of the influence of cohesive devices and paragraphing on their assessment of other criteria, such as spelling, grammar and task completion. They should further be familiarized with key elements of organisation in argumentative essays in order to teach students how to present and signal their arguments in a successful argumentative essay. Our study illustrates that it is important to discuss and practice assessing individual text features separately on analytic scales in teacher education. This could be realized in an initial practice session in which pre-service teachers could assess multiple authentic student texts in SIA and receive immediate feedback. Pre-service teachers could discuss in peer groups their assessment experiences and the influence of specific text features. In a subsequent session on rating scales, participants could discuss the advantages and disadvantages of various rating scales and apply them under expert guidance. This is particularly relevant for Swiss and German pre-service teachers who can later choose their own rating scales for use in their classroom assessments in English writing.
Further research should also develop subject-specific analytic rating scales which are aligned with the local syllabi of the educational contexts in which they are used (Rakedzon & Baram-Tsabari, 2017;Rezaei & Lovorn, 2010). Subject-specific rating scales enable teachers to identify, examine and credit specific elements of a genre more easily in comparison to general rating scales.

Limitations and directions for future research
This study employed a controlled manipulation of two organisational features within an experimental setting and thus follows a narrow operationalisation of the broad construct of organisation in student writing. As it did not aim at covering the complete range of possible text variations relating to organization which occur in authentic student texts, a generalization to a real assessment situation would require additional studies.
While our operationalization of organisation was somewhat limited, the manipulation employed in this study resembled naturally occurring phenomena and reflected the nature as well as the frequency of organisational features within the range of student texts in the dataset upon which this research is based. However, future studies should investigate the influence of organisational quality on the basis of a larger corpus of student essays in order to cover the wide range of quality occurring within authentic written performances.
This study is further limited due to the combined manipulation of cohesive devices and paragraphing. This co-occurrence of manipulated text features does not allow any conclusions about which feature had more influence on teacher judgments. Following previous research stating that teachers are strongly influenced by superficial aspects of student writing (Charney, 1984;Scannell & Marshall, 1966), one might assume that the variation of paragraphs had a greater influence on teachers' assessment, since it is more prominent at first glance. It would be interesting to analyse the precise influence of each text feature in separate studies. We included both features in this first study as they are essential components of organisation from a rater's perspective (Ruegg & Sugiyama, 2013).
Further, this study defined organisational quality of a student text as surface level indicators. In contrast to previous studies that investigated the underlying constructs of organisation, such as cohesion and writing proficiency (Crossley et al., 2014;Crossley & McNamara, 2012) or coherence (Lee, 2002;McNamara et al., 2010;Witte & Faigley, 1981), we intended to explore the influence of two organisational features on teacher judgments in an experimental setting. Manipulating a feature such as coherence would have implied complete re-writing of original texts, inevitably also changing aspects such as grammar or vocabulary which were distinct variables in our design. Still, our setting is not completely artificial as our dataset of argumentative essays exhibited several compositions by students who seemed to be struggling with appropriate paragraphing or effective use of cohesive devices.
Future studies could explore how halo effects can be minimized in order to make writing assessment more objective. They could also investigate teacher judgments in a more "natural" context. How do pre-service and in-service teachers assess organisation in classroom assessment activities? Additionally, it would be interesting to examine the potential of teacher training measures. How do pre-service teachers best learn to assess student texts in their education programmes? These studies could also compare first and second/foreign language writing assessment.
In conclusion, we believe that our study enriches the comprehension of pre-service teachers' assessment of ESL essays and contributes useful data on an essential aspect of diagnosis in the English classroom. The ultimate goal of this research is to contribute to more precise and fairer assessments of student performances in the English classroom.