Using Causal Explanation Speaking Tasks to Assess Young EFL learners’ Speaking Ability: The Effects of Age, Cognitive, and L2 Linguistic Development

ABSTRACT This paper examined to what extent causal explanation speaking tasks (CESTs) are cognitively appropriate for assessing young language learners’ (YLLs) L2 speaking. Ninety-six YLLs (48 from Grade 4 and 6 each) in China performed two CESTs in both L1 (Chinese) and L2 (English). They also completed receptive and productive L2 vocabulary size tests. We examined how their CEST performance scores, choice of causal antecedents, and speech utterances were related with language modes of the tasks (L1 vs. L2), grade levels, and L2 vocabulary sizes. L2 CEST performance scores were found to have significant positive correlations with L2 productive vocabulary size. CESTs were found to be generally cognitively appropriate for YLLs because their high scores in L1 performance indicated that performing CESTs is within their L1 capacity. By examining causal connectives used by YLLs, we found that learners from both age groups had cognitive ability sufficient to verbalise causality. Yet YLLs’ cognitive ability to interpret and verbalise mental states is still developing and reasoning between causal antecedents that have competing causal relationship with the final state can be cognitively challenging. We discussed the findings with reference to the design of cognitively appropriate CESTs that can assess both language and thinking skills. 本文研究了因果解释口语任务从认知层面来说是否适宜测评儿童语言学习者的二语口语能力。九十六名中国的儿童语言学习者(四年级与六年级各四十八名)分别用第一语言(汉语)和第二语言(英语)完成了两个因果解释口语任务。他们还完成了接受性和产出性二语词汇测试。我们分析了他们的因果解释口语任务的表现分数、对前因的选择和语言产出这三者与完成任务所使用的语言 (一语和二语)、所在年级和二语词汇量之间的关系。我们发现因果解释口语任务的二语表现分数与二语产出性词汇量之间呈显著正相关。他们用第一语言完成因果解释口语任务所获得的高分数说明完成这项任务在他们的一语能力范围之内,因此我们认为因果解释口语任务对于儿童语言学习者来说总体上是认知适宜的。通过分析儿童语言学习者使用因果连词的情况,我们还发现这两个年龄组的学习者拥有充分的认知能力来口头表达因果关系。不过儿童语言学习者理解和口头表达心理状态的能力尚处于发展阶段,他们对具有相互竞争关系的前因进行逻辑推导时会存在认知上的挑战。我们对研究结果的讨论主要涉及如何设计认知适宜的并能够同时测评语言和思维技能的因果解释口语任务。


INTRODUCTION
The EFL curricular reform in China has started to promote the development of both language and thinking skills for young language learners (YLLs) so as to develop their key competences through language learning (Ministry of Education China, 2022). Such a change corresponds to EFL educators' changing beliefs that language education provides learners with opportunities for not only language development but also holistic development in thinking skills (Gong, 2015). However, there is a lack of assessment tasks that can capture such an instructional focus. Causal explanation speaking tasks (CESTs), a type of speaking tasks using why questions to elicit explanation about cause and effect, deserves special attention. The reason is that why questions and meaning-focused tasks are being increasingly used to develop both language and thinking skills of YLLs (L. Li, 2011Li, , 2016Zheng, 2018). Causal explanation itself is part of the can-do statements of curriculum objectives that integrate language and thinking skills. For example, primary and secondary school EFL learners in China should be able to identify and analyse cause and effect (Ministry of Education China, 2022). It is also a "cognitive discourse function" that bridges content and language in Content-and-Language-Integrated Learning (CLIL; Dalton-Puffer, 2013, p. 216). Therefore, we developed causal explanation speaking tasks as can be used in standardized tests and by classroom teachers.
In addition, compared with other speaking tasks for YLLs that normally involve description, narration and simple daily conversations (ETS, 2022a;Papp, 2018a, pp. 343-345). CESTs have distinctive and perhaps higher cognitive demand because they require YLLs to reason about cause and effect and explain why. Considering YLLs' developing cognitive capacity and growing L2 proficiency, there are concerns about to what extent a CEST is a language task and to what extent it is cognitively appropriate for YLLs. To address these concerns, our study draws on the argument-based approach to validation (Chapelle et al., 2008;Kane, 2013) as well as Weir's socio-cognitive framework which has a special consideration that a cognitively appropriate task should be within YLLs' L1 capacity (Field, 2018). We also followed the suggestions of the Standards for Educational and Psychological Testing (AERA, APA and NCME, 2014) on the importance of collecting multiple validity evidence. In particular, we aim to better understand the roles that learners' cognitive and L2 linguistic development play in their L2 CEST performance.

Gathering validity evidence of the assessment of young language learners
Validation is in essence to collect evidence to validate the intended interpretations of the test scores and test use (AERA, APA and NCME, 2014; Chapelle et al., 2008;Kane, 2013). The argument-based approach to test validation requires test developers to back up the claims with theoretical and/or empirical evidence (Chapelle et al., 2008). To gather validity evidence of the assessment of YLLs, we must take into consideration of YLLs' special characteristics. Weir's socio-cognitive framework for test validation pinpoints that the cognitive demand of an L2 task should be within a YLL's L1 capacity (Field, 2018), which suggests we should take consideration of the age-related cognitive development of YLLs in their L1 performance. Thus, to back up the claim that CEST is a cognitively appropriate language test, we should collect the evidence on how YLLs' L2 performance of CESTs is related to their L2 proficiency and the development of the cognitive ability as demonstrated by their L1 performance. In the case of CESTs, it includes the age-related cognitive ability to interpret and verbalise causal relationship (see the next section) and mental states (see the section on Theory of mind and mental state words). Such considerations are also aligned with the requirement from the Standards for Educational and Psychological Testing (AERA, APA and NCME, 2014) that validation should gather multiple sources of validity evidence.

Linguistic and thinking skills required in causal explanation speaking tasks
Higher order thinking skills (hereafter thinking skills) was defined by Resnick (1987, p. 3) as involving multiple solutions, nuanced judgement, multiple criteria, uncertainty, selfregulation, imposing meaning, and efforts. Thinking skills is generally considered broader than critical thinking, which does not include creative thinking (Wegerif et al., 2015;X. Li & Liu, 2021). The assessment of thinking skills generally takes two approaches (Swartz & McGuinness, 2014). One is the psychometric approach that takes thinking skills as a construct that can be measured "separately" and "efficiently" by an instrument; and the other is the curriculum approach that measures thinking skills by evaluating learners' written or oral performance in a learning context (Swartz & McGuinness, 2014, p. 44). The existing instruments of thinking skills are mainly for adults, such as the Thinking Skills Assessment (Cambridge Assessment Admissions Testing, n.d.), the HEIghten® Critical Thinking assessment (Liu et al., 2014) and the Watson Junior® Critical Thinking Appraisal (Pearson, n.d.). There are no readily available psychometric instruments measuring young learners' thinking skills. Swartz and McGuinness (2014) suggested using curriculum approach for young learners, but they acknowledged that it could be problematic for assessment because students' content knowledge, thinking skills, and language ability tend to be mixed up in the rating. Thus, they emphasized the importance of aligning the assessment with the objectives of a curriculum and using criterion-related rating scales.
The objectives of teaching, learning, and assessment of thinking skills are always operationalised as a series of linguistic functions, such as understanding, explaining, and analysing (e.g., Ministry of Education China, 2022). These verbs are considered manifestations of thinking skills by, for example, Bloom's taxonomy (Anderson et al., 2014;Bloom, 1956). They are conceptualised by Dalton-Puffer (2013) as "cognitive discourse functions" (p. 227) that demonstrate thinking skills and connect language with content. In a soft CLIL curriculum, where language communication is a target and cognition (thinking skills) is a highly valued goal (Ikeda, 2022), causal explanation is one of such cognitive discourse functions. We define causal explanation as the explanation elicited by why questions about cause and effect (Bailey, 2017;Donaldson, 1986;Lombrozo & Vasilyeva, 2017). To accomplish a causal explanation speaking task, learners need both the linguistic knowledge to verbalise causality and describe the mental states and actions, and the cognitive ability to reason the cause and effect (Donaldson, 1986) and to interpret the relevant conceptual domains, e.g., the mental states. In addition, causal reasoning (the cognitive ability underlying causal explanation) (Lombrozo & Vasilyeva, 2017) and intentional reasoning constitute part of the cognitive demand of a language task (Robinson & Gilabert, 2007;Robinson, 2001).
Previous research on causal explanation or reasoning concentrated on pre-school children's development of causal connectives (CC) in their L1 (for a review, see Donaldson, 1986) or the processes of reasoning (e.g., Das Gupta & Bryant, 1989;Legare et al., 2010). Research on school-age English-as-a-second-language (ESL) learners' progression of causal explanation in L1 and L2 appeared only recently (Bailey & Heritage, 2014;Bailey, 2017). However, there is not yet any study looking at young EFL learners' causal explanation development or performance of CEST in a testing condition. Hickling and Wellman (2001) observed children aged from 2 years and 3 months used CC in daily spontaneous L1 speech in four conceptual domains including physical, psychological, socio-conventional, and biological. Das Gupta and Bryant (1989) found that 4-year-old children significantly outperformed 3-year-olds in making correct causal inferences when they described the initial and final states of an object as depicted in photos. Similarly, H. Li et al. (2005) found among Chinese children that 4-year-olds performed better than 3.5-year-olds in causal reasoning tasks, and cause-effect reasoning were found to be easier than effect-cause reasoning tasks. Bailey and colleague (Bailey & Heritage, 2014;Bailey, 2017) tracked the progression of explanation language used by English native-speaking monolingual children (NS) and ESL children in both kindergarten and Grade 3. After around six months, features like coherence/cohesion remained static in NS kindergarteners but not in their ESL peers, while features like sentences and vocabulary sophistication had modest increases in both NS and ESL children. They also observed modest gains in coherence/cohesion and vocabulary sophistication of NS 3rd graders' explanation, while there was attrition in these two aspects among their ESL peers.

Theory of mind and mental state words
Theory of mind (ToM) refers to the ability to understand one's own and others' minds (Wellman, 2011). Empirical studies suggest that pre-school children's developmental milestones of ToM (for a review see Wellman, 2011Wellman, , 2014 start from understanding people as intentional agents at around 1 year old to gradually understanding different types of intentions (e.g., diverse beliefs and false belief) from around 3 to 5 year old. From preschool to school age onwards, children's ToM develops from being intuitive to more reflective. Children start to reflect, question and extend the realities they experienced as they grow older and accumulate more world experience (Wellman, 2014).
The development of ToM can be manifested in children's increasing use of mental state words (MSW), the words used to depict mental state. Previous studies about MSW compared the L1 use of MSW between children with and without delayed language development or language impairments (e.g., Lee & Rescorla, 2002), and explored the development and use of MSW in different contexts (e.g., Pascual et al., 2008). Lee and Rescorla (2002) investigated the use of MSW among persistently late talkers, late bloomers, and typically developing children (all aged around 36 months). They divided MSWs into 4 types, namely physiological (e.g., sleepy), emotion (e.g., happy), desire (e.g., want) and cognitive (e.g., think) states. Butler et al. (2017) tracked the use of MSW in school-age YLLs' L1 and L2 picture-book narration in a two-year longitudinal study (from Grade 4 to Grade 6). They defined MSW as adjectives, adverbs and adverbials used to verbalise the inner states of characters. They observed more frequent use of MSW in both L1 and L2 as children grew older, and lower frequency of MSW in L2 than L1.
Another stream of relevant studies looked at how different degrees of intentional reasoning in a task might be related with different linguistic indices of task performance, but mostly among adult language learners (e.g., Robinson, 2007). Robinson (2007) invited university students to accomplish picture arrangement tests with different levels of intentional reasoning. Following the categorisation of MSW in Lee and Rescorla (2002), he identified that more MSWs, especially psychological and cognitive state words, appeared in the performance of tasks with higher degree of intentional reasoning. However, none of these studies has investigated YLLs' use of MSW in causal explanation speaking tasks.

Vocabulary size and L2 speaking proficiency
Vocabulary size can be defined as "the number of known words in terms of form-meaning connections" (Uchihara & Clenton, 2020, p. 541), and it usually includes receptive and productive vocabulary sizes. Vocabulary size has been recognized as a core component in L2 speech production models. According to Levelt's monolingual speech production model (Levelt, 1989(Levelt, , 1999 and Kormos (2006), speakers generate a pre-verbal message, then encode this pre-verbal message by retrieving relevant vocabulary from their mental lexicon and the relevant grammatical rules, and transfer it into speech utterances. In the bilingual speech model (Kormos, 2006), the L1 and L2 concepts, lemmas and lexemes are stored together in the mental lexicon. In addition, the mental representation of lexical items is a component of the core language proficiency (Hulstijn, 2007).
The number of words that children mastered has been regarded as an important indicator of their language development milestones (Speech & Language, n.d.;Stanford Medicine Children's Health, 2023). In developmental studies, vocabulary size has been used as a proxy of children's verbal ability (e.g., Stites & Özçalışkan, 2013). In studies on L2 development, vocabulary size is an important predictor of general language proficiency. For instance, Uchihara and Clenton (2020) found that receptive vocabulary size was significantly correlated with the use of lexical resources in the rating of L2 speaking, but not with lexical sophistication in speech utterances. de Jong et al. (2012) found that productive vocabulary size was a significant predictor of speaking performance of adult Dutch native speakers and L2 (Dutch) intermediate and advanced learners. Similarly, Koizumi and In'nami (2013) found that English productive vocabulary size significantly predicted various aspects (e.g., fluency and accuracy) of and the overall speaking proficiency of Japanese secondary school lower-intermediate EFL learners. However, very few empirical studies were conducted among beginner-level YLLs.

Identifying the effects of age, cognitive and L2 linguistic development on young language learners' test performance
Test developers aim to make the assessment tasks age-appropriate for YLLs, and each standardized assessment for YLLs has intended age groups, usually aged six and above (e.g., Devine, 2018;ETS, 2022b;Pearson, 2022). However, the age effects on speaking test performance have been inconclusive. In Huang et al. (2021), a study that validated the use of TOEFL Junior® Speaking test for adolescents, 252 EFL learners from Grades 7 to 9 (average age = 14) were recruited from three local schools in Taipei. Participants' speaking test scores of TOEFL Junior® had significant positive correlations with their age, the length of English instruction at school, and out-of-school exposure to English, but negative correlations with their starting age of learning English. Research conducted in diverse educational contexts however has shown mixed effects of age. Getman (2020) analysed the scores of the TOEFL Primary® Speaking test accomplished by over 400 YLLs aged from 7 to 13 in 11 countries. The study found that the speaking tasks functioned similarly for different age groups and did not have an age bias. In a large dataset of Cambridge English Language Assessment that included the test scores of YLLs aged from 4 to 16 and from around 16 countries/regions (Papp, 2018b, pp. 84-86), older test-takers tended to achieve lower band levels in the listening and speaking sections of different test suites. Such inconsistent findings suggest that effects of age on L2 test performance may need to be interpreted together with other potential confounding effects. For instance, the varying starting age of learning English, different types of English language curriculum that YLLs are exposed to, and the length and quality of English instruction as well as informal exposure outside the classroom (Muñoz, 2007(Muñoz, , 2014. Apart from these contextual factors, age-effects may also be confounded with YLLs' individual differences in cognitive development. It is challenging to tease apart the effects of L2 linguistic development from cognitive development on YLLs' performance of assessment tasks largely due to the difficulty in recruiting a sufficient sample size, obtaining appropriate and readily available instruments, and controlling the confounding effects. Some child development studies use a psychometric approach. Such an approach first assesses each linguistic and cognitive factor with an independent measurement, and then includes age as one of the independent variables and calculates their statistical relationship with the language performance mostly in L1 (e.g., Stites & Özçalışkan, 2013). However, this approach can be hard to use if the relevant instruments are not available or existent. In the field of language assessment, studies that take a psychometric approach are mainly about older language learners and do not include age as a variable under investigation (e.g., Awwad & Tavakoli, 2022), or those studies are about how one particular cognitive factor might impact YLLs' task performance while linguistic factors are not the focus of investigation (e.g., Brunfaut et al., 2021;Michel et al., 2019).
As an alternative approach to tease apart the cognitive effects from the linguistic effects, some scholars compare the linguistic features of L1 and L2 performance across different age groups or different time points (e.g., Butler & Zeng, 2014, 2015Butler et al., 2017). An underlying assumption of this approach is that the variance in L1 performance across different age groups is largely due to the variance in YLLs' cognitive ability. This approach is especially useful for revealing how cognitive development and L2 linguistic development can be manifested in linguistic production. For instance, Butler and Zeng (2015) compared the L1 and L2 performance of interactional tasks across 4th-grade and 6th-grade YLLs (ages 9-12, 24 dyads) in China. It was found that 4th graders had much lower mutual engagement in both L1 and L2 tasks, while 6th graders demonstrated more collaborations. Such a crosslanguage-mode pattern suggested that the level of collaboration is largely due to their agerelated cognitive development rather than the differences in linguistic resources. Butler et al. (2017) conducted a 2-year longitudinal study on L1 and L2 narrative development among 32 participants (ages from 9 to 12) in China. They investigated the use of CC and the use of MSW (for specific findings regarding MSW, see Section Theory of mind and mental state words). They found that a participant who first used CC in their L1 performance would use them in L2 ultimately. A participant used CC in L2 was observed to have used CC in L1 already. Such a pattern showed that "certain cognitive maturity and sophistication of narrative constructions" in L1 may be a precondition for verbalizing causality in L2 narration (Butler et al., 2017, p. 165). To our knowledge, neither the psychometric approach nor the approach that examines the linguistic features of L1 and L2 performance across age groups has been applied to investigate YLLs' performance of CESTs under the exam conditions.

RESEARCH QUESTIONS
The literature above reviewed the validity frameworks we used to support our validation, and how both linguistic and cognitive aspects of causal explanation might develop in children. It suggests the necessity yet insufficient investigation of how age, L2 linguistic and cognitive development might affect YLLs' L2 performance. We reviewed the methodological approaches to address this issue and considered it important to use these approaches in the case of causal explanation speaking tasks (CESTs). It helps us to test the validity argument that CESTs assess YLLs' L2 speaking proficiency in a cognitively appropriate way. As there are no readily available instruments for YLLs' thinking skills, we blended the two approaches reviewed above, used L1 performance of CESTs as a proxy of YLLs' cognitive development and their grade level as a proxy of age. Our study aimed to answer the following questions: (1) To what extent are YLLs' L1 (Chinese) performance scores, choice of causal antecedent, and use of causal connectives and mental state words in the L1 performance of CESTs related with their grade levels? (2) To what extent are YLLs' L2 (English) performance scores, choice of causal antecedent, and use of causal connectives and mental state words in the L2 performance of CESTs related with their L1 counterparts, L2 vocabulary sizes and grade levels?

Participants
We recruited 96 students from a public primary school in Shenzhen, China. Half of them were from Grade 4 (ages 9-10), and the other half from Grade 6 (ages 11-12). These students have 150 minutes of English lessons per week. We chose Grade 4 and 6 because our pilot study showed that lower-grade participants' successful accomplishment of CESTs depended more on the examiner's cues and encouragement. 4th and 6th graders were more likely to perform CESTs independently. We suggested teachers recruit children without diagnosed learning difficulty, and received written consent from both teachers and parents, and orally confirmed child participants' willingness to participate. The Ethics Committee of the authors' institute approved this project before data collection. We used 94 participants' data for analysis because two participants from Grade 4 were found to have participated in one pilot.

Instruments
Causal explanation speaking tasks. We adapted the causal reasoning tasks (Das Gupta & Bryant, 1989;Reed et al., 2015) to elicit participants' L1 and L2 causal explanation. These tasks were used to test pre-school children's ability to identify a cause that changes an object, an animal, or a person's physical or psychological status. Using the tasks in Reed et al. (2015) as a reference, we invited a professional visual designer to create the visuals of CESTs. We revised the time setting and test instruction of CESTs based on two pilots with 20 students. To ensure consistent Chinese and English written instruction, we translated the written instruction from English, and then invited another Chinese-English bilingual researcher to translate it back to Chinese and revised any inconsistency. Four tasks were used in the current study, with two tasks for the familiarization stage (Figures 1 and 2), and two tasks for the main study (Figures 3 and 4). In each task (see Figure 1 as an example), the audio instructions (same as the written instruction) were first automatically played. After that, participants were required to say whether the animal or person in the middle (Picture X) is happy or not, select one out of the three causal antecedents (Pictures A, B and C) that could explain the psychological status of the character in Picture X, and then explain why they chose that picture and why they did not choose the other two. In the three causal antecedents, one cause has strong causal link to the final state (e.g., the monkey hit the rabbit with a stone and made it angry), one has a neutral link (e.g., the monkey was eating a banana beside the rabbit), and the third has a weak link to the final state (e.g., the rabbit was dancing happily with the monkey). Nevertheless, considering that one's facial expression and mental state are subject to individual interpretation, no absolute correct answer was set. In other words, participants would not necessarily receive a low score if they choose the causal antecedent that bears neutral or weak link with the final state. We evaluated their performance in terms of language and delivery, as well as content and logic (For specifics on the rating scale, see Appendix A).
Vocabulary size tests. We measured participants' receptive and productive vocabulary sizes with the Peabody Picture Vocabulary Test 5 (PPVT5) (Dunn, 2019) and the Expressive Vocabulary Test 3 (EVT3) (Williams, 2019). They are used as a proxy indicator for their English oral proficiency due to the strong correlation between vocabulary size and overall English language and/or speaking proficiency (Uchihara & Clenton, 2020). Although both tests were originally designed for English native-speakers in the United States, they have been used in studies investigating YLLs' vocabulary size or second language development in the EFL context (e.g., Butler et al., 2017;Sun et al., 2018).

Procedures
Each participant did the same set of CESTs in both Chinese and English. We counterbalanced the order of language modes but not the order of task types to partially reduce practice effects. The sequence of doing the two CESTs was the same across the groups and grades, with the Boy task in Figure 3 first and the Dog task Figure 4 later. We collected the data in a quiet room at the primary school. The students first confirmed their willingness for participation. To build rapport and minimize their anxiety about speaking English with a stranger, we played a game in English with them. Then, we invited them to do a set of practice tasks on the computer for familiarization. During the familiarization process, they were allowed to ask the researcher any procedural questions about the test. They were also informed that the researcher would not answer their questions during the test administration. Subsequently, each participant performed two tasks on the computer screen in both L1 and L2, and their speaking was audio-recorded. We also recorded their eye movements and conducted stimulated recalls, but this part of data was reported in another manuscript (Ding & Yu, in preparation). Afterwards, they took the PPVT5 and then the EVT3. At the end of data collection, we gave every participant a small gift as a reward.

Analysis
Rating the speaking performance of causal explanation speaking tasks. We referred to the TOEFL Primary® Speaking Test Scoring Guide (ETS, 2018), the rubrics for evaluating reasoning (Brookhart, 2010), and a primary-school-level CLIL curriculum assessment design (Massler et al., 2014) to develop our rating rubrics. It covers language use and delivery (i.e., clarity and accuracy, fluency and intelligibility) and content and logic (relevance, coherence, logic and completeness of causal explanation/reasoning). Following an iterative approach for developing speaking rating scales (McKay, 2006, pp. 293-294), we revised the prototype rating scale for three rounds based on the independent rating and discussions of 32 speech samples from 8 participants. For the final scoring rubrics, see Appendix A. To increase rating reliability, the first author and another Chinese-English bilingual researcher used the augmentation method (Penny et al., 2000) to assess the participant's task performance based on the original 4-point scale (A, B, C, D). The augmented scales were converted to 9 to 0 corresponding to A+ to D.
To ensure high inter-rater reliability, the two bilingual researchers independently rated the performance and discussed any difference they had in scoring. If difference still existed after discussion, they averaged them as the final score. The two raters' final ratings of all the sub-scores had Cohen's Kappa ranging from 0.25 to 0.38, and Spearman's Rho from 0.51 to 0.92. 100% ratings had exact and adjacent agreement (discrepancy less than 2 points).
Analysing the speech utterances. The same two researchers coded the causal connectives (CC) and mental state words (MSW) and calculated their frequency and variety. One coder (the first author) built up the initial coding scheme according to Butler et al. (2017) and Lee and Rescorla (2002) in terms of how CC and MSW were conceptualised and operationalised in data analysis. CC refers to the connectives that indicate a causal relationship, such as so and because. To code MSW, we followed the classification by Lee and Rescorla (2002) and included the physiological (e.g., sleepy), emotion (e.g., happy), desire (e.g., want), and cognitive MSW (e.g., think). The frequency that one target linguistic item occurs in one task performance indicates YLLs' ability to use them. We also calculated the variety of the target words by counting the total of different target words used in one task performance. For example, if because and so occur, the variety of CC counts as two. The first author coded around ¼ of all the data and summarized her coding in a table, and trained and discussed it with the other coder, based on which the coding scheme was revised. They then coded all the data independently. To calculate inter-coder reliability, we multiplied the total number of agreements by two, and divided it by the two coders' total number of decisions (Mao, 2017). The inter-coder reliability reached 97.47%. The two coders further resolved any disagreement by discussion until an agreement was reached. Table 1, participants achieved similar scores in both tasks, in both sub-scores and total scores. Participants' performance all reached A or very close to A. In addition, the total score of the Boy task is slightly lower than the Dog task. The Wilcoxon Signed Rank Test showed that such a difference is statistically significant (T = 1793.00, z= −3.356, p < .001).

L1 performance scores in relation to grade levels. As shown in
As the Chinese scores were not normally distributed and grade is a binary between-group variable, we conducted a Mann-Whitney U test to check whether there is a significant difference between participants from Grade 4 and Grade 6. No statistically significant difference was observed in the Chinese scores of the Boy task between Grade 4 and Grade 6 ( Table 2). For the Dog task, however, 6th graders were found to have significantly higher score than 4th graders in the language-and-delivery (LD) sub-score and total score, but no  significant difference in the content-and-logic (CL) sub-score. Such a finding suggests that the cognitive demand of the Dog task in L1 might slightly favour participants from the higher grade level. However, the difference was small, and participants from both grade levels were able to achieve a level equal to or very close to A. Table 3, Picture A was the most chosen causal antecedent in the L1 performance of the Boy task. Seventy-six out of 94 participants chose Picture A. In the Dog task, the vast majority of participants (n = 90) chose Picture B as the causal antecedent in the Dog task. Participants from Grade 4 and 6 made similar choices in the L1 performance of both tasks (Table 4).

L1 use of causal connectives and mental state words in relation to grade levels.
We performed a Mann-Whitney U test to explore the extent to which the L1 use of CC and MSW are related to grade levels. As shown in Table 5, participants from Grade 6 had slightly more frequent and varied CC and MSW than those from Grade 4. However, the Mann-Whitney U test (Table 5) showed that such differences were not statistically significant except for the frequency and variety of MSW in the Boy task. As both 4th and 6th graders' L1 performance scores reached the highest level of the rating scale, the finding that little significant difference in the L1 use of CC across the two grades suggests that YLLs in these two age groups have sufficient cognitive ability to use CC in their L1. We can draw a similar conclusion on the use of MSW in the Dog task. The different findings in the use of MSW between the Boy task and the Dog task suggest that the Boy task might have higher cognitive demand for verbalising MSW. Note: N/A means the researcher was not able to tell which picture was chosen based on participants' task performance.  Table 6 presents L2 performance scores of the two tasks. Overall, participants' L2 performance of both tasks was at a similar level. The mean of the sub-scores ranged from 4.60 to 5.51. The English scores are significantly lower than the Chinese scores (see Table 2 for Chinese performance scores), as shown by the Wilcoxon Signed Rank Test statistics in Table 7. However, the L2 scores of the Boy task were significantly lower than the Dog task (Table 8).   Participants' average raw score of English receptive vocabulary size test was 46.06 (SD = 15.73). It was higher and had larger standard deviation than productive vocabulary size (M = 31.91, SD = 9.48). The average total score of English vocabulary size was 77.98 (SD = 23.10).

L2 scores in relation to L1 scores, L2 vocabulary sizes, and grade levels.
Multiple linear regression was conducted to analyse how the English task performance might be related to the Chinese task performance, English vocabulary sizes, and grade level. Regression Model 1 is about the Boy task, and Regression Model 2 the Dog task. The dataset for the two models met the assumptions for regression analysis, namely the assumptions for normally distributed residuals, homoscedasticity, and multicollinearity.
Language-and-delivery sub-score of L1 performance, English productive vocabulary size and grade level are significant predictors and have a significant collective effect on the English performance of the Boy task (F (5,88) = 19.068, p < .001, R 2 = .520, see Table 9). The content-and-logic sub-score of L1 performance, and English receptive vocabulary size are not significant predictors. As for the Dog task (see Table 10), it was found that language-and-delivery and contentand-logic sub-scores of L1 performance, English productive vocabulary size, and grade are significant predictors for L2 performance score. These predictors explain 54% of the variance in the L2 performance score (F(5,88) = 20.695, p < .001, R 2 = .540). The receptive vocabulary size is not a significant predictor either as in the Boy task.

The L2 choice of causal antecedent in relation with L1 choice, L2 vocabulary sizes and grade levels.
In the L2 performance (Appendix B), most participants (n = 75) chose Picture A in the Boy task, and Picture B (n = 83) in the Dog task. This is similar to the result of L1 performance. It seems that the major difference between L1 (see Table 3) and L2 performance of the same task was that more participants did not mention their choice in the L2 performance.
One surprising finding was that the Boy task elicited slightly more choices of causal antecedent than the Dog task in both L1 and L2 performance. In the Boy task, certain participants might be struggling between Picture A that bears strong causal relationship with the final state, and Picture B which supposedly bears a neutral causal relationship with the final state. For example (see Excerpt 1 and 2), although Annie (pseudonym of participant) mentioned that the boy in both Picture A and Picture B is not happy, she chose Picture A. By comparison, Bob explained that the boy in both Picture A and B had certain challenges, but he decided that Picture B would make the boy unhappy. Such potentially competing relationships between causal antecedents did not exist in the Dog task.

Excerpt 2 (by Bob from Grade 4):
The boy is sad. I choose B because the boy is cannot, do the problem. I don't choose C because the boy and another boy is playing the car. I don't choose A too because the teacher said (15s bell rings) don't, don't play game.
In addition, we found that whether participants make a choice or their choice of picture may have certain impact on their content-and-logic sub-scores. In both tasks, participants who did not make a choice tended to receive much lower scores in content and logic than those who did. In the Boy task, participants who chose Picture C which has neutral causal relationship with the final state tended to receive much lower score in the content and logic sub-score. Furthermore, participants who did not state their choice in the L2 performance in both tasks had much smaller L2 receptive and productive vocabulary sizes, suggesting that whether participants stated their choice is related to their L2 language proficiency. Like the L1 performance, participants across the two grades made similar choices in the L2 performance of both tasks (Table 4).

L2 use of causal connectives and mental state words in relation to the L1 counterparts, grade levels, and L2 vocabulary sizes.
As shown in Appendix C, L1 performance of CESTs had much more frequent and varied use of CC and MSW than L2 performance. We also noticed that if CC occurred in a YLL's L2 performance, we would also find similar CC in that YLL's L1 performance. However, if there was no causal connective identified in the L1,  there would also be no causal connective identified in their L2. Such an observation is consistent with Butler et al. (2017). Yet this pattern was not distinctive in the use of MSW.
As the frequency and variety of CC and MSW were not normally distributed, we performed Wilcoxon Signed Rank Test to analyse how they might be different across L1 and L2 performance. The statistics (Table 11) showed that both the frequency and variety of CC and MSW in L1 performance (except for variety of CC in the Boy task) were significantly higher than the L2 performance in both tasks.
In the L2 performance (see Table 12), 6th graders had significantly more frequent and varied use of CC than 4th graders. Considering participants from two grade levels used CC in L1 with similar frequency and variety, such a significant difference in L2 performance is highly likely to be related with the difference in their L2 proficiency. This is further evidenced by the finding that the frequency and variety of CC had significant positive correlations with L2 productive vocabulary size and grade levels in both tasks (Tables 13 and 14). Moreover, we found that 6th graders used significantly more frequent and varied MSW in their L2 performances than 4th graders in both tasks except for the Dog task. The frequency and variety of MSW in the Boy task and the frequency of MSW in the Dog task also had significant positive correlations with L2 productive vocabulary size and grade level. Given the cross-grade significant differences identified in L1 use of MSW, we can conclude that both the L2 and cognitive development may have contributed to L2 use of MSW, but the Boy task may have higher cognitive demands for verbalising mental states than the Dog task.

DISCUSSION
This study aims to investigate the effects of YLLs' cognitive and L2 development on their performance in CESTs to explore the validity evidence that CESTs are cognitively  appropriate language tests for YLLs. We examined how different facets of L1 performance (scores, picture options and speech utterances) are related with grade levels, and how different facets of L2 performance are related to their L1 counterparts, grade levels and L2 vocabulary sizes. We found that 4th and 6th graders received similar scores in their L1 performance of the Boy task, and similar content-and-logic sub-scores in their L1 performance of the Dog task. The majority of L1 performances (95%) received a top score. Despite the small variances in YLLs' levels of cognitive development manifested by the L1 performance scores, most participants from the two grade groups were able to accomplish the task in L1 successfully. By examining the L1 speech utterances, we found that 4th and 6th graders used CC in L1 with similar frequency and variety in both tasks. It suggests that participants from both grade levels have sufficient cognitive ability to verbalise causality in their L1. Regarding MSW, 6th graders used them with significantly higher frequency and variety than 4th graders in the L1 Boy task, but similarly in the L1 Dog task. It indicates the cognitive demand to verbalise mental states may be higher in the Boy task than the Dog task. Taken together the findings in L1 performance scores and L1 speech utterances, we can conclude that the cognitive demand of CESTs is generally cognitively appropriate and manageable for YLLs from Grade 4 and Grade 6, but we need to be aware that YLLs from lower grade levels may be less able to verbalise mental states. Or it could be that CESTs per se have different cognitive demands; some are higher than others in verbalising mental status, and thus more challenging for YLLs from lower grade levels.
We found that the L2 productive vocabulary size was a significant predictor for L2 performance scores of CESTs in the two tasks. Due to the strong predictive power of productive vocabulary size for speaking proficiency (Uchihara & Clenton, 2020), such findings offer robust evidence for using CESTs to assess YLLs' speaking proficiency. Meanwhile, it would be desirable to obtain more evidence by comparing the scores of CESTs with the scores of independent speaking tests (e.g., Huang et al., 2021). This was in our original plan, but due to the closure of the primary school in China during the pandemic, we were not able to administer the independent speaking test. In addition, we found it hard to simply use performance scores to differentiate the effects of L2 proficiency from those of cognitive development on L2 CEST performance. This could be caused by the design of the rubrics in which the relevance and completeness of content are mixed with logic of causal explanation under the category content and logic. It thus makes it hard to distinguish a speech that has relatively complete description of all the pictures but does not contain any CC from a speech that has incomplete description but contains more frequent use of causal connectives. Consequently, it is difficult to pinpoint the exact reasons why the content-and-logic sub-score of L1 performance is an insignificant predictor for L2 performance in the Boy task, while it is a significant predictor in the Dog task. On reflection, we think we should have a separate assessment criterion for logic, an indicator of thinking skills, that is more distinguishable from content. Additionally, our finding that L1 languageand-delivery sub-score was a statistically significant predictor of L2 speaking performance lends further support to Kormos's (2006) claim that L1 and L2 speech production shares the same underlying cognitive mechanism such as long-term memory and the conceptual knowledge stored in it. However, we should be aware that the strong positive correlation between L1 and L2 speaking performance can be explained by the fact that YLLs' L1 and L2 speaking performance is assessed by the same set of tasks rather than by separate tasks. It is highly likely that they may have translated the conceptual understanding of the same tasks from L1 to L2.
Examining the L2 speech utterances together with L1 speech utterances across two grade levels was helpful for teasing apart the role that the YLLs' L2 and cognitive development might have played in their task performance. The findings show that the frequency and variety of CC and MSW were significantly less in the L2 than in the L1 performance, except for the variety of CC in the Boy task. The frequency and variety in the L2 use of CC and MSW were significantly higher in Grade 6 than Grade 4, and they had significant positive correlations with L2 productive vocabulary size and grade levels, except for the variety of MSW in the L2 Dog task performance. If we combine these findings with the findings from L1 speech utterances, we can conclude that differences in the L2 use of CC were mainly related with the variance in L2 proficiency. The differences in the L2 use of MSW were related to both cognitive and L2 linguistic development, which corresponds to Wellman (2014) that school-age children's theory of mind is still developing. These findings further suggest that test designers can include the use of CC as an important component in the rating scale of CESTs because (a) YLLs were found to have sufficient cognitive ability to use CC in their L1, and (b) the variances of CC in L2 performance are found to be highly correlated with L2 speaking proficiency. Regarding MSW we would make some tentative suggestions to test designers: a) rate/compare YLLs' L2 performance only based on the performance from the similar age group; b) include the developmental features of MSW in the rubrics, but does not require the highest level of performance to have a sophisticated use of MSW, e.g., YLLs should be able to use "some" rather than "a variety of" MSW; c) design CESTs that require the causal explanation about other conceptual domains that the target age group has sufficient cognitive ability to explain.
Another noteworthy finding is the unexpected difference between the Boy task and the Dog task. The two CESTs were designed to have similar conceptual demands and generate similar speech utterances. Our findings however show that the Boy task elicited more choices of causal antecedents than the Dog task. Picture A and Picture B of the Boy task might both be a causal antecedent, though Picture B was originally designed as a neutral choice. Some YLLs might be hesitating between Picture A and Picture B when accomplishing the Boy task. If test designers want to reduce the cognitive demand of CESTs, they may carefully control the degree of causal relationship that a causal antecedent has with the final state in order to reduce the likelihood that YLLs might be wavering between two potentially competing though seemingly legitimate choices. In addition, we must acknowledge the sequence effects in the study -test-takers accomplished the Boy task before the Dog task, though we counter-balanced the order of language mode. This may partially explain why the mean performance scores of the Dog task were higher than the Boy task. Future studies should fully counter-balance the tasks to minimize the sequence effects.

CONCLUSION
To respond to the EFL education curriculum reform, we developed CESTs to capture both thinking and linguistic skills. Following the argument-based approach (Chapelle et al., 2008;Kane, 2013) and the consideration of YLLs' characteristics in Weir's socio-cognitive framework (Field, 2018), we believe that teasing apart the roles of L2 proficiency and cognitive development in L2 performance can offer strong validity evidence for using CESTs to assess YLLs' L2 speaking performance. We examined different sources of performance data, including performance scores, the choice of causal antecedents, and the speech utterances. We conclude that CESTs can be an effective speaking assessment because we consistently identified modest significant correlations between CEST L2 performance scores and L2 productive vocabulary size. We also conclude that our CESTs are generally cognitively appropriate to assess YLLs as L2 assessment tasks because both 4th and 6th graders' L1 performance scores reached the highest level of the rating scale. Nonetheless, we pointed out certain caveats for test designers and researchers to note, including how to improve the design of the rating scale. Additionally, YLLs across the two age groups in our study were found to have sufficient cognitive ability to verbalise causality and use CC as required by the tasks. Yet, we found that different CESTs had varying levels of cognitive conceptual demands for choosing a causal antecedent and verbalising the relevant conceptual domain (e.g., mental states). CEST might be cognitively less appropriate for a certain age group whose cognitive ability (e.g., theory of mind) is still developing. The findings of our study are limited to CESTs, but the way we collected validity evidence for the assessment for YLLs may shed light to the validation of assessment tasks for YLLs in general. To make language assessment tasks more cognitively appropriate for YLLs, it is important to examine different sources of performance data in both L1 and L2 across different age groups, and to be fully aware how different cognitive dimensions of one task might exert different levels and kinds of conceptual cognitive demand on YLLs.

Appendices Appendix A. Rating Scale
Score Language use and delivery 语言的使用与表达方式

A
• The meaning is clear. Most sentences are complete and correct, and most words are used in correct and appropriate way, and even with certain lexical variety. Minor errors in grammar or word choice do not affect task achievement.
• Speech is intelligible, and the delivery is generally fluid. It requires minimal listener effort for comprehension.
• The explanation is supported by clear description of three causal antecedents and the final state. The explanation is also supported with some details about the characters' psychological status or actions.
• Explanation of the events is organized in a logical manner, with varied use of connective devices, e.g., because, so, and, but. • The meaning is mostly clear. There are some complete sentences with some correct use of simple words. Some errors in grammar or word choice may interfere with task achievement.
• Speech is generally intelligible, but the delivery may be slow, choppy, or hesitant. It requires some listener effort for comprehension.
• The explanation is supported by mostly clear description of at least two of the causal antecedents and the final state. The explanation is brief and lacks details about the characters' psychological status or actions.
The English words following the Chinese are the English translation of the Chinese mental state words.