The trade-off between STEM knowledge acquisition and language learning in short-term CLIL implementations

ABSTRACT Bilingual education could solve many challenges introduced by an increasingly internationalised education system. Content and Language Integrated Learning (CLIL), in particular, may equip students with the necessary cultural and communicative skills to succeed in today’s academic environment. However, it is not yet clear how CLIL can be employed effectively in short-term educational contexts where full-term bilingual programmes are not feasible. We designed and assessed a one-day CLIL module for ninth graders at our university’s gene-technology lab. The assessment of our module with 252 grammar school students indicates that a CLIL module does not achieve the same learning success as an equivalent non-CLIL module. Even with additional language scaffolding material, full access to online dictionaries, and the availability of crucial workbook passages in their native language, CLIL students could not achieve the same short-term content learning success. We consequently argue that more attention should be paid to the inherent trade-off between language and content learning when carrying out short-term CLIL programmes. Moreover, we caution against using only content and language scaffolding to mediate this trade-off.


Introduction
Appropriate English language skills have become a requirement for most professions (Sardegna et al., 2017).In higher education, they are often needed for lectures, scientific publications, or essays and theses (Earls, 2013;Lanvers, 2018).Despite the high relevance of English as a lingua franca (Oktaviani & Fauzan, 2017), students often lack appropriate English language competencies (Zenner-Höffkes et al., 2021).In response, the European Commission (2004) has promoted the introduction of Content and Language Integrated Learning (CLIL) in grammar school teaching (Finkbeiner & Fehling, 2006).
CLIL is understood to be an innovative approach to foreign language learning (Tarasenkova et al., 2020).It retains the focus on content learning but delivers content in both the native (L1) and the foreign language (L2) (Eurydice, 2005;Coyle et al., 2010).The strategic combination of the L1 and L2 increases L2 exposure and ensures that students understand the input regardless of their L2 proficiency (Krashen, 1982;Lin, 2015).The aim of CLIL is to 'deepen awareness of both [the native] and target language' through a plurilingual approach (Marsh et al., 2001, p. 16).Unlike many immersion models (e.g.Wode, 1995), the L2 is an important didactic element in CLIL (Coyle et al., 2010).
Effectively combining the L1, L2, and scientific concepts in CLIL requires wellplanned curricula and extensive scaffolding (Lin, 2015).Scaffolding is 'a type of teacher assistance that helps students learn new skills, concepts, or levels of comprehension of material' (Maybin et al., 1992, p. 188).It empowers students to construct knowledge from guided experience in a bottom-up approach independent of the instructor (Roehler & Cantlon, 1997).These qualities make scaffolding particularly suited for less structured and more interactive learning environments (Prawat & Floden, 1994;Korthagen & Lagerwerf, 1995).Prominent examples include experiments or discovery activities in science subjects (Lin, 2015;Marsh et al., 2001).In addition to content learning, these learning environments encourage discussion amongst students and require the use of scientific terminology in both L1 and L2 groups (Klieme et al., 2010;Lemke, 1990;Meyerhöffer & Dreesmann, 2019).This makes science subjects ideal testbeds for CLIL implementation (Rodenhauser & Preisfeld, 2014).
Despite the potential benefits of CLIL for science subjects, its adoption falls short in Germany.One reason for this slow uptake appears to be mixed results for content learning in many CLIL studies (e.g.De Dios Martínez Agudo, 2019; Piesche et al., 2016).These studies generally report positive effects of CLIL on language learning, but they also observe significantly lower scores for content learning (Koch & Bünder, 2006;Meyerhöffer & Dreesmann, 2019).Moreover, many grammar school teachers in Germany still study a combination of two science subjects, such as Biology and Chemistry, instead of a science subject and a foreign language, such as English.This leaves few teachers with the opportunity to teach CLIL classes (Sylvén, 2013;Vásquez et al., 2020).Additionally, an ambitious and crowded curriculum restricts the time that teachers can dedicate to experimentation and discovery activities.They tend to relegate science teaching to regular classrooms to maximise content learning, which leaves little room for CLIL (Itzek-Greulich et al., 2014).
Previous research has already tried to tackle this problem by offering science laboratories outside of regular classrooms.In Germany, these laboratories are typically located at universities which allow teachers and their classes to spend a full day in an authentic learning environment focused on laboratory learning.However, there are only two laboratories in over 443 that offer practical CLIL experimentation in a science subject (Schülerlabor Atlas, 2022).Whilst their results are promising and indicate that even one-day interventions can positively influence content and language learning (e.g.Buse et al., 2018;Rodenhauser & Preisfeld, 2014), more research is required.
Inspired by Rodenhauser and Preisfeld (2014) and Buse et al. (2018), we designed and piloted a one-day CLIL university laboratory module for ninth graders of Bavarian grammar schools with a focus on genetics.The module builds on an established non-CLIL laboratory module that successfully combines experimentation and model learning.
We retained the overall structure of the module and used the previous non-CLIL learning group as a comparison group (Roth et al., 2020).As shown by previous research, we expect a general trade-off between content knowledge and language learning.Drawing on cognitive load theory, we hypothesise that the cognition capacity required for language learning will limit the acquisition of content knowledge, even when extensive content and language scaffolding is provided (e.g.Coyle, 2007;Grandinetti et al., 2013).To mediate this trade-off, CLIL modules need to reduce the amount of content knowledge to accommodate the cognitive effort required for learning content in another language.In the following, we will firstly outline the relevant theoretical background including the basic principles of content and foreign language learning and examine how they can be successfully combined in CLIL education.Additionally, we explain the benefits of practical experimentation, which has already proven its value in content learning.We then elaborate on the different phases of our laboratory experimentation and explain how we measured content learning.We conclude with a critical discussion of our results and relate them to previous studies.

Research context
In the fall of 2020/2021, over 180 000 students (61 000 female) enrolled in STEM undergraduate courses at German universities (Statistisches Bundesamt, 2021a).Another 22 000 students pursued their studies in various international study programmes and many students selected study programmes that require professional English language competences, such as European Studies, International Relations, and Information Systems (Statistisches Bundesamt, 2021b).Moreover, in the past two decades, higher education in Germany has developed an increasingly international curriculum.English has gained a strong foothold in this curriculum as the language of sciences, and most study programmes use peer-reviewed and internationally published articles as reference literature for their lectures (Earls, 2013;Lanvers, 2018).These articles are used as the foundation of term papers, scientific essays, or theses (Ammon, 2001;Gürtler & Kronewald, 2015).STEM undergraduate courses like Biology seminars, for instance, often require students to analyse scientific articles and present their findings to their peers in English (Gürtler & Kronewald, 2015).Yet, many students lack the required English language competencies (Zenner-Höffkes et al., 2021).To address this lack and promote a common language across the EU, the European Commission published several guidelines and briefs on how to best incorporate English language learning in schools across Europe (European Commission, 2003;European Commission, 2012;European Commission, 2017).
In over 18 EU member states, English has become a compulsory language.Most of these countries consider CLIL essential for improving English language practice without further straining the already crowded curriculum (European Commission, 2003;European Commission, 2012;European Commission, 2017).In Germany, for instance, CLIL has gained particular traction in grammar schools.Grammar schools aim to equip students with the necessary skills to access colleges or universities.Before students enter grammar schools, they have typically acquired advanced literacy skills in their native language so that CLIL instruction can be implemented effectively (Feddermann et al., 2021).CLIL is realised in either bilingual stands or singular CLIL modules.The modules do not have a fixed timeframe and can even be single lessons (Krechel, 2013).

Foreign language and content learning
Unlike immersion models, CLIL follows the Pluriliteracies Teaching for Learning (PTL) approach, which aims to develop scientific literacy and content learning in more than one language (Meyer et al., 2015;Meyer & Coyle, 2017).In CLIL, the native L1 and the foreign L2 receive equal attention in the development of scientific literacy and are of equivalent importance with content learning (Cammarata & Haley, 2017;Poza, 2016).CLIL is guided by two of Krashen's (1982) hypotheses on the importance of language for the construction of meaning.The maximum input hypothesis proposes maximum exposure to the L2 to promote second language learning.The comprehensive input hypothesis highlights meaningful exposure to content in the L2 combined with the occasional use of the L1 to support learning.
This complex and mutually dependent relationship between L1, L2, and content (Yore & Treagust, 2006) is reflected in Cummins' (1979) developmental interdependence hypothesis.He postulates that the L1 must exceed a certain threshold of proficiency to (1) enable higher-order thinking, (2) to allow for adequate L2 acquisition, and (3) to facilitate content learning in the L2.For L2 learning, Cummins (1979) distinguishes between everyday language and scientific language.He identifies them as Basic Interpersonal Communicative Skills (BICS) and Cognitive Academic Language Proficiency (CALP) respectively.Scientific literacy, as a core achievement of CLIL (Meyerhöffer & Dreesmann, 2019), requires both BICS and CALP.Yet, combining their teaching can be challenging (Cummins, 1979).
One way to navigate this challenge is to engage students 'in authentic communication through the use of hands-on tasks […] related to everyday experiences' (Buxton et al., 2008, p. 501).Hands-on tasks require information to be presented in a coherent and accessible manner, which stimulates CALP and BICS simultaneously (Gonzalez-Howard & McNeill, 2016).Argumentation during hands-on tasks fosters scientific literacy in both the L1 and L2 (Gonzalez-Howard & McNeill, 2016;Oga-Baldwin, 2019;Walker & Sampson, 2013).For CLIL, this combination can mediate between external stimuli, internal processes and academic goals (Lam et al., 2012).Lin (2015) incorporates this idea in a scaffolding strategy called the Multimodalities/Entextualization Cycle.The MEC encompasses three distinct phases in active engagement of the student with the language and content.The first phase uses multimodalities such as videos, diagrams, experiments, or discovery activities to create a 'rich experiential context'.The second phase requires the student to combine everyday L1 or L2 to explore the topic and switch between multimodalities while 'engag [ing] in reading and note-making'.The third phase encourages the student to use L1 and L2 scientific terminology to, for instance, explain and evaluate an experimental design with the help of language scaffolding material (Lin, 2010(Lin, , 2015)).
The 'rich experiential context' and the individual learning approaches of the MEC are rooted in the constructivist model of learning.Constructivism is a theory of learning in which the learner actively construct their knowledge from experience.This framework for learning requires more interactive learning environments than the previous transmission model and (Prawat & Floden, 1994;Korthagen & Lagerwerf, 1995) and is at the very heart of CLIL (Ting, 2010).Building on the Five Es -Engagement, Exploration, Explanation, Elaboration, and Evaluation (Bybee, 1997;Bybee & Powell, 1993)the constructivist model of learning has the power to not only explain content learning but also language learning.Much like L2 acquisition (Gonzalez-Howard & McNeill, 2016;Lin, 2015;Oga-Baldwin, 2019;Walker & Sampson, 2013), engagement or motivation is a requirement of constructivist learning.Students need to be interested in the content but need to actively construct meaning from their experience (Boddy et al., 2003).
Learning, however, is more than just an outcome of experience and one method does not work for all students (Hodson, 2014), especially in a language learning context (Gonzalez-Howard & McNeill, 2016).Experience is highly individual and may be influenced by factors like prior knowledge and exposure, or different learning style (Lee et al., 2015;Lin, 2015).The individuality and heterogeneity of experience can be accounted for in different ways, such as using an inquiry learning approach (Vygotsky, 1971).This approach allows each student to choose their preferred learning style.Inquiry learning is commonly understood as a 'bottom up' approach that gives learners agency to create knowledge through observation and experimentation with the teacher acting as a guide (Donaldson & Allen-Handy, 2020Rocard et al., 2007;).This does not necessarily mean that learners can freely decide on learning content, since curricula predefine a clear set of learning goals.Students can, however, choose methods for their individual learning.They are given the agency to shape their own classroom experience and create meaning through social construction (Renninger et al., 2018).

Content and language integrated experimentation
Hands-on experiments in an educational setting can offer an engaging learning environment required for CLIL and rooted in constructivist learning and inquiry (e.g.Buse et al., 2018;Lin, 2015).During the experiments, students can either follow a predefined laboratory procedure and discuss their learnings, or transfer it into another form of representation, and design their own experiments (Gardner & Elliott, 2014;Tobin, 1990).Either way, the processes involved in experimentation satisfy requirements for the Five Es and engage the MEC for the development of scientific literacy in both the L1 and L2 (Bybee, 1997;Bybee & Powell, 1993;Lin, 2015).Hands-on experimentation can also lead to intense scientific discourse that can help students to not only transfer their content knowledge but encourage them to interactively practise appropriate scientific terminology (Honeycutt-Swanson et al., 2014;Kelly, 2007).Moreover, it can engage students in science (Lovey & Riggs, 2019).
Whilst hands-on experiments have been shown to positively influence content learning across various settings (e.g.Fernández-Fontecha et al., 2020;Kelly, 2007;Mierdel & Bogner, 2019), the results for combinations with CLIL remain ambiguous.A study by Evnitskaya and Morton (2011), for instance, highlights the benefits of putting students into the role of observers, constructors, and critiquers during experimentation.Using both L1 and L2, they can learn how to communicate their findings to different audiences.Studies by Rodenhauser and Preisfeld (2014) and Buse et al. (2018) show that CLIL does not provide substantial benefits over regular non-CLIL experimentation.On the contrary, Piesche et al. (2016) found that CLIL negatively impacts content learning.Most of these studies based their insights on long-term observations of CLIL strands.Very few studies have investigated content learning (Meyerhöffer & Dreesmann, 2019) and even fewer have examined the feasibility of CLIL-based experimentation and scientific modelling in a one-day setting.

Objectives of the study
The present study seeks to explore the potential benefits of incorporating a Content and Language Integrated Learning (CLIL) approach in science education.By investigating the influence of a short-term CLIL science module during a hands-on laboratory experience, this research aims to shed light on how CLIL can enhance students' content learning.Furthermore, comparing the content learning outcomes between the CLIL group and the non-CLIL group will provide valuable insights into the effectiveness of CLIL as a teaching method.Thus, our study focuses on the following research questions: 1. How does a short-term CLIL science module influence content learning during a hands-on laboratory? 2. How does CLIL influence content learning outcomes when compared to the non-CLIL group?
The specific goals were three-fold: . to assess students' overall content learning throughout the laboratory experience .to determine potential differences between non-CLIL and CLIL groups by comparing their content learning scores .to examine correlations between CLIL learner's content learning performance with their Biology and English grades

Intervention phases
Our study builds on a one-day gene-technology laboratory module which has been developed in various iterations since 2016 (Goldschmidt & Bogner, 2016;Langheinrich & Bogner, 2016;Mierdel & Bogner, 2019;Roth et al., 2020).The structure of the last iteration was the basis for our non-CLIL comparison group.We also retained its overall structure and experimental and modelling phases for the design of our CLIL module (Roth et al., 2020).However, we used English as the language of instruction (English texts and workbooks) and a separate vocabulary exercise book for language scaffolding.These small modifications allowed us to explore the effects of CLIL whilst ensuring treatment comparability (Table 1).
The CLIL (and the non-CLIL) module was designed for ninth graders of Bavarian grammar schools and focus on the structure of DNA.Learning activities follow the Five Es of the constructivist model of learning (Bybee, 1997;Bybee & Powell, 1993) and each phase of the module focused on a different aspect of the structure and function of DNA.Short theoretical introductions provided the 'hooks' for further exploration in the experimentation and modelling phases.Students were encouraged to take notes on their observations during the experimentation and modelling phases.Open questions in the laboratory manual additionally required them to find explanations for their observations (ESM 1).A short recapitulation after each phase and a separate evaluation phase for the modelling activities were provided to help students contextualise their content knowledge.
The different phases of the module each reflected the three scaffolding cycles of the MEC (Lin, 2015).For the experimentation and modelling phases, the laboratory manual provided visuals, process models, and diagrams to make the topic more accessible independent of language proficiency (ESM 2).Group work on the different phases and open questions in the laboratory manual encouraged students to use both the L1 and L2 and train BICS and CALPS (Cummins, 1979) while exploring the structure of DNA.English was used for instruction and was encouraged for communication between students.However, we included German translations (code-switching) for key vocabulary in the laboratory manual (Cheshire & Gardner-Chloros, 1998).An additional vocabulary exercise book, which included one page for each phase, provided the relevant scientific terminology.Students could use the vocabulary to answer the open questions or to discuss and evaluate the experimentation and modelling phases.The questions were in English, but answers were accepted in both English and German.This approach was consistent with a supportive lexical focus on form (FonF) in CLIL environments (Morton, 2015, p. 256).
The instructor, who has a background in both English and Biology, acted as guide throughout the module and provided demonstrations of key experimental steps prior to experimentation.For the theoretical phases and for the final interpretation phase, the instructor used an interactive smartboard presentation to engage students and a poster on gel-electrophoresis with clozes to explain the procedure (Table 1).Throughout the two months of interventions at the university laboratory, three to four classes participated on separate days each week.Students always remained in their respective classes and worked in pairs.1. Pre-lab phase.Many students had little prior experience in laboratory procedures.In this phase, the instructor used an interactive smartboard presentation with visuals and demonstrations to familiarise students with the different laboratory techniques and concepts related to experimentation (e.g.Sarmouk et al., 2019).
2. DNA-related theoretical and experimental phases.We introduced each experimental phase with a short overview of the equipment, experimental procedures, and underlying concept.To engage students, we invited them to solve a mystery murder case using the evidence (saliva) similar to the one that the assailant left on the coat of the victim's spouse (DNA relevance, Table 1).A closer look at the composition of saliva in an interactive smart-board presentation reactivated students' prior knowledge about cells and introduced the concept of DNA.After DNA extraction, students received an introduction to gel-electrophoresis.With the help of a poster and a demonstration of the agarose gel preparation, the instructor explained the techniques of gelelectrophoresis.
3. Experimental Phases.We used an evidence-based, two-step approach in this phase (Roth et al., 2020): Students answered questions in their laboratory manuals and then worked in pairs to explore possible approaches to solve the problem before carrying out the experiments.On completion of the experiments, students took notes, preferably in English, about their observations and answered open questions to clarify the explanations of their observations.This reflective writing (Kovanović et al., 2018;Wilmes & Siry, 2019) encouraged the students to rethink and reassess the steps in experimental procedures (Mierdel & Bogner, 2019; for details, see below).Instead of simply following instructions to complete their tasks, they used their cognitive abilities to make sense of the experiments (Mierdel & Bogner, 2019).
4. Model-related Phases.After a short lunch break, the students entered the modelrelated phases.During these phases, they were given the opportunity to build on the knowledge of the experimentation phases and consolidate their knowledge through model building and evaluation (Bybee, 1997;Bybee & Powell, 1993).To help students understand modelling in science, we divided our model-related phases into two modelling and evaluation phases.The modelling phases involved mental modelling based on the analysis of an original English letter from Crick to his son (Usher, 2013) and a modelling phase using craft materials.Model evaluation included both model evaluation-1 and model evaluation-2 (Table 1).
Our four model-related phases were adapted from the four stages of modelling defined by Justi and Gilbert (2002).Students firstly gathered information on the structure of DNA.Based on an analysis of Crick's letter, which contained metaphors describing the structure of DNA (Usher, 2013), they were given the opportunity to construct a mental model (model phase 1, Table 1).This mental model served as a basis for a physical model crafted from craft materials (model phase 2, Table 1).In the final phase, students identified limitations of their first model (Justi & Gilbert, 2002).
Students evaluated their model in a reciprocal self-evaluation mode (evaluation-1 phase) by comparing their craft DNA models with a paper-and-pencil version of the model.Additional open-ended questions in the laboratory manual on the structure of DNA supported this self-evaluation process (Roth et al., 2020).In model evaluation-2 phase, students used a commercially available DNA demonstration model to assess the quality of their hand-crafted DNA models.
5. Interpretation Phase.The interpretation phase began with the result of the modelling and evaluation phases.The instructor used an interactive smartboard presentation to review the different modelling phases and presented the original DNA model created by Watson and Crick.The presentation aimed to raise students' awareness about the different DNA models, which often vary in their level of detail.The instructor showed the results of the gel-electrophoresis and explained the reason for the formation of the different bands on the electrophoresis gel.

Language scaffolding
We designed a language scaffolding exercise book with scaffolding exercises for each intervention phase (examples ESM3).It contained language specific riddles in the form of different word search puzzles or crossword puzzles.Moreover, students were required to match definitions with words and select or provide its appropriate translation.Students were allowed to use English-English and English-German dictionaries when necessary.
In addition to the language scaffolding exercise book, we included short translations or explanations for key scientific vocabulary in the laboratory manuals and interactive smartboard presentations.We also used code-switching for our interactive poster with clozes to explain the procedure of gel electrophoresis.The text on the poster was written in English.New scientific vocabulary, which required more in-depth explanations, was omitted from the text, and printed on a magnetic stripe.The magnetic stripes also included German translations of the scientific vocabulary.Where explanations in English were unclear, the instructor switched to German upon request.
Students were encouraged to approach the instructor or a junior assistant to ask questions.The research assistants were proficient in English and Biology, with a language level of at least C1 (requirement to graduate in English at German universities).The instructor and assistants would first try to rephrase explanations in English or use code-switching for specific terms or sentences.Students were given cards in green, yellow, and red to indicate how well they have understood the theoretical phases.Green indicated that the content has been well understood; yellow showed that additional explanations were necessary; and red mandated a recapitulation of the theoretical phase and/or code-switching.
Participants 252 ninth graders from Bavarian grammar schools participated in the intervention (girls 52.4%, boys 47.6%; SD Gender = 6.2;M Age = 14.6,SD Age = 0.7).Seven classes took part in the non-CLIL intervention (n = 139) and eight classes in our CLIL intervention (n = 107).The non-CLIL group were divided into 70 student groups (68 2-person groups and one 3-person group) and the CLIL treatment group into 53 student groups (51 2person groups and one 3-person group).To reduce the potential influence of school grades and previous knowledge, we calculated T0 knowledge scores and compared Biology grades for both groups.We found no significant difference (Mann-Whitney U test [MWU]: Z = −.725,p = .468)in grades between the groups.We did not perform any testing of prior language skills, as none of the students had considerable exposure to the English language outside of the classroom.The participating teachers had a background in Biology and Chemistry and only one teacher studied Biology and English.Information on the teachers' background was provided in the application letter and could be verified by short resumes on the schools' websites.The teachers either actively participated in the laboratory experiments or observed their students' performance and involvement.
Participation of schools and students was voluntary.To recruit volunteers, we sent invitations to neighbouring grammar schools in a 50 miles radius six months prior to the intervention.Teachers were asked not to teach DNA structure and function before student participation in the study.To comply with regulations of the Bavarian Ministry of Education and Cultural Affairs, we asked for written parental consent before the students participated in our study.Data collection was pseudo-anonymous and students could not be identified (Declaration of Helsinki, 2013).The design of the module and the questionnaires were pre-approved by the ethics committee of the Bavarian Ministry of Education and Cultural Affairs and received the reference number X.7-BO5106/149/ 10.The content of the module matches requirements of the state's syllabus and follows national competency requirements (KMK, 2005).The 2018 non-CLIL module underlying our intervention was adapted in two design, evaluation, and development cycles with piloting groups to accommodate CLIL.The 2018 module classes did not take  part in the 2020 CLIL intervention, and the non-CLIL and CLIL groups were treated as an independent variable (Cook & Campell, 1979) (Table 1).

Variables
We tested content learning using an established content knowledge questionnaire for structural and functional characteristics of DNA (Langheinrich & Bogner, 2016;Mierdel & Bogner, 2019;Roth et al., 2020).The questionnaire contained 30 multiplechoice questions each with four distractors and one correct answer (for examples, see Table 2).In addition to the questionnaire, we distributed a cloze test to assess language learning, a questionnaire to measure student self-efficacy beliefs, and a questionnaire to capture creativity (Roth, Conradty, et al., 2022) at three different testing timestwo weeks before participation (pre-test; T0), directly after the module (post-test; T1), and eight weeks after participation (retention-test; T2).The cloze test, which assessed reading skills, was taken in the L2 while all other questionnaires were in German.
Based on studies like Serra (2007) and Massler (2011), we decided to use the content knowledge questionnaire in the L1 to measure the advantages of L1 cognition in CLIL beginners.A more recent study by Canz et al. (2021) supports our approach and highlights the importance of students being able to communicate content knowledge in the L1 'to reach educational standards in the subject and do not experience disadvantages compared to monolingually taught students ' (p. 11).For purpose of objectivity, students were not given the hypotheses underlying the study.
The questionnaire used for content learning was designed and developed during three studies with more than 2000 student participants and their teachers (Langheinrich & Bogner, 2016;Mierdel & Bogner, 2019;Roth et al., 2020).Its items were created based on texts from schoolbooks and the course syllabus for genetics.The items' difficulties were calculated with the help of pilot groups, and we replaced items that were too easy or too difficult.18 items focused on structural aspects of DNA and 12 on functional aspects (Roth et al., 2020;Roth, Scharfenberg, 2022).Questions and respective multiplechoice answers were altered and randomly assigned after each testing period (T0, T1, T2) to prevent 'automated' responses.
We showed content validity of the questions by comparing them with the state syllabus and the content of the module.Inter-item correlations below .20 (T0 = .08;T1 = .19;T2 = .18)indicated that the items were distinct and that they tested different areas of content knowledge.In combination with the complexity of the latent construct of cognitive achievement, the low inter-item correlations confirmed construct validity (Rost, 2004).The reliability of the questionnaire was determined by Cronbach's alpha.Scores of .74(T0), .76(T1), and .78(T2) exceeded the threshold of .70,which, according to Lienert and Raatz (1998), allows for differentiating groups.An additional calculation of item difficulties (percentage of correct answers, Bortz & Döring, 1995) indicated a range between 5% (high difficulty) and 90% (low difficulty).Comparisons between the different testing times showed that item difficulties improved from T0 to T1, particularly for the CLIL group (Figure 1).
We calculated sum-scores and analysed these for improvements in content knowledge (T1 minus T0) and retention (T2 minus T0).Furthermore, we calculated the actual learning success with respect to the maximum attainable score (30 correct answers): (T1 -T0) x (T1/30); and the persistent learning success (T2 -T0) x (T2/30) (Scharfenberg et al., 2007).Increases in content knowledge were rated according to students' actual knowledge to better compare learning success.This rating also accounts for students who exhibit a significant increase in knowledge yet low achieved scores, and vice versa.We calculated correlations between Biology and English grades and post-test (T1) scores for knowledge items using the Spearman-Rho test (Field, 2012).

Statistical analysis
We used nonparametric tests to analyse our data as first assessments showed a nonnormal distribution of our variables (Kolmogorov-Smirnov test (Lilliefors modification): partially p < .001).Our assessment of subgroups, which often did not reach the threshold required for assuming a Gaussian distribution, supported our decision to use this test (Lomax, 1986).We used boxplots to visualise our results.To analyse intra-group differences of the three different testing times, we applied the Friedman test (F) to illustrate general differences and Wilcoxon (W) signed-rank test to show changes between testing times.For differences between the non-CLIL and CLIL treatment group, we used the Mann-Whitney U test (MWU).Moreover, we used Bonferroni corrections to eliminate minimally significant results, which could have simply been a coincidence  due to multiple testing (Field, 2012).Where the results remained significant after the corrections, we calculated effect sizes r (Lipsey & Wilson, 2001) and categorised them into small (> 0.1), medium (> 0.3), and large (> 0.5) effect sizes.We used Spearman's rank correlations for correlation analyses and reported them as Spearman's Rho values.

Results
Intra-group analyses of content learning Our intragroup analysis (F and W tests, Table 3) of students' overall knowledge of modelrelated and scientific knowledge of DNA indicated significant changes for monolingual and bilingual treatment groups: Their knowledge first increased and then declined from T1 to T2, but never reached levels below T0.These results suggest that both student groups were able to attain short-term and mid-term knowledge through participation in the intervention (Table 3).

Inter-group analyses of content learning
To mitigate the influence of differences in students' prior knowledge at T0 and to determine students' short-term increase in content knowledge and mid-term retention rates, we also calculated difference variables (Table 4, note a).Our calculations for  improvements in knowledge (T1-T0) and retention rates (T2-T0) of overall 30 knowledge-test items were based on sum scores (Field, 2012) (Table 4).Additional calculations aimed to determine learning success variables for actual learning success ((T1 -T0) x (T1/30)) and persistent learning success ((T2 -T0) x (T2/30)) (Scharfenberg et al., 2007).Both difference and learning success variables were then used to assess intergroup differences employing the Mann-Whitney-U test (see Table 4, notes).

Overall knowledge
The assessment of the 30-item knowledge test revealed significant differences in improvements in knowledge and retention rates between non-CLIL and CLIL learners.Non-CLIL learners scored significantly higher with a medium-to-large effect size in both difference variables (Table 4, notes d/e).While these results already provided a good indication of the students' performance, they did not account for overall content knowledge.We also completed additional calculations of learning success variablesactual learning success ((T1 -T0) x (T1/30)) and persistent learning success ((T2 -T0) x (T2/30)).We then calculated Mann-Whitney-U scores for inter-group differences and Wilcoxon scores for differences between the testing times (Field, 2012).For short-term content learning, we identified significant differences with a small effect size (Table 4; notes h).Non-CLIL learners achieved higher short-term content learning scores than CLIL learners.The differences, however, did not persist over time (Table 4; notes i).For mid-term permanent content learning, both non-CLIL and CLIL learners achieved comparable content learning scores in the 30-item knowledge test (Figure 2).

Correlation biology grades
There was no correlation in the CLIL group (rS = −.049,p = 621) between Biology grades and post-test (T1) scores.Yet, difference and learning success variables of the 30-item knowledge test revealed significant negative correlations (rS = −.171,p = .045)for non-CLIL learners.In effect, non-CLIL students with lower grades appear to have improved their difference and learning success scores in post-tests when compared to students with better grades.This did not hold true for CLIL learners.

Discussion
Data from our one-day outreach module suggests that it had a positive effect on content learning regardless of whether it was implemented as CLIL or non-CLIL.Both treatment groups profited from the combination of hands-on tasks and minds-on activities (Table 3), which corroborates findings from previous studies (e.g.Goldschmidt & Bogner, 2016;Langheinrich & Bogner, 2016;Mierdel & Bogner, 2020;Roth et al., 2020).An important determinant of this success appears to have been our choice not to use 'cookbook' procedures and to accommodate both cognitive and affective domains (Hofstein & Lunetta, 2004).In effect, the students did not only learn to confidently handle laboratory equipment but also to identify problems, generate feasible hypothesis, analyse data, and develop models to explain their results (Carmel et al., 2020).
Yet, if not regularly taught in class, these 'non-cookbook' procedures can limit knowledge acquisition due to cognitive load (e.g.Mierdel & Bogner, 2020;Roth et al., 2020).We observed the negative impact of high cognitive load on knowledge processing in a previous study (Mierdel & Bogner, 2019).The study showed that a non-CLIL module based on outreach hands-on learning is already cognitively exhausting (e.g.Meissner & Bogner, 2012;Scharfenberg & Bogner, 2010).Adding conceptual learning in a foreign language can easily increase this exhaustion (Piesche et al., 2016;Rodenhauser & Preisfeld, 2014) and overstrain the working memory (Roussel et al., 2017).Although we did not measure cognitive load in the current study, we observed signs of exhaustion like a loss of focus and fatigue in the second half of the intervention.Students and teachers also made frequent remarks on how exhausted they were during and after the laboratory and how much effort the use of the L2 required.Moreover, the item difficulty was perceived to be much higher by the CLIL groups although the items were the same as for the non-CLIL group (Figure 1).This suggests that the germane load, which is crucial for memorising and learning, has been impaired by high intrinsic and extraneous load (Sweller, 2015) resulting from the combination of content and language learning.
Initially poorer short-term knowledge scores for the CLIL groups (Table 4; Figure 2) suggest that our module mentally exhausted the students and exceeded their processing capabilities.These results confirm competing demands between language and content learning.Canz et al. (2021), for instance, show that greater experience in the L1 leads to an improvement in the processing of content as opposed to the L2.Poor skills in the active and passive language, may negatively affect content learning.In our study, students did not only receive a maximum of L2 exposure but also high content input from experimentation and modelling activities.This may have been too challenging (Craik & Lockhart, 1972).
We also observed positive effects.For instance, the discursive activity of negotiating meaning in the process of doing science by talking science helped many of the students to understand fundamental scientific concepts (e.g.Evnitskaya & Morton, 2011;Kress et al., 2001).The combination of hands-on tasks with minds-on activities in CLIL learning (Glynn & Muth, 1994) seems to have supported knowledge acquisition (Table 3).Students conversed during the sessions, they answered the open questions in the laboratory manual, and solved riddles in the vocabulary exercise book.Moreover, the instructor and research assistants rarely used code-switching except for the content knowledge test, which was consistently in the L1, to make CLIL beginners feel more confident.Besides positive effects on confidence, testing content knowledge in the L1 helps to better map students' de-facto knowledge achievement (Canz et al., 2021;Massler 2011), and shows their ability to transfer knowledge encoded in the L2 to the L1 (Canz et al., 2021).
With mid-term permanent content learning success, other factors than cognitive load appear to be at play as the differences between the CLIL and non-CLIL groups were significantly lower (Table 4; Figure 2).A possible explanation may be that CLIL learners cognitively process content on several levels (Meyerhöffer & Dreesmann, 2019;Rodenhauser & Preisfeld, 2014).Although more challenging, many text passages in our CLIL module required repeated and precise reading, which may explain why a similar amount of knowledge remained anchored in long-term memory (e.g.Marian & Fausey, 2006;Rodenhauser & Preisfeld, 2014).Piesche et al. (2016) or Fernández-Sanjurjo et al. (2017) provide a similar explanation for knowledge acquisition in both native language and CLIL learners.Admiraal et al. (2006) or Haagen-Schützenhöfer et al. (2011) do not report significant differences between their CLIL and non-CLIL groups for all testing times.Instead, they ascribe their success to the choice of teaching methods, such as delivering the same content knowledge across several learning cycles in different forms of presentation (e.g. text, visuals, etc.).
We used this approach in our CLIL module and adopt linguistic, graphic, and interactive scaffolds (ESM 1; ESM2) to support content and language learning.Appropriate scaffolding at both the language and content level can reduce the cognitive load but warrants high quality scaffolds (e.g.Fernández-Fontecha et al., 2020;Gottlieb, 2016).Our short-term content knowledge acquisition was not as hypothesised.This suggests that our scaffolds neither reduces the language-cognitive nor the content-cognitive demands (Grandinetti et al., 2013).Many students had difficulties learning the required vocabulary whilst at the same time understanding the genetic concepts and laboratory procedures.Their language may not have developed fast enough to realise positive content level effects of the scaffolds.CLIL short-term implementations may therefore require more than scaffolding to promote knowledge acquisition (e.g.Coyle, 2007;Grandinetti et al., 2013). Fernańdez-Fontecha et al. (2020), for instance, suggest multimodal scaffolding, which includes images, to present content in semiotic forms other than text.
Another approach could be to have students learn context-specific key vocabulary, such as experimental procedures and equipment, prior to participation in short-term CLIL implementations, like our gene-technology lab.This would reduce cognitive load and students would only need to put their recently learned vocabulary into context (McGuiness, 1999).At the same time, an increased focus on key vocabulary should not distract from the complex relationship between language and content learning (European Commission, 2004).To further explore this relationship in the science classroom, additional research will be required.One avenue worth investigating may be how deeplearning as opposed to more superficial content learning could be fostered using CLIL in short-scale bilingual implementations.Ke et al. (2020), for instance, have found commonalities between high levels of discursive activity and deep-learning processes associated with scientific modelling.The exploration of this type of content knowledge could help us tackle the trade-off between content and language learning.

Limitations
Our study builds on a one-day outreach CLIL module with ninth graders who have little prior experience in hands-on experimentation and with the foreign language outside of classrooms.We also lacked commonly agreed standardised instruments for assessing CLIL content learning, which may impair adequate comparison (Dalton-Puffer, 2011).Additionally, the context-dependency of CLIL encumbered the extrapolation of our results to other contexts (Pérez-Cañado, 2012).Due to the diversity of possible CLIL implementations, the generalisation of our results is limited to short-term implementations of CLIL in science subjects (Fernández-Sanjurjo et al., 2017).To better understand this type of CLIL, CLIL researchers may better explain the various options of implementation.

Conclusion
Previous studies have primarily focused on language learning or content learning in long-term CLIL modules (Meyerhöffer & Dreesmann, 2019).Our study of a shortterm CLIL module contributes to understanding the relationship between language and science content learning at different levels.Although the CLIL group has shown to be less successful, CLIL outreach learning provides positive data for long-term learning in the content knowledge test.Whilst we cannot establish the exact reason for lower scores in the CLIL group, we think that either reducing the content in later trials or prior training of key vocabulary may help reduce cognitive load (Martin, 2015;Sweller, 2015).
76.9 A positively charged particle migrates in an electrical field … (Item 1) (a) between the two poles (b) to the positive pole (c) to the negative pole* (d) not at all 52.9/28.5 8.4 / 15.0 What is wrong?The migration speed of a molecule within the electrical field depends on … ?molecular structure of DNA can be best compared to … ?(is wrong?DNA molecules are being made visible with gel-electrophoresis via … (Item 30) (a) blue coloured loading buffer (b) a dye that attaches to DNA molecules (c) a dye that glows under UV radiation (d) addition of dye to the gel* Note: Correct answers are marked with an asterisk *

Figure 1 .
Figure 1.Item difficulties of monolingual and bilingual learners for knowledge items between T0 and T1.

Figure 2 .
Figure2.Differences between monolingual and bilingual learners in content learning between testing times T0, T1, and T2; calculated improvements in content knowledge and retention rates as well as temporary and permanent learning success.

Table 2 .
Knowledge item examples.

Table 3 .
Content learning of non-CLIL and CLIL student groups.

Table 4 .
Dependent variables for both non-CLIL and CLIL, analysed with regard to content knowledge scores, difference variables and learning success.