Effects of games in STEM education: a meta-analysis on the moderating role of student background characteristics

ABSTRACT Game-based learning has proven to be effective and is widely used in science education, but usually the heterogeneity of the student population is being overlooked. To examine the differential effects of game interventions in STEM (Science, Technology, Engineering and Mathematics) related subjects on diverse student groups, a meta-analysis has been conducted that included 39 studies that compared game-based learning interventions with traditional classrooms in primary and early secondary education. We found moderate positive effects on cognition (g = .67), motivation (g = .51), and behaviour (g = .93). Additionally, substantial heterogeneity between studies was found. Moderator analyses indicated that primary school students achieve higher learning outcomes and experience game interventions as more motivating than secondary school students, whereas gender did not have any moderating effect. There were too few studies reporting information on the remaining moderators (socioeconomic status, migration background, and special educational needs) to include them in a multiple meta-regression model. Therefore, we assessed their role by separate moderator analyses, but these results need to be interpreted with caution. Additional descriptive analyses suggested that game-based learning may be less beneficial for students with low socioeconomic status compared to students with high socioeconomic status.


Introduction
Technical advances lead to an increased demand of highly skilled employees in STEM (Science, Technology, Engineering and Mathematics) industries. Therefore, students need to acquire science literacy to ensure that they have the skillset to actively participate in today's society (European Commission, 2015;European Commission, 2021). Prior research suggests that many students are not very motivated for STEM education and for many students, their motivation to engage in STEM subjects decreases already in upper primary school grades (Renninger et al., 2015;Shin et al., 2019) which can lead to students choosing a different specialisation in later grades to avoid STEM subjects (Simpkins et al., 2006). Game-based learning is widely used in schools as an approach to motivate students for science. It intends to satisfy students' psychological needs for autonomy, competence, and relatedness and thereby contribute to greater interest, motivation, engagement, and achievement (R.M. Ryan & Rigby, 2020).
Various literature reviews and meta-analyses have been conducted with regard to the effects of game-based learning on student outcomes. Overall, they concluded that gamebased learning improves students' learning outcomes (Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013), and contributes to students' motivation and engagement (Lamb et al., 2018;Zainuddin et al., 2020). Looking in particular at game-based learning in STEM education, results are mixed and vary from no significant effect on science learning (Wouters et al., 2013) to improved learning outcomes for all STEM subjects (Riopel et al., 2019;. Although all these studies have provided valuable insights with regard to the overall effects and have shown that games have the potential to improve students' motivation and learning outcomes, more insight is needed regarding the differential effects of games for different groups of students. That is, even though games may, on average, yield positive outcomes, these outcomes may not be the same for all students. Students can differ from another with regard to their age and gender, but also regarding their background characteristics such as their socioeconomic status (SES), migration, or the presence of special educational needs (SEN). Given that the effects of games have been found to depend on prior knowledge Shute et al., 2015) and game self-efficacy (Nietfeld, 2020), it could be that games are not equally effective for different student groups who may differ in prior knowledge or self-efficacy beliefs. These results align with prior research showing differential effectiveness for other educational interventions (e.g., Li et al., 2021). Likewise, it may be that the effects of games may also differ based on different student characteristics. Since games are widely used in education, it is important to gain insight into whether some groups of students benefit more from games than others. If games are less effective for certain groups of students, the wide implementation of games might actually contribute to the perpetuation of preexisting differences between groups.
As classrooms become increasingly diverse (Fine-Davis & Faas, 2014;OECD, 2019), it is important to gain more insight into whether games are indeed beneficial for all groups of students. The present meta-analysis aims to extend previous game-based learning research by examining the potential moderating effects of different student characteristics, in early STEM education. We focus on early STEM education in primary and lower secondary education as early interventions could be particularly beneficial to counteract the widely reported decrease in motivation for STEM education (Renninger et al., 2015;Shin et al., 2019;Stewart et al., 2013).

Game-based learning
Games are designed to be immersive and enjoyable activities (Kinzie & Joseph, 2008). They provide a safe environment for graceful failure Kinzie & Joseph, 2008;Plass et al., 2015Plass et al., , 2020 and can offer adaptivity to engage each learner individually (Plass et al., 2015;Van Oostendorp et al., 2014). While there is no consensus on the exact definition of a game, we align with the definition of Plass et al. (2020) who define game-based learning as learning tasks that are redesigned into games with a full range of game features to make it more interesting and effective for learning. Typical characteristics of games that are widely agreed upon are that games have clear rules of play, are challenging, and motivate a player to play . We chose to adopt this broad definition to avoid restricting the possible sample studies in this meta-analysis. Regarding STEM education, game-based learning can be used to demonstrate different science concepts and they enable students to work with phenomena that would be otherwise invisible, or they can explore locations that usually would be unreachable (Klopfer & Thompson, 2020). Moreover, games can provide learners with additional scaffolding and context, which may support the development of skills (i.e., problem solving) that are needed for STEM education Klopfer & Thompson, 2020).
Game-based learning is believed to affect a wide range of outcomes. Plass et al. (2015) described four different theoretical perspectives for game-based learning research that refer to the game design as well as the outcomes that games are targeting: a cognitive perspective, a motivational perspective, an affective perspective, and a sociocultural perspective. The cognitive perspective of game-based learning focuses on the construction of mental models and as such is using theories describing cognitive aspects of learning. Research from this perspective typically includes cognitive outcome measures, such as learning outcomes. Research focusing on motivational perspectives of game-based learning is based on motivational theories such as self-determination theory (Ryan & Deci, 2000) or self-efficacy theory (Bandura, 1977) and focuses on games as a tool to engage and motivate players. The outcome measures associated with this perspective usually include measures of motivation and associated constructs such as interest and engagement. Studies based on affective theories, such as the control-value theory (Pekrun, 2006) focus on games as a means to create a more positive affective emotional experience for students. Outcome measurements mostly pertain to the emotions experienced by students (e.g., curiosity or boredom; Loderer et al., 2020). Research using sociocultural perspectives investigates how social interactions in game environments influence learning (Plass et al., 2015). As such, research could investigate the differences among single play, collaborative play, and competitive play or the identity construction within large game communities.
In a review on game-based learning in primary education, Hainey et al. (2016) distinguished four different categories of outcome variables that are most commonly examined in research, including (1) knowledge acquisition and content understanding, (2) affection and motivation, (3) perceptual and cognitive outcomes, and (4) behavioural change.
In the present study, we follow the categories of Hainey et al. (2016) for our outcome variables. Category (1) knowledge acquisition and content understanding and category (3) perceptual and cognitive outcomes were combined into one single category 'cognitive outcomes', since other frameworks (e.g., Plass et al., 2015) refer to cognition as a single category and due to the conceptual overlap between these two categories. Accordingly, we distinguish the following three types of outcomes of game-based learning. .
(1) Cognitive outcomes. This category aligns with Plass' cognitive perspective and combines Hainey's categories (1) knowledge acquisition and content understanding and (3) perceptual and cognitive outcomes, as these two categories contain relatively overlapping constructs that are hard to distinguish. Outcome variables within this category are for example, assessed learning outcomes, knowledge retention (Fokides, 2018b) or conceptual knowledge (Chee & Tan, 2012). (2) Motivational outcomes. In line with Hainey, affect and motivation were combined into one category motivation. We understand motivation as affect towards a task which can be called differently depending on the theory used (e.g., intrinsic motivation (Ryan & Deci, 2000), interest (Hidi & Renninger, 2006), self-efficacy (Bandura, 1977), enjoyment (Ryan & Deci, 2000)).
(3) Behavioural outcomes. Aligning with Hainey, we added the category behaviour. This category includes various behavioural outcomes, such as self-reported strategy use, behavioural engagement, or effort (Fredricks et al., 2004).
Unlike the other three perspectives, the sociocultural perspective of Plass et al. (2020) does not specify a category of outcomes. Rather, it focuses on specific game features (individual versus collaborative), which is beyond the scope of the present study as the present study focuses on the moderating role of different student characteristics.

Background characteristics of students
Games are an increasingly popular tool in schools and several meta-analyses have shown the beneficial effects of game-based learning (Lamb et al., 2018;Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013). These meta-analyses have mostly focused on explaining heterogeneity between studies by moderators such as discipline (Riopel et al., 2019;Wouters et al., 2013), instructional methods (Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013), and game features (Lamb et al., 2018;Riopel et al., 2019;. Very limited attention has been paid to the potential differential effects of game-based learning for students with different background characteristics. As such, the question if and how the effects of games may differ across different student groups is largely unexplored. Regular classrooms are typically characterised by heterogeneity. That is, students can differ in age, gender, migration background, SES, and in having special educational needs (e.g., ADHD, dyslexia). Some of these background characteristics of students are associated with a higher risk of school failure and early school leaving. According to the European Agency for Special Needs (2019) the following learner circumstances can have a negative impact on students' educational outcomes: (im)migration background, low SES, or special educational needs due to learning or behavioural disabilities. This finding is supported by results of other studies, where students with low SES, special educational needs and migration background score lower in STEM-related subjects than their peers Mullis et al., 2020;Garon-Carrier et al., 2018;Lie, Selcen Gucey, & Moore, 2019). Those differences can already be seen as early as in kindergarten. Children with disadvantageous learner circumstances show lower mathematical precursor abilities (Garon-Carrier et al., 2018), lower literacy, language, and self-regulatory skills (Verhaeghe et al., 2018) than other children. Additionally, low parental income can negatively influence science learning opportunities inside and outside the home environment (e.g., museum visits), due to fewer resources available .
Even though gender has not been classified as predictor that could negatively influence educational outcome (European Agency for Special Needs and Inclusive Education, 2019), there are gender differences in the domain of STEM. Women are still an underrepresented group in STEM careers (Turner et al., 2019) and girls in general report lower science self-efficacy (Marriott et al., 2019). Lower self-efficacy can also be found in students from disadvantageous backgrounds with negative learning biographies. Students become discouraged over time with lower levels of hopefulness and high levels of anxiety, frustration, and distress (Bandura, 1977;Schunk & DiBenedetto, 2020). To reduce these negative emotions, students can adept aversive behaviour as coping strategy (Bandura, 1977) which could explain loss of interest in the specific subject and in general the higher risk of early school dropout.
There are several reasons to assume that games may have differential effects for students with different background characteristics. Boys have been found to play digital games more frequently than girls (Homer et al., 2012;Hygen et al., 2019;Kinzie & Joseph, 2008) and boys' and girls' preferences for game characters and game type (e.g., strategic, problem-solving) can differ (Homer et al., 2012). Additionally, games seem to have the potential to motivate and to reach especially those students with negative learning biographies through their typical characteristics such as graceful failure, scaffolding or providing autonomy. At the same time, high achieving or gifted students could potentially get disengaged by game-based learning due to cognitive underload (Van Oostendorp et al., 2014). Studies have found that games could increase the social participation of at-risk students (Hanghøj et al., 2018), improve word reading skills for dyslectic children (Ronimus et al., 2019), have beneficial effects for training children with autism (Ke & Moon, 2018;Lu et al., 2018) and can increase prosocial behaviour among socioeconomically diverse students (Harrington & O'Connell, 2016). However, the lower availability of home resources (Mullis et al., 2020) could also indicate that not every student has the same access to technology and might not be used to playing digital games. Several studies have found differences in ICT literacy between students from different socioeconomic and ethnic backgrounds (Aesaert et al., 2015;Scherer & Siddiq, 2019;Volman et al., 2005), whereas gender differences are rather small (Gnambs, 2021;Scherer & Siddiq, 2019). Moreover, ICT skills are lower in younger students (Aesaert et al., 2015) yet increasing over time (Gnambs, 2021). These differences could be an indicator for possible inequalities in the effects of game-based learning as game-based learning research mostly focuses on digital games, which requires basic knowledge of the tool used.
Thus far, many overview studies have neglected to examine these student background characteristics as moderators of the effects of game-based learning and only age (Riopel et al., 2019;Tokac et al., 2018;Vogel et al., 2006;Wouters et al., 2013) and gender (Vogel et al., 2006) have been tested as moderators. Findings are not conclusive, as these overviews suggested less beneficial, though non-significant, effects for younger students (Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013) as well as no differences between all age groups (Tokac et al., 2018;. For gender, no significant differences were found (Vogel et al., 2006).

The present study
The present study aims to extend previous game-based learning research by examining the potential moderating effects of different student characteristics in early STEM education. Therefore, we conducted three meta-analyses to examine the effect of game-based learning in STEM education on the outcome variables cognition, motivation, and behaviour. Our aim was to conduct moderator analyses to examine possible variations in effectiveness on game-based learning due to background characteristics of students and to identify whether this could possibly enlarge existing inequalities in classrooms. As such, we examined age (i.e., school level), gender, socioeconomic status, migration background, and special educational needs as potential moderators of the effects of game-based learning. The present meta-analysis could therefore be a valuable contribution both for research and practice, as it aims to give insight whether game-based learning works for every student equally and tries to highlight possible differences in effectiveness. This knowledge could help both researchers and educators to decide when and for whom to use game-based learning and whether they need to tailor their instruction when using game-based learning in classrooms. The scope of this study was to cover students in K1-8 and it answers the following research question: How do effects of game-based learning differ for students with different characteristics (school level, gender, migration background, socio-economic status, special educational needs) in early STEM education?

Literature search
Published articles and unpublished research (e.g., dissertations) on game-based learning in early STEM education were searched in the databases ERIC and PsychInfo (via Ovid), Scopus, and EBSCOHost. We chose these databases, in consultation with an expert on search strategies, because these include both domain-specific as well as general databases. Since many relevant studies are interdisciplinary in nature, this combination of databases was expected to cover the entire range of relevant journals. Our aim was to discover all relevant studies, therefore our search terms included synonyms of gamebased learning, STEM education, motivation, and learning outcomes and were used for free text search and subject headings. An example of a search string we used in PsychInfo (via Ovid) can be found in Appendix A. For a quality check of the search terms, we compared them to the search terms of previous meta-analyses and consulted a university librarian with expertise on search strategies for meta-analyses.
Additionally, we restricted the search to records published after 2000, with English as publication language, and to elementary and secondary school students (K1-8). The results of the literature search and the selection of articles are shown in the PRISMA diagram (Moher et al., 2009) in Figure 1. The PRISMA diagram shows the numbers of articles found, included, and excluded at each step of this review.

Inclusion of studies
The first database search resulted in a total of 4452 papers. After deduplication, a total of 4331 papers were left for initial screening for which the platform Rayyan (Ouzzani et al., 2016) was used. In the initial screening phase, all papers were screened based on their titles and abstracts and papers that clearly did not fit the inclusion criteria presented in Table 1 (e.g., participants that were teachers instead of students, qualitative studies, no control group, simulations instead of a game) were excluded. In case the abstract was not providing enough information, the papers were included for the full-text screening. After the abstract screening, 306 papers were left for full-text assessment. Those articles were assessed for eligibility by applying the inclusion and exclusion criteria presented in Table 1. Aligning with the broad definition of 'games' we adopted, we included all studies which used the term 'game' for their intervention. Moreover, all included studies needed to investigate at least one of our defined outcome variables. As such, studies that examined, for example, knowledge retention, science literacy or conceptual understanding were included and coded for the outcome variable cognition. To establish reliability, at each screening step given in Figure 1, 10% of all papers were screened independently by the first and second authors. or a trained research assistant. Disagreements were resolved through discussion (e.g., whether an outcome variable fitted within one of the three outcome categories). The interrater-reliability for the title and abstract screening was 84%, for the full-text screening 100%.
We excluded papers that were not in English, did not investigate the effects of games and where the dependent variable was not cognition, motivation or behaviour (for example, teachers' perception, collaborative learning, or students' perceptions of different game mechanics). Additionally, papers were excluded if they did not report enough data to calculate an effect size. In total, 39 papers were included. Two studies included data of independent groups that could be treated as individual studies. So, in total, k = 43 studies were coded for this meta-analysis with a total of 60 effect sizes.

Coding of study characteristics and moderator variables
First, we coded basic information from the study, i.e., general study information (e.g., author, publication year), educational context features (e.g., academic domain), methodological characteristics (e.g., type of control group), and general sample characteristics (e.g., grade, gender, SES). Thereafter, information to calculate the effect size was coded, i.e., the outcome variable (cognition, motivation or behaviour) and statistical data needed to calculate the effect size (e.g., number of participants, mean scores). After two rounds of testing and adjusting the coding scheme, the first two authors coded 13% of the included papers (k = 6) individually to check for interrater-reliability. For most variables, an interrater-reliability of Cohen's k = 1.00 could be reached; for the variables study design and SES composition differences in coding could be resolved after discussion.
Outcome variables were coded as cognition (e.g., learning outcomes), motivation (e.g., intrinsic motivation, interest, self-efficacy), or behaviour (e.g., self-reported strategy use, behavioural engagement, or effort). For background characteristics, the percentages of males, low SES students, migration background students, and special needs students in the sample were coded. Additionally, the grade, calculated as sample average, and two composition variables for SES and special educational needs were coded. Those composition variables included detailed information about the SEN type and SES characteristics and were intended to help interpreting possible effects in the moderator analysis. Study characteristics (e.g., journal type, study design) were included to assess the effects of study quality on the outcomes. To code SES, we followed the definitions of the American Psychological Association (2007) and the TIMSS studies (Mullis et al., 2020). As such, low SES could be based either on parental characteristics (e.g., low-income level, low educational level, or low occupational level), material and resources (e.g., low support or few books at home, missing material), geographic characteristics (e.g., disadvantaged neighbourhood), or eligibility for free lunch programs. Special educational needs students were defined as students with either learning disabilities (e.g., dyslexia), behavioural disabilities (e.g., ADHD), or gifted students. Students with learning or behavioural disabilities were combined into one category, whereas gifted students were considered to be a separate category.

Computing effect sizes
Each study could have a maximum of three effect sizes, one for each outcome variable. In total 63 effect sizes were computed. For computing the effect sizes, standardised mean differences were used as effect size metric (Borenstein et al., 2009). Hedge's g was used as effect size unit, adjusted from Cohen's d to control for small sample size bias (Hedges & Olkin, 1985). In case ANCOVAs were reported, the adjusted post-test means were used to calculate the effect size. In case of a pre-post-test control group design, we intended to use the equations provided by Morris (2008) to calculate the effect size. However, none of the studies reported the correlation between pre-test and post-test results and calculating the correlation manually using formula 24 of Morris and De Shon (2002) was impossible due to missing data. Thus, sensitivity analysis would not have been possible (Borenstein et al., 2009) and this approach had to be discarded. Therefore, we used raw post-test means and standard deviations to calculate the effect sizes (Borenstein et al., 2009, p. 26;Thompson, 2007, pp. 424-425). In case neither of those data was available, information of inferential statistics (t-test, U-test) were used and the effect size was calculated with the online effect size calculator of Lipsey and Wilson (2001).

Multiple outcomes and non-independence
Nine studies included multiple outcome measures for a single category (e.g., multiple cognitive outcomes). An example is a study that assessed scientific concept and scientific argumentation skills as outcomes (Chen, Lu & Lien, 2019). In these cases, effect sizes for each were computed and then individually pooled in a fixed meta-analysis, which were then included as pooled effect size in the main meta-analysis (Borenstein et al., 2009;Higgins et al., 2019). Additionally, 10 studies reported the data of multiple intervention or control group conditions. In those cases, a combined effect was computed, pooling the means and standard deviations using the formulas provided by Higgins et al. (2019). Two studies (Fokides, 2018a;Chang et al., 2015) reported their results for several independent groups (e.g., different grades of different schools), in which case those groups were treated as individual studies.

Data analysis
For each dependent variable, one random-effects meta-analysis was conducted with the metafor package in R (Viechtbauer, 2010), using the restricted maximum-likelihood estimator for all models. The confidence interval of individual studies was compared with the confidence interval of the pulled effect and checked through sensitivity analysis to detect possible outliers (Viechtbauer & Cheung, 2010). Besides the effect size, standard deviation and its 95% confidence interval, the τ 2 for between-study variance is reported. To indicate the degree of heterogeneity of the effect sizes, Cochran's Q, I 2 , and the prediction interval were reported, which gives an overview over the effect size range (Borenstein, 2019). The I 2 statistics indicates the proportion of variance between studies due to heterogeneity rather than sampling error (Borenstein, 2019).
For the moderator analysis, the mixed-effects model using the Knapp-Hartung adjustment (Knapp & Hartung, 2003) was chosen, allowing the studies within subgroups being treated as random, with subgroups being fixed and simultaneously controlling for small sample size. For categorical subgroups dummy variables were computed, whereas for continuous variables, a mixed-effects meta-regression was conducted.

Publication bias
To reduce publication bias, unpublished studies were included and the publication type was added as one of the moderator variables to assess study quality. To check whether small studies with small sample sizes might have been missing, funnel plots were computed with the metafor package to check for asymmetry. Additionally, Egger's test for funnel plot asymmetry and the Duval & Tweedie's trim-and-fill procedure were conducted to adjust the effect size for publication bias (Borenstein, 2019;Van Lissa, 2019).

Included studies
The 39 papers were published between 2008 and January 2020. The majority of studies came from peer-reviewed journals (90%) and all papers included information on the grade of tested participants. Regarding the subject domains in which the game intervention took place, most of them were in mathematics (40.25%), followed by biology (31%), physics (10.25%), technology (8%), chemistry (8%), and engineering (2.5%). Two-thirds used a randomised control group design, of which 15 studies randomly assigned an entire classroom to one condition, and 13 randomly assigned students within classes to a condition. Roughly one-third of included studies used a quasi-experimental design (k = 11). Retrieving information on the background characteristics of students yielded 26 papers reporting information on gender, 4 papers reported students' migration background, 9 papers reported information on students' socioeconomic status and 9 papers included information on the amount of special educational needs students. Table B.1 in Appendix B provides additional information about the game interventions of all included studies. The most common used game design features for learning (Plass et al., 2015) in our sample were incentive systems, visual aesthetic designs, and teaching or practicing knowledge and skills.

Additional analysis
For some moderators, there was an insufficient or relatively low number of studies to include in the moderator analyses (see 4 Results) because many studies did not report sufficient information on students' background characteristics. Therefore, we decided to complement the moderator analyses with descriptive analyses in which we qualitatively describe the potential role of the moderator variables.

Effects and variation across studies
To answer our research question How do effects of game-based learning differ for students with different characteristics (school level, gender, migration background, socio-economic status, special educational needs) in early STEM education?, we conducted one metaanalysis for each outcome variable (cognition, motivation, behaviour) to examine the overall effect of game-based learning before moving into moderator analyses. The results of the random-effects meta-analyses are presented in Table 2, where the mean weighted effect size (Hedge's g), the between-study variance, the Q-test for heterogeneity and the prediction interval are presented.
The results show, that for all three dependent variables -cognition (g = .67, p < .001), motivation (g = .51, p < .001), behaviour (g = .93, p .01)-the weighted mean shows a moderate effect in favour of games compared to conventional teaching methods. This means that students who were in the intervention group scored on average 0.67 standard deviations higher in tests assessing cognitive outcomes than those who were not. The corresponding 95% confidence interval for the effect size of cognition ranges between 0.51 and 0.88, which means that the standardised mean difference in the range of comparable studies could fall anywhere in this range. The same pattern can be seen in the confidence intervals of the motivation and behaviour effect sizes, which are presented in Table 2. The Q statistic provides a test of the null hypothesis that all studies in the analysis share a common effect size (Borenstein, 2019). The Q-value is higher than the degrees of freedom for all outcome variables, with a p-value of <.001. Therefore, the null hypothesis is rejected, the true effect size is not identical in all the studies. The I 2 statistic ranges from 76% to 92% that shows that the variance of observed effects are due to variance in true effects rather than sampling error. k = number of studies; # students = total number of participants; g = mean weighted effect size in Hedges' g; SE = standard error; CI = confidence interval; τ 2 = between-study variance, Q = Cochran's heterogeneity test; df = degrees of freedom Q-test; I 2 = percentage of variation between studies that is due to heterogeneity rather than sampling error; PI = prediction interval Whereas the confidence interval indicates the precision of the measured effect sizes, the prediction interval can give a better overview on the dispersion of effect sizes across populations (Borenstein, 2019). For cognition, the 95% prediction interval ranges from −0.38 to +1.76. This means, that in the population represented by the included studies, the true effect size in 95% of cases will fall somewhere in this range. The mean effect size is moderate (g = 0.67), indicating that the average knowledge is improved by an amount that may have substantive impact (Borenstein, 2019). However, the prediction interval shows that the dispersion of effects around this mean is substantial. Hence, there are some populations where the impact is very strong, some populations where it is moderate, and somewhere it is trivial or even negative. The same pattern can be seen in the prediction intervals for the outcome variables motivation (−0.11; +1.12) and behaviour (+0.08; +1.78).
A sensitivity analysis was performed to test for between-study heterogeneity and to check whether the observed effect size is robust and not heavily influenced by outlier studies (Viechtbauer & Cheung, 2010). The findings do not indicate any strong outlier impact on the calculated effect sizes for cognition, motivation, and behaviour.

Moderator analyses -background characteristics
The results of the moderator analyses to test for differential effects of games for students with different background characteristics are presented in Tables 3 and 4 for cognition and motivation, respectively. No moderator analyses were conducted for behaviour because only six studies in total focused on this outcome variable, which is too few to perform a moderator analysis. All studies reported the school level and most studies reported on gender (k = 26), but only few studies reported on the other background characteristics. For cognition, school level (k = 40), gender (k = 23), SES (k = 9), migration background (k = 4), SEN (k = 7), and giftedness (k = 5) were included as moderators. For motivation, moderator analyses could only be performed for school level (k = 15), gender (k = 12), SES (k = 6), and SEN (k = 3). No moderator analyses could be performed for migration background and giftedness as there were too few studies (k < 3) reporting sufficient information.
Instead of combining the different background moderators in a multiple metaregression, separate regression models were performed for each moderator. This approach was chosen because the focus of this meta-analysis is on the main effects of each moderator rather than their combined effects, and because combining them would mean excluding most of the studies as only very few studies included information on all moderators, thereby increasing the Type II error. To correct for Type I error, we adjusted a stricter p-value (p = .001) for these variables. The statistics for heterogeneity (τ 2 , R 2 , I 2 ) for each regression model are reported in Tables 3 and 4. For cognition, school level was found to be a significant moderator as the omnibus test of the coefficients was significant (p < .001) and accounted for 34% of the amount of heterogeneity. The findings indicate that games are less effective for secondary school students (g = −0.67, p = <.001) than for primary school students. The other background variables, gender, SES, migration background, and special educational needs were not found to be significant moderators as the omnibus tests of the coefficients were not significant. More specifically, gender (k = 23) did not have a statistically significant moderating effect on cognition (p = .37). Also, SES, migration background, and special educational needs were not found to be significantly moderating the effects of games on cognition. However, these results should be interpreted with caution given the small number of studies included. For motivation, the school level model accounted for 22% of the heterogeneity with a significant omnibus test (p = .063). Secondary school students seem to experience a significantly lower motivating effect than primary school students (g = −.37, p = .063). The omnibus tests for the study quality model and for the background characteristics were not significant, indicating that this model has no moderating effect.

Publication bias
A common concern in meta-analyses is that the studies included could be a nonrandom subset of all studies performed. Especially since studies that report significant effects are more likely to be published than studies that do not report statistically significant results. Which in turn raises concerns that the mean effect size of any given meta-analysis could be larger than the mean effect size in all studies that were actually performed (Borenstein, 2019). One indication for such publication bias is that the effect sizes in meta-analyses are not evenly distributed around the mean and that there is a relative lack of studies with small samples and null or negative effects. To test for publication bias, the Egger regression test and Duval and Tweedie's trim-and-fill procedure were used and we added study quality characteristics in our moderator analysis.
The Egger regression test yields significant p-values for cognition (p = .004), motivation (p = .002), which provides evidence of a small-study effect. This could reflect the fact that the effect size is larger in smaller studies or it could reflect publication bias. The Egger regression test for behaviour was not significant (p = .284). For cognition, the study quality characteristics model (Table 3) accounted for 36% heterogeneity with a significant omnibus test (p = .001). Studies published in peer reviewed journals seem to have a significantly lower effect size than studies in non-peer reviewed journals (g = −1.216, p = .001, while accounting for the other model variables). For motivation, this model was not significant (p = .855).
The Duval and Tweedie's trim-and-fill procedure suggests that there may have been four studies missing for the meta-analysis on motivation (see, Figure 2, missing studies are indicated with white dots), and one study for the meta-analysis on behaviour, whereas for cognition no additional studies were imputed. The observed weighted mean for motivation is 0.51 and the adjusted weighted mean, after imputing these missing studies, is 0.43. These findings indicate that the effect would be still moderate, even if publication bias might have shifted the effect size upwards and the basic conclusion (that games have a motivating effect compared to traditional teaching) remains unchanged. A similar conclusion can be made for the metaanalysis on behaviour, the observed weighted mean is 0.93 and the adjusted mean, after imputing the missing study, is 0.92.

Additional descriptive analyses
Given the limited number of studies included in the moderator analyses, we also descriptively reviewed and described the results regarding our moderators, by comparing the effect sizes for studies with different numbers of students with certain background characteristics. Thereby we aimed to see if there were some preliminary patterns emerging from the data regarding our moderators, which could not be detected in the moderator analyses. Although we cannot draw strong conclusions based on these descriptive findings, these may point to certain directions for future research.

Socioeconomic status (SES)
In total, only nine studies reported socioeconomic background information of their sample (see , Table 5), yet none investigated whether their game intervention had a different effect for students with different SES backgrounds. Out of those nine studies, two studies included only high SES students with the socioeconomic indicators private school (Yallihep & Kutlu, 2020) and high parental education level (Núñez Castellar, All, de Marez, & Van Looy, 2015), whereas two studies included only low SES students (Chang et al., 2015;Khan, Ahmad, & Malik, 2017) in their sample. For Chang et al. (2015), the SES indicator geographic level was used, whereas the data of Khan et al. (2017) was collected in a low-cost private school in Pakistan and was coded as low SES sample through the material and resources level. 1 The remaining five studies included a heterogeneous sample, with the proportion of low SES students ranging from 34% (Star et al., 2014) to 95% (Ronelus, 2016). Among those nine studies, two investigated motivational effects only (Atwoord-Blaine, 2016; Ke, 2008), whereas the other seven studies measured either  cognitive or multiple outcomes. In total, seven studies looked into cognitive effects, six studies into motivational effects, and one study measured behavioural effects of game interventions.
Even though the moderator analyses for SES failed to reach significance due to the low sample size, there seems to be a tendency of a lower effect size (g = −.52) for the effects of games on cognition for low SES students compared to high SES students. Looking at the effect sizes of cognitive outcomes of those studies more closely, there appear to be differences in the effectiveness with the effect size ranging from g = −.12 to g = 1.98, displayed in Figure 3 below. Looking particularly at the four studies that included only high or low SES students, there seems to be a difference in the effectiveness of the intervention. Whereas the two studies with high SES samples report a high effect size of g = 1.98 and g = 0.71 with confidence intervals in the positive range, the two studies with low SES samples report small effect sizes of g = 0.09 and g = 0.06 with confidence intervals covering negative as well as positive values. This observation can also be made about the data of Table 3. Hence, it may be that games are less effective for improving cognitive outcomes in low SES students, but more research is needed to test this assumption. Furthermore, the findings do not appear to demonstrate any clear pattern between SES and motivational outcomes (see forest plot in Appendix C), suggesting that games may not differ in motivational effectiveness for students with varying SES backgrounds.

Migration background
Four studies reported information on the minority background of students, see, Table 6. One study was conducted in a school with a student body consisting of minority students only (Ronelus, 2016), whereas for the remaining studies the proportion is ranging between 40% (Ke & Clark, 2020;Star et al., 2014) and 75% (Anderson & Barnett, 2013). All four studies investigated cognition, two additionally assessed motivation and one included behavioural outcome measures. The moderator analysis was not significant (p = .320) for the outcome variable cognition. For descriptive purposes, the effect size is displayed in the forest plot in Figure 4. The effect size for cognitive outcomes ranges from g = −.12 to g = 1.56. Interestingly, the study with the highest effect size had a sample consisting of only minority students (Ronelus, 2016), whereas the lowest effect size study has the lowest sample of minority students (Star et al., 2014). For motivation, no moderator analysis could be conducted due to low number of studies. Additionally, when comparing the effect sizes in the forest plot in Appendix C, there does not seem to be any difference between the motivational outcomes for minority students in particular. In general, there does not seem to be a difference in the effectiveness of game-based learning for STEM education for this moderator.

Special educational needs and giftedness
Nine studies included information on special needs status of their sample, the information can be found in Table 7. In the moderator analyses, we differentiated between SEN learning and giftedness.
Eight studies included information about the presence of students with SEN in their sample, three of which reported that there were no students with SEN, and five studies included SEN students varying from 9% to 100% of the sample. Eight studies looked into the cognitive effects, four into motivational effects and one into behavioural effects of game-based learning.
Looking specifically into the cognitive effects for SEN students, the effect sizes range from g = .34 to g = 1.43, as shown in Figures 5 and 6. The highest effect size here was reached with an intervention aiming at dyscalculia students (De Castro et al., 2014) and is therefore also the only study that included a homogeneous student sample. The moderator analyses for both SEN learning and SEN gifted were not significant.  Of the five studies that reported on the number of gifted students, only two studies actually included gifted students, one included 25% gifted students and one study had a sample consisting only of gifted students (100%). For gifted students, the effect sizes of the two studies are g = −.38 and g = .67. For these cases, it is important to point out that the sample with 100% gifted students (Chee & Tan, 2012) seemed to have a better effect on achieved learning outcomes than the sample with 25% gifted students (Long & Aleven, 2017). No difference in the effectiveness can be observed when looking at the forest plot in Figure 6. Consequently, there do not seem to be noteworthy differences of the effectiveness on cognition between special educational needs, gifted students and other students, even though some might seem to benefit slightly more from the intervention.
Finally, looking into the motivational outcomes of game-based learning, also in this case there does not seem to be any particular difference between the effectiveness for special needs and gifted students. In other words, game-based learning seems to be equally motivating for all student groups.

Discussion
Game-based learning is increasingly implemented in primary and secondary STEM education. Even though previous meta-analyses have indicated the overall effectiveness of games for improving students' cognition, motivation, and behaviour (Lamb et al., 2018;Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013), thus far little is known about whether games are equally effective for different groups of students, and thus whether some groups benefit more than others. Therefore, the aim of this metaanalysis was to gain insight whether games have different effects for students with different background characteristics (gender, age, SES, migration background, and SEN) in early STEM education. To address this aim, we first conducted three random-effects meta-analyses to compute the weighted mean effect, before moving into mixed-effect moderator analyses and complementary descriptive analyses. In short, our meta-analyses revealed that, overall, game-based learning has positive effects on students' cognition, motivation, and behaviour, but also suggested some differences based on certain students' background characteristics, whereas others did not have any moderating effect.

Overall findings
Aligning with findings of other meta-analyses in STEM education, which reported effect sizes ranging between 0.29 and 0.67 (Lamb et al., 2018;Riopel et al., 2019;Wouters et al., 2013), our results show that, overall, game-based learning in STEM education has beneficial effects on students' cognitive outcomes (g = .67), their motivation (g = .51) and behaviour (g = .93), compared to traditional classrooms. Yet, for all three meta-analyses we found significant heterogeneity between effect sizes, suggesting that the findings substantially differed across studies. This heterogeneity was not caused by sampling error and the prediction intervals indicated substantial dispersion of effects. Thus, although the overall effect was positive, in some studies, students in the game intervention actually had more negative outcomes compared to students in traditional classrooms. This raised the question on the potential differential role of students' background characteristics.

Moderator analyses
Unfortunately, many studies failed to report sufficient information about students' background characteristics to include them in the moderator analyses. While most studies reported students' age (school level) and the gender distribution of the sample, only few studies reported on the other background characteristics. Moreover, none of the studies included in this meta-analysis explicitly tested for differences in effects based on students' background. This complicated and limited the possibilities for performing moderator analyses. This was especially the case for the outcome category behaviour. The total number of included studies for behaviour was too low to perform any moderator analyses for this outcome variable. For cognition and motivation, only for the moderators age and gender, a sufficient number of studies was included to be able to draw solid conclusions. Given the small number of studies reporting on students' SES, migration background, or SEN, the findings of these moderator analyses need to be considered as preliminary findings and interpreted with caution. Therefore, we combined these moderator analyses with a more descriptive review of the findings to provide some first indications regarding potential directions of moderating effects. Below, we discuss the findings for each moderator.
Age. The moderator analyses for students' school level (as an indication of their age) included a sufficient number of studies to be able to draw conclusions. Our results indicate that games are less beneficial for secondary school students compared to primary school students, with regard to cognition (g = −.67) and motivation (g = −.37) for STEM. In other words, especially primary school students seem to benefit from the use of STEM games compared to traditional classroom settings. These results contrast previous findings that indicated that game-based learning is equally effective for all age groups (Tokac et al., 2018; or less beneficial for younger students (Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013). None of the previous reviews did specifically compare primary with secondary school students and in most cases, primary and secondary school students were combined and compared with adults in higher education. However, since previous research indicated that the motivation for STEM subjects decreases already from upper elementary school (Renninger et al., 2015;Shin et al., 2019), it may be that comparing primary with secondary school students yields a differential effect that could possibly be explained with the general decrease of interest which is common in the population of secondary school students. Hence, these findings indicate that interventions using games to enhance cognition and motivation may be especially effective when implemented early.
Gender. Gender on the other hand had no significant moderating effect on cognition (p = .37) and motivation (p = .16), meaning there is no recordable difference in the effectiveness of game-based learning for boys and girls. Those results align with the result of Vogel et al. (2006), showing no significant difference in the effectiveness of game-based learning. This means, even though boys and girls have different preferences for game types (Homer et al., 2012) and play them in different frequencies (Hygen et al., 2019), they equally benefit from game-based learning interventions in STEM subjects. (2019) found when examining the effect of ICT use, students with low socioeconomic backgrounds may learn less when game-based learning is used compared to a traditional classroom setting. Even though the moderator analysis for SES failed to reach significance, there seems to be a tendency of a lower effect size (g = −.52) for the effects of games on cognition for low SES students compared to high SES students that could also be observed in the descriptive analysis. For motivation on the other hand, no difference in the effectiveness of game-based learning was found, indicating that games can be equally motivating for low and high SES students. One possible explanation could be the lack of home resources of low SES students, which would mean that they are less accustomed to the use of technology in general. The gamebased learning intervention could be perceived as fun and exciting, raising the motivation for the subject, but due to the lack of ICT skills compared to high SES students, cognitive overload could interfere with successfully learning from these interventions. This aligns with the results of Hanghøj et al. (2018), where at-risk students were motivated through game-based learning in school, but did not learn more through the intervention.

SES. Similar to what Scherer and Siddiq
Migration background. Previous research on ICT literacy indicated that minority students report lower ICT skills compared to majority students (Volman et al., 2005). Therefore, we expected to find a moderating effect for game-based learning interventions. The moderator analyses, however, were not significant and no differences in the effectiveness were found in the descriptive analysis. Further research is needed to investigate possible differential effects for this moderator, since we could only include four studies in this meta-analysis that reported information on minority background students.
SEN and Giftedness. The moderator analyses for both SEN learning and SEN gifted were not significant and no noteworthy differences in effectiveness were found through the descriptive analysis. Games have been particularly designed and used as treatment for special educational needs students and have been shown to be effective (Ke & Moon, 2018;Lu et al., 2018;Ronimus et al., 2019). But none of these studies looked into the effects of game-based learning intervention compared to traditional teaching raising the question whether similar results could also be seen in these interventions. On the other hand, we suspected that gifted students might not be challenged enough in game-based learning interventions leading to under-stimulation that could negatively influence the observed outcomes. Our results are preliminary, but it seems that gamebased learning does not increase pre-existing inequalities and both SEN and gifted students benefit from game-based learning interventions as much as students without SEN.
These results have several practical implications. First, games are a suitable tool to motivate all students for a STEM subject and therefore might be a great way to reach also disengaged students in a classroom. However, since not every student seems to be able to learn well with game-based learning, teachers have to be careful about how to design their lessons, taking into consideration that some students might need additional support to be able to benefit from using games. Furthermore, future research could investigate whether additional instruction or the use of dynamic difficulty adjustment (Van Oostendorp et al., 2014) within a game can counteract the disadvantages that low SES students seem to be facing. Although the number of included studies in our meta-analysis was rather small, our results show that it is important to take the different characteristics (particularly age and SES) of students into consideration when planning a game-based learning intervention to avoid contributing to pre-existing differences.

Limitations
The majority of the included studies did not report sufficient demographic information of their sample, apart from the information reported for age and gender. Therefore, some of the moderator analyses could not be conducted or were performed with low numbers of studies. One approach to deal with such missing data would be to manually insert estimates (Borenstein et al., 2009), but this had to be discarded in our case as the number of missing studies was too high. Another approach would be to use a random forests algorithm for meta-analyses to explore heterogeneity and to identify relevant moderators (Van Lissa, 2020). Again, this approach was not feasible for our sample as the number of missing data was simply too high to draw any conclusions. Therefore, the results of the moderator analyses on the other background characteristics and the additional descriptive analyses can only be seen as a first indication of potential differential effects or the lack of thereof.
One general difficulty of conducting a meta-analysis is to find all available studies that fit the inclusion criteria. With searching in several multidisciplinary databases we tried to find all papers available, but we cannot fully exclude the possibility that the choice of databases has limited the amount of studies that were found. Moreover, our exclusion criteria might have limited the number of available studies. One example is the exclusion of non-English articles which might have excluded relevant papers in other languages. Another example is the exclusion of articles that investigated a different age group. It could be of interest to investigate the possible differential effects of game-based learning in other age groups and settings as well, as games that are being used in e.g., employee training might show a similar differential effect depending on the background of the employee. We adopted a broad understanding of games and did not look into a mediacomparison approach to distinguish between e.g., paper-based games or virtual game environments, which might yield different effects. Additionally, we did not look into specific game features that could have an effect on the outcome variables and might moderate the effects, as this was beyond the scope of the present meta-analysis.

Implications for theory, research and practice
The findings of this meta-analysis suggest several implications for theory, future research, and practice.
Implications for theory. Game-based learning is a multidisciplinary field and as such, research on game-based learning is very diverse in terms of the underlying theories. Many studies on game-based learning adopt theories from other fields, for example, theories on multimedia learning, motivation, and, to a lesser extent theories on instructional design. However, theories on inclusive education, for example, the universal design for learning approach (UDL; CAST, 2018) have, to our knowledge, not been incorporated in studies on games. These theories can provide insights into how to make games suitable for diverse student groups. Incorporating such theories in intervention studies and game design could help to make games optimally suitable for different student groups and maximise the beneficial effects.
Implications for future research. In this meta-analysis, it was not possible to investigate possible interaction effects of students' background characteristics with certain game features due to our limited sample size. Nevertheless, it would be relevant to gain more insight about the effectiveness of game-based learning for different student groups. To allow future meta-analyses to look into these types of interactions, future empirical studies need to be more elaborate on their sample characteristics. Moreover, future metaanalyses could investigate the differential effect of games not only in STEM education but also include other domains. This could lead to a higher sample size and could allow additional moderator analyses to investigate possible interaction effects between certain game mechanics and player characteristics.
In addition, empirical studies are needed which explicitly compare the effects of games between student groups. Even more, games can strongly differ from one another and they may also elicit different effects in different disciplines. There may be interactions between game features, disciplines, and student characteristics (e.g., certain games may be particularly effective for certain groups in a certain domain). Empirical studies are needed to address such interactions, to provide more insights into which types of games, in which disciplines, work best for which students. Moreover, future research could also examine the factors that may underlie different effectiveness for different groups, such as poor self-regulation skills or differences in experiences with games. Gaining insight into the exact causes of differences can help to overcome these differences in effectiveness by adjusting the game design or game instructions.
Implications for practice. Overall, our findings confirmed findings of previous metaanalyses (Lamb et al., 2018;Riopel et al., 2019;Vogel et al., 2006;Wouters et al., 2013;Zainuddin et al., 2020) and suggested that games can be a helpful tool to improve students' learning outcomes, motivation, and behaviour. However, our findings also warrant some caution. That is, despite the inconclusiveness of our findings, the findings do seem to suggest that the effects of games can differ between student groups. This means that educators need to be aware that games that have been shown to be effective in general, may not be as effective for their specific population or the effects may vary between students. Hence, educators need to carefully consider if and which games they may implement for their student population and they need to monitor continuously if the games yield the desired effects, and if there are students who do not benefit from the game in terms of their cognition, motivation, or behaviour. In these cases, additional instruction (i.e., pretraining) or an alternative method may be required.
More specifically, the findings suggested that the games may be less effective for low SES students than for their higher SES counterparts. Hence, implementing games in diverse classrooms may exacerbate existing gaps between students with different socio-economic backgrounds and needs to be considered carefully. Moreover, our findings show that game-based learning in STEM education is effective in general, but especially younger students seem to benefit from the use of games. This means that games should be implemented early on to be most effective, particularly to improve students' motivation for STEM subjects which might help to intervene with the general decrease of interest in these subjects later on.

Conclusion
The present study quantitatively summarised the results of 39 studies about the effects of games on STEM learning, motivation, and behaviour compared to conventional classroom settings in K1-8. In general, it can be concluded that the students in the game interventions achieve significantly higher learning outcomes, report more motivation, and behavioural change compared to students in traditional classrooms. Particularly students in primary education (K1-6) seem to benefit from game interventions in STEM education. The high heterogeneity between effect sizes in our sample, however, indicates that the implementation of games as additional teaching tools should be planned with caution, as not every student is equally benefiting from game-based learning. Our study provides some preliminary support for the assumption that low socioeconomic background students might learn less with game-based learning than students with a higher socioeconomic status, whereas gender, migration background, and special educational needs did not seem to have any differential effect. More empirical research is needed that reports and investigates the effectiveness of game-based learning for heterogeneous student groups. Future metaanalyses will then be able to provide more substantiated results. Note 1. Private schools in Pakistan are divided between low-cost and elite schools and make up two thirds of all available middle schools (Asian Development Bank, 2019). Even though those schools are labelled private, one quarter is missing books (UNESCO, 2007). Compared to public schools they still have better resources such as running water and electricity, but following our definition of low SES the low-cost private school was coded as such.