THE RELATIONSHIP BETWEEN L2 LISTENING AND METACOGNITIVE AWARENESS ACROSS LISTENING TESTS AND LEARNER SAMPLES

ABSTRACT To better understand the relationship between second-language (L2) listening comprehension and metacognitive awareness, we examined the moderating effects of listening tests and learner samples while focusing on aspects of metacognitive awareness. Students of English-as-a-foreign-language at a Japanese university (n = 75; the 2019 cohort) took the Test of English as a Foreign Language Institutional Testing Program (TOEFL ITP®) test, a paper-based TOEFL; the TOEFL Internet-based test (TOEFL iBT®); and the Metacognitive Awareness Listening Questionnaire (MALQ). Another group of students (n = 107; the 2020 cohort) took the TOEFL ITP and MALQ. Random forest analysis was applied to the results of the 2019 cohort, showing that, in order of importance, person knowledge, mental translation, and directed attention were related to listening comprehension in both listening tests. Problem solving was not related in either listening test. Further, planning and evaluation strategies were related to listening comprehension only in the TOEFL ITP. Comparison between the TOEFL ITP results of the 2019 and 2020 cohorts showed that only person knowledge was related to listening comprehension across the two cohorts, indicating a strong generalizability of person knowledge and weak generalizability of the remaining metacognitive strategies across learners. Implications and future directions are discussed.


Introduction
Successful second-language (L2) listening comprehension requires complex cognitive processes that involve various variables (e.g., Buck, 1991;Vandergrift & Goh, 2012). Consequently, research has been conducted on how listening is directly related to these variables (see a meta-analysis by In'nami et al., 2021;Karalık & Merç, 2019) and how such relationships are moderated (Wallace, 2020). The present study adds to this corpus of research by focusing on aspects of metacognitive awareness and by assessing the moderating effects of listening comprehension tests (which have different task types; task demands; and paper-based vs. computer-based formats) and samples of learners (2019 and 2020 cohorts) on the relationship between metacognitive awareness and listening comprehension. Researchers have found a positive relationship between L2 listening comprehension and metacognitive awareness as a whole. However, it has not been clear whether and how such a relationship changes according to aspects of metacognitive awareness. Although a variety of listening tests exist, it is not clear how they impact the relationship between listening and metacognitive awareness. While a comparison of results across learners supports generalizability of the findings, research on this area remains sparse. Investigating these issues would further enhance our understanding of how L2 aural information is comprehended.
CONTACT Yo In'nami innami@tamacc.chuou.ac.jp Division of English Language Education, Faculty of Science and Engineering, Chuo University This article has been republished with minor changes. These changes do not impact the academic content of the article.  Goh and Hu (2014)  Note. MC = multiple-choice task. OE = open-ended task. BEC = Business English Certificate by Cambridge English. All learners are L2 English learners, except for Vandergrift and Baker (2015), which targeted L2 French learners. a Comprising multiple-choice (including matching) and open-ended questions (e.g., plan, map, diagram labeling; form, note, table, flow-chart, summary completion; sentence completion, and shortanswer questions; https://ielts.idp.com/prepare/article-question-types-listening). b Comprising fill-in-the-gap and multiple-choice questions. c Comprising multiple choice and dictation questions (Jin, 2010). d Test takers are given time to read the questions. e Evolutionary algorithm-based symbolic regression.
example, Goh and Hu (2014) found that (a) person knowledge and (b) problem solving were significant predictors of listening comprehension in multiple regression. Vafaee and Suzuki (2020) found that mental translation and problem solving did not load significantly on the metacognitive awareness factor and were removed from the model. The three metacognitive aspects retained in the model, that is, (a) person knowledge, (c) planning and evaluation, and (d) directed attention, significantly predicted listening comprehension. Aryadoust (2015a) found that (b) problem solving and (c) planning and evaluation strategies were significant predictors of listening comprehension in the evolutionary algorithm-based symbolic regression. Aryadoust (2015b) found that all strategies were significant predictors of listening comprehension in an artificial neural network analysis. Vandergrift and Baker (2015) reported that (a) person knowledge was the only significant factor affecting listening when the three learner groups (2008, 2009 and 2010 cohorts) were analyzed together, whereas (b) problem solving or (d) directed attention was a significant factor affecting listening when the data of one of the three cohorts was analyzed separately. Table 1 suggests that the relationship between L2 listening and metacognitive awareness varies across contexts and that some moderator variables affect this relationship. We focused on the moderating effects of the two variables of listening tests and learner samples.

Effects of listening tests on the relationship between L2 listening comprehension and metacognitive awareness
It is important to investigate the effects of listening tests on the relationship between listening comprehension and metacognitive awareness. While listening tests differ in many respects, we review three that are related to the current study. First, constructs measured and task types may lead to different task demands and elicit different listening and test-taking strategies and processes, eventually affecting the relationship between metacognitive awareness and listening. Previous studies have examined learners' test-taking processes or strategies in the context of listening tests that target different aspects of listening ability and include different tasks. For example, Wu (1998) used a verbal protocol method to elicit the cognitive processes involved in answering paper-based, multiple-choice test items measuring literal and inferential understanding of passages, and found that when learners failed to understand the passage, they used non-linguistic general knowledge to compensate for their lack of understanding. Taguchi (2001) used a questionnaire to measure strategies employed during a listening test that seemed to assess the literal understanding of passages. It was found that all the strategies -repair, affective, compensatory, and linguisticwere used by learners who attempted a paper-based listening test containing questions with two option (yes-no). Carrell (2007) investigated how notetaking strategies affected the scores in a computer-based, multiple-choice L2 listening test, assessing literal and inferential understanding of passages. The results showed a moderate relationship between notetaking strategies and listening scores, wherein the number of words that directly corresponded with the answers and content words strongly correlated with the listening scores (r = .63 for answers; r = .49 for content words). Further, using eye-tracking and cued retrospective reporting, Suvorov (2018) examined the aspects of visual information that examinees paid attention to in a video-mediated, multiple-choice L2 listening test, assessing literal and inferential understanding of passages. The results showed that learners focused on speaker-related (e.g., speaker's body movements) and lecture-related (e.g., visual aids) information that helped them understand the passage. However, the information was also distracting if it was not explicitly related to the point the speaker was trying to make. These previous studies focused on listening strategies or processes involved in answering multiple-choice questions, but previous studies have also suggested that open-ended questions elicit different processes (Buck, 1991). In the context of the relationship between metacognitive awareness and L2 listening, Table 1 shows that previous studies used either multiple-choice questions (Vandergrift & Baker, 2015) or a combination of multiple-choice and short-answer questions (the other studies in Table 1).

THE RELATIONSHIP BETWEEN L2 LISTENING AND METACOGNITIVE AWARENESS ACROSS LISTENING TESTS AND LEARNER SAMPLES
Second, in addition to constructs measured and task types, another factor that could impact the relationship between metacognitive awareness and L2 listening is the availability of question preview opportunities. For example, IELTS allows examinees to preview questions before listening to a passage and answering the questions (British Council, 2021). Existing literature has documented that such a listening preview positively affects the results (Sherman, 1997). Table 1 shows that Goh and Hu (2014), Aryadoust (2015aAryadoust ( , 2015b, Vafaee and Suzuki (2020), and Ghorbani Nejad and Farvardin (2019) used tests that offered the learners an opportunity to preview questions. The positive and strong mediation of previewing on the relationship between L2 listening and metacognitive awareness may be attributed to the fact that it helps learners to understand the test and prepare for corresponding problem-solving activities (also see Field, 2019).
Finally, the third factor is the delivery mode of tests. With the advancement of Information and Communication Technology, it has become conventional to deliver tests on computers (e.g., Douglas & Hegelheimer, 2007). Coniam (2006) compared the score of the two modes (i.e., computer-based and paper-based) of an L2 listening test and reported that learners performed better on computer-based tests than paper-based tests overall. Table 1 shows that all previous studies have used paper-based tests. While the differences in test-taking processes have been examined across reading and writing test formats (e.g., Brunfaut et al., 2018;Sawaki, 2001), few studies have investigated listening. It can be assumed that listening scores and processes may not change as long as learners are comfortable using a computer, although this is something that it would be useful to examine in future studies.
We have reviewed the three factors that may be involved in understanding the relationship between metacognitive awareness and L2 listening. They are (a) constructs measured and task types, (b) the availability of question preview opportunities, and (c) the delivery mode of tests. The current study expands previous studies by investigating these three factors in two formats of TOEFL ITP and iBT tests.

Test of English as a Foreign Language Institutional Testing Program (TOEFL ITP®) test and TOEFL Internet-based Test (TOEFL iBT®)
The two formats of TOEFL tests share similarities and differences. First, regarding test constructs and task types, both use multiple-choice task types and play the listening passages only once. However, for the construct measured, the TOEFL ITP focuses on "the ability to understand spoken English . . . used in colleges and universities" (Educational Testing Service, 2021c). While the TOEFL iBT also measures this construct, it includes broader but more precisely defined constructs: "listening for: basic comprehension" and "pragmatic understanding (speaker's attitude and degree of certainty) and connecting and synthesizing information" (Educational Testing Service, 2021d). Although both tests share the same target language use domain (i.e., listening comprehension of the English language used in tertiary education), the authenticity and demand of tasks in the TOEFL iBT are higher so as to better represent the construct and academic domain more lucidly. As summarized in Table 2, the TOEFL iBT administers fewer questions in a longer period of time, whereas the TOEFL ITP administers a larger number of questions in a shorter period of time. It can be assumed that using longer passages and fewer items may reflect the need to measure pragmatic, connecting, and synthesizing abilities. Second, in terms of question preview, in the paper-based TOEFL ITP, multiple-choice options are presented in the test book before, while, and after the passage is played, while the questions are aurally presented (and not written in the test book) after the passage is played. While examinees can consider the options beforehand, comprehending them before and/or while the passage is played is not an easy task because of the limited time and cognitive resources. In contrast, multiple-choice options and questions are presented on the screen after the passage is played in the TOEFL iBT. To minimize the risk of losing track of the passage, examinees can strategically take notes to structure the content of the passage.
Third, the availability of the TOEFL iBT in a digital format compared to the paper-based TOEFL ITP may increase task demands unless learners are familiar with using computers.
Consequently, all these characteristics could make the TOEFL iBT more challenging than the TOEFL ITP. As little research has been conducted, it is not clear what aspects of metacognitive awareness of listening, as measured in the MALQ, are related to comprehension in each test.

Effects of learner samples on the relationship between L2 listening comprehension and metacognitive awareness
Another factor that affects the relationship between metacognitive awareness and L2 listening comprehension is learner samples. This is important as it ascertains generalizability or "the extent to which research results can justifiably be applied to a situation beyond the research setting" (Chalhoub-Deville, 2006, p. 3). As reviewed above, findings across studies have suggested that metacognitive awareness is positively related to listening comprehension. However, how each aspect of metacognitive awareness measured by MALQ affects this relationship has not been conclusive. Consequently, while existing literature suggests a strong generalizability of the relationship between L2 listening and metacognitive awareness, it suggests a weak generalizability of the relationship between L2 listening and each aspect of metacognitive awareness.
Previous studies have differed in many respects. As seen in Table 1, participants' proficiency levels and learning settings were different. For example, Goh and Hu (2014) and Vafaee and Suzuki (2020) used IELTS listening test in EFL contexts, but their participants' L1s (Chinese and Persian, respectively) and proficiency levels (slightly higher but less varied, as reported in Goh & Hu, 2014 [M = 24.58 out of 40; SD = 5.47], than Vafaee & Suzuki, 2020 [M = 19.86 out of 40; SD = 8.93]) differed. Therefore, it is challenging to identify the factor that is responsible for producing mixed results across studies regarding the importance of each aspect of metacognitive awareness. This can be addressed by examining two groups of similar learners. Ercikan (2009) argued that participants, construct definitions, measurements, and settings affect the generalizability of study findings. Consequently, if these conditions are similar, similar results are likely to be obtained. However, Vandergrift and Baker (2015) assessed three cohorts with the same characteristics and reported different results. For one cohort, directed attention of metacognitive awareness was significantly related to listening, whereas for another cohort, problem solving was significant. When all the cohorts were combined, person knowledge emerged as significant. This suggests the effects of learner samples on the relationship between L2 listening comprehension and metacognitive awareness. The current study further examines this issue by assessing two cohorts and their responses to the same research instruments (i.e., the same definitions and measures of listening comprehension and metacognitive awareness) at the same university (i.e., the same settings). The results would provide further evidence of the stability of the relationship between L2 listening comprehension and metacognitive awareness.

Random forests
To examine the research questions as reported below, we used random forests. According to Strobl, Malley et al. (2009), the random forest approach is an example of a recursive partitioning method for THE RELATIONSHIP BETWEEN L2 LISTENING AND METACOGNITIVE AWARENESS ACROSS LISTENING TESTS AND LEARNER SAMPLES examining the relationships among variables. Variables can be categorical or continuous, and normal or nonnormal. Linear and nonlinear relationships among variables can be modeled. To obtain precise and stable results, the random forest uses bootstrapping. This means that a sample of data is drawn randomly and is used to examine the relationships between variables. This process is repeated and the results are averaged across runs to refine those results. For example, in the current study (see below for details), we examined how five metacognitive strategies were related to L2 listening comprehension. It could be possible, for instance, that Strategy A was related to comprehension in Data 1, Strategies B and C related in Data 2, and Strategy D in Data 3. The results were tallied and averaged across runs to obtain accurate and stable findings.
Compared to the random forest method, multiple regression, which is often used in L2 studies (Plonsky & Ghanbar, 2018), typically examines a linear relationship among normally distributed, categorical or continuous variables by analyzing the whole data once (without bootstrapping). Thus, random forests allow researchers to take a more flexible approach to analyzing data while still being able to obtain precise and stable results across samples of simulated data sets.
Further, unlike regression, the order in which variables are entered into the model does not matter in random forests. We believe this is important because the way independent variables entered into the regression affects the results (see, for example, Tabachnick & Fidell, 2014) but is not always reported. The issue is more serious when researchers have a large number of independent variable to model. It is not always easy to decide which variable to enter into the equation first. In random forests, all variables enter into the analysis simultaneously, and their importance can be interpreted more easily than regression. Collectively, these features of the random forest method make it an appealing approach to modeling relationships among variables, including those in the current study.

Research questions
Our review has shown two primary gaps in previous studies. First, researchers have found a positive relationship between L2 listening comprehension and metacognitive awareness as a whole. However, it has not been clear whether and how such a relationship changes according to aspects of metacognitive awareness. Second, it remains to be examined whether and how listening tests and learners moderate the relationship. To fill these gaps, the current study examines the relationship between L2 listening and metacognitive awareness while focusing on aspects of metacognitive awareness and considering the moderating effects of listening tests (TOEFL ITP vs. iBT) and learner samples (two cohorts), neither of which have been rigorously compared in previous studies. The following two research questions were addressed: 1. How is the relationship between L2 listening comprehension and metacognitive awareness affected by different listening tests?

How Is the Relationship Affected by Different Samples of Learners?
The study is outlined in Table 3, with further information described in the Method section.

Participants
Two groups of learners participated in the current study. The first group (the 2019 cohort) consisted of 75 English-as-a-foreign-language students at a Japanese private university majoring in medicine. They were 1st-year undergraduate students (aged 18 or above) enrolled in a preparation course for the TOEFL test. Their listening proficiency ranged from A2 to C1 (B1 on average), of the Common European Framework of Reference based on their TOEFL scores (see Table 4) and Educational Testing Service (2021a, 2021b). They had learned English for at least six years in secondary school before enrolling in the university. According to their instructors, the students routinely used computers inside and outside the classroom so their computer skills would not affect their performance on the computer-based tests. The students had had little experience taking computer-based tests.
The other learner group (the 2020 cohort) included 107 students from the same Japanese private university with the same major (medicine). They enrolled in the school in April 2020, a year after the 2019 cohort. The two cohorts differed only in the timing of their enrollment. According to the instructors who taught both cohorts, they had much in common in terms of academic achievement and did not demonstrate any noticeable differences. Further, there were no major changes in the educational policy in Japan implemented around this time. Thus, both cohorts were considered very similar, with three minor exceptions. First, the 2019 cohort took the TOEFL ITP and iBT, whereas the 2020 cohort only took the TOEFL ITP. Second, all courses, including the TOEFL preparation course, were delivered online due to the coronavirus outbreak for the 2020 cohort, whereas the 2019 cohort studied in the classroom. Third, the 2019 cohort took the TOEFL ITP twice an academic year-once in April and once in December. Their scores from December were analyzed. The 2020 cohort took the TOEFL ITP once-in September. They were not able to take it in April as it was in the midst of the  pandemic. Their scores from September were analyzed. These three differences between the two cohorts arose due to logistical constraints that were beyond the researchers' control.

Instruments and procedures
One or two measures of listening proficiency and one measure of metacognitive awareness were administered. For the former, the 2019 cohort took the paper-based TOEFL ITP Level 1 and the TOEFL iBT. Both tests were conducted annually as part of the quality assurance effort of the university. The TOEFL ITP was administered in December 2019. As for the TOEFL iBT, participants were required to take it between September and December 2019 and submit their scores to the department in December 2019. More than 95% of them submitted their scores from the November slot. Thus, the impact of the timing of taking the test was considered minimal. The 2020 cohort attempted the TOEFL ITP Level 1 in September 2020. As the current study focused on listening, participants' listening section scores were used. Metacognitive awareness of L2 listening was measured using the MALQ (Vandergrift et al., 2006). Following the MALQ Scoring and Interpretation Guide (n.d.), Goh (2008), and Wallace (2020), negatively-worded items (i.e., item numbers 3, 4, 8, 11, 16, and 18) were reverse coded so that a high value indicated a favorable response corresponding with the remaining items. Items were presented as they appeared in Vandergrift et al. (2006). The MALQ was administered online between December 2019 and January 2020 for the 2019 cohort and in July 2020 for the 2020 cohort.

Analyses
The data comprised TOEFL ITP listening section scores, TOEFL iBT listening section scores, and MALQ section scores grouped according to the aspects (i.e., subcomponents) of metacognitive awareness. MALQ section scores were the sums of the responses for the items measuring a particular aspect of metacognitive awareness. For example, person knowledge was measured using three items. The responses for these items were summed, and a maximum score of 18 was obtained (6 points × 3 items).
Descriptive statistics were obtained using the psych package (Revelle, 2020) in R. The data were then analyzed using random forests in the party package (Hothorn et al., 2021) in R, following the recommendation of Strobl, Hothorn et al. (2009). The random forests' results were interpreted using variable importance values (which show the relative importance of independent variables in relation to a dependent variable) and partial dependence plots (which show the pattern of such a relationship). Partial dependence plots were drawn using the pdp package (Greenwell, 2018) in R.
Based on Strobl, Hothorn et al. (2009), we selected the party package for two reasons. First, it can address the bias that arises due to the differences in the number of items for each variable, ranging from person knowledge (6 points x 3 items = 18 points) to problem solving (6 points x 6 items = 36 points). Otherwise, variance importance values are overestimated favoring predictor variables with higher full marks; in other words, problem solving might have a larger variance importance value than person knowledge simply because it had more items. Second, the party package can adjust the bias toward variable importance values when a predictor variable is not correlated with an outcome variable by itself and is correlated only through another predictor variable. Further, we confirmed that the order of the importance of variables stayed the same when different random seed values were used. 2 Supplementary material is available at https://osf.io/hnvr8/.

Results
The descriptive statistics are presented in Table 4. All instruments (and their subsections) displayed a wide range of values, suggesting a broad distribution of scores. The reliability estimates for the MALQ were acceptable (.756 and .793 for the 2019 and 2020 cohorts, respectively), whereas those for its subsections varied (from .530 for the 2020 planning and evaluation section to .812 for the 2019 person knowledge section). As Figures 1 and 2 show, person knowledge strongly correlated with the two listening tests (r = .59 and .53 for the 2019 cohort; .62 for the 2020 cohort). Patterns of correlations between the variables of metacognitive awareness were similar to Vandergrift et al. (2006). For example, there was a moderate relationship between directed attention and problem solving (r = .63 for the 2019 cohort; r = .52 for the 2020 cohort; and .57 in Vandergrift et al., 2006, p. 446). While Figures 1 and 2 were presented for comparison with previous studies, the current study focuses on the results of random forests for interpretation due to the advantages it offers (see the Random Forests section above).
Variable importance scores are presented in Table 5 and Figure 3. Higher positive values indicate the increasing importance of variables. Values close to or less than zero signify the nonimportance of variables, reflecting random fluctuation (Strobl, Malley et al., 2009). From the 2019 TOEFL ITP listening scores, person knowledge had the highest positive importance score (5.83), suggesting that it was the strongest correlate of the listening score. This was followed by mental translation (0.27), directed attention (0.13), and planning and evaluation (0.04). Problem solving had a negative importance value (−0.01), suggesting that it was not associated with the listening score. From the 2019 TOEFL iBT listening scores, person knowledge (7.45), mental translation (2.15), and directed attention (0.35) had positive importance scores. In contrast, planning and evaluation (−0.25) and problem solving (−0.13) had negative importance scores. From the 2020 TOEFL ITP listening scores, person knowledge (14.02) was the only variable whose importance value was located outside the other negative values, that is, the importance values of directed attention (0.17) and mental translation (0.19) were positive but less than 0.25, the absolute value of −0.25 in problem solving. This suggests that directed attention and mental translation were not associated with the listening score.
To better understand the nature of the relationship between the aspects of metacognitive awareness and listening test scores, Figure 4 depicts partial dependence plots. Figure 4 (top panel) shows that from the 2019 listening ITP scores, planning and evaluation, directed attention, person knowledge, and mental translation -the four major correlates in Table 5 -were linearly related to the listening scores. According to Figure 4 (middle panel), such a linear relationship was also found for the 2019 Note. As for variable importance scores, positive values indicate that those metacognitive awareness aspects were associated with the listening score. Negative values indicate otherwise (e.g., problem solving and the 2019 listening ITP score). Further, to interpret the results, we followed Strobl, Malley et al.'s (2009) statement that "All variables with importance that is negative, zero, or positive but with a value that lies in the same range as the negative values can be excluded from further exploration." (p. 343). Thus, for the 2020 listening ITP, directed attention and mental translation demonstrated positive values that were within the same range as planning and evaluation strategies and problem solving (i.e., −0.25 to 0.25); so they were excluded from further interpretation. a Values in the parentheses indicate the importance rank of each metacognitive awareness aspect (e.g., person knowledge is a stronger correlate of the 2019 listening ITP score than is mental translation).
listening iBT scores. The uses of directed attention, person knowledge, and mental translation -the three major correlates in Table 5 -were directly proportional to the listening scores. Finally, the bottom panel of Figure 4 (bottom panel) shows that for the 2020 listening ITP score, person knowledge -the only major correlate in Table 5 -demonstrated a linear relationship with the listening score.

Research Question 1: Relationships across listening tests
To better understand how different listening tests affect the relationship between L2 listening comprehension and metacognitive awareness, a group of students took a listening test in two formats (TOEFL ITP vs. iBT). As summarized in Table 6, results showed that, in order of importance, person   Table 4).
knowledge, mental translation, and directed attention were related to listening comprehension in both formats. Planning and evaluation strategies were related to listening comprehension only in the TOEFL ITP.
In both formats, person knowledge, that is, one's beliefs about listening, was most strongly related to listening comprehension. Higher, positive values of person knowledge after reverse coding indicate that learners find listening in English easier than reading, speaking, or writing in English. Such learners tend to feel that English listening comprehension is not as challenging for them and not to feel nervous while attempting listening tests. These strategies reflect learners' higher self-perceived listening ability and greater confidence in listening. As shown in Table 4, the mean scores of the current learners were neither high nor low (56.06% and 55.67%). The consistent relationship between person knowledge and listening across the listening tests seem to suggest that learners' positive beliefs or greater confidence in their listening ability are related to listening comprehension, a finding which corroborates previous studies.
More specifically, Taguchi (2001) and Wu (1998) found that students who are less confident in listening faced great difficulty in processing aural information. For example, if they lost track, then it was difficult for them to catch up due to the fast speed rate of the text. Goh (1997) reported that successful listeners had positive attitudes toward listening and understood the importance of using appropriate strategies. Our results show that, just like in paper-based tests, person knowledge emerged as an important factor in computer-based test that assesses broader but more precisely defined constructs of listening. As reported in Table 1, previous studies have also found a constant relationship between person knowledge and listening. However, unlike other studies, Aryadoust (2015a) used an evolutionary algorithm-based symbolic regression that does not assume linearity between variables and found that a constant relationship did not occur between person knowledge and listening. Different methods of examining relationships and other possible factors should be considered in further explorations.
Further, after person knowledge, mental translation was the second most strong aspect related to listening comprehension. Higher positive values of mental translation after reverse coding indicate that listeners did not mentally translate keywords or other words as they listened. This strategy reflects the avoidance of mental translation and presence of automatic recognition and processing of auditory information leading to a quick comprehension of the listening passage. This corresponds with previous studies that have reported on the debilitating effect of mental translation on comprehension and observed this strategy among low-proficient learners. (e.g., Goh & Hu, 2014). The role of automatic recognition and processing of auditory information could be even more prominent for computer-based tests, such as TOEFL iBT, where cognitive demands of listening tasks (e.g., multiplechoice options and questions are presented on the screen after the passage is played; connecting and synthesizing abilities are tested using long passages) are heavier than those in paper-based tests. In fact, mental translation had a higher importance value in the TOEFL iBT than TOEFL ITP (0.27 for the TOEFL ITP; 2.15 for the TOEFL iBT; see Table 5), suggesting that the avoidance of mental translation plays a more important role in computer-based tests with higher cognitive demands than in paperbased tests with less heavy demands.
Directed attention was related to listening comprehension in both tests, albeit to a lesser degree. It refers to strategies such as paying more attention to the text when it is difficult to understand; in other words, working harder to regain concentration. These strategies signify the importance of focusing on  Problem solving  2019  TOEFL ITP  4  3  1  2  -TOEFL iBT  -3  1  2  -2020  TOEFL ITP  --1  --the aural input and regulating attention even under difficult circumstances. This importance was observed in both TOEFL iBT and TOEFL ITP. Problem solving was not related to listening comprehension in either of the tests. It signifies strategies to infer the meaning of words from context and monitor whether one's understanding of the passage is correct compared to one's background knowledge or experience. While these should have appeared as significant in both tests, the reason for their absence was not clear. One reason could be that participants who used problem solving were occupied with this activity and missed essential incoming information related to the answers (Ridgway, 2000).
Planning and evaluation strategies were related to listening comprehension only in the TOEFL ITP. Planning signifies that a listener prepares themselves to listen beforehand -planning takes place before and during listening. Evaluation signifies whether one's understanding of the passage is satisfactory and how one might improve one's listening skills -evaluation takes place during and after listening. The absence of these factors in the TOEFL iBT could be explained by students' inexperience in attempting computer-based tests. As they take more tests, they might use planning and evaluation strategies in computer-based tests as well as in paper-based tests. As this process continues, their strategy use may gradually relate to listening comprehension in computer-based tests as well.
However, the presence of question preview may alter the relationship between planning and evaluation and problem solving strategies in computer-based tests because the opportunity to preview questions allows learners to prepare their cognitive resources and use them more efficiently. Question preview may facilitate a learner's discerning the meaning of unfamiliar words by referring to information extracted from their topical knowledge that is activated before listening commences (i.e., effective use of problem solving). Further, such learners may plan ahead and set a clear goal before listening, and handle listening tasks better. As mentioned above, the TOEFL iBT does not provide question preview and the TOEFL ITP does not allot additional time to read the questions; therefore, it is unlikely that this strategy was employed by the learners. In Table 1, of the five studies that allowed question preview (Aryadoust, 2015a(Aryadoust, , 2015bGhorbani Nejad & Farvardin, 2019;Goh & Hu, 2014;Vafaee & Suzuki, 2020), three (Aryadoust, 2015a(Aryadoust, , 2015bGoh & Hu, 2014) reported a relationship between problem solving and listening, while three (Aryadoust, 2015a(Aryadoust, , 2015bVafaee & Suzuki, 2020) reported a relationship between planning and evaluation strategies and listening, whereas studies without a listening preview did not find such relationships. Further research may be needed in terms of how question preview affects the relationship between planning and evaluation strategies and listening. Finally, the three differences observed in the introductory section sufficiently explain the subtle differences in the results of the current study and previous studies.

Research Question 2: Relationships across cohorts
As for Research Question 2 regarding how the relationship between L2 listening comprehension and metacognitive awareness was affected by different samples of students, two cohorts (2019 and 2020 cohorts) took the same listening test. In Table 6, the results from the 2019 cohort showed that, in order of importance, metacognitive strategies of person knowledge, mental translation, directed attention, and planning and evaluation were related to listening. Results from the 2020 cohort showed that only person knowledge was related to listening comprehension. Problem solving was not related in either of the samples.
Observing the relationship between listening comprehension and person knowledge across the two cohorts suggests the generalizability of such a relationship beyond the study's setting. Among the five aspects of metacognitive awareness of listening, person knowledge was stably related to listening comprehension across groups of students. However, this does not corroborate Vandergrift and Baker (2015), who observed different strategies as being significant across cohorts and when they were analyzed together. The combined results from the current and previous studies seem to suggest that the relationship between listening comprehension and person knowledge is robust in some contexts.

THE RELATIONSHIP BETWEEN L2 LISTENING AND METACOGNITIVE AWARENESS ACROSS LISTENING TESTS AND LEARNER SAMPLES
Strategies other than person knowledge were not related to listening comprehension for the 2020 cohort (planning and evaluation, directed attention, and mental translation) and problem solving was not significant for either of the cohorts. This could be explained by two reasons. First, the 2020 cohort took the test for the first time and could not afford to relate metacognitive strategies with listening, as their mental resources were focused on answering the questions. Their instructors commented that although most students took the listening tests for university entrance examinations, they tended to feel perplexed with the TOEFL ITP, in which the audio was played only once, in contrast to Japanese university entrance exams where the audio was typically played twice (i.e., the National Center Test for University Admissions; see Yanagawa, 2012).
Second, the timing of the administration of the MALQ may have influenced the results related to strategies other than person knowledge. The 2019 cohort completed the MALQ in the second semester (December-January) and the 2020 cohort completed it in the first semester (in July). According to their instructors, students tended to take lessons less seriously in the second semester, as they got used to the university lifestyle. This suggests that the 2019 cohort's responses to the MALQ might not fully reflect their perception of metacognitive awareness.
In sum, consistently observing the relationship between listening comprehension and person knowledge across two cohorts suggests the robustness of such a relationship, indicating its strong generalizability across learners. In contrast, different results of planning and evaluation, directed attention, and mental translation, despite many aspects shared by the two cohorts (with minor differences across the groups), suggest weak generalizability of the relationship between these strategies and listening comprehension.

Conclusion
Research has shown that metacognitive awareness is positively related to L2 listening comprehension, wherein successful listeners employ various strategies to address the cognitive demands of the incoming aural information (e.g., Goh & Hu, 2014;Vandergrift et al., 2006). The current study extended this line of research by focusing on aspects of metacognitive awareness and by examining how this relationship is moderated by (a) listening comprehension tests (e.g., eliciting broadly defined constructs, using more demanding listening situations and a computer-based method) and (b) samples of learners. As for the effects of listening tests, three aspects of metacognitive awareness (i.e., person knowledge, [the avoidance of] mental translation, and directed attention) were related to listening comprehension in both tests. Problem solving was not related in either test. Regarding the effects of learner samples, person knowledge was consistently related to listening comprehension, suggesting a strong generalizability of the findings across learners. In contrast, other aspects of metacognitive awareness were inconsistently related to listening comprehension. This suggested their weak generalizability. Consequently, the relationship between L2 listening comprehension and metacognitive awareness is moderately and highly affected by (a) listening comprehension tests and (b) learner samples, respectively.
However, the current study has certain limitations. First, we only used a questionnaire to assess metacognitive awareness. According to Vandergrift and Baker (2015), questionnaires only assess learners' perceptions and not their actions; therefore, questionnaire data may not completely reflect their behaviors. Future research must employ other measures, such as eye-tracking techniques. This technique can be used to examine a listener's gaze behavior, as it indicates how metacognitive strategy unfolds. Low and Aryadoust (2021) found that listeners' gaze behavior was moderately related to their responses to the questionnaire and recommended researchers use gaze behavior rather than questionnaires. As the results of both methods differ (e.g., see Bax, 2013), it would be better to use them complementarily to measure listeners' metacognitive awareness. Second, even with multiple measures, it might be challenging to assess metacognitive awareness because its use may depend on the task that learners are engaging with. This is an interactionalist perspective (Bachman, 2006) of understanding metacognitive awareness, whereas the MALQ measures metacognitive awareness without any reference to particular tasks. This is a trait perspective (Bachman, 2006). Metacognitive awareness could be better assessed by specifying contextual factors and using other methods such as verbal protocol along with questionnaires. Moreover, other variables should be examined that are not only directly associated with listening but also moderate the relationship between metacognitive strategies and listening (see Wallace, 2020).
For pedagogical implications, our findings could help teachers better understand the relationship between L2 listening and metacognitive awareness across listening tests and learner groups. Following Vandergrift et al.'s (2006) instructional advice, teachers could focus on teaching metacognitive strategies that are more strongly related to listening (e.g., person knowledge). Doing so could help learners identify their strengths, weaknesses, goals, and actions in relation to listening comprehension.

Notes
1. The In'nami et al.'s meta-analysis used correlations that had been collected before the current study was conducted. Thus, the meta-analysis did not include the current study. In case readers are interested in what the correlations looked like between metacognitive awareness and L2 listening comprehension in the current study, the correlations between the TOEFL ITP listening (paper) and the MALQ were r = .54 (2019 data) and .42 (2020 data). The correlation between the TOEFL iBT listening (computer) and the MALQ was r = .37 (2019 data). All correlations were statistically significant, p < .01. 2. Following a reviewer's suggestion, we conducted factor analysis separately on the 2019 and 2020 data to examine whether the five-factor structure of the MALQ (supported in Vandergrift et al., 2006) was supported as well in the current study. The results for the 2019 data showed that the matrix was not positive definite. This seemed to be due to the model-implied correlation matrix of the latent variables: The (standardized) correlation between directed attention and problem solving was 1.047. To address the not positive definite matrix, the ridge option in the lavaan package was used (Rosseel, 2012). However, the matrix was still not positive definite. Thus, we were not able to examine the factor structure of the MALQ for the 2019 data. In case readers are interested, the resultswhich were questionable and should not be interpreted-showed that the model fit the data poorly (comparative fit index [CFI] = .736, root mean square error of approximation [RMSEA] = 0.084 [90% confidence interval = 0.065, 0.103], standardized root mean square residual [SRMR] = .108). As for the 2020 data, the results showed that the model fit the data poorly (CFI = .661, RMSEA = 0.089 [0.074, 0.104], SRMR = .108). In sum, the factor structure of the MALQ could not be tested in the 2019 data; it was not supported in the 2020 data. Thus, the findings from the study may need to be interpreted with caution. It is also worth noting that it was not clear whether these results-not confirming the five-factor structure-were specific to the current data sets. This was particularly because, to the best of the authors' knowledge, little research has been conducted on the factor structure of the MALQ since it was developed in Vandergrift et al. (2006). One exception is Wallace (2020), which reported on the removal of the mental translation items from his model as they loaded weakly on the metacognitive strategy factor. This suggests that mental translation strategy is not explained by a common ability that underlies planning and evaluation, directed attention, person knowledge, and problem solving. This further suggests the need to examine to what extent Vandergrift's et al. five-factor model holds across studies.