Psychometric assessment of an instrument evaluating the effects of affective variables on students’ WTC in face-to-face and digital environment

Abstract The current study was aimed to translate and examine the psychometric characteristics of the Indonesian version of Lee and Hsieh’s (2019) questionnaire measuring the effects of four affective variables (i.e. self-confidence, L2 anxiety, grit, and motivation) on students’ willingness to communicate in face-to-face (F2F) in both inside and outside classrooms, and in a digital environment. Data were collected from 458 students—269 university students, 102 upper secondary school students, and 87 lower secondary school students. Several statistical analyses (i.e. unidimensionality assessment, reliability analysis, misfit analysis, and differential item functioninganalysis) was performed using Rasch analysis. Findings of the study revealed that seven of eight constructs in the Indonesian version of Lee and Hsieh’s (2019) questionnaire is valid with Indonesian samples, and they preserve the psychometric characteristics of the original scale. Recommendations are offered in reference to the findings.


PUBLIC INTEREST STATEMENT
A body of literature has suggested several constructs to examine students' willingness to communicate (WTC) in a foreign language classroom. However, little has been explored regarding the assessment of WTC constructs in face-to-face and digital settings, except one offered by Lee and Hsieh (2019)' questionnaire measuring the effects of four affective variables (i.e. selfconfidence, L2 anxiety, grit, and motivation) on students' willingness to communicate (WTC) in face-to-face (F2F) settings (i.e. inside and outside classrooms) and in a digital environment. This current study attempted to examine the psychometric characteristics of the Indonesian version of Lee and Hsieh (2019) questionnaire using a Rasch analysis. The current study is significant in depicting an empirical evidence related to the psychometric characteristics of the Indonesian version of the Lee and Hsieh (2019) questionnaire using Rasch analysis. More importantly, it contributes to the under-explored topic in the literature on WTC in F2F and digital environment study within an Indonesian context.

Introduction
The term willingness to communicate (henceforth WTC) was first coined by McCroskey andBear in 1985 (Khany &Nejad, 2017;Ningsih et al., 2018). The term was originally conceptualized to portray individual differences that encourage people to communicate in their first language (Yashima et al., 2018). In the second language (L2) learning context, the term L2 WTC is used to reflect one's readiness to take part in particular communication events using the target language when they are given opportunity (Khany & Nejad, 2017). Teachers' understanding of their students' WTC is crucial in L2 learning because students' WTC determine their participation in learning and achievement (Amiryousefi, 2018;Ningsih et al., 2018).
Many scholars have attempted to develop constructs to understand the nature of WTC. Horwitz et al. (1986) seminal work presents the Foreign Language Classroom Anxiety Scale (FLCAS), suggesting the connection of communication apprehension and students' learning anxiety in the classroom. Yashima (2002) synthesizes a number of literature on WTC constructs and reveals that WTC can predict students' communication frequency in L2, while students' motives can be used to predict their WTC, L2 communication frequency or both. MacIntyre, Clément et al. (1998) as cited in Yashima (2002) suggest several variables that influence students' L2 WTC or students' use of L2, including student personality, intergroup climate, attitudes, motivation, self-confidence, and communicative competence. Other latent variables such as students' cultural background, shyness, interaction issues are also known as factors that contribute to WTC (Cao, 2010) in addition to interest, motives and demographic factors (e.g. age and gender) (Amiryousefi, 2018) and potentially students' cognitive styles (de Sinatra et al., 2012).
Scholars have also examined the role of technology in promoting student WTC and students WTC in a digital environment. Some previous studies (e.g. Buckingham & Alpaslan, 2017;Lee, 2019;Reinders & Wattana, 2014;Waldeck et al., 2001) have revealed that the incorporation of technology in L2 learning had improved students' learning motivation and lower their affective obstructions; and thus increase student WTC. Several WTC constructs used in a few earlier studies, unfortunately still adopted ones developed for the assessment of WTC in nondigital setting, such as Maclntyre and Conrod (2001) and Cao and Philp (2006). Lee and associates (i.e. Lee, 2019;Lee & Hsieh, 2019;Lee & Lee, 2019) proposed a WTC construct that incorporated several affective variables (e.g. motivation, self-confidence, risk-taking, speaking anxiety, and grit), intercultural background, and WTC communication situation (inside the classroom, outside classroom, and in digital settings). Particularly, Lee and Hsieh (2019) develop a questionnaire to measure the effects of four affective variables (i.e. selfconfidence, L2 anxiety, grit, and motivation) on students' willingness to communicate (WTC) in Faceto-face (F2F) settings (i.e. inside and outside classrooms) and in a digital environment.
This current study attempted to examine the psychometric characteristics of the Indonesian version of Lee and Hsieh (2019) questionnaire. Specifically, the current study was aimed to examine the internal construct of validity as well as the internal consistency of the translated questionnaire using the Rasch analysis. Rasch analysis is a psychrometric technique used widely by instrument developers as well as researchers to monitor the quality of the instrument by estimating item difficulty and person ability at the same time (Yu, 2020). Rasch is believed to offer an objective measurement due to its capability in providing invariant measurement characteristics across diverse situations (Wright, 1992 as cited in Yu, 2020). To the best our knowledge, a validation study of Lee and Hsieh (2019) questionnaire using the Rasch analysis has not been reported elsewhere. The current study thus is significant in providing empirical evidence related to the psychometric characteristics of the Indonesian version of the questionnaire using Rasch analysis. More importantly, it contributes to the underexplored topic in the literature on WTC in F2F and digital environment study within an Indonesian context. Lee and Hsieh (2019) questionnaire was aimed to measure the effects of four affective variables (i.e. self-confidence, L2 anxiety, grit, and motivation) on students' WTC in F2F in both inside and outside classrooms, and in a digital environment. The 33 questionnaire items were developed using a 5-point Likert scale and comprised of three sections: affective variable scale, second language willingness to communicate (L2 WTC) scale, and the demographic information of the participants. The affective variable subscale included six items of self-confidence subscale, six items of anxiety subscale, four items of motivation, and five items of grits. The other L2 WTC scale comprised of four items of F2F WTC inside classroom, four items of F2F WTC outside classroom, and four items of WTC in a digital environment. Responses to the questionnaire items in each subscale are detailed below.

Translation into Bahasa Indonesia
The original Lee and Hsieh (2019) questionnaire was written in English and translated into Bahasa Indonesia by the first and second author. The translated survey was sent to other researchers who were fluent both in English and Bahasa Indonesia, to validate and proofread the translation. Wordings of the Indonesian translation were refined in accordance to their feedback; ensuring the meaning of the original items was maintained in the translation.

Sample
Sample of the current study was selected using the following procedure: first, the researchers identified potential samples over social media (e.g. WhatsApp individual account or groups). Second, the invitations were sent to the target participants through a Google form link. Before completing the survey, the participants were asked to fill out a consent and demographic information session. A total of 458 students from lower secondary school, upper secondary school and Indonesian university completed the survey. 1 There were 337 females (73.6%) and 121 males (26.4%). Many of the participants were Indonesian EFL university students (N = 269, 58.7%), upper secondary school students (N = 102, 22.3%), and lower secondary school students (N = 87, 19%). Linacre (1994) argue that a sample size of 108 is considered appropriate in the case that the scale  (5) WTC in a digital environment 4 Definitely not willing (1) to Definitely willing (5) is well-targeted, and a size of 243 is required for unspecific target. Given that the sample of 458 participants from three cohorts of Indonesian students was available in the current study, it is expected that the Rasch analysis can provide an appropriate level of precision from the data calculation. More importantly, the Outfit before Infit statistical analysis was performed to the data set to address the potential outliers that might violate the data and the analysis.

Analysis
Rasch analysis using WINSTEP (version 4.5.1) was performed to evaluate the data from 33 items. The analysis included several the assessment of an individual item and person fit through Outfit before Infit statistics and through mean square before Z-standardised (Zstd) fit (Ling Lee et al., 2020), unidimensionality, item and person separation reliability, the effectiveness of item scale, item and person mapping, and item bias (Chan & Subramaniam, 2020;Huang et al., 2020). The acceptable values of Infit and Outfit statistics range between 0.60 and 1.40; and between −2 and +2 for Zstd (Huang et al., 2020). A total of 169 samples were observed to be misfit and thus, were removed (Linacre, 2010). The remaining 289 samples included 169 university students (37%), 63 upper secondary school students (14%), and 58 lower secondary school students (12%). The total sample of 289 still met the recommended threshold for sample size (i.e. 50-250 samples) (Linacre, 1994).

The analysis of unidimensionality of the items
Unidimensionality assessment was carried out to evaluate if the questionnaire items measure a single construct (Yu, 2020). Unidimensionality assessment was done by evaluating the Rasch Principle Component Analysis (PCA) for each variable, and the finding revealed that 34.4%-64.9% of the variance is explained by the Rasch measures (see Table 2). The PCA range values were found greater than the PCA threshold of 20% (Sumintono & Widhiarso, 2014). Furthermore, the analysis of PCA showed ranges of eigenvalues of the first contrast for all variables were between 1.5 and 1.9; lower than 2.0 (Linacre, 2018). The findings have indicated that the Indonesian version of Lee and Hsieh (2019) questionnaire fitted to the Rasch model, reflecting a unidimensional measurement of the underlying construct.

Item and person separation reliability
Item separation reliability refers to the item reproducibility when the same items were given to a new sample with comparable ability, whereas person separation reliability concerns with the person classification reproducibility in a new sample when they respond the same items (Chang et al., 2014). The threshold of an item separation index was higher than 3 and with reliability higher than 0.90, while the person separation index should be greater than 2 with reliability greater than 0.8 (Linacre, 2018;Van Zile-Tamsen, 2017). As shown in Table 2, the item separation reliability of Lee and Hsieh (2019) questionnaire was excellent for the global scale and the subscales (α > 0.91) and with high item separation for the global scale and the subscales (separation index > 3). The person separation reliability was considered good for the global scale (α = 0.80), the subscale "Self-Confidence" (α = 0.80), subscale "Speaking Anxiety" (α = 0.83), scale 'L2 WTC outside classroom (α = 0.81); and was observed at "fair to good" level for subscale "Motivation" (α = 0.62), "L2 WTC inside classroom" (α = 0.78), and "L2 WTC in digital environment" (α = 0.64) (Bond & Fox, 2015;Ningrum et al., 2019). The person separation reliability of subscale "Grit" was very low (α = 0.44). Although the person separation reliability of subscale "Motivation" (α = 0.62), "L2 WTC inside classroom" (α = 0.78), and "L2 WTC in digital environment" (α = 0.64) were still considered fair to good internal consistency, the subscales are unlikely able to distinguish between high and lower performers from the relevant person sample (Linacre, 2018).

Effectiveness of the rating scales
The ordering functioning of the rating scale step categories was evaluated to understand participants ability to distinguish among the rating scales used in the questionnaire (see Table 1 for the rating scales). Table 3 below details the ordering functioning of the rating scale step categories As shown in the above Table 3, each of the rating scales included more than the minimum of 10 observations, and the distribution of responses by the category tended to be right-skewed. The outfit MNSQ values were observed less than 2.0, but the threshold distance between the rating scales was lower than the ideal threshold of 1.4-5 logits. Besides, the average calibration, as well as the step threshold, were observed to increase monotonically from the lowest rating point (i.e. −0.27) to the highest (1.59). These findings indicate that the category allowed the assessment of the latent variable and might suggest precision of the assessment (Van Zile-Tamsen, 2017). The monotonical increase in the average calibration suggested that "the higher category selected, the higher the students" average measures' (DiStefano & Jiang, 2020, p. 39). Though, it was also found in the analysis that participants might have difficulty in distinguishing amongst the categories when completing the questionnaire.

Item and person mapping
Item and person maps (also known as Wright Maps) were developed to figure out the spread of subject perceptions and the distribution difficulty levels in the questionnaire items. Figure 1 presents the Wright item person map detailing the distribution of item and person difficulties. The left side of the map shows the distribution of the measured ability of the respondents from most able at the top to least able at the bottom, and the right side of the map shows the distribution of most difficult items at the top to the least difficult at the bottom. The item difficulties range from −1.21 logits to 1.61 logits.
As shown in Figure 1, item Q18 (−1.21 logit) of subscale "Grit" was observed to be the hardest item to agree amongst the students, whereas item Q16 (1.61 logit) of subscale "Motivation" was the easiest item to agree. The subscale "Grit" had three difficulty items to answer by the participants (i.e. Q18, Q20, Q21) that they were two standard deviations above the mean item difficulty. Three items in the subscale "Motivation" (i.e. Q14, Q 15, and Q16) and subscale WTC in a digital environment (i.e. Q30, Q31 and Q32) were easy to answer as they were two standard deviations below the mean item difficulty level. The findings indicate that the assessment instrument is likely easy to measure students' motivation to communicate in L2, but is relatively difficult to identify the grit factor of the group of students.

Item bias
The analysis of differential item functioning (DIF) using Rasch-Welch tests was performed to evaluate if there were potential item biases caused by group characteristics (e.g. gender and level of education) (Chan & Subramaniam, 2020). An item is considered as DIF if the value of DIF contrast (the difference in difficulty of the questionnaire item) is greater than 0.5 logits and were significant (Rasch-Welch probability value < 0.05) (Chan & Subramaniam, 2020;Linacre, 2018). DIF evaluation of the participant gender revealed all DIF contrast values were lower than 0.5 logits. In contrast to gender characteristic, several items with significant DIF were found in reference to students' level of education (see Table 4).
Item Q2 was considered too easy for lower secondary school students, but not for university students (DIF LS = −0.86 logit, DIF U = −0.31 logit, p < 0.05). Item Q4 was observed to ease lower secondary school students, but was considered difficult for upper secondary and university

Discussion and conclusion
Although a quite number of current literatures on WTC offers measures to evaluate students' willingness to communicate in L2 setting, they rarely compared two different WTC settings: in F2F classroom settings and in digital environment. The current study examined the psychometric characteristics of the Indonesian version of Lee and Hsieh (2019) questionnaire measuring the effects of four affective variables (i.e. self-confidence, L2 anxiety, grit, and motivation) on students' willingness to communicate (WTC) in both F2F inside and outside classrooms and in a digital environment. It contributes to the under-explored topic in the literature on WTC in F2F and digital environment study within an Indonesian context. The psychometric properties of the Indonesian version of Lee and Hsieh (2019) questionnaire were examined through several different types of analysis (i.e. analyses of unidimensionality of the items, item and person separation reliability, item and person mapping, and item bias), primarily based upon item response theory (IRT). IRT theory is concerned with the assessment of individual items, offering person and item parameter invariance when the model fit is presented, and providing more detailed information about the scale measuring particular constructs of interest (Zanon et al., 2016). Based upon such a theory, Rasch analysis provides researchers with a tool to assess the psychometric quality of the scale and enable them to improve particular aspect, allowing them to make a prediction of person scores according to his/her item performance (Boone et al., 2014;Linacre, 2018;Zanon et al., 2016).
The result from Rasch analysis has shown that Lee and Hsieh (2019) questionnaire possess a sufficient range of item difficulty. The monotonical increase found on the average calibration of the rating points has indicated that the rating scale used in the questionnaire had allowed the assessment of the eight variables and might suggest precision of the assessment. Of the eight variables in Lee and Hsieh (2019) questionnaire, six were considered to have strong internal consistency and one with a fair internal consistency level. The other variable "grit" had a lowreliability level and thus, should not be used in the questionnaire. Findings of the current also suggest that three items in grit subscale were of the most difficult items to agree. These findings corresponded our early analysis (see Mulyono & Saskia, 2020), showing that the variable "motivation" was marginally reliable, and the variable "grit" subscale had unacceptably low reliability.
Although a body of literature has provided strong empirical evidence about the role of grit in promoting students' WTC in a foreign language (Lee, 2020;Lee & Drajati, 2019;Lee & Lee, 2019), Credé et al. (2017) have raised a concern regarding the construct validity of scales measuring the individual level of grit. In their meta-analysis on the study about grit, Credé et al. (2017) have identified several empirical evidence suggesting that the relationship between individual's grit and his/her performance might not be strong in any types of settings. By citing MacNamara et al. (2014), Credé, Tynan, and Harms argue that the high level of individual's grit may only be obtained when he/ she is given difficult but clearly defined tasks. Such difficult but clear tasks are unlikely well articulated in the construct of grit offered by Lee and Hsieh (2019). It is thus, items in grit needs to be redesigned to reflect "perseverance of effort" construct (Credé et al., 2017;Lee, 2020).
Overall, the Indonesian version of Lee and Hsieh (2019) questionnaire measuring the effects of four affective variables (i.e. self-confidence, L2 anxiety, grit, and motivation) on students' WTC in F2F settings (i.e. inside and outside classrooms) and in a digital environment fit the proposed Note. LS = lower secondary school, US = upper secondary school, U = university model based on Rasch analysis. The questionnaire is substantiated by no item bias found across male and female participants. However, four items with DIF were found across the background of the participants (i.e. Q2, Q4, Q9, and Q18), particularly between university students and lower secondary school students. Rephrasing these four items is recommended to address potential item bias for participants with different educational background. More description of the context of F2F outside the classroom and digital settings is also suggested in the future update of the questionnaire.