No difference without item comparison! The effect of parallel item presentation on the self-efficacy of out-of-field physical education (PE) student teachers and qualified PE student teachers

Abstract Teacher’s self-efficacy is a relevant judgement of self-belief by teachers. Studies reveal inverse response bias of teachers’ self-assessment. Parallel item presentation can be used as a method to reduce such distortions. The major goal of this study was to develop and verify such a measure of parallel item presentation in order to compare self-efficacy of qualified and out-of-field PE teachers. Therefore out-of-field and qualified PE student teachers (N = 68) were randomised into two groups. They responded to 14 self-efficacy items related to classroom subjects and PE teaching. One group of out-of-field (n = 17) and qualified (n = 18) PE student teachers was presented with the items in parallel so that they could compare classroom and PE teaching items. For the other group of out-of-field (n = 11) and qualified (n = 22) PE student teachers, the items were presented sequentially so that no direct comparison was possible. Data was analysed using nested ANOVA. The results reveal that with a dimensional item comparison, out-of-field PE teachers have a significantly lower self-efficacy in PE than qualified PE student teachers (p = .006, ηp 2 = .18). Without comparison, there is no significant difference. The method of parallel item representation can thus contribute to the reduction of inverse response bias.


PUBLIC INTEREST STATEMENT
Self-efficacy, i.e. a person's assessment of being able to cope successfully even with difficult situations, is a relevant resource for teachers. Out-of-field physical education (PE) teachers often overestimate their self-efficacy. This can lead to an interpretation problem. Some studies report, for example, that self-efficacy values of out-of-field PE teachers are equally high or even higher than those of qualified PE teachers. One reason for the overestimation is the lower expert and experiential knowledge. Therefore, an additional comparative framework is suggested in the present study. The PE student teachers are always presented with two self-efficacy items in parallel, one on classroom teaching and one on PE. The study shows that the comparison "classroom vs. PE teaching" results in qualified PE student teachers having higher PE-related selfefficacy scores than out-of-field PE student teachers. In the classic presentation of the items, qualified and out-of-field PE student teachers show comparable PE-related self-efficacy scores.
.006, η p 2 = .18). Without comparison, there is no significant difference. The method of parallel item representation can thus contribute to the reduction of inverse response bias.
Self-efficacy is a personal judgement of an individual's belief in their own ability to perform tasks (Bandura, 1977(Bandura, , 1997. A distinction is made between general and task-specific self-efficacy (Bandura, 1997;Zimmerman & Cleary, 2005). General self-efficacy encompasses all areas of life and describes a person's general ability to cope with life's challenges (e.g., "I can solve most problems if I invest the necessary effort"; sample item from Schwarzer et al., 1997, p. 88). However, according to Schwarzer the self-efficacy of a teacher is not only a generic construct for teaching (e.g., "I am convinced that I am able to teach successfully all relevant subject content to even the most difficult students"; Schwarzer & Hallum, 2008, p. 171) but also a domain-specific construct for certain subjects.
"Teachers' self-efficacy can be defined as their beliefs about their capability to teach their subject matter even to difficult students" (Holzberger et al., 2013, p. 774). High self-efficacy in teachers is associated with an increased sense of commitment, job satisfaction and enhanced resilience when dealing with stressful and challenging situations in the workplace (Caprara et al., 2003;Dicke et al., 2014;Klassen & Chiu, 2010). In addition, it also appears to have a direct influence on teachers' lesson planning and classroom management-in particular on the way teachers offer constructive support in the classroom (Tschannen-Moran & Woolfolk Hoy, 2001).
Individuals tend to assess their own self-efficacy drawn from experience of mastery or vicarious experience (Bandura, 1977(Bandura, , 1997. In general, they derive such subjective assessments based on time, social and dimensional related comparisons (Möller et al., 2009;Paulick et al., 2017). Thus, individuals for instance, compare how they mastered similar challenges in previous times (e.g., "I have already tackled a similarly difficult situation successfully/unsuccessfully, so this time I will also . . . "). A social comparison, in contrast, means the subjective evaluation of the individual's experience with reference to other people (e.g., "I can do this better/worse than my colleague"). A further determinant of self-efficacy is the variability of situational dimensions (e.g., "This type of teaching situation in the classroom context is easier/harder to cope with than in my physical education class").
As a rule, self-efficacy is measured using standardised self-assessment scales (e.g., Schwarzer & Hallum, 2008) which comprise items measuring the levels of confidence and self-efficacy teachers perceive when having to deal with various classroom situations (see sample items mentioned above). Appropriate procedures of this type are recommended for skills diagnosis ("Kompetenzdiagnostik" in German; . Pan et al. (2013), for example, also applied such self-assessment scales to physical education.
However, the quality of the data yielded by self-assessment tools is also discussed controversially. One major criticism is that subjective self-assessments do not allow conclusions to be drawn about actual skills (Shavelson, 2010(Shavelson, , 2013. Empirical research has revealed both differences and similarities between self-assessment and other-assessment of teachers' skill levels (Kunter & Baumert, 2006;Wagner et al., 2016). Voltmann-Hummes (2008), for instance, identified high selfefficacy levels of PE teachers while having a low level of qualification (for instance, if they teach "out-of-field", see below). Apart from this, there are virtually no studies of this kind in physical education (PE) teacher education, and experts regard the quality of self-assessment procedures as being severely under-researched (Baumgartner, 2017).
Studies from other areas show that (a) women tend to underestimate themselves and men tend to overestimate themselves (Dahlbom et al., 2011;Jakobsson, 2012), and that (b) people with low levels of professional competence tend to overestimate themselves whereas people with high levels of professional competence tend to underestimate themselves (Kruger & Dunning, 1999). Baumgartner (2017) has shown that the latter observation is also true for PE teacher education. Kruger and Dunning (1999) believe that people with low levels of professional competence overestimate themselves because they do not yet possess the metacognitions necessary to arrive at a realistic self-assessment. However, a so-called "illusory optimism" (Mezulis et al., 2004;Taylor, 1989;Taylor & Brown, 1994)-or an individual's tendency to overestimate their own chances of success-also has a functional value. Bandura (1997) believes that an individual can increase their motivation and volition by slightly overestimating their own abilities-a frame of mind which seems especially desirable in novice teachers as they seek to acquire the professional skills they need. Contrasting with this is "defensive pessimism" (Norem, 2001;Norem & Illingworth, 1993;Ntoumanis et al., 2010). If individuals anticipate negative personal feedback, they underestimate their competence to protect themselves against potential disappointment. This explains why the self-efficacy of novice teachers decreases when, during the initial (university) phase of their teacher training, they are confronted with situations in which they acquire practical classroom teaching experience (Tschannen-Moran et al., 1998). In cases such as these, underestimating their competence serves the (preventative) function of protecting their self-esteem.
According to the research summarised above, distortions (such as illusory optimism and defensive pessimism) seem to be a natural phenomenon within teachers' self-assessment. However, it seems problematic using such self-assessment scales in order to conclude on teachers' competencies. This question is especially relevant when using a self-assessment scale to compare samples that differ according to the teachers' qualification level. Within the setting of schools such differences are given since the subject PE, in particular in primary-schools, is not only taught by qualified but also out-of-field teachers (Brettschneider & Brandl-Bredenbeck, 2011). Du Plessis (2016, p. 42) defines out-of-field teaching as "an instance where teachers teach subject areas and year levels outside their scope of qualification or expertise". According to Du Plessis (2020), out-of-field teaching is a phenomenon that implicates issues of teaching quality and students' learning outcomes.
It can be assumed that qualified teachers have a higher level of self-efficacy than out-of-field teachers (Hobbs & Törner, 2019). If the findings on item distortions are taken into account, we come to the conclusion that illusory optimism and defensive pessimism are not merely a functional over-or underestimation of ability, but they are the source of potential inverse distortions in the way respondents answer the questions. Consequently, we assume that these potential inverse distortions can result in only slight differences in self-efficacy of qualified and out-of-field PE teachers despite the respondents' differing levels of qualification (Flores et al., 2004;Fox & Peters, 2013;Voltmann-Hummes, 2008).
Previous studies have shown (Garbarski et al., 2015;Lee et al., 2016) that the way items are presented can have a significant influence on the way the respondents answer the questions. Franke (1997), for instance, proved that grouping and randomising items can have an effect on the reliability and validity of questionnaires. Her studies showed that item-blocking, for instance, caused significantly lower mean values as well as a lower reliability.
Following Podsakoff et al. (2003), a parallel item presentation is considered to be appropriate strategy to cope with inverse distortions. Parallel item presentation comprises two similar items which only differ in terms of their situational application reference, for instance, "I am good at Asian cooking vs. I am good at Italian cooking". The method follows the assumption that a dimensional related comparison stimulates cognitive phases of item response. In particular, the phases of retrieval (retrieving information from long-term memory through key stimuli), judgement (evaluating retrieved information and making a decision) and response reporting (checking consistency through comparison between the decision made and other response options) seem to be favoured by a specifically created dimensional item comparison.
In a first pilot study, we have successfully applied the method of parallel item presentation for reducing inverse response distortions within the group of out-of-field and qualified PE student teachers (Liebl, 2018). However, a more systematic approach is still pending.
Subsequently, the objective of the presented study is to further develop and verify such a measure of parallel item presentation in order to compare self-efficacy of qualified and out-offield PE teachers.
We seek to meet this objective by using parallel item presentation in line with Podsakoff et al. (2003), such that the respondents are asked to assess themselves once with reference to classroom teaching and once with reference to PE teaching. By considering the respondents' teaching competence in relation to their classroom teaching and PE teaching, our intention is to evaluate the information collected within a dimensional comparison and to support the final consistency check for the item responses (Podsakoff et al., 2003).

Hypotheses
We differentiate between out-of-field and qualified PE student teachers and address the following problem: does a dimensional comparison of items on self-efficacy in classroom-based subjects and of items on self-efficacy in PE result in a different self-efficacy in PE in out-of-field and qualified physical PE student teachers? In line with the approach presented here, we would expect out-offield and qualified PE student teachers to differ when the items are presented in parallel and, therefore, a dimensional item comparison is possible.
Out-of-field and qualified PE student teachers to show a less clear, inconclusive difference in self-efficacy when the items are presented one after the other and, therefore, a dimensional item comparison is not possible.

Design
In July 2018, we conducted a quasi-experimental study with student teachers at the University of Regensburg (Germany) (N = 68). Participants filled out an online questionnaire (platform: soscisurvey.de). Respondents were assigned to a parallel group (PG, n = 35) and a sequential group (SG, n = 33) according to a simple randomisation (Kim & Shin, 2014). The PG consisted of n = 17 out-offield and n = 18 qualified PE student teachers; the SG consisted of n = 11 out-of-field and n = 22 qualified PE student teachers. The group of out-of-field PE student teachers was made up of primary and middle school student teachers who were studying PE neither as a major nor a minor subject but who were nevertheless still likely to have to teach PE based on the class teacher principle.

Sample
The sample characteristics are listed in Table 1. All participants provided written informed consent to participate in the study and confidentiality of collected data was maintained. Due to the study's quasi-experimental approach, there were significant differences in students' levels of qualification depending on their respective school type, χ 2 (3) = 34.53, p < .001. Out-of-field PE student teachers are mostly students of primary or middle school education whereas qualified PE student teachers are mostly students of secondary modern or grammar school education. However, there was no significant difference between the PG and SG based on the students' school type, χ 2 (3) = 3.80, p = .284. Because the majority of primary school student teachers are women (Wolfram et al., 2009), there are more women than men among out-of-field PE student teachers, χ 2 (1) = 10.66, p = .001. There was no significant difference in sex distribution between PG and SG, χ 2 (1) = 1.63, p = .201. However, when comparing all four subgroups, sex distribution differed significantly (Fisher's exact test: χ 2 = 21.93, p < .001). Standardised residuals indicated that this was especially true for the two qualified groups, with women being overrepresented in the SG, and men being overrepresented in the PG. Students in the PG (M = 23.88, SD = 3.82) were slightly older than those in the SG (M = 22.14, SD = 2.22), t(66) = 2.31, p = .024, d = 0.56, and qualified PE student teachers (M = 23.63, SD = 2.41) were significantly older with a medium effect size difference than out-of-field PE student teachers (M = 22.07, SD = 3.94), t(66) = 2.01, p = .048, d = −0.50.
There was no significant difference between men and women regarding the dependent variable self-efficacy in PE, t(29.39) = 1.61, p = .119, d = 0.45. The same was true for the students' chosen school type, F(3, 64) = 0.69, p = .559, η p 2 = .03. Neither was there any correlation between the age of the respondents and the self-efficacy in PE, r = −.09, p = .482. In addition, the control for the variables sex, age and school type does not change our main effect; see Table 2 in the section data analysis.

Instruments
In our study we used a self-assessment scale based on the self-efficacy instrument compiled by Schmitz and Schwarzer (2000). The original self-assessment scale consists of 10 items which measure teachers' expectation of self-efficacy. It shows an acceptable to good degree of internal consistency (study 1: α = .76; study 2: α = .81; Schmitz & Schwarzer, 2000). For our own study, we removed three items 2 from the original scale since they were not relevant to student teachers. The remaining seven items (Schmitz & Schwarzer, 2000;Schwarzer & Hallum, 2008, p. 171) were adapted according to classroom or PE-related teaching (see Appendix): Notes: Pr = primary school; Mi = middle school; SM = secondary modern school; Gr = grammar school; PG = parallel group; SG = sequential group; OF = out-of-field PE student teachers; QL = qualified PE student teachers, a One participant did not indicate their sex.
• Even if I am disrupted while teaching a subject in the classroom, I am confident that I can maintain my composure and continue to teach well.
• Even if I am disrupted while teaching a physical education class, I am confident that I can maintain my composure and continue to teach well.
Participants were asked to answer each item according to the following response format: 1 = not at all true, 2 = barely true, 3 = moderately true, 4 = exactly true (Schwarzer & Hallum, 2008). Because the sample size used in this study was not extremely large, we used data of a pilot study (N = 110) to test for reliability and validity of the shortened 7-item version of the scale (Liebl, 2018). This analysis revealed an internal consistency of ω = .69. Confirmatory factor analysis with maximum likelihood estimation also showed an acceptable fit of the data for a unidimensional model (χ 2 (14) = 19.11, p = .161, CFI = .93, RMSEA = .06). The resulting 14 items on self-efficacy in classroom-based subjects (SCR) and self-efficacy in PE (SPE) were presented to the two PG groups (PG out-of-field & PG qualified ) in pairs as described above (SCR 1 & SPE 1 ; SCR 2 & SPE 2 ; . . . ; SCR 7 & SPE 7 ) in Notes: *p < 0.1; **p < 0.05; ***p < 0.01; Mi = middle school; SM = secondary modern school; Gr = grammar school; PG = parallel group; SG = sequential group; when comparing the models, the person who did not state his or her sex (Table 1) had to be excluded. order to provide a dimensional comparison and support the final consistency check of the respondents' item responses.
Conversely, the two SG groups (SG out-of-field & SG qualified ) first answered the seven SPE and subsequently the seven SCR items (SPE 1-7 ; SCR 1-7 ) in order to preclude any updating effect for classroom and PE teaching. The SPE and SCR items were presented independently of one another. The return function in the online questionnaire was deactivated to preclude item comparison. The dependent variable SPE was measured by calculating the mean of the seven SPE items.

Data analysis
We analysed the data using nested analysis of variance (Krzywinski et al., 2014). This method assesses the variation introduced at each hierarchy layer in relation to the layer below it. First, we checked normality of residuals and homogeneity of variances). Both statistical assumptions were met (Shapiro-Wilk test: W = 0.983, p = .486, Levene test: F(3, 64) = 0.76, p = .519). Afterwards, we compared different nested analysis of variance models with and without further control variables

Figure 2. Self-efficacy in physical education (PE) for the parallel (PG) and the sequential group (SG), divided into out-offield (OF) and qualified (QL) PE student teachers. The error bars represent 95% confidence intervals.
(sex, age, school type); see Table 2. Adding additional control variables to the model does not alter the main effects (out-of-field vs. qualified and PG vs. SG) which shows that the between-group effects are not due differences in the control variables. Thus, in our study we established "group" (PG vs. SG) as the top group factor and "qualification" (out-of-field vs. qualified) as the subgroup factor, and with "self-efficacy" in PE as the dependent variable ( Figure 1). In the case of significant group effects, we computed Tukey post hoc tests. The alpha level was set at 5%. Effect sizes are represented by partial eta squared (η p 2 ), where values beginning from 0.01 indicate small, from 0.06 medium and from 0.14 large effects (Cohen, 1988). The analyses were carried out by using the software R (version: R 4.0.2; packages: car, sjstats, emmeans).

Results
The descriptive statistics can be found in Figure 2 and Table 3. The nested analysis of variance showed a significant and large group effect on self-efficacy in PE (F(3, 64) = 4.58, p = .006, η p 2 = .18). The PG qualified (M = 3.38, SD = 0.34) descriptively reported a higher self-efficacy in PE than the PG out-of-field (M = 3.01, SD = 0.38). Descriptively, the SG qualified (M = 3.18, SD = 0.43) also had a higher self-efficacy in PE than the SG out-of-field (M = 2.91, SD = 0.33). Post-hoc tests showed that these differences between qualified and out-of-field student teachers were significant only for the PG, p = .025, but not for the SG, p = .216.
Qualified PE student teachers rated their self-efficacy in PE higher than out-of-field PE student teachers only if a dimensional comparison of classroom-based and PE-related self-efficacy items is provided. Without an item comparison, there is no difference in self-efficacy in PE between qualified and out-of-field PE student teachers. This result confirmed our hypotheses. We can therefore assume that a dimensional comparison of items on self-efficacy in classroom-based subjects and self-efficacy in PE leads to a different self-efficacy in PE in out-of-field and qualified PE teachers.

Discussion
Teacher's self-efficacy (Bandura, 1977(Bandura, , 1997) is a relevant, personal judgement of self-belief by teachers in their ability to perform tasks Dicke et al., 2014;Hovey et al., 2020;Porsch, 2015;Skaalvik & Skaalvik, 2007;Tschannen-Moran & Woolfolk Hoy, 2007) and is usually measured using self-assessment scales (Pan et al., 2013;Schwarzer & Hallum, 2008). Studies show that people with low levels of professional expertise frequently overestimate their abilities and that people with high levels of professional expertise underestimate their abilities (Baumgartner, 2017;Kruger & Dunning, 1999). The assumption is that this can lead to inverse distortions in the response behaviour of respondents, resulting in virtually no differences in selfefficacy despite differing levels of qualification, e.g., qualified and out-of-field teaching (Flores et al., 2004;Fox & Peters, 2013;Voltmann-Hummes, 2008). Thus, our study tested one method for reducing this distortion: by asking respondents to directly compare items on self-efficacy in Notes: PE = physical education, PG = parallel group; SG = sequential group; OF = out-of-field PE student teachers; QL = qualified PE student teachers. classroom-based subjects and self-efficacy in PE, this method has used a dimensional comparison in order to minimise respondents' tendency to overestimate or underestimate themselves.
The results show that, in a standardised written self-assessment of a respondent's self-efficacy, there is only a significant difference between out-of-field and qualified PE student teachers if a dimensional item comparison is used. With an item comparison, out-of-field PE student teachers (PG out-of-field ) rated their self-efficacy in PE significantly lower than qualified PE student teachers (PG qualified ). With respect to SG, differences between qualified and out-of-field student teachers were not significant. We argue that the reason for this is not so much the slightly smaller difference in means, but the larger variance in SG qualified (see Figure 2). This can be seen as an indication that the item comparison enables a clearer self-assessment, especially for the more highly qualified. A dimensional item comparison can help participants of self-assessments to achieve more self-reflected and focused evaluation of their self-efficacy. Our findings underpin the assumptions that a dimensional related comparison of two similar items can foster cognitive phases of item response (Podsakoff et al., 2003).
One limitation of this study is that it is not possible to say whether the item comparison resulted in a more differentiated as well as a more realistic self-assessment by the respondents. Consequently, in order to validate such self-assessment measures future studies should focus on a comparison of subjective self-assessments with intersubjective other-assessments or objective skills tests (Shavelson, 2013).
Other limitations include potential institutional influences (the study was only carried out at a single university), the subject combination (only the school types were analysed and not the subject combination [with the exception of physical education]) or practical teaching experience. Also, the uneven sex distribution in the different experimental groups warrants attention. It might be possible that the high percentage of male participants in the qualified PG group influenced the relatively high self-efficacy scores for this group because males tend to have higher self-efficacy estimates than women (e.g., Huang, 2013). We recommend future studies to expand sample size and subject-related variety of participants in order to determine the validity of the presented parallel item self-assessment. Furthermore, a block randomisation instead of a simple randomisation should be taken into consideration as suggested by Kim and Shin (2014).
Another limitation concerns the reliability of the shortened 7-item scale. We found an internal consistency of ω = .69 which is just short of the often-cited recommended cutoff value of .70 (Nunnally & Bernstein, 1994). However, critics argue that this cutoff value was not derived from empirical research and that an increase in reliability is often obtained at the cost of validity (Cho & Kim, 2015). Still, there remains some doubt about the reliability of our scale. Also, although the confirmatory factor analysis revealed an acceptable fit of a unidimensional model, the sample size used for this kind of analysis was small and therefore, the result should be interpreted with caution.
In general, surveys with student teachers are quite common due to the facilitated sample access. Due to the early internships in the Bavarian teacher education system, some of which have to be completed before the start of the study, it can be assumed that the student teachers surveyed also have sufficient experience to answer the items. However, the present results are of course not directly transferable to in-service teachers. Therefore, future research on the method "parallel item presentation" should also focus on experienced teachers.
Further research could possibly transfer the method "parallel item presentation" to other subjects. In accordance with the subject PE, a dimensional-related comparison, for example, in the subject music, could lead to a comparison in terms of "classroom vs. music-related teaching".
The presented study focused on effects of item placement based on the method of dimensional item comparison. In addition, it seems worthwhile to examine effects of item placement in general. Thus, a general effect of item placement is also conceivable. Against this background, test for parallel effects would be possible with two groups: one with classroom item first; the other with PE specific item first. With such an item placement it could be possible to identify the direction of influence between classroom and PE specific items (i.e. Does the answer to the classroom item influence the answer to the PE specific item or vice versa?). Our study shows that traditional self-assessment procedures can only be recommended to a limited extent. The major problem lies in the interpretation of the self-assessment data and the conclusions to be drawn from them. One must not conclude on teachers' actually existing competences. This is especially true when trying to compare samples that differ based on the professional qualification level.
Thus, our findings are especially relevant for the development of instruments of selfexamination and the counselling of novice PE teachers. This approach simply requires selfassessment questionnaires which are easy to implement, and it lends itself well to the dimensional item comparison method. Indeed, the dimensional item comparison method is particularly wellsuited to these kinds of practical application. Notes 1. In Bavaria (Germany), the Ministry of Culture stipulates that student teachers of primary and middle school education who have not selected PE as a major or a minor are obliged to take a basic programme of PE of two to three contact hours per week per semester. In our study, we asked the students whether they had already taken this basic programme of physical education or not. 2. These three items are: (a) I know that I can maintain good contacts with parents even under difficult circumstances; (b) I feel confident in my ability to interest students in new projects; (c) I can facilitate innovations even when confronted with sceptical colleagues.

Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval
Ethical approval is not required for non-interventional studies such as the present survey due to national laws. All survey participants gave written informed consent to participate in the study and confidentiality of the data collected was maintained.