Tale of the tape: psychometric investigation of two popular teacher self-efficacy scales

ABSTRACT This study compares the psychometric properties of two popular teacher self-efficacy scales, namely, Teachers’ Sense of Efficacy Scale (TSES) and Teacher Efficacy Scale (TES). This study was quantitative in nature and distributed the two scales to 287 preservice teachers in the USA and in Australia. Findings indicate that the two scales function well and show linear measurement properties after Rasch optimization. However, the TES’s factor of general teacher efficacy was found to have very low reliability. Regression analyses indicated a significant positive relationship between the Student Engagement subscale of TSES and TES. There is a limited relationship between the two popular scales that have been used for decades. Therefore, we argue for the need for a new scale that will reflect all components of Bandura’s Triadic Reciprocal Determinism.

relationship with 'Environment' and 'Behavior'; therefore, this model should be viewed as the framework for discussions of self-efficacy.
All teachers have a sense of self-efficacy; however, there are two specific types of belief systems that make up this broad term.First, a teacher's Personal Teaching Efficacy (PTE: nested in the 'Person' part of the TRD model) is a teacher's feelings of confidence in their teaching skills.Second, a General Teaching Efficacy (GTE, nested in the 'Environment' part of the TRD model) relies on external issues that are out of the teacher's control (Protheroe, 2008) such as parental involvement or school climate (Dembo & Gibson, 1985).Teachers with low GTE believe that the influence of the environment overwhelms their ability to have an impact on student learning.In this paper, we examined to what extent two popular teacher self-efficacy scales address PTE and GTE when administered to preservice teachers.Preservice teachers' beliefs affect their perceptions, judgments, decision-making and actions in the classroom (Cerit, 2010).In this sense, teacher training effectiveness can be considered according to the development of preservice teachers' cognitive structure of teaching competence, a significant part of which is a sense of efficacy (Yeung & Watkins, 2000).Additionally, determining the level of preservice teachers' self-efficacy beliefs may contribute to foreseeing how they will behave during inservice training (Chesnut & Burley, 2015).Therefore, it is imperative that we assess their teacher selfefficacy with valid scales.
The first scale we examined was the Teacher Efficacy Scale (TES; Gibson & Dembo, 1984) that has played an important role in spawning teacher efficacy research (Nie et al., 2011).Gibson and Dembo's (1984) study on the concept of teachers' self-efficacy belief and other studies revealed that it consisted of two factors (Hoy & Woolfolk, 1993;Torre Cruz & Arias, 2007): Personal Teacher Efficacy (PTE -the teacher's belief in their ability to bring about positive student outcomes) and General Teacher Efficacy (GTE -the teacher's belief that their ability to bring about desired outcomes is limited by external factors such as home environment and family background) (Gibson & Dembo, 1984).These factors have been confirmed and used by many researchers, until the late 1990s (e.g.Emmer & Hickman, 1991;Hoy & Woolfolk, 1993;Riggs & Enochs, 1990;Soodak & Podell, 1993)  that time, the popularity of the scale has faded somewhat, due to issues with the validity of the GTE factor (e.g.Pajares, 1997;Tschannen-Moran & Woolfolk Hoy, 2001;Woolfolk & Hoy, 1990).But this scale continues to be used in a variety of sociocultural contexts, although the operationalization of the scale is questionable, and the psychometric properties have been shown to be weak; in fact, the two factors account for just 28.8% of the total variance (Nie et al., 2011).
The second scale examined was the Teachers' Sense of Efficacy Scale (TSES; Tschannen-Moran & Woolfolk Hoy, 2001).Duffin et al. (2012) stated that the TSES had become the dominant means of assessing teacher self-efficacy among preservice teachers.TSES's authors pointed out the conceptual and methodological shortcomings of TES and in an effort to overcome them, they developed TSES.Its three subscales measure teachers' beliefs about their abilities to engage their students, to use instructional strategies effectively, and to manage their classroom successfully.What each subscale is measuring has been under debate, with the authors claiming that the scale presents a unified and stable factor and therefore the subscales are inter-correlated.This claim has been supported by some studies (e.g.Salas-Rodríguez et al., 2021), and challenged by others.In particular, Ma et al. (2019) who did an extensive analysis of studies, argued that 'the lack of a unified and stable factor structure in the TSES and similar scales has been clearly demonstrated in other research, so the original claim by Tschannen-Moran and Woolfolk-Hoy was not only anomalous and disconfirmed by their own data but has also been unsupported by subsequent research.' (p. 626).
Overall, TSES treats teacher self-efficacy as a task-specific, three-dimensional construct reflecting instructional practices, classroom management, and student engagement.This Bandura-based instrument -developed in reaction to the partial invalidity of scores on TES -has been described by its authors as 'superior to previous measures of teacher efficacy in that it has a unified and stable factor structure' (Woolfolk Hoy & Burke-Spero, 2005, p. 354) but psychometric and theoretical issues need to be considered.Given that studies continue to use the two scales and to draw conclusions about a variety of educational topics, the issue of using valid scales remains in the forefront.The psychometric properties of the two scales will be discussed further in the Instruments section.
One might ask what the rationale is for still examining the psychometric properties of scales that were originally constructed 40 (TES) and 20 (TSES) years ago.The answer is that no other scales have dominated the field despite their shortcomings; therefore analyses and reviews of the two scales are still necessary.This has been a perpetual process that started shortly after the construction of TES and due to its conceptual and statistical questions, back in 1993, Hoy and Woolfolk recommended that researchers conduct factor analysis on their own data when they use the scale -a process that has been followed.Despite the shortcomings, the scale has been a popular instrument to date (e.g.Cerit, 2010;Semerci & Uyanik, 2018).As far as the TSES is concerned, back in 2001, the authors themselves (Tschannen-Moran and Woolfolk Hoy) acknowledged that their scale needed 'further testing and validation' (Tschannen-Moran & Woolfolk Hoy, 2001, p. 802).This issue has not gone away and as recently as 2019, Ma and colleagues argued that 'Careful examination of at least some aspects of the TSES and highly similar scales therefore seems warranted . . .' (Ma et al., 2019, p. 619).Further, the reason for our comparison of the two scales was so that researchers who use one scale can understand how their scores correspond to the other.When researchers choose one of the two scales, they assume that they are measuring the same construct, i.e. teacher self-efficacy.So, it would be helpful to know whether the scores of one scale relate to the other and in what way.Given the fact that there is still urgency to examine the psychometric properties of the two scales, we formulated the following research questions: RQ1: Do the Teacher Efficacy Scale (TES, Gibson & Dembo, 1984) and Teachers' Sense of Efficacy Scale (TSES; Tschannen-Moran & Woolfolk Hoy, 2001) indicate good measurement quality (e.g.demonstrate evidence of construct validity, reliability, and functioning interval-level scaled units)?
RQ2: Do aspects of teacher self-efficacy concerning Student Engagement, Classroom Management, and Instructional Strategies (TSES factors) predict teacher self-efficacy at the personal level (TES PTE) (i.e. is there a relationship between the three factors of TSES and TES's PTE)?
RQ3: Do aspects of teacher self-efficacy concerning Student Engagement, Classroom Management, and Instructional Strategies (TSES factors) predict teacher efficacy at the environmental level (TES GTE) (i.e. is there a relationship between the three factors of TSES and TES's GTE)?

Participants
This study included a convenience sample of 287 preservice (trainee) teachers from two Anglo-Saxon countries: the USA (n = 135) and Australia (n = 152).The two countries were chosen because, in addition to common language, they share similar professional standards for teachers.The USA has adopted the INTASC (Interstate New Teacher Assessment and Support Consortium, 2013) standards, while the Australian Professional Standards for Teachers were endorsed by Australia's Education Ministers in December 2010 and released by the Australian Institute for Teaching and School Leadership (AITSL) in February 2011.Since this is a study that focuses on psychometric analysis, we included participants from these two countries to ensure the invariance of the scales and their generalizability across nations.Overall, the demographics of the two groups indicate that the groups were similar.More specifically, of the 135 US preservice teachers, 82% were female, similar to the ratio of male and female teachers in the US (National Center for Education Statistics, 2019).The race distribution of the US sample included 89% White Caucasians and 11% non-white.Moreover, 68% of US participants were under 25; 27% were between 26 and 35; and 5% were over 35 years of age.All participants were studying to become general education teachers: 23% in pre-school, 44% in elementary and 33% in high schools.Of the 152 Australian preservice teachers, 84% were female, similar to the ratio of male and female teachers in Australia (Australian Bureau of Statistics, 2015) and the US sample.Furthermore, 77% of the Australian participants were under 25; 20% were between 26 and 35; and 3% were over 35 years of age.The race distribution of the Australian sample included 86% White Caucasians and 14% non-white.
The US participants were attending a public university in New York State, have completed their liberal arts requirements and were registered in their first foundations of education course; so they were in their third of four years of undergraduate study.The participants had chosen to become teachers but did not yet hold a teaching licence.The Australian participants were also attending a public university in the state of New South Wales, and they were in their third year of a four-year undergraduate teacher training degree.

Instruments
In this study, the following two scales were distributed to the participants: 1. Teacher Efficacy Scale (TES, Gibson & Dembo, 1984).TES was designed as a 30-item Likert scale.The authors built the scale on the formulations of the RAND studies but brought to bear the conceptual underpinnings of Bandura as well (Tschannen-Moran & Woolfolk Hoy, 2001).Perplexed when factor analysis of the items yielded a two-factor structure, the authors assumed that the two factors reflected the two expectancies of Bandura's SCT: self-efficacy and outcome expectancy.Consequently, the authors called the first factor Personal Teaching Efficacy (PTE) assuming that it reflected self-efficacy, and the second General Teaching Efficacy (GTE) assuming that it captured outcome expectancy.Thus, their study, as well as others, revealed that the scale consisted of two factors (Hoy & Woolfolk, 1993;Torre Cruz & Arias, 2007).Meanwhile, as research using the TES scale continued, researchers noticed that some items were loading on both factors, therefore a shorter version of 10 items ended up being used that included only items that loaded clearly on one factor or the other (Guskey & Passaro, 1994;Hoy & Woolfolk, 1993;Soodak & Podell, 1993;Woolfolk & Hoy, 1990).Although TES has been a popular instrument to-date (e.g.Cerit, 2010;Semerci & Uyanik, 2018) conceptual and statistical questions remain, and this is what our study examined.
While there is general consensus that PTE assesses teachers' beliefs in their competence, the interpretation of the GTE factor is still somewhat ambiguous.On the one hand, Emmer and Hickman (1991) called GTE 'external influences' which is reminiscent of Rotter's construct of external locus of control.On the other hand, Riggs and Enochs (1990) labelled GTE as an outcome expectancy; the component of Bandura's SCT in which a person assesses the consequences of an act they expect to achieve.However, Bandura (1986) does not explain or justify such a reasoning as an outcome expectancy.That outcome expectancy does not add much to the motivational explanation as the outcome that a person expects stems from making a judgment from their own capabilities rather than assessing what might be possible for somebody else to achieve in a similar situation.Therefore, GTE cannot be considered an outcome expectancy (Tschannen-Moran et al., 1998;Woolfolk & Hoy, 1990).After eliminating the outcome expectancy argument as an interpretation for GTE, researchers have to keep looking for an explanation of what is going on with this factor.A possible explanation might be that the GTE factor assesses the 'Environment' portion of TRD.
2. Teachers' Sense of Efficacy Scale (TSES; Tschannen-Moran & Woolfolk Hoy, 2001).The scale has been widely used.Ma et al. (2019) reported that just 'for the three years spanning 2016 to 2018 (inclusive), no fewer than 26 articles were based on the TSES or a close adaptation of it.'(p.619).The scale initially included 52 items based on a scale that Bandura himself had constructed; in the process of validating, it was reduced to 24 (long version) and 12 (short version) items.Factor analysis revealed three factors: efficacy for instructional strategies, for classroom management, and for student engagement.These factors have been regarded as comprising three moderately correlated but independent domains.As previously discussed, the authors claim that the scale presents a unified and stable factor; but such a claim has been challenged by Ma et al. (2019), whose analysis indicated that there is no empirical evidence of a unified and stable factor structure in the TSES so the original claim by Tschannen-Moran and Woolfolk-Hoy has been unsupported.
The scale is supported by validity and reliability evidence (Klassen et al., 2009;Tschannen-Moran & Woolfolk Hoy, 2001;Wolters & Daugherty, 2007).When checking for construct validity in particular, the authors ran correlations with other scales including TES and reported that the strongest correlations between TSES and other measures are with scales that assess PTE.The strong correlations with other measures of PTE support the notion that TSES mainly deals only with the 'Person' factor of TRD and that TES's GTE is measuring something else -most probably the 'Environment' factor of TRD.

Procedure
Participants in the U.S. and Australia were surveyed at the end of a lecture/course that dealt with foundational knowledge of education.The survey was distributed by a research assistant to all the sections of the course while the instructor was not in the room.A box was provided for the completed surveys.The response rate was 68.7%.Participants were given an information sheet outlining the aims of the study and informed that their participation was voluntary and anonymous.

Data analysis
Four main statistical analyses were conducted in order to answer the three research questions.In order to answer RQ1, the factor structures of the TES and the TSES were investigated by conducting a series of factor analyses on the two scales.The examination of the factor structures would provide evidence of the construct validity of the two multidimensional scales through the testing of convergent and discriminant validity, and reliability.Next, a series of analyses using the polytomous Rasch model (PRM) were run on the TES and the TSES and their respective dimensions (i.e.factor structures).These analyses were conducted to test 1) the linear measurement properties of the scales; 2) the functioning of the items; and 3) the unidimensionality of each factor per scale.
Moreover, Rasch analyses were conducted to provide evidence of construct validity.Here, we adopt a unitary approach to construct validity (Messick, 1989) and provide evidence via several approaches, such as through confirmatory factor analysis (i.e.convergent and discriminant validity), criterion validity, and Rasch analysis.Rasch analysis provides measures of unidimensionality testing through item and model fit statistics and post hoc tests examining patterns in the standardised residuals.These tests provide rigorous evidence of construct irrelevant variance (or the lack thereof) and construct validity.
After optimization of the TES and TSES subscales and affirmation of their linear measurement (interval level) scale properties, a series of multiple regression analyses (enter method) were conducted to investigate the predictive effect of Classroom Management, Instructional Strategies, and Student Engagement (TSES factors) on the TES PTE and GTE in order to answer RQ2 and RQ3, respectively.

Factor analyses of TES and TSES scales
A series of CFAs were conducted using IBM SPSS AMOS (version 29) on the two-factor TES and three-factor TSES.Maximum likelihood method was used to examine the sample variance-covariance matrix of the respective scales.Goodness of fit was calculated with the comparative fit index (CFI; Bentler, 1990), root mean square error of approximation (RMSEA; Steiger, 1990), and standardized root mean square residual (SRMSR; Little, 2013).The Chi-square is acknowledged as sensitive to sample size.Acceptable model fit indices are determined by the commonly used criteria -(CFI > .95),(RMSEA <.06), and (SRMSR <.08) (Little, 2013).The construct validity of the scales was assessed by the factor loadings, composite reliability (CR), average variance extracted (AVE), and convergent and discriminant validity testing.Factor loadings should be (>.5)while (>.7) is considered good.Composite reliability, which is similar to Cronbach alpha, should be .7 or above.If the AVE is .5 or above it is considered evidence of convergent validity (Hair et al., 2010).Evidence of discriminant validity is measured by comparing the square root of the AVE against the correlations of the factors or constructs.If the square root of the AVE is larger than the factor correlations then the scales indicate discriminant validity (Fornell & Larcker, 1981).

Rasch analyses (RQ1)
The Rasch model is a mathematical model of probability which is commonly used in the psychometric testing of questionnaires and tests.It has utility in that it can be used to determine the linear functioning and interval-level quality of raw data.The model depicts the probability of correctly responding to an item as a logistic function of the relative distance between a person's location (i.e.ability or their level of the trait) and an item's location (difficulty or amount of the trait expressed by the item) (Tennant & Conaghan, 2007).These location parameters are situated on the same logistic scale.
The PRM was run on all the subscales of the teacher efficacy scales using the full (Australian and US) sample (N = 287).Rasch Unidimensional Measurement Modelling (RUMM) 2030 software (Andrich et al., 2010) was used for all analyses.Person parameter estimates were obtained by the Weighted Maximum Likelihood method.Log-likelihood ratio tests indicated partial credit parametrization be applied per Rasch analysis (all ps < .02)except in the cases of the Instructional Strategies (p = .74)and Student Engagement (p = .87)subscales of TSES where rating scale parametrization was applied.The RUMM 2030 program provides summary statistics inclusive of 1) threshold order (in the case of partial credit parametrization), 2) model fit, 3) reliability indices, 4) individual item fit, 5) differential item functioning (DIF), and 6) response dependency and unidimensionality requirements.
The correct ordering of response categories per item should indicate congruence between increasing levels of the latent trait and the likelihood of choosing higher level response categories and vice versa.Disordered thresholds indicate the reverse of this situation but can be resolved by collapsing anomalous response categories into a single category.
Fit of the data to the Rasch model is expressed by the item trait interaction (Χ 2 statistic) which indicates a consistent hierarchical ordering of items across the measured trait (Tennant & Conaghan, 2007).A non-significant result (p > .05)indicates acceptable model fit and is evidence of the correct linear functioning of the scale.The person separation index (PSI) indicates internal consistency (reliability) and functions similar to a Cronbach alpha (i.e.> .7 is acceptable).Individual item fit is determined by fit residuals falling within acceptable ranges (−2.5 to + 2.5); chi-square and F-test statistics (ANOVAs) where an insignificant p value (>.05) indicates item fit; and visible inspection of the item characteristic curves (ICCs).ICCs are graphs which plot the observed data against a sigmoid theoretical curve (i.e.Rasch estimates).Close proximity of the observed data to the theoretical curve indicates item fit.
Differential item functioning (DIF) is used to determine response bias in items.DIF occurs when groups of persons with the same levels of the latent trait respond significantly differently on items.In the Rasch model, items must work invariantly across groups for a scale to be unidimensional (Tennant & Conaghan, 2007).Hence, items indicating DIF will misfit the Rasch model.Consequently, this will undermine the construct validity of the measurement instrument.DIF is widely used to investigate the cross-cultural validity of measurement instruments (Tennant & Conaghan, 2007).Because our sample was crosscultural, it was important to test for any country-specific bias in the items.DIF by gender was also run because strong gendered effects have been found in prior research (e.g.Ehrich et al., 2020).
There are two main types: uniform DIF where groups respond differently on items across class intervals in a consistent manner and non-uniform DIF where group responses across class intervals are inconsistent.DIF can be detected by running twoway ANOVAs on the standardised residuals to determine group differences.DIF can be determined when a significant difference between comparison groups is attained (i.e.p < .05after Bonferroni adjustments).DIF can also be detected by observation of each comparison groups' ICCs (i.e.plotted observed means by class interval) whereby distance or lack of proximity between comparison groups' ICCs indicate response bias.DIF was run on the person factors: gender (male versus female) and country (Australia versus U.S.A.).
In cases where significant uniform DIF is found, it can be resolved by splitting the item, that is, by creating two group-specific items (i.e. one item for each group) and separately calibrating the item for each group.In cases of significant non-uniform DIF, the item cannot be resolved and is usually removed from the scale (Pallant & Tennant, 2007).
Finally, tests of response dependency and unidimensionality are indications of whether items are too closely related and that only one latent trait is measured per dimension, respectively.Response dependency and unidimensionality are both tested by a post hoc principal components analysis (PCA) on the standardised person-item residuals.In theory, such residuals are considered as random noise (Tennant & Conaghan, 2007) and consequently, any patterns which emerge can suggest multiple dimensions in the data.Response dependency is determined by an examination of the PCA residual correlation matrix.Items with residuals which are strongly related to each other (>.3) are considered to be response dependent, that is, the response of an item affects the response of another.Unidimensionality testing follows the method outlined by Smith's (2002) t-test procedure.In this procedure, a subset of negatively and positively loaded items from the first principal component loading are compared for each person measure using multiple t-tests.A large percentage of cases which are signficantly different (i.e.>5%) at the 5% level suggests potential multiple dimensions in the data.(Gibson & Dembo, 1984) and TSES (Tschannen-Moran & Hoy, 2001) TES results.An initial CFA on the two-factor TES provided poor fit, chi-square (100) = 261.3,p < .001;CFI = .876;TLI = .851;RMSEA = 0.079; SRMSR = 1.057,where three items from the GTE factor (items 2, 3, & 14) did not load onto the model (all items < .4).Acceptable model fit was achieved only after removal of these three items, chi-square (62) = 129.5, p < .001;CFI = 940; TLI = .925;RMSEA = 0.065; SRMSR = .061.All loadings were above .52except for items 1 (.44) and item 16 (.43).Levels of AVE for both scales fell below the required level of .5 (Personal Teacher Efficacy [PTE] subscale = .45)and (General Teacher Efficacy [GTE] subscale = .32)indicating poor convergent validity.The CR indices indicated good reliability for the PTE subscale (.87) and low reliability for the GTE subscale (.65).Discriminant validity evidence was attained as the √AVEs (PTE subscale = .67)and (GTE subscale = .57)were larger than the factor correlation (.36).
In summary, these factor analyses indicated evidence for a two-factor structure for the TES, but there were three items which failed to load (items 2, 3 & 14).However, the TES was found to have poor convergent validity but good discriminant validity.Similarly, evidence was found supporting the three-factor TSES structure.By contrast with the TES, the TSES was found to have good convergent validity but poor discriminant validity.

RQ1: Results of Rasch analysis
Teacher Efficacy Scale (TES; Gibson & Dembo, 1984) Personal Teacher Efficacy (PTE) subscale.An initial Rasch analysis on the PTE subscale (items 1, 5, 6, 7, 9, 10, 12, 13, 15) indicated disordered thresholds in items 1, 5, 10 and 15.Items 1, 10 and 15 were resolved by collapsing the 1 and 2 response categories into a single category.A significant item-trait interaction indicated misfit of the data to the Rasch model, Χ 2 (36) = 98.6992, p < .001.However, the scale had good reliablity (PSI =.84, α = .87).Examination of the individual item fit statistics indicated misfit in items 1 (When a student does better than usual, many times it is because I exerted a little extra effort), 5 (When a student is having difficulty with an assignment, I am usually able to adjust it to his/her level), and 6 (When a student gets a better grade than he usually gets, it is usually because I found better ways of teaching that student) (see Table 1

for details).
A DIF analysis on the person factor of gender (male vs female) indicated significant non-uniform DIF in items 1 (F = 5.046, p < .001)and 6 (F = 5.015, p < .001)and by country (USA vs Australia) in items 1 (F = 5.527, p < .001)and 6 (F = 5.829, p < .001).By gender, males ranked themselves as higher than expected whereas females ranked lower than expected in item 1 (When a student does better than usual, many times it is because I exerted a little extra effort).This may be a reflection of a more general pattern of males' displaying more self-confidence in their teaching abilities than females (e.g.Ehrich et al., 2020).Similarly, in item 6: When a student gets a better grade than he usually gets, it is usually because I found better ways of teaching that student.DIF here could be attributed to greater male confidence also.DIF by country in these items is more difficult to explain with lower ability US students ranking themselves higher than expected, while higher ability students ranked themselves lower than expected and vice versa for the Australian student sample.Possibly, this may be a reflection in the practicums of the different teacher education programs.For example, preservice teachers in Australia commence practicums in schools earlier in their degree than US preservice teachers.More specifically, at the time of the study, the US participants were taking their first education course and had no experience in the classrooms before that class.Hence, US preservice teachers may overstate their teaching abilities until they accrue the reality of authentic classroom experience and vice versa for the Australian sample.
A PCA on the standardised residuals indicated response dependency between items 6 (When a student gets a better grade than he usually gets, it is usually because I found better ways of teaching that student) and 9 (When the grades of my students improve it is usually because I found more effective teaching approaches) (residual correlation = .33)indicating that responses on these items were too closely related.Unidimensionality testing following Smith's (2002) t-test method indicated a multi-dimensional scale with a large percentage of cases that were significantly different (10.21%) at the 5% level.Overall, the subscale indicated poor functioning according to the Rasch model.Consequently, another Rasch analysis was run on the items 7, 9, 10, 12, 13 and 15 after the removal of misfitting and response dependent items and or items with significant non-uniform DIF (i.e. 1,5,& 6).This reduced subscale indicated good fit of the data to the Rasch model, Χ 2 (24) = 22.3, p = .56,good reliability (PSI = .82;α = .87),and correct ordering of thresholds per item.All individual items indicated well-functioning fit statistics according to the model.A DIF analysis indicated no significant DIF on the person factor gender; however, significant uniform DIF was found on the person factor country (Australia vs USA) in item 7 (When I really try I can get through to most students) (F = 9.191, p < .003)after Bonferroni adjustments (alpha = .003).However, after splitting the item, the model fit statistics indicated good fit Χ 2 (28) = 25.8, p = .58,and reliability (PSI = .82,α = .87),hence the item was retained.
A PCA on the standardised residuals and a review of the residual correlation matrix indicated no response dependency (all correlations < .3).Unidimensional testing following Smith's (2002) t-test procedure indicated that 7.45% of cases differed at the 5% level, which is a violation of unidimensionality requirements indicating potential multiple dimensions in the data.Overall, however, the 6-item PTE subscale (items 7, 9, 10, 12, 13 and 15) functioned well according to the Rasch model.The GTE subscale (items 2,3,4,8,11,14 and 16) misfit the Rasch model, Χ 2 (28) = 67.85,p < .001.Examination of the threshold map indicated ordered thresholds for all items, except for item 8 and 11 which were both resolved by collapsing the response categories 5 and 6 and item 14 which was resolved by collapsing the 1 and 2 response categories.Items 8 (A teacher is very limited in what he/she can achieve because a student's home environment is a large influence on his/her achievement), and item 14 (The influences of a student's home experiences can be overcome by good teaching) indicated misfit to the model (see Table 1).
A PCA on the standardised residuals indicated no response dependency (all correlations < .3).Unidimensionality results indicated a slight violation of requirements with 6.01% of cases differing at the 5% level.
After removal of misfitting items (i.e.items 8 and 14) another Rasch analysis was run on the remaining items (2, 3, 4, 11 and 16) which found good fit to the model, Χ 2 (20) = 21.5, p = .37.All thresholds were ordered with the exception of item 11 which became ordered when the response categories of (1 and 2) and (5 and 6) were collapsed into single categories.All items were found to fit the Rasch model.However, the reliability index indicated very low reliability (PSI = .57,α = .62).DIF analysis found no significant DIF by country or gender after Bonferroni adjustment (p < .003).Finally, a PCA on the standardised residuals indicated no response dependency (all correlations < .3).Evidence of a unidimensional scale was also found with only 2.82% of cases differing at the 5% level.
Teacher self-efficacy scale (TSES; Tschannen-Moran & Hoy, 2001) Classroom Management.Overall fit of the data to the Rasch model was found for the Classroom Management subscale after the removal of item 1 (How much can you do to control disruptive behaviour in the classroom?)which misfit the model, Χ 2 (9) = 7.6, p = .57.All other items fit the model (see Table 2) and all threshold values were ordered.Reliability indices were reasonable (PSI=.79,α = .84).A DIF analysis on the person factors of gender (male vs female) and country (Australia vs USA) indicated no significant DIF.A PCA on the standardised residuals indicated no response dependency (all correlations < .3).However, post hoc testing following Smith's (2002) t-test procedure indicated a slight violation of unidimensionality requirements with 6.19% of cases differing at the 5% level.Overall, the Classroom Management subscale indicated a well-functioning 3-item scale according to the Rasch model.

Instructional Strategies.
Good fit of the data to the Rasch model was found for the Instructional Strategies subscale (items 5, 9, 10 and 12), Χ 2 (9) = 15.5, p = .22.The PSI and alpha indicated a reliable scale (PSI = .80,α = .84).All individual items fit the model well (see Table 2).A DIF analysis on the person factor of gender (male vs female) indicated no significant DIF.However, significant uniform DIF was found for country (USA vs Australia) in item 12 (How well can you implement alternative strategies in your classroom?)(F = 5.24, p = .003)after Bonferroni adjustment (alpha = .004).The item was resolved after splitting, that is, the revised item fit the Rasch model, Χ 2 (15) = 24.0,p = .06,(PSI = .80,α = .84).A PCA on the standardised residuals indicated no response dependency (all items < .3).However, there was a slight violation of unidimensionality requirements with 6.6% of cases differing at the 5% level.Overall, the 4-item subscale demonstrated well-functioning according to the Rasch model.

Student Engagement.
Good fit to the Rasch model was found for the Student Engagement subscale (items 2, 3, 4 and 11), Χ 2 (9) = 16.3, p = .06,with good reliability indices (PSI = .83,α = .79).All individual items fit the model except for item 11 (How much can you assist families in helping their children do well in school?).Consequently, this item was removed from the scale (see Table 2).Some misfit was found in item 3 (How much can you do to get students to believe they can do well in school work?), however examination of its ICC indicated good functioning so the item was retained.No significant DIF was found on the person factors of gender (male vs female) nor country (Australia vs USA).Examination of the residual correlation matrix indicated no strongly correlated items (all <.3).Unidimensionality requirements were also met with less than 5% of cases differing at the 5% level (4.36%).Overall, a well-functioning 3-item scale was found according to the Rasch model.

RQ2 and RQ3 results of regression analyses
Two multiple regression analyses (enter method) were conducted to investigate the predictive effect of Classroom Management, Instructional Strategies, and Student Engagement (TSES factors) on the TES PTE and GTE (outcome variables).No items in any of the respective scales were removed in these analyses (in order for comparison with prior research).The results of the multiple regression for Model 1 (predictor classroom management, instructional strategies, student engagement) on outcome PTE) found that the predictors significantly explained 34.6% of the variance (R 2 =.346, F(3, 260) = 45.32,p < .001).Only student engagement and classroom management were significant predictors accounting for 31.8% and 2.2% of the unique variance, respectively.Student engagement was the strongest predictor where an increase in one standard deviation in the TES PTE resulted in an increase of .571standard deviations (β = .57,p < .001),whereas classroom management increased only by .213standard deviations (β = .213,p < .005).Instructional strategies did not significantly predict the TES PTE.
The results of the multiple regression for Model 2 indicated that the predictors (classroom management, instructional strategies, student engagement) significantly explained 93.3% of the variance (R 2 =.933, F(3, 264) = 1216.61,p < .001).However, only student engagement was a significant predictor, accounting for most of the variance (90.25%).As the TSE GTE increased by one standard deviation, student engagement increased by .962standard deviations (β = .96,p < .001).The results for both regression models are displayed in Table 3.
Note, the US and Australian samples were combined in the prior regression analyses.Because measurement invariance did not hold in some items (i.e.significant non-uniform  DIF was found in a number of items in the TES PTE and GTE scales) this may have confounded the results.Therefore, an identical regression analysis was run with the removal of misfitting items and items with significant non-uniform DIF.The results were unchanged hence using the combined sample was justified.Contact the authors for details of this analysis.

Discussion
It is advantageous for teachers to have a high level of teacher self-efficacy since there is abundant evidence that it positively affects motivation, attitudes, and performance of both teachers and students (e.g.Ashton & Webb, 1986, Muijs & Reynolds, 2015;Turkoglu et al., 2017;Zee & Koomen, 2016).Several scales have been developed over time that assess the construct of teacher self-efficacy, but there are still unresolved issues concerning its measurement (Klassen et al., 2011).In the present study we compared the results of two popular scales: Teacher Efficacy Scale (TES; Gibson & Dembo, 1984) and Teachers' Self Efficacy Scale (TSES; Tschannen-Moran & Woolfolk Hoy, 2001) when administered to a sample of 287 preservice teachers (152 from Australia and 135 from the USA).First, we examined RQ1, in which we looked for evidence of sound psychometric functioning of the two scales.We concluded that the two scales function well and indicate linear measurement properties after Rasch optimization through the removal of the items which misfit the model.The analysis of the scales demonstrated evidence of both gender and country invariance and indicated generalizable subscales.Of some concern, however, was the TES GTE subscale which indicated very low reliability (.57) -significantly below minimal acceptable standards for a measure of internal consistency.Then, we compared the two scales with each other (RQ2 and RQ3).The reason for such a comparison was so that researchers who use one scale can understand how their scores correspond to the other scale.When researchers choose one of the two scales, they assume that they are measuring the same construct, i.e. teacher self-efficacy.So, it would be helpful to know whether the scores of one scale relate to the other and in what way.With that in mind, we conducted regression analyses and were able to establish significant relationships between two out of the three subscales of TSES with TES's PTE and one with TES's GTE.More specifically, this finding indicated a significant positive relationship between the Student Engagement and Classroom Management subscales of TSES and TES's PTE.There was also a significant positive relationship between the Student Engagement subscale of TSES and TES's GTE.These findings indicate that the abovementioned subscales are measuring similar constructs in the two scales which provide evidence of criterion validity of the two subscales.It is obvious from the analyses that most of the variance is explained by Student Engagement both in TES's GTE and PTE.
It has been more than 20 years since correlations have been run between the two scales (Hoy & Woolfolk, 1993;Tschannen-Moran & Woolfolk Hoy, 2001), and the results had indicated that the long form of TSES significantly correlated with TES's PTE (r = 0.64; p < 0.01) as well as GTE (r = 0.16; p < 0.01), although the correlation was lower.Similar results were reported for the short form.We also found a positive relationship between the TSES and TES's PTE (r = .33,p < .001)and TES's GTE (r = .15,p < .02)conducted on the raw data of the total scores of these scales.The findings were general and expected: the stronger correlations were found between TSES and TES's PTE.This is not surprising since both are dealing with the 'Person' component of TRD.However, our psychometric findings indicate that the relationship among the various subscales is more complicated than these simple correlations on the scales would suggest.We found a significant positive relationship only between Student Engagement and Classroom Management of TSES and the TES's PTE as well as Student Engagement of TSES and the TES's GTE.Our findings indicate that these subscales are the points of intersection between the two scales.The rest of the subscales of the TSES and TES's PTE and GTE share no relationship which indicates that they are measuring different constructs.TSES's Student Engagement, which explains most of the variance in the regression analyses, can be defined as depicting students' willingness to participate in routine school activities, such as attending class, submitting required work, and following teachers' directions in class (Chapman, 2003;Woodcock & Tournaki, 2023).Such important skills should be emphasized by teachers and assessed by scales of self-efficacy.
In conclusion, our results take us back to the psychometric limitations of the scales and the measurement of the elusive construct of teacher self-efficacy.In this field researchers are dealing with the properties of each scale separately in their studies and, as already mentioned, both scales have psychometric shortcomings in their current forms (i.e.misfitting, response dependent, and invariant items and multiple dimensions).To resolve them, researchers have been in a perpetual process of seeking to assess the construct of self-efficacy.This process started shortly after the construction of TES due to its conceptual and statistical questions; back in 1993, Hoy and Woolfolk recommended that researchers conduct factor analysis on their own data whenever they use the scalea process that has been followed until today (e.g.Cerit, 2010;Semerci & Uyanik, 2018).As far as TSES is concerned, back in 2001, the authors themselves (Tschannen-Moran and Woolfolk Hoy) acknowledged that their scale needed 'further testing and validation' (p.802).This issue has not gone away and as recently as 2019, Ma et al. were arguing that we should keep examining aspects of TSES.Therefore, our findings come to add to the question (and not the answer) not only of how scores obtained by one scale compare to scores on the other but more importantly of what it is that the two scales are measuring.The lack of correlations among most subscales of the two instruments is problematic since it brings us back to issues of construct validation.So, in order to move forward, we need to go back and examine again Bandura's definition of self-efficacy: 'Perceived self-efficacy refers to beliefs in one's capabilities to organize and execute the courses of action required to produce given attainments' (Bandura, 1997, p. 3).This definition highlights some inherent challenges in measuring the construct of selfefficacy, as it appears to be measuring beliefs on more than one dimension; based on the definition, more than beliefs are at play in the construct.It is clear that both theoretically and empirically one of the challenges to the two self-efficacy scales is the inability to realize the full richness of the construct itself.To date, most empirical studies have failed to use complex multidimensional measures.In 1997, Bandura cautioned researchers that measuring self-efficacy using undifferentiated self-efficacy scales was problematic and would result in poor validity.Given that neither of the two scales we compared utilizes cross-sectional designs, they are therefore unable to account for the particular and potentially fluctuating nature of teacher self-efficacy.Therefore, a new, well-validated scale that will take into account the richness of the definition of self-efficacy as well as all the shortcomings of the existing scales is needed.Such a scale will have to also take into account both the 'Person' and 'Environment' component of TRD in its construction and item analysis and be valid both for preservice and inservice teachers.

Limitations
The results of this study should be viewed with consideration of its limitations.The data are self-reported and as such potentially biased, as respondents may tend to overestimate themselves (Bryman, 2008).While both scales have been validated with preservice and inservice teachers, in this study we surveyed only preservice teachers while keeping in mind that 'different sets of attributes and outcomes related to self-efficacy have been claimed for each group' (Ma et al., 2019, p. 614).Lastly, the sample was one of convenience, similar to the studies reviewed in the literature, and came only from one state in each country, so caution should be used when comparing nationally and internationally.
Despite the limitations, it is widely accepted that teacher self-efficacy plays an important part in teachers' performance in the classroom, therefore its valid assessment is imperative.Our findings indicate a limited relationship between the two popular scales that have been used for decades, therefore we argue for the need for a new scale that will reflect all the components of Bandura's TRD.

Table 1 .
Fit of the individual items of the personal and general teacher efficacy scales to the rasch model.

Table 3 .
Multiple regression results for TSES factors (classroom management, instructional strategies, student engagement) predicting TES PTE and GTE.

Table 2 .
Fit of the individual items of the TSES (Tschannen-Moran) to the Rasch model.