Teachers’ self-efficacy in teaching mathematics: tracing possible changes from teacher education to professional practice

ABSTRACT This article reports on a cross-sectional study tracing possible changes in primary teachers’ self-efficacy in teaching mathematics across various points in their professional careers, involving both novice and experienced pre-service teachers (PSTs) and in-service teachers (ISTs). In the relevant literature, self-efficacy appears to be operationalised differently for PSTs and ISTs. Here, we conceptualise it as a unidimensional construct when measuring the future-oriented beliefs teachers hold regarding their own ability to explain mathematics to others, with a focus on understanding the underlying mathematical concepts and procedures. To measure teacher self-efficacy, we used a previously developed and validated 20-item instrument. Participants included novice PSTs (n = 191), experienced PSTs (n = 130), novice ISTs (n = 119) and experienced ISTs (n = 194). Rasch analysis enabled a comparison of the results from the four groups, confirming the theory-based expectation that self-efficacy in teaching mathematics develops with experience. The article concludes with a discussion of the implications of our work, as well as directions for further research. For example, future studies could shed light on the conditions under which the development of self-efficacy can be supported more effectively at different transition points of mathematics teachers’ professional careers.


Introduction
This paper constitutes a first attempt to explore the developmental nature of teachers' self-efficacy in teaching mathematics.As we explain subsequently, the developmental nature of self-efficacy is not explored here directly (through a longitudinal exploration of a specific cohort over a long period of time).It is rather attempted by proxy, through considering snapshots of self-efficacy of different cohorts, in a cross-sectional manner.In general, self-efficacy is concerned with people's perceptions of their own competence rather than with their actual competence.Nevertheless, the significance of the concept lies with its potential to have 'consequences for the course of action' people 'choose to pursue and the effort they exert in those pursuits' (Hoy & Spero, 2005, p. 344).Along these lines, a large body of literature has established positive correlations between teacher self-efficacy and factors such as pupil achievement, teachers' instructional quality, and teachers' management of educational reform (Bruce & Ross, 2008;Depaepe & König, 2018;Gabriele & Joram, 2007).Additionally, a well-developed sense of self-efficacy appears to be negatively associated with teacher anxiety and burnout (Brouwers & Tomic, 2000;Gresham, 2008).
As indicated by several authors (Brinkmann, 2019;Giles et al., 2016), self-efficacy in teaching mathematics (SETM) develops mainly during students' pre-service teacher (PST) education.In other studies, a fluctuation in self-efficacy during the first years of teaching has been observed (Hoy & Spero, 2005;Işıksal-Bostan, 2016;Thomson et al., 2022).Nonetheless, to the best of our knowledge, the development of SETM at various points in their professional careers (e.g. from pre-service education to the first years of teaching to having several years of professional experience) remains unexplored.Therefore, the current study aims at contributing to the filling of this gap by exploring the following research question: In what possible ways does SETM change throughout teachers' professional careers?
In the following sections, we examine some theoretical considerations that helped us frame our study, followed by a presentation of our methodology and findings.We conclude with a discussion of the contributions and implications of this study, as well as the limitations of our work and how this research area could be advanced.

Self-efficacy and its sources
The concept of self-efficacy was introduced by Bandura (1977Bandura ( , 1997) ) and refers to the future-oriented beliefs people hold regarding their own capabilities to accomplish specific tasks or courses of action in the future.As such, self-efficacy 'is not concerned with what someone believes they will do, but about what someone believes they can do' (Maddux & Kleiman, 2016, p. 89) in specific conditions.For Bandura (1977), selfefficacy stems from four sources: mastery experiences (i.e.past experiences of success in performing a particular task, or the idea of, 'I did it before, so I can do it again'); vicarious experiences (i.e.observing others successfully performing a task, or the idea of, 'If they can do it, then I can do it too'); verbal persuasions (i.e.feedback and encouragement received from other people, or the idea of, 'These people believe in me, so I can do it'); and psychological/affective responses (e.g.stress and anxiety levels).According to Bandura (1997), mastery experience is the most powerful source, a view supported by a significant number of studies (Usher & Pajares, 2008).However, mastery experiences can be elusive in relation to complex tasks and situations because it is not always easy to identify when one has been successful (Palmer, 2011).In this respect, Skaalvik and Skaalvik (2007) emphasise that it is not success per se that provides efficacy information but rather one's perception of success.Such a perception may derive from different sources or combinations of them (Morris et al., 2023).Hence, the four sources outlined above do not work in isolation.Rather, an individual's self-efficacy is a function of inputs from various sources.

Self-efficacy and the teaching of mathematics
In mathematics teacher education research, self-efficacy has been extensively examined in relation to PSTs and, to a lesser extent, in-service teachers (ISTs).However, the concept appears to be conceptualised and operationalised differently, depending on which of these two groups of participants are involved.In studies involving PSTs, selfefficacy has been explored in relation to mathematics (one's beliefs about one's own subject knowledge competence; e.g.Bjerke & Solomon, 2020;Norton, 2019), in relation to mathematics teaching (one's beliefs about one's own ability to teach mathematics in meaningful and supportive ways; e.g.Charalambous et al., 2008;Giles et al., 2016) or as a two-dimensional construct comprising both mathematics and mathematics teaching (Bates et al., 2011;Briley, 2012).In addition, self-efficacy is often studied in relation to actual mathematics subject knowledge (Akay & Boz, 2010;Carney et al., 2016).Overall, in the PST body of literature, especially in studies that tend to approach mathematics selfefficacy and SETM as distinct constructs, it is established that these two are highly interrelated and together they have an impact on mathematical competence, knowledge of mathematical concepts and the fluency of procedures (Bates et al., 2011;Briley, 2012;Li & Kulm, 2008).
Studies involving ISTs appear to be significantly less common than those involving PSTs.The limited number of related published papers focus mainly on SETM-in other words, teachers' perceived efficacy in teaching mathematics effectively and managing the classroom (e.g.Charalambous & Philippou, 2010;Wilhelm & Berebitsky, 2019).Occasional exceptions (Andrews & Xenofontos, 2015;Beswick et al., 2012;Xenofontos & Andrews, 2020) approach self-efficacy from the perspectives of both subject knowledge and teaching.This relative absence of research 'alludes to a hidden (yet erroneous) assumption that in-service teachers have a high sense of mathematics self-efficacy' (Xenofontos & Andrews, 2020, p. 263).
We acknowledge that one's beliefs about one's own mathematical competence and ability to teach mathematics are not always easily distinguishable (Bates et al., 2011;Bjerke & Eriksen, 2016;Briley, 2012;Xenofontos & Andrews, 2020).Therefore, for the purpose of this study, we conceptualise SETM as a unidimensional construct, involving both beliefs about one's own subject knowledge competence and beliefs about one's own ability to explain mathematics to others so that learners can understand the underlying mathematical concepts and procedures.

Methodology
To examine in what possible ways SETM changes throughout teachers' professional careers, a cross-sectional research design was chosen instead of a longitudinal design (Robson, 2002;Shanahan, 2010).In acknowledging what can be seen as shortcoming of a cross-sectional design, we want to highlight that in this work we explore possible changes in SETM.Ideally, we would have liked to have followed the same sample of participants at different stages of their professional careers, so that we could talk about the developmental nature of SETM.However, this was not possible for practical reasons.For example, in our sample, we have participants with more than 10 years of teaching experience; it would not have been realistic to follow them from their undergraduate studies up until their tenth year of teaching.Typically, studies employing longitudinal designs to examine teachers' self-efficacy follow the same participants over a much shorter period (e.g.Hoy & Spero, 2005;Işıksal-Bostan, 2016;Thomson et al., 2022).In addition, cross-sectional designs are regularly used in research about the professional development of teachers in terms of their beliefs, knowledge and practices (e.g.see Collie et al., 2020;Dudenhöffer et al., 2017;Lawrence et al., 2019).In the following section, we give an account of our study's participants and the stages they were at in their teaching careers.Then, we describe the instrument used to gather our data before outlining the process of validating, anchoring and analysing our data.

Participants
The data collected in this study were gathered physically at a Norwegian university (targeting PSTs) and through an online survey (targeting ISTs).In Norway, primary teacher education has been a 4-year programme since the beginning of the 90ies (Skagen & Elstad, 2023), leaving most of the ISTs in our data sample with the same educational background as the PSTs in our sample.
The PSTs were enrolled in a 4-year teacher education programme for primary school teachers (i.e.grades 1-7, ages 6-13) 1 that included a compulsory 30-credit course in mathematics pedagogy spanning the first 2 years, which was previously reported on by Bjerke and Eriksen (2016).We divided the PSTs into two groups: the first group (hereafter called novice PSTs, n = 191) completed the survey at the start of their first semester, while the second group (hereafter called experienced PSTs, n = 130) completed the survey near the end of their second year, coinciding with the end of the compulsory mathematics pedagogy course.
The ISTs, also primary teachers, were invited-both through the researchers' own professional networks and through invitations shared in closed groups 2 for Norwegian teachers on social media-to fill out an online version of the same instrument to which the PSTs responded.The ISTs take part in the closed groups for various reasons; some because they are interested in discussing mathematics teaching and sharing their ideas, some because they need help and support in their struggle with mathematics and mathematics teaching.We suggest that this in sum reflects the diversity in the group of PSTs.Several empirical studies (e.g.Graham et al., 2020;Henry et al., 2012;Tricarico et al., 2015) have suggested that the first five years of teaching can be seen as a threshold in one's professional career, representing a sufficient amount of time for one to gain a range of experiences as a practicing teacher.For this reason, we divided the ISTs into two groups: those with 1-5 years of mathematics teaching experience (hereafter called novice ISTs, n = 119) and those with more than 5 years of experience (hereafter called experienced ISTs,n = 194).

The instrument
Data were collected though an instrument explicitly targeting a core activity of teaching mathematics: helping a generic child/pupil with mathematical tasks.The survey was originally designed to measure and map the development of SETM during teacher education and is described in more detail in Bjerke and Eriksen (2016).It consists of 20 items/tasks, each requiring participants to indicate-on a 4-point Likert scale (with the response categories 'Not confident', 'Somewhat confident', 'Confident' and 'Very confident')-their level of confidence in performing that task.Specifically, 10 items focus on rules and algorithmic procedures in mathematics (e.g.'How confident are you that you can help a child with the task, "Calculate 23 × 0.7"?'), which simply ask for instrumental calculations (hereafter referred to as calculate items).The other 10 items focus on reasoning and explanations (e.g.'How confident are you that you can help a child with the task, "Explain why you must find the common denominator when you add two fractions"'), requiring what Skemp (1976) called relational understanding, or, 'knowing both what to do and why' (p.20; hereafter talked about as reasoning items).The exact wording of all 20 items can be found in Bjerke and Eriksen (2016).
In line with the approach taken when validating and reporting the instrument (Bjerke & Eriksen, 2016) and when reporting on development in PSTs (Bjerke, 2017), the Rasch Rating Scale Model (RSM) was applied when analysing the data.In general, the RSM aims at studying anomalies in data, instead of choosing a model that best describes the given data.In this respect, the RSM helped in identifying the items that best measured the underlying construct.Furthermore, it revealed the items that were interpreted in the same way by the different groups of participants.

Data analysis
Likert scales yield ordinal data (typically assigned numbers, e.g.0 for 'Not confident' and 3 for 'Very confident') that cannot be approached as linear (Boone et al., 2014).This means that the spacing between the response categories is not necessarily equal.For this reason, using the raw score obtained by adding up a participant's responses on a Likert scale to denote their level of self-efficacy is problematic (Bond & Fox, 2007).
With the RSM, ordinal data are converted into linear measures.As such, the strength of the RSM lies in the construction of a genuine interval estimate for the underlying construct, wherein both items and persons are measured on the same scale in unit logits, the logarithm of the odds of success.As a result, comparisons between items, between persons, and between items and persons are possible in the form of establishing a person's probable answer to an item: the higher the estimate is for a person, the more self-efficacious a person feels, and the higher the estimate is for an item, the more self-efficacy is needed to endorse it.Following the approach taken in a previous work (see Bjerke & Eriksen, 2016), we applied the RSM to analyse our data.WINSTEPS Rasch Analysis and Rasch Measurement software (version 3.81.0;Linacre, 2014) was used to test the compliance of the data with the RSM.
To monitor the data quality, we explored the degree of fit between the data from the mixed sample of PSTs and ISTs (n = 634) and the RSM.When the data satisfied the conditions of the model (primarily unidimensionality), and when detailed principal component analysis revealed no multidimensionality, we could be sure that the instrument measured only one underlying construct.The unidimensionality condition of the RSM held sufficiently well for the data, with mean square (MNSQ) fit statistics showing fit values (between 0.64 and 1.21) within acceptable limits (0.6 < MNSQ < 1.4) for all items.The standardised fit statistics (ZSTD) showed no noticeable unpredictability in the data, with a well-covered administration that had no significant measurement gaps between the items; this indicated a uniform gradation in terms of difficulty (Baghaei, 2008).The Rasch reliability estimates (which, in general, underestimate reliability) for persons and items were 0.86 and 0.99, respectively, indicating a reproducible measure.
A danger to the generalisability of the instrument is that items might be interpreted in significantly different manners by different groups filling out a survey instrument (Bond, 2004).To investigate any invariance in the item difficulties (i.e.whether the item difficulties were the same for the four groups in our study), a set of differential item functioning (DIF) analyses was conducted, each holding one group's item difficulty up against its average difficulty for all groups (as proposed by Linacre, 2012).Five items 7, 8, 9, 14 and 15 exhibited DIF.These items are, for that reason, treated as distinct items for the 4 groups in the following analysis, whereas the remaining 15 items are used to anchor the 4 implementations.This anchoring enables an accurate comparison between the groups.
Next, a new analysis of the four anchored implementations was conducted, which led to a new round of data monitoring.Some misfits were detected, and data cleaning was necessary to avoid misfitting items.In line with the suggestions provided by Boone and Noltemeyer (2017), the data cleaning process involved deleting 'most unexpected responses' (the WINSTEPS software provides such lists) instead of removing full response strings (that would reduce the sample size).Table 1 shows that the unidimensionality condition of the RSM held sufficiently well for all groups, with MNSQ fit values below the acceptable limit of 1.4 for all 20 items but with several items showing MNSQ fit values of below 0.6.We followed the advice given by Boone et al. (2014) and took no action in such cases because even items with a very low MNSQ tell us something that is new and useful, and they do not contradict what we already know.The Rasch reliability estimates were 0.89, 0.84, 0.83 and 0.71, respectively, for the four groups of teachers (i.e.novice PSTs, experienced PSTs, novice ISTs and experienced ISTs) and 0.99, 0.98, 0.97 and 0.97 for the items, again indicating reproducible measures.
Comparisons of the results from the four groups who filled out the survey enabled investigations into the nature of and extent to which PSTs' and ISTs' SETM changes during teacher education and years of teaching mathematics in schools.We next turn to an analysis of the results of our research.

Findings
To provide an answer to the research question of in what possible ways SETM changes throughout teachers' professional careers, we started by comparing means across the four groups (i.e.novice and experienced PSTs and novice and experienced ISTs).Next, we investigated the nature of the anchored items (i.e.not exhibiting DIF) before we looked into those interpreted differently by the four groups (i.e.exhibiting DIF).
A set of Wright maps produced by the WINSTEPS software shows the hierarchies of the respondents (i.e. one for each group, shown to the right of the ruler in Figure 1) in combination with the Andrich thresholds of those anchored items (i.e.items having the same measure across the four groups, shown to the left of the ruler).The Andrich thresholds give the points of equal probability of adjacent categories (Bond & Fox, 2007), where, for example, the label '17.1*' represents item 17 at its first answer category threshold (i.e. between 'Not confident' and 'Somewhat confident').A PST or IST with an estimate at this position on the scale will have a 50% chance of choosing either of the two categories.Because the thresholds are ordered, a person with an estimate above the label '17.1*' but below the label '17.2*' will most probably endorse the category 'Not confident' on item 17.
Closer inspections of the distribution of teachers along the unidimensional scale for each group showed how the mean changed across the groups (i.e.0.58 logits [standard deviation (SD) = 1.21] for novice PSTs; 1.56 [SD = 1.04] for experienced PST; 2.43 [SD = 2.00] for novice ISTs; and 3.29 [SD = 1.98] for experienced ISTs).A set of unpaired t-tests showed that the difference between each subsequent group of teachers was statistically significant at the 0.01 level (p-values <0.00), confirming the theory-based expectation that self-efficacy in teaching mathematics develops with experience.The observed ceiling effect in the two IST groups (i.e. a substantial number of teachers were more confident than the item that was hardest to endorse)-which limited what could be understood from the analysis presented in Figure 1, in that it did not measure the ISTs well (and the experienced ISTs in particular)-did not cause any problems here.Our cross-sectional design enabled us to examine how SETM changed throughout teachers' professional careers, with significant increases as the teachers became more experienced.
We continue here with closer inspections of the items shown in Figure 1 (i.e. the ones that did not exhibit DIF, or the items that were interpreted in the same way by all four groups who participated in the survey).Item 17 ('Explain why division doesn't always make a number smaller') features at the top, while item 11 ('Calculate 342-238') is placed at the bottom.This is indicative of a trend.The upper half of the Andrich thresholds of those anchored items consisted of only reasoning items, except item 18 ('Circle the integer divisions without reminder: 92/2 105/2 (108 × 3)/2'), while most calculate items were placed at the bottom.
To illustrate this tendency of reasoning items being harder to endorse than calculate items (see Table 1 in the Methodology section for exact item measures), Figure 2 gives the proportion of responses in the category 'Very confident' for each item for each group of teachers.Each item is given four bars, with the top horizontal bar giving the portion of experienced ISTs selecting 'Very confident' on this item, the second giving the portion of novice ISTs selecting the same category, and so on.In Figure 2, the reasoning items are given to the left and the calculate items to the right.Based on the fact that the bars tend to be longer on the right, we can see that teachers tend to be more inclined to select 'Very confident' on calculate items.Moreover, it reveals how the reasoning items are, overall, harder to endorse than the calculate items, with a tendency to be more conspicuous the less experienced the group of teachers is.Further, there is yet another thing to notice from Figure 2: even if years of experience seems to correlate with the number of teachers reporting being 'Very confident', it is still notable that even in the group of experienced ISTs (i.e. the top bars in Figure 2), no bar reaches 100%.This means that, even in the group of mathematics teachers with more than 10 years of teaching experience, there is a fair amount of feeling less than 'Very confident' when judging their self-efficacy in teaching mathematics.
Another way of investigating the differences between the groups of teachers who participated in the survey is to explore the nature of the items that were interpreted differently.Five of the twenty items exhibited DIF, and a more detailed DIF analysis than reported on above revealed how these five items behaved in a pairwise comparison between the groups (see Table 2).These items were item 7 ('Explain why you always Table 2. Significant DIF contrasts between the four groups for five items 7 ('Explain why you always get an even number when multiplying an even and an odd number'), item 8 ('Calculate −17 + 5'), item 9 ('Calculate 23 × 0.7'), item 14 ('Explain why, when subtracting, you can sometimes borrow from the place to the left') and item 15 ('Calculate 2/3 + 1/5').
To further analyse this table, let us take a closer look at what the grey cell in column 2 reveals (which is perhaps the most interesting cell in the table, as it compares the least experienced PSTs with the most experienced tISTs).This window shows that when experienced ISTs' responses are compared with novice PSTs' responses, two items behaved unexpectedly: item 8 (a calculate item) was unexpectedly more difficult for experienced ISTs than for novice PSTs, while item 14 (a reasoning item) was significantly easier for experienced ISTs than for novice PSTs.The latter is perhaps easier to understand than the former, but we will revisit these in the Discussion section.
What appears in the grey cell of Table 2 repeats itself across the table (with one exception: item 9 in column 4) and reveals that the reasoning items exhibiting DIF tend to be easier than expected for more experienced teachers, while the calculate items exhibiting DIF (in bold in Table 2) tend to be harder than expected to endorse for more experienced teachers.

Discussion and concluding remarks
In this section, we revisit some of our key findings, discuss them and provide suggestions about how each could be addressed in future research.A first key finding is in support of the view that SETM develops over time.As we see in our data, the mean efficacy of each group of participants (i.e.novice PSTs, experienced PSTs, novice ISTs and experienced ISTs) was higher than that of every other group with fewer years of experience (in higher education and/or in professional practice).This is in accordance with previous studies that have concluded that PSTs' self-efficacy increases during (and presumably as a result of) their undergraduate studies (see Bjerke & Solomon, 2020;Brinkmann, 2019;Giles et al., 2016).However, we notice that even in the group of experienced ISTs, there is a fair amount feeling less than 'Very confident' when judging their SETM, which is most noticeable when measured by items demanding that teachers explain mathematics in such a way as to allow students to understand the underlying mathematical concepts and procedures.
Nevertheless, we still do not know much about the nature of this growth.Research evidence (Hoy & Spero, 2005;Işıksal-Bostan, 2016;Thomson et al., 2022) has documented a fluctuation in self-efficacy during the first years of teaching, something we did not examine in our research.Instead, we considered the novice IST group (i.e. with 1-5 years of teaching experience) to be a unified cohort, because of our sample size.Overall, while we observed higher self-efficacy in more experienced groups of participants, we did find that some items were interpreted differently by the four groups participating in the survey.Table 2 reveals that when more experienced teachers' responses were compared with those of less experienced teachers, some items behaved unexpectedly: the calculate items exhibiting DIF were unexpectedly more difficult to endorse for experienced than for novice teachers, while reasoning items exhibiting DIF were significantly easier to endorse for experienced than for novice teachers.How can this be?To understand this, we turned again to Bandura (1997), who stated that mastery experience is the most powerful source of selfefficacy.It is reasonable to expect that more experienced teachers have experienced more mastery.Remembering that it is not success per se that provides efficacy information but rather one's perception of success (Skaalvik & Skaalvik, 2007), we speculate that experienced teachers are more confident on reasoning items possibly because throughout their teaching careers they developed explanation skills in helping students understand the underlying mathematical principles.Likewise, it is quite possible that experience let them understand that performing accurate calculations is more complex and demanding than one might first think, and moreover, that they considered the need of explaining even when encountering calculation tasks.To this end, further research is required so that the nature of this growth can be better explored and understood.
Our work highlights the importance of several transition points (Gueudet et al., 2016) in mathematics teachers' professional lives regarding their self-efficacy and its development: from novice PSTs to experienced PSTs; from experienced PSTs to novice ISTs; and from novice ISTs to experienced ISTs.As our findings allude, both teacher education (Bjerke & Solomon, 2020) and professional experience (Xenofontos & Andrews, 2020) contribute to changes in SETM.However, what we still do not know much about, especially as far as ISTs are concerned, are the conditions under which the development of self-efficacy can be facilitated more effectively.More targeted research focussing on these transition points is needed.To this end, we echo the recommendation of Philippou and Pantziara (2015) for more qualitative investigations of teachers' self-efficacy, especially in regard to these transition points.
Finally, as highlighted by Xenofontos and Andrews (2020), SETM cannot be seen as independent of the cultural context within which teachers are educated and work.We fully acknowledge that our study is framed by the particularities of the Norwegian educational and cultural context.That said, we would like to encourage colleagues working in different countries to undertake similar investigations, in order for the international mathematics teacher education community to gain deeper insight into the role of context in the formulation of one's professional identity.
As discussed earlier, our study is cross-sectional.This can be seen as a limitation of our work, as it does not allow us to explore the developmental nature of SETM.Here we rather focus on possible changes in self-efficacy, the same way other cross-sectional studies have done in the past (e.g.Collie et al., 2020;Dudenhöffer et al., 2017;Lawrence et al., 2019).While, overall, longitudinal designs can address this issue, in our case this was not possible, as several ISTs had more than 10 years of teaching experience.A longitudinal design would require, roughly speaking, two decades, as we would need to follow the same cohort from the time they enter teacher education until several years later, when they become experienced ISTs.Alternatively, we propose a different design that could perhaps be employed in the future, involving a combination of cross-sectional and longitudinal elements.Specifically, three different groups (novice PSTs, experienced PSTs, novice ISTs) could participate for a number of years, until they become experienced PSTs, novice ISTs, and experienced ISTs respectively.To an extent, Hoy and Spero (2005) and Işıksal-Bostan (2016) employ such approach, in examining how self-efficacy is developed throughout teacher education and the first year of professional practice.Such an approach would allow us to explore, not only possible changes in SETM, but also its developmental nature.

Figure 1 .
Figure 1.Wright maps of the four groups.

Figure 2 .
Figure 2. The number of teachers reporting being 'Very confident' on reasoning items (to the left) and calculate items (to the right).

Table 1 .
Item measures (M) and MNSQ and ZSTD fit values for items 1-20 across the four groups.Grey rows show reasoning items; items marked with an asterisk (*) are unanchored (i.e.M changes across the four groups).