Traditional vs Intersectional DIF Analysis: Considerations and a Comparison Using State Testing Data

ABSTRACT Recent research has demonstrated an intersectional approach to the study of differential item functioning (DIF). This approach expands DIF to account for the interactions between what have traditionally been treated as separate grouping variables. In this paper, we compare traditional and intersectional DIF analyses using data from a state testing program (nearly 20,000 students in grade 11, math, science, English language arts). We extend previous research on intersectional DIF by employing field test data (embedded within operational forms) and by comparing methods that were adjusted for an increase in Type I error (Mantel-Haenszel and logistic regression). Intersectional analysis flagged more items for DIF compared with traditional methods, even when controlling for the increased number of statistical tests. We discuss implications for state testing programs and consider how intersectionality can be applied in future DIF research.

with an item.When a grouping variable only contains two levels (e.g., female and male), a single comparison has been considered sufficient, and the historically overrepresented or privileged group usually serves as the reference (male) with the other as the focal group (female).When the grouping variable contains more than two levels, the standard approach is to estimate DIF via multiple pairwise comparisons (e.g., White as reference and Black as focal, White as reference and Hispanic as focal), although methods are also available for comparison across multiple groups simultaneously (e.g., Penfield, 2001).
Missing from the educational measurement literature, until recently, have been studies exploring interactions between multiple test taker grouping variables (e.g., crossing gender and race to create subgroups such as Black female and Hispanic male, compared with the reference group White male).Our preference for what we might call main effects DIF, where we consider grouping variables separately, may stem from an inclination toward parsimony as well as unfamiliarity in the field of educational measurement with the concept of intersectionality (Crenshaw, 1989(Crenshaw, , 1990)).Intersectional perspectives are becoming more common in quantitative research (Bauer et al., 2021), and they may improve our understanding of fairness and bias in educational testing (Russell & Kaplan, 2021).

Intersectionality in Quantitative Methods
The shift to considering intersectionality in DIF and test bias research, and in quantitative methods more broadly, follows a shift in how we conceptualize validity with respect to socioculturally diverse groups of test takers.Traditional approaches to validation, as outlined in professional standards and guidelines (e.g., AERA, APA, & NCME, 2014;International Test Commission, 2018), have focused on establishing comparability across groups, where fair comparisons in test results are made possible by standardized test content and administration conditions.However, fairness in normative score comparisons may come at the expense of fairness in the interpretation of scores for individuals, especially ones from marginalized or underrepresented groups (Sireci, 2020).Validation through comparability alone can also be seen as producing evidence that is solidified as truth, thereby maintaining social, epistemic, and linguistic hierarchies in assessment systems (Cushman, 2016).The validation process must allow us to challenge and disrupt assessment to make it more justiceoriented and equitable.
A shift in perspective requires that we be introspective about traditional assessment practices while grounding validity in alternative theories and frameworks.Moss (1996) argued for a multi-method approach to assessment validation, where we leverage the contrast between traditional "naturalist" methods and contextualized "interpretive" ones, with the goal of "expanding the dialogue among measurement professionals to include voices from research traditions different from ours and from the communities we study and serve" (p 20).Shepard (2000) provided a social-constructivist framework for assessment in support of teaching and learning, one that builds on cognitive, constructivist, and sociocultural theories.The concept of cultural validity draws attention to how assessments address sociocultural influences (i.e., values, beliefs, experiences, communication patterns, epistemologies) that shape how individuals think, learn, and respond to assessments (Solano-Flores & Nelson-Barber, 2001).Ecological theories have also been used to organize psychometric procedures, acknowledging how assessments are dynamic and socially embedded artifacts defined by networks of relationships within society (e.g., Vo & French, 2021;Zumbo, 2007).
Intersectionality encourages us to see an individual's experiences as the product of complex interactions between what we often compartmentalize as independent dimensions.The term first originated in Black Feminism and legal scholarship, where Crenshaw (1989Crenshaw ( , 1990) ) argued that raceonly and gender-only analyses could discount the voices and experiences of women of color.Others have also highlighted the limitations of isolating race, gender, and other social categories as the primary lens through which we study identity, differences, and disadvantages (e.g., Collins, 1990), with applications in the fields of psychology (e.g., Cole, 2009;Rosenthal, 2016), sociology (Choo & Ferree, 2010), and quantitative methods (Else-Quest & Hyde, 2016a, 2016b).
Else-Quest and Hyde (2016a) identify three key assumptions or requirements in the application of intersectionality to quantitative methods: 1) recognizing that each individual experiences, simultaneously, membership in multiple interconnecting social categories; 2) examining the power and inequality embedded within each social category; and 3) acknowledging how social categories are characteristics of the individual and of the social context in which individuals live.Building on these, Else-Quest and Hyde (2016b) then outline how an intersectional approach can be applied across different components of the quantitative research process, including theory, design, sampling techniques, measurement, data analytic strategies, and interpretation and framing.Within each component are specific methods that can be implemented with an intersectional perspective.We focus here on methods that are suitable for secondary analysis of testing and assessment data, including measurement invariance, within the measurement component, and interaction effects, within data analytic strategies.
As noted above, DIF has traditionally been used to examine item bias one grouping variable at a time, and then by disaggregating to two levels or categories of the grouping variable at a time.For example, with a variable like race, we inspect results for Black and White, Hispanic and White, and Asian and White test takers.Next, we might consider test takers who are economically disadvantaged and advantaged.Applying Else-Quest and Hyde (2016b), each pairing within each variable constitutes a separate main effects DIF analysis.The absence of DIF is taken as evidence of measurement invariance at the item level for each binary comparison.Moreover, having considered all items and all grouping variables that are relevant and measurable in the target population, validity evidence is supposed to be established in support of the proposed interpretations and uses of the test.
Following Else-Quest and Hyde (2016b) and extending Russell and Kaplan (2021), reviewed further below, an intersectional approach to DIF would involve consideration of interaction effects that account for the simultaneous multiplicative influence (e.g., through structures, institutions, and power inequalities derived from those forces) associated with membership in multiple social categories.These interactions begin to acknowledge that a single social category (e.g., sex or race alone) cannot represent the complexities of an individual's multiple lived experiences.Through interaction effects, we can thus better represent our target population and provide a more contextualized and interpretive evaluation of fairness, equity, and validity.Russell and Kaplan (2021) first proposed an intersectional approach to DIF and demonstrated its application with a standardized state test from grade 5 English language arts (ELA), containing 25 items.Item response data were analyzed for roughly 67,000 students across three grouping variables: gender (female, male), race (Black, Asian, Hispanic, White), and economic status (advantaged, disadvantaged).DIF was estimated for each item twice using the standardization method, first with the traditional approach and then with the intersectional approach (i.e., interactions of categories).The traditional approach had a different reference group per grouping variable (male for gender, White for race, advantaged for economic status).The interaction of the three grouping variables essentially produced one large grouping variable with a single reference group (chosen as the combination of male, White, advantaged) and 15 focal groups (combining the remaining levels, e.g., female, Asian, disadvantaged).The traditional approach thus resulted in five pairwise DIF comparisons per item (one for gender, three for race, and one for economic status) for 125 total comparisons, whereas the intersectional approach produced 15 pairwise DIF comparisons per item for 375 total comparisons.Standardization results for each DIF comparison were labeled as negligible or no DIF, suspicious DIF, or likely DIF, where the last level flags an item for bias review.

Intersectionality in DIF Analysis
Of the 25 items examined, the traditional approach identified five with suspicious levels of DIF.Four items were flagged via a single DIF comparison each (e.g., female vs. male or Hispanic vs. White), and one was flagged on four separate comparisons (disadvantaged, Black, Hispanic, and Asian).In contrast, the intersectional approach identified more items with suspicious DIF (14 total) and four with likely DIF.All five items identified with the traditional DIF approach had suspicious DIF with the intersectional approach, and four of the five also had likely DIF for at least one focus group.Aggregating across items, whereas likely DIF was found in none of the 125 total comparisons from the traditional approach, it was found in 10 of 375 comparisons (2.67%) from the intersectional approach.
Russell, Szendey, and Kaplan (2021) replicated the intersectional DIF analysis of Russell and Kaplan (2021), examining a total of 186 items from state tests in grade 5 math and science, and grade 8 math, science, and ELA.The DIF methods and procedures for defining focal and reference groups were consistent across studies.Similarly, DIF was identified in three out of 930 total comparisons (0.32%) for the traditional approach, and 69 out of 2,790 total comparisons (2.32%) for the intersectional approach, supporting the conclusion that interacting grouping categories increases the rate of DIF detection with the standardization method.
Finally, Russell, Szendey, and Li (2022) extended the previous two studies by incorporating four DIF detection methods into their analysis, the standardization, Mantel-Haenszel (MH), SIBTEST, and logistic regression (LR) methods.Intersectionality was presented as before (i.e., gender, racialized identity, and economic status), and data came from an operational test administration in grade 5 ELA with roughly 68,000 students (seemingly the same test as in prior studies).Focusing on a subset of 17 items from the ELA test, results again showed an increase in the percentage of DIF flags as well as the number of items flagged for the intersectional approach compared with the traditional approach.With the standardization, SIBTEST, and MH methods, percentages of DIF flags at the moderate or suspicious level were up to four times higher in the intersectional approach, and the number of items flagged with the most severe level of DIF increased from zero in the traditional approach (for all three methods) to two of 17 items with SIBTEST, three with MH, and four with standardization.Results for LR were more difficult to interpret.Under both the traditional and intersectional approaches, LR statistical significance testing flagged 15 of 17 items as having the most severe DIF.The authors explain that, due to large sample sizes, the LR method may have been overly sensitive to performance differences.This conclusion was corroborated by LR effect size estimates, which were found to be small for all 17 items.
It should also be noted that Russell, Szendey, and Li (2022) seems to have applied the LR method for a given item by combining all grouping variables into a single DIF model.The typical LR approach involves running separate models for each grouping variable, in which case statistical significance and effect size can be estimated using model fit and variance explained for the DIF model compared with the simpler LR model that omits the grouping variable, and conclusions from each model apply to a single DIF comparison.The simultaneous approach, combining grouping variables, still allows for tests of statistical significance based on the standard errors of each model coefficient.This appears to be what Russell, Szendey, and Li (2022) used to test for statistical significance.However, it is unclear how they estimated effect sizes per focal group when the effect size they used (change in variance explained; Zumbo, 1999) would have been estimated at the model level and thus would not be available for each DIF comparison.
Results from these three studies suggest that the traditional approach to DIF analysis may be insufficient for large scale standardized testing, as it may overlook more complex intersecting sources of DIF and potential item bias.The studies also leave a few questions unanswered.The first two studies (Russell & Kaplan, 2021;Russell, Szendey, & Kaplan, 2021) did not account for the expected increase in sensitivity due to chance (i.e., Type I error) because the standardization method did not involve statistical significance tests.Instead, these studies focused on relative comparisons to determine practical significance.There is a positive relationship between the number of DIF comparisons and the number of Type I errors.The more DIF comparisons we conduct (more items and/or more focus group comparisons), the more Type I errors we can expect to see (items flagged for significant DIF in the sample when no DIF would be present in the population).Therefore, an increase in sensitivity for the intersectional approach would be expected, and should be controlled for using other DIF detection methods.The third study (Russell, Szendey, & Li, 2022) did adjust results for Type I error increase, but only in the LR method.The authors note that the remaining methods did not involve statistical significance testing and thus could not be adjusted for the larger number of tests in the intersectional approach.However, MH typically combines both effect size and statistical significance information when flagging DIF, and so could also be adjusted.
The intersectional DIF studies are also limited in that they all relied on operational testing data.In operational data, we can expect that reviews for DIF and item bias would have already taken place, so that problematic items would have already been revised or removed from a test prior to the selection of items for administration (although not for an intersectional approach given it is not standard practice at this time).Pilot or field test data, where an original DIF analysis typically occurs, should provide a more realistic and comprehensive evaluation of the intersectional approach.A related issue is that Russell, Szendey, and Kaplan (2021), Russell and Kaplan (2021), and Russell, Szendey, and Li (2022) do not appear to have used purification in their DIF analyses.Total score was used as a matching variable without adjustment for potential DIF items within it.The use of an un-purified matching variable can decrease power and increase Type I error in DIF detection (French & Maller, 2007).
Our main goal in this study was to inform the practical application of intersectional DIF analysis by comparing DIF detection rates for analyses conducted using main effects vs interaction effects.We aimed to extend the work of Russell and colleagues, addressing limitations, and continue the conversation about DIF analysis through an intersectional lens.We expected an increase in flagging when using interaction effects, due to the increase in the number of statistical tests conducted.We also expected that adjusting Type I error would reduce this increase for interaction effects.Increases in Type I error were accounted for through the use of DIF statistics that incorporate significance tests, where error adjustments can be made, including the MH and LR methods, and DIF comparisons were made with field test items, administered alongside operational items in an end-of-year state test.Without access to item content, the usual caveat still applies that DIF detection indicates potential for item bias that would need to be confirmed through a qualitative review.We also acknowledge that our contribution to the literature on intersectional DIF is limited in that, due to practical constraints, we could not address all of the key elements of intersectionality, as called for by Else-Quest and Hyde (2016b).

Data
Data came from a state testing program.Tests were administered at the end of the school year in math, ELA, and science to all students in 11th grade.At the time of administration, the state used Rasch modeling and precalibration to an item bank for each subject test.All students saw the same set of operational items (60 in math, 50 in ELA, 60 in science) plus a subset of field test items according to a random assignment of students to one of five test forms per subject (there were 50 field test items per subject, 10 per form per subject).Thus, a given student completing all three tests responded to a total of 200 items (70 in math, 60 in reading, 70 in science).Items were all multiple-choice and scored dichotomously.
Table 1 includes sample sizes by demographic group.This study included Black, Hispanic, and White students as recorded by the state for ethnicity.Other groups not analyzed here included American Indian, Asian, Pacific Islander, and multiple groups.These other groups were excluded because sample sizes were below 400 after crossing with either male or female.The total sample size after subsetting was 19,490, with 49% Female and 51% Male, 6% Black, 16% Hispanic, and 77% White.

Analysis
For simplicity, and following Russell and colleagues, we limited our study to uniform DIF.The state from which data were obtained, calibrating via the Rasch model, also only reported on uniform DIF.We also used White students or White male students as the reference group in DIF comparisons, again following Russell and colleagues and traditional DIF procedures.
Analysis with traditional groupings or main effects was conducted first, in part to replicate results published in technical reports for the state (which we accomplished).These reports include DIF significance levels (labeled A, B, and C) that incorporate both statistical significance from the Mantel-Haenszel (MH) chi-square, and effect size estimates (called delta) calculated from the odds ratio for reference group test-takers getting an item correct relative to matched test takers in the focus group (Holland & Thayer, 1986).Following ETS conventions (Zwick, 2012), level A is applied when the MH chi-square is not statistically larger than 1, with p � 0:05, or when the effect size delta is smaller in absolute value than 1.For level B, the MH chi-square has p < 0:05 and delta 1 or larger but below 1.5.Level C is applied when the MH chi-square is significant with p < 0:05 and delta is 1.5 or larger.Level A typically indicates small or negligible DIF, level B is medium DIF, and level C is the largest of the three, usually taken to indicate an item that requires further investigation.DIF was analyzed separately for sex (male vs female) and then ethnicity (White vs Black, White vs Hispanic), producing three DIF results per field test item.The state examined other grouping variables as well, but these were not covered here.The control or matching variable in the MH analysis was always the total score across operational items.DIF was not examined for operational items.
Next, DIF analysis with intersectional groupings was conducted using the MH method and three significance levels chosen using a Bonferroni correction.This extended the first analysis by examining the interaction between sex and ethnicity, which produced five DIF comparisons per field test item (Black female, Black male, Hispanic female, Hispanic male, and White female).The reference group was always the same (White male).Significance levels were applied first as with main effects, using the uncorrected 0:05 criterion.With 0:05 per comparison or flag, the overall Type I error rate per item was 0:14 with main effects (three comparisons) and 0:23 with interaction effects (five comparisons).Significance levels were applied a second time with a Type I error corrected at the item level based on the increase in DIF comparisons from three (main effects) to five (interaction).In this second case, to achieve level C DIF, an effect needed to be statistically significant with p < 0:03, in addition to having an effect size of 1.5 or larger.The second approach maintained an overall error rate of 0:15 per item.A third application of significance levels used 0:01 as the criterion per DIF comparison.This resulted in an overall error rate of 0:05 per item.
The main effects and interaction effects analyses were repeated using the standardization (ST) method (Dorans & Kulick, 1986).For a given item and DIF comparison, ST provides a standardized difference in proportion correct for the focal and reference group, averaged over levels of the matching variable.As with MH, the matching variable was based on total scores computed across the operational items.Following the usual conventions, averages were weighted by the focal group sample size and levels of significance were based on cutoff values for the average absolute difference of less than 0.05 for no DIF, 0.05 or larger but less than 0.10 for suspicious DIF, and 0.10 or larger for likely DIF.To simplify reporting, these labels were replaced with A, B, and C to match the MH ETS significance levels.
Finally, main and interaction effects DIF were also estimated using the logistic regression (LR) method (Swaminathan & Rogers, 1990).For a given item and uniform DIF comparison, the LR method involves fitting two models.The first model (referred to here as the base model) contains an intercept term and total scores as the only predictor variable.The second model (the DIF model) adds the grouping variable.Indicator coding is typically used so that the coefficient for the grouping variable estimates the change in log-odds of correct response on the item for the focal group, compared with the reference group (whose item performance is estimated via the intercept).DIF for grouping variables with more than one focal group can be estimated by subsetting the data and running separate DIF models (e.g., Black and White test takers in one model, Hispanic and White in another) or simultaneously (e.g., a single model with White as reference and indicator terms for Black and Hispanic).We used the former approach, with one DIF model per focal group.
DIF significance levels for LR were determined using chi-square likelihood ratio tests (comparing fit for the base and DIF models) and increase in coefficient of determination (Nagelkerke, 1991) or multiple R-squared from the base to DIF model (labeled ΔR 2 ).Following guidelines from Jodoin and Gierl (2001), DIF was categorized as level A if the chi-square likelihood ratio was not statistically significant, with p � 0:05, or if ΔR 2 was smaller than 0.035.Level B required p < 0:05 and ΔR 2 of 0.035 or more but smaller than 0.07.Level C then required p < 0:05 and ΔR 2 of 0.07 or larger.With interaction effects, the cutoff for statistical significance was also reduced as in the MH analysis, to 0.03 and then 0.01.LR flags were also determined using all of the same cutoffs for Type I error but following guidelines from Zumbo (1999), where ΔR 2 thresholds are more demanding at 0.13 for level B and 0.26 for level C.
Our use of ΔR 2 as an effect size estimate for LR DIF diverges slightly from previous applications.Recommendations in Jodoin and Gierl (2001) and Zumbo (1999) seem to be based on LR models that test the combination of uniform and nonuniform DIF, which involves a chi-square test with 2 degrees of freedom (following Swaminathan & Rogers, 1990).In developing and testing their ΔR 2 cutoffs, Jodoin and Gierl (2001) did explore a separate test of uniform DIF with 1 degree of freedom, followed by a test of nonuniform DIF with 1.However, testing only for uniform DIF, as we implement here, did not seem to be their intention when examining ΔR 2 cutoffs.
Data were analyzed in R (version 4.2.0;R Core Team, 2022).The MH, ST, and LR analyses were conducted using code written by the authors and available online at https://github.com/talbano/epmr.

Results
Results are summarized first by DIF significance level (A, B, C), aggregating across items and comparisons, for each method (MH, ST, LR).Then, more detailed results (DIF statistics, focal and reference groups, by item) are presented for items flagged at level C.

Summarizing Across Items and DIF Comparisons
Table 2 summarizes results from the MH and ST analyses.With 50 field test items per test/subject, each analysis with main effects produced 150 total DIF comparisons (3 focal groups � 50), and each interaction effects analysis produced 250 comparisons (5 focal groups � 50).Table 2 shows the distributions of these 150 and 250 DIF comparisons across significance levels A, B, and C. When assigning levels with MH, the p-value cutoff for main effects was 0:05 per comparison, and shown here are interaction effects results based on the 0:05 and 0:01 cutoffs per comparison.
For MH main effects, there were 3, 1, and 1 DIF comparisons flagged at level C in ELA, math, and science, respectively.For MH interaction effects, these increased to 13, 12, and 10 for the 0:05 cutoff, and 11, 10, and 9 for 0:01.Dividing by 150 and 250, respectively, for MH main effects and interaction effects, level C DIF was found in 2%, 0.7%, and 0.7% of comparisons for main effects, 5.2%, 4.8%, and 4.0% of comparisons for 0:05 interaction effects, and 4.4%, 4.8%, and 3.6% of comparisons for 0:01 interaction effects, across the three tests.Results are not shown for the p-value cutoff of 0:03, as the distribution of DIF comparisons over levels changed slightly compared with the cutoff of 0:05, and only in the A and B flags.
Main effects DIF with ST always flagged fewer items at level A and more items at level B than did MH.ST main effects flagged 2, 2, and 5 items at level C for ELA, math, and science, respectively.Similar to MH, the ST flags at level C increased with interaction effects, both in frequency and percentage of total flags.In ELA, ST with interaction effects flagged the same percentage of items at level C as MH with the 0.01 cutoff (4.4%), for math it flagged a slightly higher percentage (5.6 vs 4.8%), and for science ST flagged twice as many items at level C (7.2 vs 3.6%).
Table 3 summarizes results from the LR analyses.As in Table 2, each analysis with main effects produced 150 total DIF comparisons, and each interaction effects analysis produced 250 comparisons.Table 3 shows the distributions of LR DIF comparisons across significance levels A, B, and C with Type I error at 0.05 per comparison for main effects and interaction effects, and Type I error at 0.01 per comparison for interaction effects.Results are all based on Jodoin thresholds.Results for Zumbo thresholds are omitted as they never flagged DIF beyond level A. The Jodoin thresholds were nearly as conservative in flagging as the Zumbo ones, with only 1 DIF flag at level B in ELA and math, and only three flags at level B in science.LR never flagged an item at level C DIF.

Results For Items With Level C Dif
Tables 4-6 show results for all of the items flagged at level C based on either the MH method with 0:05 cutoff or the ST method, including level C DIF for main effects (which did not change based on MH cutoff) and interaction effects.The tables show the reference and focal groups for each flag, the MH delta values, with negative values indicating items that favored the reference group (as did most items) and positive values indicating items that favored the focal group (one item in ELA, two items in math, none in science), the ST standardized difference in proportion correct (pdiff), and DIF significance levels based on MH and ST.Table 4 contains results for ELA.Three items were flagged for level C DIF via main effects, one with male as the reference group and female for focal (with MH delta of −2.220) and two with White as the reference group (delta of −1.772 for Hispanic students and 2.151 for Black students).The ST method flagged two of these at level C. Note that these three main effects flags were found on three separate items, but the same item may be shown with multiple flags in these tables, as was the case with ELA interaction effects.Table 4 shows the 14 level C flags found with interaction effects (for either MH or ST), spread across nine unique items.These flags include the three items from the main effects analysis, plus six additional items.Of the 11 C flags for MH with 0.01 cutoff, eight were also flagged at level C with ST.One item (20) was flagged level C with ST but not MH, and two were flagged level C with ST and MH with 0.05 cutoff but not MH with 0.01 cutoff (items 45 with White male and Black Female and 49 with White male and Black male).
Table 5 shows level C DIF results for the math test, where MH with main effects flagged one item, ST main effects flagged two, and interaction effects produced 15 C flags across 11 unique items.All but four of these 15 flags involved Black students.The two C flags that favored the focal group were for Black male students (items 6, flagged by both MH methods, and 27, flagged by all three methods).ST flagged at level C 11 of the 12 C flags from MH with cutoff 0.01 (all but item 6).ST also had three C flags that were not raised by MH (item 21 for White male and Hispanic female and two comparisons on item 43).
Finally, Table 6 shows level C DIF results for the science test, where MH main effects found C DIF for one comparison (item 12), and ST main effects found five C flags (three of them on item 12, two on item 30).As shown in Table 2, there are twice as many C flags for ST with interaction effects (18) than for MH at cutoff 0.01 with interaction effects (9 flags).Some of these were on items that MH did flag at level C but under different comparisons (i.e., items 28, 30, and 47).Otherwise, C flags for ST with interaction effects were raised on four items that were never identified as having serious DIF via MH (items 21, 34, 37, 43).MH identified one C flag that ST did not (item 11).

Discussion
Overall, results from this study confirm the main conclusion from previous research, that an intersectional approach, where we account for the interaction between social grouping variables, can uncover DIF that would go undetected when only considering main effects, even after controlling for Type I error increase.Undetected intersectional DIF may lead to undetected item bias that skews and invalidates score interpretation and use for disadvantaged groups.This can have serious implications for operational testing programs that traditionally do not account for intersecting group memberships in DIF analysis.
With the standardization method (ST), results for main effects compared with interaction effects were similar to those observed in previous studies (Russell & Kaplan, 2021;Russell, Szendey, & Kaplan, 2021, 2022).Here, there was an increase from 1.3% of likely or C DIF flags with ST main effects to 4.4% with interaction effects for ELA.For math, the percentage increased from 1.3 to 5.6%, and for science, from 3.3 to 7.2%.In comparison, when examining 25 items in an ELA test, Russell and Kaplan (2021) found zero items with likely DIF using main effects, or what they call the traditional approach.With the intersectional approach, which involved more comparisons (15 per item) than in the present study (5 per item), they found 10 out of 375 comparisons or 2.7% with likely DIF.Russell, Szendey, and Kaplan (2021) examined five different state tests and found likely DIF in 0.8% of main effect comparisons and in 5% of interaction effect comparisons.Results for the individual tests ranged from 0 to 2% for main effects and 1 to 5% for interaction effects.Russell, Szendey, and Li (2022) noted a similar trend for three of their DIF methods.
Rates for flagging likely DIF via ST were typically higher in this study than in previous studies.For main effects, the percentages were always higher, and for interaction effects they were higher for all but ELA (excluding LR).While there are several possible explanations for this (the studies presumably used data from different states and different testing programs), the increase in detection rates for ST is likely due, at least in part, to the fact that the present study focused on field test items that had not already been scrutinized for DIF.As a result, we can expect that there would be more items with differential functioning to work with here, compared with analyses that only include operational items.
As noted by Russell, Szendey, and Kaplan (2021), a limitation of the ST method is that it does not account for chance increases in the rate of DIF detection.The MH method, which incorporates statistical testing into the flagging process through the chi-square, did facilitate adjustments for Type I error increase.The standard threshold of 0.05 was reduced first to 0.03 and then to 0.01.The reduction to 0.01 led to slightly lower rates of flagging significant or level C DIF in ELA and science, compared with the 0.05 cutoff and with the ST method.For C DIF in the math test, the MH flagging rate from the 0.05 cutoff to the 0.01 cutoff remained unchanged.
Finally, the LR method stood apart from MH and ST, in that it never flagged a DIF comparison at level C.This finding aligns with Russell, Szendey, and Li (2022), where LR also never identified serious DIF in the items studied, once effect size was accounted for.Together, results for LR suggest that cutoffs for ΔR 2 given in the literature (i.e., Jodoin & Gierl, 2001;Zumbo, 1999) may need to be revised to be less conservative for analyses that focus only on uniform DIF.
In addition to considering effect size cutoffs for LR, future research should also examine intersectional DIF with other methods (e.g., within item response theory models; Thissen, Steinberg, & Wainer, 1993) and other model formulations (e.g., with random effects for items; De Boeck, 2008), and using simulated data where effects can be manipulated and methods can be evaluated for accuracy.We recommend more exploratory research on covariates that may help explain differential performance (e.g., opportunity to learn; Albano & Rodriguez, 2013).As noted above, we recommend that intersectional DIF, to the extent that sample sizes support it, become standard procedure in operational testing programs.Intersectional analysis should also be conducted at the aggregate level to check for differential test functioning, an accumulation of bias over multiple items that can go undetected as DIF without a validated external criterion.Testing programs should recruit samples that are representative and of sufficient size according to intersectional groupings rather than traditional groupings.
The convention of using the historically advantaged test takers as the reference group should also be reconsidered.Because DIF is usually structured as a relative comparison, results will not change except in their signs if reference and focal groups are switched.However, moving away from the conventional reference groups will help to decenter Male and White in discussions of test performance.Alternatives include centering more diverse groups, using models that evaluate DIF effects within groups relative to their own means (we have not seen this method used before) or using models that compare effects against an aggregate of all groups (e.g., Austin & French, 2020).On a related note, future DIF studies should also explore how DIF frameworks can be repurposed to examine differential performance on culturally responsive items.
As the educational measurement field becomes more attentive to intersecting social identities and experiences, we should not to lose sight of broader calls to reimagine our assessment systems to become more socially just from the outset, from construct articulation to test consequences (Randall, 2021;Slomp, Corrigan, & Sugimoto, 2014).Studies examining items for potential DIF may not be designed to capture intersectional experiences from the outset (e.g., in construct definition).In this way, intersectional DIF may not see outside of its own biases when the underlying constructs capture privileged values, beliefs, and ways of knowing.
We recognize that examining interactional or multiplicative DIF effects alone does not constitute intersectional research as it is defined by critical scholars (e.g., Cole, 2009;Crenshaw, 1989).Effective application of intersectionality requires investigating the meaning and mechanisms behind the experiences of interactional groups, beyond analysis of statistical differences (Cole, 2009).Homogeneity of experiences within a given intersectional category should not be assumed.Gender studies, for example, highlights how Black women can be subjected to both racism and sexism, and sometimes by members of their own social groups (Settles, 2006).Unfortunately, qualitative rigor remains a significant challenge in DIF research due to the operational context in which DIF investigations are typically conducted.As DIF research continues to grow, we recommend that measurement professionals approach the study of social categories with more nuance and context, paying attention to how gender, race, and class can simultaneously affect perceptions, experiences, and opportunities when living in a society stratified along these dimensions (Cole, 2009).Such an approach may involve supplementing intersectional DIF with nonconventional DIF methods such as counter-storytelling, ethnography, or case study methods, to name a few.
That said, there is no single gatekeeper or litmus test determining what constitutes intersectional research (Else-Quest & Hyde, 2016a).Intersectional DIF is a step toward acknowledging how multiple social and structural forces impact individuals.This method certainly does not absolve all underlying issues within test bias, but it provides a practical example of how a popular measurement tool can be better aligned with justice-orientated assessment practices.As we work toward creating assessments that reflect socioculturally authentic representations of the constructs we desire to measure, there is room to focus on refining and studying the accuracy of intersectional approaches in invariance analysis, including DIF, and understanding how it compares to traditional techniques.

Disclosure Statement
No potential conflict of interest was reported by the author(s).

Table 1 .
Sample sizes by demographic group.

Table 2 .
Summary of MH and ST DIF flags by test and level.Note: MH is Mantel-Haenszel method and ST is standardization method.M is main effects DIF (with unadjusted 0:05 cutoff for p-values for MH), X is interaction effects, and X05 and X01 are interaction effects with cutoffs of 0:05 and 0:01.n are counts of DIF comparisons flagged at each level.Percentages were found by dividing main effects by 150 total comparisons per test, and interaction effects by 250.

Table 3 .
Summary of LR DIF flags by test and level.
Note: M indicates main effects DIF, X is interaction effects, and 05 and 01 are cutoffs of 0:05 and 0:01.n are counts of DIF comparisons flagged at each level.Percentages were found by dividing main effects by 150 total comparisons per test, and interaction effects by 250.

Table 4 .
Level C item results for ELA.Note: B is Black, F is female, H is Hispanic, M is male, and W is White.A single value under Reference and Focal indicates main effects DIF, whereas two indicates interaction effects.Delta is the MH delta value, pdiff is the ST standardized difference, and MH 05, MH 01, and ST are significance levels for MH with cutoff 0.05, MH with cutoff 0.01, and the ST method.

Table 5 .
Level C item results for math.Black, F is female, H is Hispanic, M is male, and W is White.A single value under Reference and Focal indicates main effects DIF, whereas two indicates interaction effects.Delta is the MH delta value, pdiff is the ST standardized difference, and MH 05, MH 01, and ST are significance levels for MH with cutoff 0.05, MH with cutoff 0.01, and the ST method.

Table 6 .
Level C item results for science.Black, F is female, H is Hispanic, M is male, and W is White.A single value under Reference and Focal indicates main effects DIF, whereas two indicates interaction effects.Delta is the MH delta value, pdiff is the ST standardized difference, and MH 05, MH 01, and ST are significance levels for MH with cutoff 0.05, MH with cutoff 0.01, and the ST method.