What does the CBM-maze test measure?

In this study, we identified the code-related (decoding, fluency) and language comprehension (vocabulary, listening comprehension) demands of the CBM-Maze test, a formative assessment, and compared them to those of the Gates – MacGinitie test, a standardized summative assessment. The demands of these reading comprehension tests and their developmental patterns were examined with multigroup structural regression models in a sample of 274 children in Grades 4, 7

Reading comprehension is generally defined as a complex cognitive process (Kendeou, van den Broek, Helder, & Karlsson, 2014) that involves the construction of a mental representation of what a text is about, termed the situation model (Kintsch & van Dijk, 1978).The construction of this situation model is driven by at least two core processes: code-related processes that are responsible for the efficient recognition of words, and language comprehension processes that are responsible for the extraction of meaning from what is read (e.g., Snow, 2002;Verhoeven & Perfetti, 2008).These two sets of processes compose the influential simple view of reading (W. A. Hoover & Gough, 1990).It follows that measures of reading comprehension would tap on both code-related and language comprehension processes.
Reading comprehension is measured by both summative and formative assessments, each serving a different purpose (Scriven, 1967).On one hand, summative assessment is used to discern the state of achievement, which summarizes performance at a particular point in time.Various standardized reading comprehension tests are used for this purpose, with users likely assuming that different tests are roughly equivalent or interchangeable (Keenan, Betjemann, & Olson, 2008).However, accumulating evidence in the extant literature challenges this assumption (e.g., Keenan et al., 2014;Keenan & Meenan, 2014;Papadopoulos, Kendeou, & Shiakalli, 2014).For example, Nation and Snowling (1997) were the first to provide evidence that a score on the Suffolk Reading Scale was more strongly associated with code-related skill scores whereas a score on the Neale Analysis of Reading Ability was more strongly associated with language comprehension scores.Since this study, the demands of several other standardized tests have been examined (e.g., Andreassen & Bråten, 2010;Cutting & Scarborough, 2006;Keenan et al., 2008;Kendeou, Papadopoulos, & Spanoudis, 2012;Nation & Snowling, 1997).Specifically, the Woodcock-Johnson Passage Comprehension subtest is most strongly associated with code-related skills and less with language comprehension (Keenan et al., 2008), whereas the Gray Oral Reading Test and the Gates-MacGinitie Reading Comprehension Test associate more comparably with code-related and language comprehension skills (Cutting & Scarborough, 2006;Tilstra, McMaster, van den Broek, Kendeou, & Rapp, 2009).The results of these studies taken together demonstrate that summative, standardized reading comprehension tests differ in the extent to which they tap on code-related and language comprehension abilities.
On the other hand, formative assessment is used to discern the needs of a student with respect to instruction in an effort to improve achievement.One class of formative assessments-Curriculum-Based Measures (CBM)-are highly popular in the United States.Various versions of CBM measures have been adopted by schools in 30 U.S. states, including a statewide adoption in Iowa (92%), exceeding 5 million administrations annually (FastBridge Learning, 2015).CBM measures have been shown to be highly sensitive to growth over brief periods (see Espin, McMaster, Rose, & Wayman, 2012, for a review).Specifically, for the assessment of reading comprehension, the CBM-Maze test is established as a reliable and valid measure to assess progress (Espin, Wallace, Lembke, Campbell, & Long, 2010;Fuchs & Fuchs, 2002;Marcotte & Hintze, 2009;Pierce, McMaster, & Deno, 2010;Shin, Deno, & Espin, 2000;Tichá, Espin, & Wayman, 2009).The test has a standardized cloze format (Fuchs & Fuchs, 1992): Every seventh word is deleted and replaced with three multiple-choice alternatives-one correct and two incorrect words.This format is not strictly followed across different available CBM batteries.For example, in some instances the target word and distractors are selected purposely to increase test difficulty (Shinn & Shinn, 2002).The scoring procedures also vary across batteries, such that in some instances a ceiling number of errors is used or a correction for guessing.It is important to note, however, that when these variations were directly compared in different experimental studies, no major differences were observed in reliability or predictive validity of the CBM-Maze test scores (for a review, see Pierce et al., 2010).
Even though there is evidence that scores on the CBM-Maze test rely on code-related abilities (Gellert & Elbro, 2013;Kendeou & Papadopoulos, 2012;Kendeou et al., 2012;Pierce et al., 2010), there are no studies that examined its reliance on language comprehension.In the absence of this information, it is unclear whether the CBM-Maze test serves as a strong predictor of reading comprehension, despite its wide use in the United States.Thus, the purpose of the current study is to identify the relative code-related and language comprehension demands of the CBM-Maze test.Identifying these demands in isolation will make it challenging to validly interpret their magnitude.Thus, these demands were compared to those of the Gates-MacGinitie Reading Comprehension Test, a standardized test that relies on both code-related and language comprehension skills (Cutting & Scarborough, 2006).Further, for the selection of measures to assess code-related and language comprehension skills, we followed the normative paradigm in the extant literature.Specifically, the assessment of code-related skills often includes measures of reading accuracy and fluency (de Jong & van der Leij, 2002;Florit & Cain, 2011;Verhoeven & Perfetti, 2008).Both accuracy of word reading and automaticity and fluency are essential (Kirby & Savage, 2008), because when decoding is accurate but slow, the working memory demands on processing capacity often overcome and interfere with higher level reading comprehension processing (Perfetti & Hart, 2002).Thus, reading should be accurate and fast to be adequate to support higher level comprehension.The assessment of language comprehension skills often includes measures of vocabulary and listening comprehension (de Jong & van der Leij, 2002;Kim, 2016;van den Broek, Kendeou, & White, 2009).
Because we anticipated that the demands of reading comprehension would change developmentally (Adlof, Perfetti, & Catts, 2011;Afflerbach, Cho, & Kim, 2015), we sampled elementary, middle, and high school children.Specifically, a series of studies has shown that the relation between listening comprehension and reading comprehension increases over time, particularly in later elementary school, as the texts that are read and the questions asked about those texts become increasingly complex (e.g., Catts, Adolf, Hogan, & Weismer, 2005;Diakidoy, Stylianou, Karefillidou, & Papageorgiou, 2005;Vellutino, Tunmer, Jaccard, & Chen, 2007).Also, the relation between word reading accuracy and reading comprehension decreases over time (and may disappear by midelementary school), as most children have sufficient word reading skill to identify the words in written texts they encounter in reading comprehension tests (García & Cain, 2014;Vellutino et al., 2007).In contrast, the relation between reading fluency and reading comprehension increases, as most children begin to read not only with accuracy but also with speed (e.g., Language and Reading Research Consortium, 2015).Thus, in this study, the developmental patterns of children from elementary, middle, and high school were investigated.
Following previous work, we hypothesized that performance on the CBM-Maze test would depend more on code-related skills than performance on the Gates-MacGinitie test, whereas performance on the Gates-MacGinitie test would depend more on language comprehension skills than performance on the CBM-Maze test.With respect to the developmental patterns, we expected the code-related demands to decrease across grades (e.g., Vellutino et al., 2007), whereas the language comprehension demands we expected to increase (e.g., Catts et al., 2005).To that end, we tested whether the code-related and language comprehension demands were equal or different across grades.

Participants
Participants were 92 fourth graders, 90 seventh graders, and 92 ninth graders who were part of a larger study on reading comprehension.This study was carried out in suburban schools in a major metropolitan area in the midwestern region of the United States and aimed to create profiles of readers of different age and skill levels by using both online (eye-tracking and think-aloud) and offline (cognitive, linguistic, and achievement) measures.For the goals of this article, a subset of the administered linguistic measures was used to examine the demands of the CBM-Maze and the Gates-MacGinitie Reading Comprehension Tests.
Students who participated in this study were selected from four elementary school classes, two classes from a middle school, and three classes from a high school.Students receiving special or gifted education were excluded from the study.The sample was drawn from an upper middle-class, predominantly White population.There were several missing values on the demographic information variables; thus, descriptive statistics were computed only for the available data.In Grade 4, there were 49 boys and 43 girls, who were on average 9 years 8 months old (SD = 4.30).In Grade 7, there were 36 boys and 53 girls, who were on average 12 years 9 months old (SD = 3.79).In Grade 9, there were 42 boys and 50 girls, who were on average 14 years 8 months old (SD = 4.65).All students spoke English as their native language.The majority of the children were Caucasian (83%), with a few African American (6%), Asian (6%), Hispanic (2%), and other ethnicities (3%).The distribution of race was comparable for the different grade-level groups.

CBM-Maze test
The CBM-Maze test (Deno, 1985;Espin & Foegen, 1996) requires students to read passages that include incomplete sentences.From these passages, the first sentence is always intact, and after the first sentence, each seventh word is omitted.Students are required to choose the correct word to appropriately complete the sentence out of three options.Three written passages were presented to students, one at a time, in a booklet form.These passages were based on the curriculum for each grade level.Students had 1 min (fourth graders) or 2 min (seventh and ninth graders) to read as much of each passage as possible and circle the appropriate word to accurately complete the sentences.This same pattern was repeated for all three passages.The total time for test administration ranged from 5 to 10 min.A student's score consists of the average number of words selected correctly minus the number of words selected incorrectly across the three passages.The test-retest reliability is .83(Shin et al., 2000).

Gates-MacGinitie test
The Gates-MacGinitie Reading Comprehension Test (MacGinitie, MacGinitie, Maria, & Dreyer, 2000) includes 11 passages and 48 multiple-choice questions related to these passages for Grades 4, 7, and 9.The questions require constructing an understanding based on information that is either explicitly or implicitly stated in the passage.Students independently read the passages and questions and then record their answers on a separate answer sheet.Administration of the test takes approximately 45 min (10 min pretest; 35 min for actual test administration).A student's score consists of the total number of questions answered correctly.Test-retest reliability is .88(MacGinitie et al., 2000).

Decoding
Students completed the Word Identification and the Word Attack subtests from the Woodcock-Johnson-III Achievement Test (WJ-III; McGrew & Woodcock, 2001).Together, they take about 10 min to administer.In the Word Identification task, students read aloud a list of words.The ceiling rule is set at six errors in a row.The procedure is the same for the Word Attack subtest, which includes pseudowords.A student's score is the average number of words and pseudowords read out correctly.The split-half reliability is .97for the Word Identification subtest and .91 for the Word Attack subtest (Hosp & Fuchs, 2005).

Reading fluency
Students completed a CBM task to assess their oral reading fluency (Deno, 1985).In this task, students read aloud three separate age-appropriate passages for 1 min each.A student's score consists of the average number of words read correctly minus the average number of incorrectly read words (e.g., omissions, insertions, mispronunciations, substitutions, and hesitations of more than 3 s).Reliability ranged between .80 and .91 (Hintze & Silberglitt, 2005).

Listening comprehension
Students completed the Listening Comprehension subtest (H.D. Hoover, Heironymus, Frisbie, & Dunbar, 1996) from the Iowa Test of Basic Skills (ITBS).This test assesses literal meaning, inferential meaning, following directions, visual relations, numerical/spatial/temporal relations, and speaker point of view.There are 33 questions in the fourth-grade test, 38 questions in the seventh-grade test, and 40 questions in the ninth-grade test.Administration of the test takes approximately 45 min.A student's score consists of the total number of questions answered correctly.Reliabilities ranged from .67 to .79 (H.D. Hoover et al., 1996).

Vocabulary
Students completed the vocabulary subtest from the ITBS.The fourth graders completed Level 10, the seventh graders completed Level 13, and the ninth graders completed Level 14.In this test, students read sentences that have a word underlined.Under each sentence are four possible meanings or synonyms for the underlined word, and the student circles the item that has the closest meaning to the meaning of the underlined word.A student's score consists of the total number of questions answered correctly.The reliabilities ranged between .70 and .91 (Malecki & Elliott, 2002).

Procedure
The CBM-Maze test, Gates-MacGinitie Reading Comprehension Test, and the Listening Comprehension subtest from the ITBS were administered to participants at each grade level in two sessions on 2 days.The CBM-Maze and Gates-MacGinitie Reading Comprehension Tests were administered in the first session.The Listening Comprehension Test was administered in the second session.The rest of the assessments (CBM Oral Reading Fluency, WJ-III Word Identification Task, WJ-III Word Attack Task, and ITBS Vocabulary) were administered during a third session that was one-on-one with a research assistant.All test administrations followed a standardized procedure (dictated by each test).

Analyses
A multigroup structural regression model was fitted to the data to evaluate the code-related and language comprehension demands of the CBM-Maze and the Gates-MacGinitie Reading Comprehension Tests.The differences between the demands across tests were tested in each grade by examining whether regression coefficients could be constrained to be equal across the two reading comprehension tests or whether it was better to freely estimate those coefficients.When regression coefficients are constrained to be equal, reliance on each specific skill is hypothesized to be the same for both the CBM-Maze and the Gates-MacGinitie tests.When regression coefficients are freely estimated, then reliance on a specific skill is hypothesized to be different for the CBM-Maze and the Gates-MacGinitie tests.To test the hypothesized developmental patterns, that is, decreasing code-related demands and increasing language comprehension demands, we compared models with these coefficients constrained to be equal across grades, with models in which these coefficients were freely estimated.
However, comparing standardized regression coefficients or correlations is not possible in a regular structural regression model.In such models, analyses are based on the comparison of covariance matrices across groups.These covariances are composites of the correlations between variables and the variance of each variable.Thus, differences between covariances can reflect differences between correlations, between variances, or both.Differences between standardized regression coefficients or correlations can be tested only with a structural regression model in which the differences in variances are taken into account, which is possible with the use of phantom factors.Phantom factors are latent variables in which the variance is constrained (de Jong, 1999; Rodríguez, van den Boer, Jiménez, & de Jong, 2015; van den Boer, van Bergen, & de Jong, 2014).See Figure 1 for a graphical display of a structural regression model with phantom factors.In a phantom factor model, the regression coefficients are already standardized and the relations between variables are correlations instead of covariances.For the independent, or exogenous phantom factors (i.e., decoding, reading fluency, vocabulary, and listening comprehension), the observed variables had a loading on their corresponding latent phantom factor.The residual variances of the observed independent variables were fixed to zero, and the variances of the latent phantom factors were fixed to one (Rodríguez et al., 2015).For the dependent, or endogenous  phantom, factors (i.e., both reading comprehension tests), the variances were freely estimated and the factor loadings of the observed dependent variables on their latent phantom factors were fixed to their standard deviations (van den Boer et al., 2014).The residual variances of the observed dependent variables were also fixed to zero.

Decoding
Parameters of the multigroup structural regression model were estimated with Mplus Version 7.11 (Muthén & Muthén, 2012).To obtain parameter estimates, full information maximum likelihood estimation was used.Model fit was evaluated with the chi-square goodness-of-fit test statistic, the root mean square error of approximation, and the comparative fit index (Kline, 2011).A nonsignificant chisquare value (p > .05)indicated exact model fit (Hayduck, 1996).Root mean square error of approximation values less than .05indicated good approximate fit, values between .05 and .08 were taken as satisfactory fit, and values greater than .10were considered a poor fit (Browne & Cudeck, 1993).Comparative fit index values greater than .90were considered acceptable, and values greater than .95were taken as good incremental model fit (Hu & Bentler, 1999).Differences between the fit of any two nested models were tested with the chi-square difference test (Kline, 2011).

Data screening and descriptive statistics
Before conducting the analyses, the data were checked for outliers.Scores that were more than 3 standard deviations above or below the mean were omitted.In total, less than 1% of the data were omitted.Means and standard deviations of reading comprehension, decoding, reading fluency, vocabulary, and listening comprehension are displayed in Table 1.Correlations between all variables are displayed in Table 2.The correlations show that scores on the CBM-Maze and Gates-MacGinitie tests were highly related (correlations ranging from .69 to .79).Also, the CBM-Maze test related to both code-related (correlations ranging from .50 to .88) and language comprehension skills (correlations ranging from .49 to .65).A similar pattern was observed for the Gates-MacGinitie test (correlations with code-related skills ranged from .47 to .80;correlations with language comprehension skills ranged from .55 to .76).A comparison of these relations suggests that the two tests relate to both code-related and language comprehension skills.In addition, the relations of both reading comprehension tests with decoding decreased across grades, whereas the relations with reading fluency remained relatively stable.The relations of the tests with vocabulary and listening comprehension did not show any clear developmental pattern.

Differences between the demands of the CBM-Maze and the Gates-MacGinitie Reading Comprehension tests
A multigroup structural regression model with phantom factors was fitted to the data to evaluate and compare the unique demands of the CBM-Maze and the Gates-MacGinitie tests.In this model (Figure 1), phantom factors were specified for each construct, with the observed variables as single indicators.This model was a just-identified or saturated model, that is, the model had zero degrees of freedom.The regression coefficients of the structural part of this just-identified model are presented in Table 3. Differences between the demands of the two comprehension tests were examined by constraining regression coefficients of each of the four factors of the different demands to be equal across the two tests.Note that in these models the regression coefficients were constrained to be equal in each grade but not across grades.
Step-by-step model comparisons were used to test whether the regression coefficients in each grade should be constrained to be equal or freely estimated.

Decoding and fluency demands
With respect to decoding and fluency, two separate models were estimated with each of these demands constrained to be equal across tests.It was hypothesized that the CBM-Maze test would rely more heavily on decoding and reading fluency than the Gates-MacGinitie test.The model in which the regression coefficients of decoding on both reading comprehension tests were constrained to be equal within each grade fitted the data well (see Table 4, Model 1.1).This model could not be further improved by freely estimating regression coefficients of decoding on reading comprehension.With respect to reading fluency, the model in which this demand on both reading comprehension tests was constrained to be equal had a poor fit to the data (see Table 4, Model 2.1).
Step-by-step model testing suggested that the equality constraints had to be dropped in all grades (see Table 4, difference between Model 2.1 and 2.2).In sum, these model comparisons showed that the decoding demands of the CBM-Maze and the Gates-MacGinitie tests were equal in all grades, but the CBM-Maze test relied significantly more on reading fluency than the Gates-MacGinitie test.

Vocabulary and listening comprehension demands
The unique demands of the reading comprehension tests with respect to vocabulary and listening comprehension were also examined.It was hypothesized that the Gates-MacGinitie test would rely more heavily on vocabulary and listening comprehension skills than the CBM-Maze test.The model with equality constraints on the vocabulary demands of both reading comprehension tests in all grades had a poor fit to the data (see Table 4, Model 3.1).
Step-by-step model comparisons suggested that freely estimating the regression coefficients in Grades 4 and 9 for the vocabulary demands resulted in improvements of the fit of the models (see Table 4, Model 3.1 vs. Model 3.2), whereas removing the

Differences in the demands of the CBM-Maze and the Gates-MacGinitie tests across grades
The multigroup structural regression model with phantom factors was also used to investigate the demands of the CBM-Maze and the Gates-MacGinitie tests across grades.Because there is evidence that the demands of reading comprehension change developmentally, we added model constraints to the model in Figure 1, separately for each of the four factors.We hypothesized that the decoding and fluency demands of reading comprehension tests would decrease, whereas the vocabulary and listening comprehension demands would increase across grades.We tested whether equally constraining or freely estimating the regression coefficients resulted in the best-fitting models.

Decoding and fluency demands
A model with an equality constraint on the decoding demands of both tests across grades had a good fit to the data (see Table 5, Model 1.1) and could not be further improved.With respect to fluency, a model with equality constraints of both tests across grades also had a good fit to the data (see Table 5, Model 2.1) and could not be further improved.To sum up, these model comparisons revealed that the decoding and fluency demands of both reading comprehension tests remain equal across grades.

Vocabulary and listening comprehension demands
Changes in the vocabulary and listening comprehension demands of reading comprehension across grades were also examined.A model with equality constraints on the vocabulary demands across grades had a good fit to the data (see Table 5, Model 3.1) and could not be further improved.For listening comprehension, a model with equality constraints had a poor fit to the data (see Table 5, Model 4.1).This model could be improved by removing the equality constraint of the Gates-MacGinitie test in Grade 9 (see Table 5, difference between Model 4.1 and 4.2).This model had an acceptable fit to the data (see Table 5, Model 4.2) and could not be further improved.In sum, the model comparisons revealed that the vocabulary demands of both reading comprehension tests and the listening comprehension demands of the CBM-Maze test are equal across grades.The listening comprehension demands of the Gates-MacGinitie test, however, decrease in Grade 9.

Discussion
The purpose of the current study was to identify the relative code-related and language comprehension demands of the CBM-Maze test in elementary, middle, and high school children.These demands were compared to those of the Gates-MacGinitie Reading Comprehension Test.Even though the CBM-Maze and the Gates-MacGinitie tests substantially correlated among them and to both code-related and language comprehension skills, the results of the multigroup structural regression modeling, in which unique effects are considered, showed that the CBM-Maze test depends more heavily on code-related skills than does the Gates-MacGinitie test, whereas the Gates-MacGinitie test relies more heavily on language comprehension skills than does the CBM-Maze test.With respect to developmental patterns, the code-related and language comprehension demands of both tests remain relatively stable across grades.These results highlight that the CBM-Maze test relies more on code-related skills than on language comprehension skills.Indeed, the test comparisons revealed that even though the decoding demands of the CBM-Maze and the Gates-MacGinitie tests are equal in all grades, the CBM-Maze test relies significantly more on reading fluency and less on language comprehension than the Gates-MacGinitie test.These findings are interesting considering the high correlation between the two tests across grades.Further, the findings also raise an important question, namely, whether the wide use of the CBM-Maze test, when used as a formative assessment to influence instructional focus (vs.more simply using as an efficient predictor), can adequately evaluate the strengths and weaknesses of students in reading comprehension.Previous studies also suggested that such cloze tests measure primarily lower level comprehension processes (e.g., sentence comprehension; e.g., Gellert & Elbro, 2013).
Further, the comparisons across grades showed that the demands on decoding and fluency remained relatively stable for both tests.Note that at each grade level, students completed gradelevel decoding items and read grade-level texts (in CBM-Maze, CBM-Oral Reading, and in Gates-MacGinitie).Thus, "a stable" pattern across grades suggests that both decoding and fluency continue to covary with reading comprehension.Further, the unique contribution of decoding was hardly significant, whereas the role of fluency was substantial, consistent with recent findings showing that the contribution of decoding in reading comprehension is gradually taken over by reading fluency (LARRC, 2015).The small contribution of decoding can be attributed, in part, to suppression effects due to the high correlation of decoding and fluency.Note, however, that the model comparisons produce the same results when decoding and fluency are entered in separate models.A more likely Note.RMSEA = root mean square error of approximation; CI = confidence interval; CFI = comparative fit index.*p < .05.
explanation is that the measure of fluency used in this study (i.e., oral reading fluency) draws on skills beyond those captured by word identification efficiency alone (Eason, Sabatini, Goldberg, Bruce, & Cutting, 2013).
The comparisons across grades also showed that the demands on vocabulary remained stable for both tests, suggesting that vocabulary continues to covary with reading comprehension.The demands of listening comprehension also remained stable across grades even though they were relatively low for both tests (see Table 3), but they decreased in Grade 9 for the Gates-MacGinitie test.One potential explanation is the measurement of listening comprehension itself, as the ITBS Listening Comprehension Test taps on a broad set of skills.Specifically, in addition to comprehension of literal and inferential meaning, which is often the focus of various listening comprehension tests (Hogan, Adlof, & Alonzo, 2014), the test specifications include comprehension of numerical/ spatial/temporal relations, following directions, visual relations, and speaker point of view.It is unclear whether the text and questions included in the Gates-MacGinitie test pose consistent demands on this broad set of skills across grades.Another potential explanation is that the presence of both vocabulary and listening comprehension in the model may have resulted in suppression effects.Note, however, that the model comparisons produce the same results when vocabulary and listening comprehension are entered in separate models.A final explanation is that the nature of the demands on comprehension changes across development.Specifically, comprehension demands in higher grades (such as Grade 9 in our study) may not be adequately accounted for by listening comprehension alone; for example, question and text complexity in higher grades (e.g., content, structure) may pose increasing demands on 21st-century higher order skills such as reasoning (Goldman, 2012;Goldman & Pellegrino, 2015;Graesser, 2015;Sabatini, O'Reilly, Halderman, & Bruce, 2014).As previously noted, it is unclear whether the text and questions included in the Gates-MacGinitie test pose such demands, thus further research is needed to examine this issue.
The current set of findings must be considered in the context of certain design and measurement constraints of the present study.One limitation derives from the use of different students in each grade level.Therefore, it is not clear whether the developmental patterns observed are "developmental" or due to cohort differences.Despite the advantages of the cross-sectional design that allowed direct comparisons of elementary, middle, and high school students at a single time point, a longitudinal design would have been more appropriate to establish developmental patterns.Further, the actual interpretation of the comparisons across grades is also limited by the use of a mix of standardized, state, and CBM tests.For example, both the fluency and the Maze tests were CBM measures that used grade-level texts at each grade, and their high correlation likely also reflects common method variance.Thus, the correlation between fluency and CBM-Maze may be difficult to interpret when compared to that of fluency with the Gates-MacGinitie test.Further, this study used one specific variation of the CBM-Maze test.As we discussed earlier, the test has several different variations; even though these variations may have limited, if any, implications for reliability and predictive validity (Pierce et al., 2010), they do have implications for the specific demands posed by the test.A final issue pertains to constraints to generalizability as a function of the sample demographics.It would be important to address this issue also in future work, drawing from a more diverse population.
Despite these limitations, the findings from this study taken together have important implications for both researchers and educators.One implication pertains to the diagnosis of students at risk of reading difficulties.Specifically, the type of reading comprehension test that is used determines to a large extent which students are diagnosed as at-risk or struggling readers (Keenan & Meenan, 2014;Papadopoulos et al., 2014).For example, in a sample of 1,500 participants who ranged from 8 to 19 years of age, only half of the individuals who performed poorly on a reading comprehension test that mainly relied on comprehension also performed poorly on a reading comprehension test that mainly relied on decoding (Keenan et al., 2014).It follows that if the CBM-Maze test is used to identify strengths and weaknesses in readers (in reading research or at schools), it will likely result in the identification of struggling readers who experience difficulties with lower level comprehension abilities.A second implication concerns potential revisions that could further improve the CBM-Maze test so that it fares better when compared to "balanced" standardized reading comprehension measures as a tool for measuring comprehension skills per se.The CBM-Maze test is a useful measure and has many advantages over traditional standardized test measures; it is reliable, fast and easy to administer, and inexpensive (Fuchs & Fuchs, 2002;Pierce et al., 2010).Future work can focus on how to revise the test so that its language comprehension demands are increased.For example, existing test variations that purposely select the target word and distractors are a few ways that can increase comprehension demands (Gellert & Elbro, 2013).Note, however, that several attempts for revising the CBM-Maze test have been made already, but these have not led to significant improvements in capturing deep comprehension (Lembke et al., 2016).Thus, more work is needed in this direction.
In conclusion, the current study revealed that the CBM-Maze test relies more on code-related skills than on language comprehension skills.Comparisons between the demands of the CBM-Maze and the Gates-MacGinitie tests also revealed that the CBM-Maze test relies more heavily on reading fluency and less on language comprehension skills than the Gates-MacGinitie test.This study has important implications for the use of the CBM-Maze as a formative assessment measure and suggests that it should be further revised to increase its demands on language comprehension skills.

Figure 1 .
Figure 1.Multigroup structural regression model with phantom factors for the demands of decoding, reading fluency, vocabulary, and linguistic comprehension on the CBM-Maze and the Gates-MacGinitie Reading Comprehension Tests.

Table 1 .
Descriptive statistics for reading comprehension, decoding, reading fluency, vocabulary, and listening comprehension.

Table 3 .
Standardized regression coefficients of the structural regression model with phantom factors of the decoding, reading fluency, vocabulary, and listening comprehension demands of the CBM-Maze (Maze) and the Gates-MacGinitie Reading Comprehension (Gates) tests.constraint in Grade 7 did not result in improvement in model fit.The model with equality constraints in Grade 7 only had a good fit to the data (see Table4, Model 3.2).For listening comprehension, the model with equality constraints on this demand of the two reading comprehension tests in all grades had a poor fit to the data (see Table4, Model 4.1).Step-by-step model comparisons suggested that the fully constrained model could be improved only by removing the equality constraint in Grade 4 (see Table4, Model 4.1 vs. 4.2).This model with equality constraints in Grades 7 and 9 had a good fit to the data (see Table4, Model 4.2).In sum, these step-by-step model comparisons showed that the Gates-MacGinitie test relied more heavily on vocabulary in Grades 4 and 9 than the CBM-Maze test.The reliance on vocabulary of both tests was not significantly different in Grade 7; however, the pattern was in the expected direction.With respect to listening comprehension, the Gates-MacGinitie test relied more heavily on this demand than the CBM-Maze test in Grade 4. In Grades 7 and 9, the reliance on listening comprehension of both tests was not significantly different. equality

Table 4 .
Values of the selected fit indices for the models concerning the differences in the demands of the CBM-Maze and the Gates-MacGinitie tests.Note.RMSEA = root mean square error of approximation; CI = confidence interval; CFI = comparative fit index.*p < .05. **p < .01.

Table 5 .
Values of the selected fit indices for the models concerning the developmental patterns of the demands of the CBM-Maze (Maze) and the Gates-MacGinitie (Gates) tests.