Discrimination with unidimensional and multidimensional item response theory models for educational data

Abstract Achievement tests are used to characterize the proficiency of higher-education students. Item response theory (IRT) models are applied to these tests to estimate the ability of students (as latent variable in the model). In order for quality IRT parameters to be estimated, especially ability parameters, it is important that the appropriate number of dimensions is identified. Through a case study, based on a statistics exam for students in higher education, we show how dimensions and other model parameters can be chosen in a real situation. Our model choice is based both on empirical and on background knowledge of the test. We show that dimensionality influences the estimates of the item-parameters, especially the discrimination parameter which provides information about the quality of the item. We perform a simulation study to generalize our conclusions. Both the simulation study and the case study show that multidimensional models have the advantage to better discriminate between examinees. We conclude from the simulation study that it is safer to use a multidimensional model compared to a unidimensional if it is unknown which model is the correct one.


Introduction
In different disciplines of studies like statistics, education, psychology, to mention a few, tests might be multidimensional. This means that these tests measure two or more dimensions of ability, also called constructs. The tests consist of several questions, called items. To model such items, a multidimensional item response theory (MIRT) is a widely used tool. The MIRT models are appropriate since these models explain the relationship between two or more latent (unobserved) variables and the probability to correctly answer the particular question by the examinee. The MIRT model has become a popular tool for the analysis of test content, computerized adaptive testing and equating of test calibrations noticed by Reckase (2007).
Generally, MIRT models can be categorized into compensatory and non-compensatory models. More often than not, MIRT models are compensatory allowing the dimensions to interact. As Yang (2007) points out, the literature of educational research is dominated by compensatory model applications (Reckase 1979;Drasgow and Parsons 1983;Yen 1984;Way, Ansley, and Forsyth 1988;Reckase and McKinley 1991;Kirisci, Hsu, and Yu 2001). Our focus remains on compensatory models for this article. For a compensatory model, the high ability, in terms of probability to correctly answer the question in one dimension can compensate for the low ability in another dimension. On the other hand, a non-compensatory model does not permit this. Therefore, one needs to be proficient in each ability to obtain the highest score.
In MIRT models the emphasis has not been to minimize the number of dimensions as is the case in factor analysis. Reckase (1997) analyzed item response data even with 25 or more dimensions. However, most of the recent applications of MIRT have used fewer dimensions either due to the limitations in estimation programs or because it is useful, on other hand, to plot the results but it is difficult to plot more than three dimensions. The usage of unidimensional item response theory (UIRT) models is not new to the disciplines of education and psychology. However, meeting the unidimensionality assumption for ability is difficult in some achievement tests (Reckase 1985;Hambleton, Swaminathan, and Rogers 1991).
Many researchers (Spencer 2004;De La Torre and Patz 2005;Yang 2007; Kose and Demirtasli 2012) have compared unidimensional and multidimensional models. They have demonstrated that the ability and item parameters have less estimation error and provide more precise measurements under MIRT due to the fact that the number of latent traits underlying item performance increase. Moreover, the model-data fit indexes are in the favor of MIRT models. Wiberg (2012) used multidimensional college admission test data to examine the consequences of applying the UIRT models. She discovered from a simulation study that MIRT gives better model fit result compared to UIRT. However, the result of a consecutive UIRT approach is proximate to MIRT. She concluded that if the test had between-item multidimensionality, the use of consecutive UIRT instead of MIRT models is not harmful and it may be easier to interpret logically. Hooker, Finkelman, and Schwartzman (2009) have highlighted a paradox occuring with MIRT in contrast to UIRT: if the answer of an examinee to an item is changed from correct to wrong, it could happen that the estimated ability decreases in some dimension. Further insights to this potential issue have been given by Finkelman, Hooker, and Wang (2010), Hooker (2010), and Spiess (2012, 2018). The mentioned references argue for the need to address this issue to avoid unfairness in the test. Other researchers have contrasting views. According to van Rijn and Rijmen (2012), preventing paradoxical results are less relevant as long as one has not a contest perspective on a test. According to van der Linden (2012), the convergence of ability estimates (e.g., the MLE) to the true ability is of main importance; additional information improves estimates even if paradoxical results occur (Reckase and Luo 2015). In contrast to this view, Jordan and Spiess (2012) point out that the concept of fairness is an individual concept and refers not to infinite fictitious repetitions like quality of estimates.
The discrimination parameter of the model is a measure of the differential capability of an item. An item is considered valuable if it well discriminates subjects with ability levels in a range of interest for the exam. A larger value of the discrimination parameter demonstrates that the probability to correctly answer a question rapidly increases with the increase in the ability. The high value of discrimination parameter yields a steeper slope of the item characteristic curve (ICC), and the item has then a better possibility to differentiate subjects around the difficulty level of the item. Items with higher discrimination power can therefore contribute more in assessment precision in comparison to items with a low discrimination power, provided that the item-difficulty is in the scope of the exam. So, a feature which is of importance for educationalists is the capability of an item to discriminate between examinees. In UIRT model, items can discriminate subjects on one direction, whereas items in MIRT model differentiate examinees on each dimension.
The objectives of this study are: (1) to show how the question of dimensionality can be explored for an achievement test in higher education, (2) to show how a good model for analysis can be chosen in a real situation and (3) to investigate the impact of the number of dimensions in the chosen model on the discrimination parameter's estimate.

Data of case study
Data used in this paper are acquired from a statistics exam for higher education students. This exam was taken by 238 examinees in March 2015. There were in total 16 questions in this exam which had a mixed-format consisting of 15 multiple-choice items and one freeresponse item. The first 15 questions Q 1 to Q 15 offer 5 options and one of them is correct. The responses have been dichotomized with the values 0 (wrong answer) and 1 (correct answer). The last question Q 16 offers a maximum of 10 points. We have polychotomized response of this item with the values 0, 1, 2, 3, 4 and 5. The codification 0 represents if the student got 0 points and 1 for 0 < points 2 and 2 for 2 < points 4 in the question, and so on. The characteristics of the exam and the percentage of correct answers are summarized in Table 1. Question 1 was easiest and question 12 toughest in the exam for the students according to this percentage. Most of the students were able to get 10 points for the last question. The item questions target different areas of statistics according to the teacher's intentions, e.g., basic statistics, testing of hypothesis, regression, see Table 1.

Unidimensional item response theory models (UIRT)
Let U ij be the dichotomous response of examinee j ðj ¼ 1, 2, :::, NÞ to item i ði ¼ 1, 2, :::, nÞ and PðU ij ¼ 1jh j ; a i , b i , c i , g i Þ the probability of a correct response by examinee j to item i with ability h j : Commonly used IRT models for dichotomous responses are two-and three-parameter logistic models. However, the four-parameter logistic (4PL) model recently received more attention and some researchers highlighted its potential for practical purposes and from a methodological point of view (Magis 2013). The R package catR recently introduced the 4PL model as a baseline IRT model for computerized adaptive test (CAT) generation (Magis and Raîche 2012). The 4PL model can be expressed as where h j is the ability of examinee j, b i is the difficulty of item i (location of the item response function), a i > 0 is the discrimination of item i that determines the steepness of the item response function, c i is the lower asymptotic parameter for the i th item response function, g i is the upper asymptotic parameter for the i th item response function. The three-parameter logistic (3PL) model is a special case of the 4PL model with g i ¼ 1: The two-parameter logistic (2PL) model is a special case with g i ¼ 1 and c j ¼ 0: For the polychotomized free-response item we have used the graded-response (GR) model which was first introduced by Samejima (1969) for ordered categorical responses. This model is a generalization of the 2PL model. The probability of the j th examinee to respond the k th out of r i categories for the i th item is expressed as PðU ij ! kjh j ; a i , b i Þ ¼ 1 1 þ e Àa i ðh j Àb ik Þ , k ¼ 1, 2, :::, r i À 1, where a i and h j are again discrimination and ability parameter, respectively, and b i ¼ ðb i1 , :::, b iðr i À1Þ Þ is a vector of category intercepts for item i.

Multidimensional item response theory models (MIRT)
In MIRT models, the probability of success is modeled as a function of multiple ability dimensions. Each person has a vector h j ¼ ðh j1 , :::, h jm Þ of ability parameter values. Each item has vector of discrimination parameter values a il ¼ ða i1 , :::, a im Þ, one difficulty parameter d i ¼ Àa i b i , a guessing c i and an upper asymptote parameter g i (Reckase 2009). We can write a four-parameter logistic MIRT model (M4PL) as generalization of the UIRT model (1) by replacing a i h j with P m l¼1 a il h jl where m is the total number of dimensions. The M3PL (or M2PL) is again a special case of the M4PL model by setting g i ¼ 1 (or g i ¼ 1 and c i ¼ 0). The multidimensional graded response (MGR) model generalizes formula (2) by replacing a i h j with P m l¼1 a il h jl using the vector of category intercepts d i ¼ ðd i1 , :::, d iðr i À1Þ Þ for item i, where d ik ¼ Àa i b ik :

Exploratory and confirmatory MIRT models
The MIRT models are either exploratory or confirmatory depending upon the available information at the model specification step. An exploratory MIRT model is useful to explore the possible structure of a set of test items without preconceived structure on the items, or even when unconstrained solution is seen as a strong test to verify the hypothesized structure of test items. After applying exploratory MIRT model one could identify underlying structure of a set of test items. The confirmatory MIRT models are useful when we have clear hypothesis about structure of set of items in an examination taken by students. In fact, the confirmatory MIRT model needs to have a specified number of latent abilities to explain the pattern of relationships between the items. In confirmatory MIRT the researcher postulates the relationship between items and specific number of latent variables a priori, based on knowledge of the theory, empirical research, or both. This hypothesized structure is statistically tested by the researcher. The confirmatory MIRT models are valuable to measure aspects such as different knowledge structure of examinees, multiple abilities, strategies, satisfaction level and attitudes on which persons differ quantitatively.
It is common that in exploratory analyses a model is used where all ability parameters can influence all items (Figure 1, middle) called within-item multidimensionality. A typical confirmatory analysis uses models where the different abilities influence disjoint sets of items (Figure 1, right) denoted as between-item multidimensionality, see Adams, Wilson, and Wang (1997) and Hartig and H€ ohler (2009). We use a model with between-item multidimensionality later for an exploratory analysis of our case study. The MIRT model with between-item multidimensionality has also been called MIRT model of simple structure or multiunidimensional IRT model; Sheng and Wikle (2007) compare this model with the UIRT model. It has been noted e.g., by Jordan and Spiess (2012) that the model with between-item multidimensionality and with ML estimation avoids the issue of paradoxical results mentioned in the introduction (however, if a Bayesian analysis is used and the dimension is at least 3, paradoxical results still are possible, see Finkelman, Hooker, and Wang (2010)).
A variant of a MIRT model with between-item multidimensionality is the consecutive UIRT model where the items belonging to one ability are analyzed as separate models. In the right panel of Figure 1, we remove the arrow between h 1 and h 2 to illustrate the consecutive UIRT.

The ability parameters in the models
The ability of the examinees is characterized by a single ability parameter h j in the UIRT or by a multidimensional parameter vector h j ¼ ðh j1 , :::, h jm Þ T in the MIRT. We assume that the investigated population has normally distributed ability parameters with standardized variance 1, i.e., with covariance matrix C having a diagonal of 1's, see (Reckase 2007;Chalmers 2012) and specifically for m ¼ 1, (Baker and Kim 2004;Rizopoulos 2006). The standardization is important to make the discrimination parameters a i in the model well-defined (identifiable) as otherwise dividing all abilities by a factor and multiplying all discrimination parameters by the same factor would lead to the same data fit. We think that this is even more important from an interpretation perspective than from a theoretical perspective: it is beneficial to make discrimination parameters comparable across different analysis models as well as across different examinations.

Model estimation
First, it is useful to define the response data in the form of an indicator function ( Let w be the set of all unknown item parameters. The conditional likelihood for the j th N Â 1 response pattern vector, U j can be defined as where the number of categories r i is set to 2 for dichotomous items (k 2 f0, 1g). Denoting the density of the ability parameter by gðhÞ (recall that we assume them having a standardized multivariate normal distribution), we integrate them out of the likelihood function. This yields to the marginal likelihood equation of the observed data U (N Â n data matrix) ::: For estimation of the IRT models, the recommended method in case of a small number of dimensions (less than 4) is the expectation maximization (EM) algorithm by using fixed Gauss-Hermite quadrature while the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm and Quasi-Monte Carlo EM (QMCEM) estimation are recommended in case of higher dimensionality or when multidimensional confirmatory model is required (Chalmers 2012). For higher dimension, the EM algorithm solution becomes difficult as we require large number of numerical quadrature in the E-step to evaluate high-dimensional integrals in the likelihood function of the item parameters.

Models comparison criteria
To evaluate the goodness of fit of different competing models, we have used following statistical approaches: Akaike information criterion (AIC), bias corrected AIC (AICc), Bayesian information criterion (BIC) and sample size adjusted BIC (SABIC). These statistics are based on the log-likelihood of a fitted model, sample size and number of parameters estimated as follows: Here p is the number of item parameters being estimated (¼ number of item parameters À mðmÀ1Þ 2 ), LL is the log likelihood of the model, N indicates the number of examinees and m represents number of dimensions. Smaller AIC, AICc, BIC and SABIC values indicate better fit. Like BIC, SABIC places a penalty based on sample size for adding parameters, but the penalty is not as high as for BIC. We note that we use here the above mentioned form of AICc since it is the version implemented in the R package mirt which we use for analysis and since Burnham and Anderson (2002) suggested that this form could be used generally. However, AICc in general is model dependent, see Section 7.4 of Burnham and Anderson (2002).
The likelihood ratio tests were also conducted between the nested models to find out whether a more complicated model fits better than the simple one.

Determination of number of factors
Standard principal component analysis (PCA) and factor analysis (FA) assume that the analyzed variables are continuous and follow a multivariate normal distribution. With this assumption, the sample covariance matrix is a sufficient statistic and inference can be done based on this matrix. In particular, methods are developed to determine the number of factors for the model which we briefly summarize now.
One starts with computing eigenvalues and eigenvectors of the covariance matrix. We can retain only those factors that have eigenvalues greater than one which is applying the "eigenvalue >1-criterion" or Kaiser-Guttman rule (Guttman 1954;Kaiser 1960). These factors explain more than the average of the total variance. However, Conway and Huffcutt (2003) summarize ample research findings and say that the Kaiser-Guttman rule "tends to produce too many factors" and "probably should not be relied on". Gorsuch (1983) proposes that the number of factors is expected to be between ½n=5 and ½n=3, here n is number of variables. This rule of thumb is even more appropriate when sample size is larger (recommended is at least 100) but when the data has fewer than 40 variables. Horn (1965) proposes another strategy known as parallel analysis which now emerges as one of the strongly recommended techniques for deciding how many factors to retain (Zwick and Velicer 1986;Fabrigar et al. 1999;Velicer, Eaton, and Fava 2000;Hayton, Allen, and Scarpello 2004;Peres-Neto, Jackson, and Somers 2005;Henson and Roberts 2006;Ruscio and Roche 2012;Garrido, Abad, and Ponsoda 2013). Parallel analysis compares the scree of eigenvalues of the observed data with that of a random data matrix of the same size as that of original. For the factor analysis, any adjusted eigenvalues greater than 0 is retained where the adjustment is given by: Observed k p -mean simulated k p : Here, k p observed is the p-th ordered eigenvalue of the observed data (for p ¼ 1 to n), and mean simulated k p is the corresponding mean eigenvalue of the iterations number of simulated random data sets. Conway and Huffcutt (2003) have observed that different techniques suggest retaining different number of factors for exploratory factor model, as is reported frequently in literature.
In our MIRT situation with mixed format data consisting of dichotomous and polychotomous items, we cannot use these methods directly. However, these methods can be applied using the poly-choric correlation matrix of the MIRT data, see e.g., (Li, Jiao, and Lissitz 2012;Verma and Markey 2014). The poly-choric correlation matrix between the items is computed under the assumption that dichotomous and polychotomous items dissect continuous latent variables that follow bivariate normal distributions. In contrast to the situation when the data are multivariate normally distributed, the sample poly-choric correlation matrix is not a sufficient statistic of the data. Hence, some information is lost when principal component analysis is performed based on this correlation matrix and this method is therefore called limited information method. Nevertheless, it is useful in some situations to perform principal component analysis based on polychoric correlation to obtain a first guidance how many dimensions to use in an MIRT model. Further, this limited information method can provide quite accurate starting values for parameters (Bock, Gibbons, and Muraki 1988;Lee and Terry 2005) in the algorithms mentioned in the section for model estimation.
Alternatively, we can determine the number of factors in MIRT models with a full information method as follows: we use the full response pattern (not only the sample poly-choric correlation) and compute maximized log likelihood and parameter estimates for several possible number of factors. We decide then about the preferred number of factors by using model comparison criteria or likelihood ratio tests applied to the models with the different number of factors.
In this paper we use the poly-choric correlation matrix and parallel analysis for principal component analysis as limited information method to obtain a prior indication for the dimension. Then, we analyze the IRT model with a full information method as described in the previous paragraph and by this we obtain a recommended number of factors by applying model comparison criteria and by using likelihood ratio tests.

Analysis with MIRT model
We start with investigating an appropriate number of dimensions (number of latent factors) based on the poly-choric correlation matrix (limited information matrix). This gives us a rough estimate for the dimensionality in our data set of students result in statistics examination. Figure 2 shows the poly-choric correlation between the pairs of items. We conduct principal component analysis and consider first the eigenvaluegreater-than-one-criterion to determine the appropriate number of factors. Table 2 depicts that five components meet the rule. The initial two components have eigenvalues (6.54, 1.42) far greater than one, which strongly suggests the multidimensionality in the data. When computing the adjusted eigenvalues greater than 0, the parallel analysis indicates here to retain 3 factors. Parallel analysis is performed using 1000 simulated data. According to Gorsuch's rule of thumb, we expect our data set to have around 4 factors as we have 16/5%3 and 16/3%5. While these methods yield somewhat different recommendations for the factors number, our general conclusion is that the data lack unidimensionality and that at least 2 components exist.
Since we have a small number of items (16), we decided to investigate multidimensional item response theory (MIRT) models with up to 3 dimensions as recommended by parallel analysis. We will later choose the most appropriate model in an empirical way using the model comparison approaches described before. However, we simultaneously take a rationalist point of view and consider the content areas for the item (Table 1) to select the meaningful construct, and an appropriate model for that construct.
We compare three MIRT models, M2PL, M3PL, M4PL, each combined with the MGR model, with respect to their fit to the mixed-format data. The global fit statistics are reported in Table 3. Most of them support unidimensionality. They show that the M2PL þ MGR model has small AIC, AICc, BIC and SABIC as compared to other models.
The likelihood ratio (LR) test is applied to compare nested models. The resulting p-values of the LR tests for nested models are presented in Table 4. The block on the left-hand side in the table compares two different models each having a specific dimension. The right-hand side block compares models with different dimensions within each of M2PL þ MGR, M3PL þ MGR and M4PL þ MGR model. These results show that a better model fit is produced with increased dimensionality and complexity of the model. We conclude, therefore, that the unidimensionality assumption has not   We now investigate which model (M2PL þ MGR, M3PL þ MGR and M4PL þ MGR) is better to figure out the construct. Table 5 shows the structure of the construct for 2 and 3 factor solutions. The groups are made with the questions (items) having higher loading values in the same dimensions. Factor loading expresses the relationship of each item with underlying factor. The items with strongest association to the factor have high value of loading onto that factor. Here, we have observed that complex models with more dimensions are better to form a group of similar kind of questions as compare to other models. Moreover, the factors representing the group of items are more interpretable as compared to the factors representing the group of items in other models. So, the M4PL þ MGR model is better in forming groups that are more interpretable and this result remains consistent in the comparison of models by using the LR test. The LR test supports the 3 dimensional, more complex model (M4PL þ MGR) and this model is good to figure out the underlying construct of the exam. The Q 1 , Q 3 and Q 4 lie in 1 st dimension. This dimension represents the students' ability in basic statistics which include interpretation of a box plot or the definition of the term "range". The 2 nd dimension represents Q 5 , Q 6 , Q 7 , Q 11 , Q 12 and Q 13 . This dimension evaluates the students' skill to solve the contingency table and hypothesis testing problem. The 3 rd construct is constituted with Q 8 , Q 9 , Q 10 , Q 14 , Q 15 , Q 16 and Q 2 . This dimension indicates the students' ability in probability distribution, regression and chi-square test. The Q 2 , which is a basic statistics question, is also included in the construct. Its inclusion in this dimension is debatable. One possible reason could be that students have to perform some manual calculations (e.g., calculation of standard deviation for a given sample) to answer the question, which is also required for question Q 14 , Q 15 and Q 16 . The questions in black color represent the 1 st factor, red the 2 nd factor and similarly green color questions represent the 3 rd factor.

Analysis using MIRT with between-item multidimensionality
Lastly, using the dimension information from the analysis with the MIRT model before, we run now the M4PL þ MGR model with 3 factors and between-item multidimensionality. This model offers the possibility for a simpler interpretation and avoids an issue of paradoxical results. Further, a possibility when using between-item multidimensionality is that we can examine how the M4PL þ MGR model item response curve characterizes the items compared to unidimensional 3PL þ GR model. The graphs of the item response curves of both models of each item are in Figure 3 to Figure 6. The item characteristic curve (ICC) of Q 1 , Q 3 and Q 4 (factor 1; see Figure 3) in the multidimensional model with between-item multidimensionality are steeper as compared to the unidimensional model. The Q 3 and Q 4 have an upper asymptote parameter smaller than 1 in the M4PL þ MGR, while Q 1 has an upper asymptote parameter close to 1 in the M4PL þ MGR corresponding to the 3PL þ GR. Questions Q 5 , Q 6 , Q 7 , Q 11 , Q 12 and Q 13 (factor 2; see Figure 4) have approximately the same ICC for both models especially if we look at ability level between -2 and 2. Slight differences in guessing parameters are visible in Q 5 , Q 6 , Q 11 and Q 13 . While for factor 3 (see Figure 5), some ICC differ somewhat due to an upper asymptote estimate < 1 for the M4PL þ MGR (Q 8 , Q 9 , Q 10 and Q 14 ), the most striking difference between the ICCs is also for this factor that the discrimination parameter is higher in the multidimensional model. Also for the graded response item Q 16 (see Figure 6), all categories show steeper curves for the multidimensional model. As a whole, ICCs from the between-item multidimensionality model has steeper slopes as compare to unidimensional model. This shows that items in the MIRT model with between-item multidimensionality have more discrimination power. The difficulty level of all questions are almost similar in both models except Q 1 , Q 3 , Q 5 , Q 6 and Q 13 which have a slightly different difficulty level if we look at ICCs of the MIRT model and the UIRT model.

Simulation study
As observed in the case study in Sec. 4, the estimates for the items' discrimination parameters are larger for the MIRT compared to the UIRT. We investigate whether this observation is generally valid and carry out a simulation study using a Monte Carlo approach. Motivated by the situation in the case study, a mixed-format data of 16 items have been generated through 300, 500 and 1000 examinees using a unidimensional and a multidimensional model. To avoid the complexity of degree of multidimensionality and to make the simulation study simpler, we have opted to investigate here a twodimensional model with between-item multidimensionality as representant of the multidimensional models. Further, we considered 2PL models in all cases as we are here not interested in the guessing and upper asymptote parameter. In the multidimensional case we have assumed the situation in the first column of Table 5, i.e., the 2 nd dimension constitutes of items Q 11 and Q 12 and the 1 st of the other 14 items. In the two-dimensional model, we used the estimated covariance matrix (1, 0.78; 0.78, 1) from the original data as covariance for the simulation. This means that the two latent abilities h 1 and h 2 are correlated with a correlation of 0.78. As true parameter values for the simulation, we used the estimated values we got in the respective models, the 2PL þ GR and the M2PL þ MGR, in our case study. For data generation we had the described two models and used for data analysis both the 2PL þ GR and the M2PL þ MGR model for both cases, i.e., we analyzed in each case both with the correct model and the wrong model. The design of simulation study, including data generation process and data analysis method, is illustrated in Figure 7. Data were generated and analyzed with R package mirt. For each data generation model and each considered sample size (300, 500, and 1000 students), 1000 simulation runs (repetitions) were done.
The estimated discrimination parameter was compared with the true parameter used in data generation. Root mean square error (RMSE), absolute bias (AB) and bias have then been computed for each item in each simulation run. The mean and the standard deviation of RMSE, AB and bias of the 16 items in the 1000 simulation runs was computed and is shown in Table 6.
In the 1 st block of Table 6, data were generated from the unidimensional 2PL þ GR model and analyzed with 2PL þ GR and confirmatory M2PL þ MGR. The estimated discrimination parameters of all items are positively biased and bias reduces as the sample size increases. For each sample size the values of RMSE, AB and Bias values for M2PL þ MGR model are higher compared to 2PL þ GR, i.e., we obtain as expected, better values when estimating with the correct model compared to the overspecified two-dimensional model.
In the 2 nd block of Table 6, data were generated from the two-dimensional confirmatory M2PL þ MGR model and analyzed with 2PL þ GR and confirmatory M2PL þ MGR. When analyzed with the wrong unidimensional model, the estimated discrimination parameters for the items are negatively biased. Their absolute value decreases with the increase in sample size. When analyzed with the correct two-dimensional model, the estimated discrimination parameters for the items are positively biased, decreasing with the increase in sample size. For each sample size the values of RMSE, AB and the absolute value of the mean bias are higher for the 2PL þ GR model compared to M2PL þ MGR, i.e., we have again better values when estimating with the correct model compared to the wrong unidimensional model. In both the 1 st and 2 nd block of Table 6, the bias for each sample size is lower for the unidimensional model as compared to the multidimensional model. This highlights that we get higher estimated discrimination parameter in the multidimensional case as compared to unidimensional. Hence, this simulation study supports the finding of our case study that items have a higher discrimination capability in multidimensional model as compared to unidimensional model. Why is this case? When using the MIRT and considering a specific item, we obtain first a discrimination parameter estimate for each dimension. We focus then on the dimension which has the highest discrimination capacity. This means that the discrimination estimate we consider in the MIRT case is a maximum out of several estimates. It is the value which seems to belong to the abilitydimension connected to this item. Therefore, the estimate based on MIRT is larger compared to the UIRT no matter which of the models is the true one.
We have focused so far on the discrimination parameter. Looking at the results for the difficulty parameter, the results of RMSE, AB and Bias (Table 7) show only small difference between unidimensional and multidimensional. This indicates that the degree of multidimensionality seems to have no greater effect on the estimation of difficulty parameters. This result confirms the findings of Childs and Oppler (1999) and Yang (2007).
In an actual study in contrast to this simulation study, we do not know if the true model is unidimensional or multidimensional. Applying the multidimensional model, even though the true is unidimensional, is safer since items can discriminate the students better and model estimated discrimination parameter are close to actual parameter, especially for large sample size as suggested by the simulation study result. Furthermore, model fit criterion AIC is only a little higher when analyzing the wrong multidimensional model when the true is unidimensional (1 st block in Table 6). On the other hand, if we apply the wrong unidimensional model in the multidimensional case (2 nd block in Table 6), the AIC is increased with a larger amount. The negative bias of estimated discrimination parameters is not reduced when sample size is increased to 1000. We observe, based on these results, that the violation of unidimensionality assumption may seriously bias ability and estimated item parameters which is in accordance with statements by Reckase (1979) and Harrison (1986).

Discussion and conclusion
In some situations, it is possible to build the model by first conducting an exploratory factor analysis and thenwith new dataconfirm this model. While an approach to  confirm with new data would lead to strong certainty about the proposed model, we have not this possibility here: the data are too small to be split into subgroups for exploration and confirmation and due to the nature of examination in higher education, the exam will not be repeated with the same questions for new students such that no new data with the same items will be obtained in the future. In order to obtain a meaningful model in a situation like ours, we recommend to choose the model using purely data driven criteria in combination with knowledge about groups of items suspected to belong to the same factor. Models with factor structure corresponding to the areas in Table 1 are preferable. Nevertheless, it is important to carefully consider unexpected results and reflect possible explanations; in our case study, the placement of item Q 2 within another factor than it's group Q 1 to Q 4 could be explained. This approach aims to facilitate a meaningful interpretation of our model but we point out that the resulting model still needs to be seen as exploratory.
Instead of performing a confirmatory analysis which here is difficult or impossible, we propose to conduct analyses similar to ours for several exams from different occasions (i.e., with different questions) and to look out for results which are supported from several different analyses. Based on interpretation supported by independent analyses, exams from higher education can be improved. For example, low discriminating items or items which discriminate in an ability range which is not of interest for the exam can be avoided. The sample size in the case study appears low in the context of the applied models (e.g., 4PL) where recommendations usually require higher numbers. However, the sample size is common in higher education. Our approach described before tries to gain insights from the analysis in this challenging situation while acknowledging that results needs to be taken with care.
As we have seen in the case and simulation study, the estimates for the discrimination parameter were higher when an MIRT was used as compared to an UIRT. This effect which we explained in Sec. 5 has not found much attention in the literature yet. The estimate based on MIRT is larger compared to the UIRT no matter which of the models is the true one. This effect is important to keep in mind when working with both types of models: when one is used to a certain range of discrimination parameter estimates from UIRT analyses, one can expect a higher range when the analysis model is switched to MIRT.
An implicit assumption we made in this article is that high discriminating items are preferable. In a situation like considered here, an achievement test in higher education, it is intuitively reasonable to strive for a set of several well discriminating items with difficulties spread over a range of interest. However, in general it is not clear if high discrimination is preferable. Buyske (2005) mentions reasons to use low discriminating items in the beginning of a computerized adaptive test. To investigate a general situation in detail, we need explicitly define the objective. Then optimal design methods can formally determine the optimal item parameters, see e.g., Berger and Wong (2005).
We demonstrated in our simulation study that if it is unknown whether the UIRT or the MIRT is the underlying model it is safer to use the MIRT. We found there that if the MIRT model is used for analysis, estimated discrimination parameters are quite close to the true parameters especially for large sample size.
Based on our analysis of real data and on simulations with realistic parameter choices, we pointed out advantages of MIRT compared to UIRT. How our conclusions depend on parameter values, e.g., on the correlation between the abilities, was not in our focus here and should be investigated in future research.

Data availability
The data that support the findings of this study are available from the corresponding author (Ul Hassan) upon reasonable request.