Adjustment for guessing in a basic statistics test for Indonesian undergraduate psychology students using the rasch model

Abstract The purpose of this study comprises (1) calibrating the Basic Statistics Test for Indonesian undergraduate psychology students using the Rasch model, (2) testing the impact of adjustment for guessing on item parameters, person parameters, test reliability, and distribution of item difficulty and person ability, and (3) comparing person scores after adjustment for guessing with original analysis results. Adjustment for guessing was performed using the Rasch model-based method with a tailored analysis. Participants were 488 undergraduate psychology students (221 male, 267 female; mean age = 18.86, SD = 0.49) in Indonesia. The Rasch analysis results show that the assumptions of unidimensionality and local independence of the Basic Statistics Test are met. However, the two items do not fit the Rasch model on the original or tailored analysis. Comparing the original and tailored analyses shows that adjustment for guessing impacts the item parameters, person parameters, reliability, and distribution of item difficulties and person abilities. With the university’s minimum pass-levels (MPL) based on total scores (scores > 60), adjustment for guessing resulted in more respondents passing the MPL. Thus, adjustment for guessing is recommended for making pass-fail decisions based on a predetermined MPL. Implications and suggestions for future research are discussed.


PUBLIC INTEREST STATEMENT
The research reports on adjustments for guessing using data from the Basic Statistics Test for Indonesian Undergraduate Psychology Students. The adjustment was performed using Raschbased method. Rasch measurement is an advanced methodology and has advantages over other relevant methods because adjustments to guessing can be made without including guessing parameters into the model. This paper first introduces the basic concept of guessing adjustments based on the Rasch model, termed tailored analysis. Next, the empirical demonstration of the procedure was carried out. Finally, we compared the psychometric characteristics between the original and tailored analysis. The results of the comparison between the original and the tailored analysis show that the adjustment for guessing impacts the item parameters, person parameters, reliability, and distribution of item difficulties and person abilities. At the end of the article, we provide conclusions and resources to consult for further learning.

Introduction
Indonesia is a vast archipelago with 275.1 million people and has a large geographical area consisting of 34 provinces (Worldometers, 2021). Indonesia's enormous geographic area has 4621 higher education institutions (633 universities; Ministry of Research and Technology Republic of Indonesia, 2019). Of the many studies offered by Indonesian higher education institutions, psychology is one of the favorite faculties in social sciences (Muluk et al., 2018). Psychology is also one of the most popular majors worldwide (Gielen et al., 2017). Psychology, as a science and profession, is experiencing a period of rapid development in Indonesia. In 1952, a few years after Indonesia claimed its independence in 1945, psychology was introduced to the nation through the scientific speech of a psychiatrist, Prof. Dr. Slamet Iman Santoso, on the importance of psychology and the role of psychologists in the implementation of a selection process for the identification and placement of qualified workers (Sarwono, 2004). Indonesia has 120 universities offering psychology study programs at the undergraduate, postgraduate, and doctoral levels (Muluk et al., 2018).
Statistics is a compulsory subject in a psychology study program in Indonesia. Generally, statistics courses are offered with three to nine credits. There is no agreement on the psychology curriculum in Indonesia. Each university offers a different number of statistics courses; some universities offer advanced statistics courses for undergraduate psychology students (e.g., Brawijaya University, 2020; Syarif Hidayatullah State Islamic University Jakarta, 2018). The statistics courses offered at each university depend on the availability and expertise of lecturers. However, considering that statistics is a compulsory subject, it is a prerequisite for taking subsequent courses. Not passing this course will seriously impact the students' study completion period.
In general, Indonesia's student assessment system consists of three components: assignments, midterm examinations, and final semester exams. Although the passing criteria for a course are different among universities, a minimum pass-level (MPL) based on a total score of 60 is typical (Syarif Hidayatullah State Islamic University Jakarta, 2018). The final semester exam will determine whether a student passes a course. The final semester exam is comprehensive and generally uses multiple-choice (MC) questions with minimum pass-level (MPL) based on a total score using the classical test theory (CTT) approach.
Although MC is quite popular and easy to use, there are factors beyond person ability that impact the ability estimate, one of which is guessing (Paek, 2015). If guessing is not adjusted or corrected for, then the estimate of person ability (Andrich & Marais, 2014), item statistics, and test reliability will not be accurate (Paek, 2015). In turn, the pass-fail decision is also affected by guessing. In this context, guessing refers to having no construct-relevant basis for preferring one response over another, well known as a random-guessing assumption (Lord, 1980). The important thing to note is that guessing is not a property of the test but a property of persons (Andrich & Marais, 2019;Wright & Stone, 1999).
In the context of guessing behavior, there are at least three solutions to overcome this problem: correction of guessing with classical test theory (e.g., Cureton, 1966), the application of modern test theory to overcome the limitations of CTT, such as the IRT 3-PL model by including guessing parameters into the model (e.g., Chiu & Camili, 2013;Von Davier, 2009), and adjustment for guessing based on the Rasch measurement model. Furthermore, various adjustment procedures using the Rasch model have been developed, including tailored analysis (Andrich & Marais, 2014), cut-off responses with low expectations (CUTLO; Royal & O'Neill, 2011), guessing adjustment (Waterbury & DeMars, 2019), tailored data method (Wyse, 2016), and response pattern analysis (Wright & Stone, 1999).
Correction of guessing with CTT is often not an option because of the several limitations inherent in this theory. Modern measurement theories such as the IRT or Rasch models have been developed to overcome the limitations of the CTT (Boone & Staver, 2020). However, the IRT 3-PL model, which includes the guessing parameter, does not have sufficient statistics to estimate the guessing parameter (Lord, 1980) and would be difficult to use with the raw score-based passing criteria. In addition, the passing criteria using the raw score are not easy to change, considering that these criteria are easily understood and applied by university policymakers. The Rasch model (Rasch, 1960) is an option, considering that this model has statistical sufficiency and allows the transformation of the total score into an interval scale (Wright, 1977) so that the passing criteria based on the total score can still be used.
The Rasch model, which consists of only one item parameter, namely item difficulty, is often considered unable to correct guessing because it does not include the model's guessing parameter (Smith, 1993). In the Rasch model developers' view, guessing is not an item parameter. After all, it makes no sense if a test taker guesses an item because it asks the person to guess (Wright & Stone, 1999). In addition, if guessing and discrimination parameters are included as item parameters, as in the IRT-3PL model, the model no longer has sufficiency, specific objectivity, and separation parameters (Andrich & Marais, 2019). Although it has a simple mathematical formula, the Rasch measurement model fulfills the basic requirements of fundamental measurement (Mair, 2018). The Rasch measurement model offers several alternative methods for making adjustment for guessing. Some of these methods have been developed for a long time (Choppin, 1974(Choppin, , 1975(Choppin, , 1983. For example, several studies have modified the Rasch model to consider guessing to produce an accurate equating (Weitzman, 1996). Experts have intensively developed research on guessing adjustment with the Rasch model at the University of Western Australia (i.e., Andrich & Marais, 2014Andrich et al., 2012Andrich et al., , 2016Humphry, 2015;Marais, 2015). This adjustment method to guessing using the Rasch measurement approach formalizes the procedure previously developed by Waller (1976). Other researchers have also found that the Rasch model's statistical fit can function well for detecting guessing behavior (Holster & Lake, 2016). The study found that when the response with likely guessing is corrected, there will be a change in the item's estimated difficulty level (Lin, 2018). The application of guessing adjustment to TIMSS 2015 data in South Africa found that the respondents' estimated ability increased (Bansilal et al., 2019). These studies show that research on adjustment for guessing with the Rasch model has developed intensively in the last few decades.
This study is related to adjustment for guessing using the Basic Statistics Test for undergraduate psychology students. This test measures the competencies that students must master to pass the Statistics 1 course, a prerequisite for taking Statistics 2 and Statistics 3 courses. Specifically, this study aims to: (1) calibrate the Basic Statistics Test for Indonesian undergraduate psychology students with the Rasch model, (2) make adjustment for guessing using the method developed by Choppin (1983) and Waller (1976), which was then formalized by Andrich et al. (2012), and (3) compare the results of the original raw score analysis and the results of adjustment for guessing with minimal passing criteria.

Basic statistics test for psychology students
The Basic Statistics Test for Psychology Students (TKSD-P; Tes Kemampuan Statistika Dasar untuk Mahasiswa Psikologi) is a test developed to measure basic statistical competencies for undergraduate psychology students. The competence measured in this test refers to the course outline and teaching materials, including six topics: definition of statistics, level of measurement; measures of centrality, variability, and relationship; scale transformation; and basics of linear regression. The materials were taught during 14 weeks of lectures. In developing this instrument, we assume that it is unidimensional and bipolar, where in the extreme right shows high statistical competence and the extreme left shows low statistical competence. The test is a 25-item MC test with four answer options.
The passing criteria for this test are determined based on university policy, wherein a total score of 60 is a passing score with the following passing level categories: a score of 60-69 for the minimum category (grade C), a score of 70-79 for the good category (grade B), and a score of 80-100 for the excellent category (Grade A; Syarif Hidayatullah State Islamic University Jakarta, 2018). Students need correct answers on 15 of the 25 questions to obtain a total score of 60.

Participants
This study comprised 488 respondents-221 men and 267 women who were students majoring in psychology from four universities located in the Jakarta metropolitan area of Indonesia. Their age range was 18-20 years (mean = 18.86, SD = 0.49). The respondents were students who had taken Statistics 1 or an equivalent course. The data collection was carried out in a lecture setting using a paper-pencil format. Respondents also filled out an informed consent form, including the notification that this research would be published maintaining the respondent's identity confidentiality.

The Rasch model
In the field of social measurement, there are three traditions of measurement: test-score tradition such as CTT, scaling tradition such as Rasch measurement theory (RMT), and structural tradition such as structural equation modeling (Engelhard & Wind, 2021). The motivation for RMT is that within a frame of reference, the comparison of persons and comparisons of items are invariant with respect to different subsets of items and persons, respectively (Andrich & Marais, 2019). However, the test-score tradition and structural tradition do not possess this unique property.
The Rasch model was developed by Rasch (1960) to analyze dichotomous data (1 = true, 0 = false). As developed by a group of experts at the University of Chicago, the Rasch model can now be applied to analyze polytomous data (Suryadi et al., 2020). Various models have been developed under the RMT as a family of measurement models that share unique objectivity features (Engelhard & Wind, 2021;Masters & Wright, 1984). However, like other latent trait models, the Rasch model in its early development was not free from criticism (see, Divgi, 1986;Goldstein, 1980Goldstein, , 2015, because other models have less restricted assumptions (e.g., IRT 3-PL) than the Rasch model and are considered more realistic for describing the data. Nevertheless, recent literature and scientometric review concluded that Rasch measurement is an influential psychometric approach in psychology research (Aryadoust et al., 2019;Edelsbrunner & Dablander, 2019).
Among psychometricians, the Rasch model can be seen as a model-based or data-based tradition (Maydeu-Olivares & Montano, 2013), both of which have philosophical differences. The dichotomous Rasch model is considered equivalent to IRT 1-PL as a data-based tradition. Under this tradition, if the model does not fit the data, the model is modified, for example, by adding an item parameter. The model-based tradition has been embraced by George Rasch and Rasch measurement developers (Wright, 1968). If the data did not fit the model, the data were excluded from the analysis (Gershon et al., 1994;Linacre, 2010).
In contrast to the Rasch measurement model, which only includes item difficulty parameter, IRT with the 3-PL model consists of a pseudo-guessing parameter and the difficulty level and discrimination parameters of item (see, Chiu & Camili, 2013 for an overview). The Rasch model for dichotomous data has the following basic equation (De Ayala, 2009): is the probability of answering item j correctly (score 1), θ is the test taker's ability, and δ j is the difficulty of item j. Item difficulty and person ability were expressed on a logit scale.
In using the Rasch model, several assumptions of the model need to be fulfilled, namely: (1) unidimensionality, namely the test measures only one trait, (2) local independence, namely the response given by test takers to one item must be statistically independent of responses to other items in a test, and (3) parallel item characteristics curve (ICC), which means that items have the same discriminating power (Mair, 2018). In this study, the assumptions of unidimensionality and local independence were tested. ICC parallelism was not tested because it had entered the basic formula of the Rasch model.
In the Rasch measurement model, the fit index used is a mean square-based statistic called infit (unweighted) and outfit (weighted) mean square (Wright & Stone, 1979). The expected value of these two statistics is 1, wherein the values in the range 0.5-1.5 indicate that the data fit the Rasch model (Boone & Staver, 2020). Meanwhile, the reliability information of the Rasch model is reported in the form of person separation reliability and indices (Wright & Stone, 1979). A value above 0.70 and 1.50 indicates acceptable internal consistency, and the test functions well in separating the test takers (Tennant & Conaghan, 2007).

Correction for guessing based on the Rasch model
Correction for guessing using the Rasch model approach has developed rapidly over the past decade, as evident in the number of publications focusing on this method over the last five years (e.g., Bansilal et al., 2019;Waterbury & DeMars, 2019;Wyse, 2016). This procedure is carried out by following the methods proposed by Choppin (1983) and Waller (1976Waller ( , 1989 and formalized by Andrich et al. (2012), where standard programs in Rasch analysis (e.g., RUMM2030, Winsteps) already have features that allow researchers to perform this type of analysis (see, Linacre, 2018).
The tailored analysis procedure uses a probability value to correctly answer an item as a criterion (Andrich et al., 2012). There are several criteria for a tailored analysis procedure (Waterbury & DeMars, 2019). One of them is the −2 logit criterion. Suppose the respondent can correctly answer an item that is two logits above their ability, or in other words, the respondent succeeds in answering an item that only has a 12% chance of being answered correctly, the answer is treated as missing data (Royal & O'Neill, 2011). Another criterion is −0.847 logit, which means that an unexpected response with a chance of answering an item correctly less than 30% or −0.847 logit above their ability is used as missing data (Andrich et al., 2012(Andrich et al., , 2016. This study used the −0.847 logit as a probability threshold for tailored analysis. An illustration of implementing tailored analysis is shown in Table 1. As an illustration of the tailored analysis, we use three respondents' response patterns as an example, as shown in Table 1. In the original analysis (i.e., the initial item calibration for the original data), the respondents answered all 25 items (completed cases). By applying tailored analysis, the response of a person who has a chance of answering items correctly below 30% is used as missing data. The data is then recalibrated, regardless of whether the response is right or wrong (Andrich et al., 2012). As can be seen, the tailored analysis resulted in item difficulties that differed from the original analysis.

Data analysis procedure
The dichotomous Rasch model was employed to analyze the data in this study using the Winsteps program with the joint maximum likelihood (JMLE) estimation method. The results of the analysis include the following: (1) testing the unidimensionality and local independence assumptions; (2) comparison of the item and person parameters between the original analysis and the tailored analysis; (3) comparison of the proportion of persons passing the minimum score of 60 from the original analysis and the tailored analysis; (4) graphical presentation with Wright Maps; and (5) comparison of the test information function from the original analysis and tailored analysis.

Dimensionality and local Independence
In this study, the Rasch model's unidimensionality assumption was tested using the principal component analysis of residuals (PCAR) method (Linacre, 1998). Using PCAR, the variance explained by the measure greater than 40% (Holster & Lake, 2016), and an eigenvalue lower than 2.0 for the minor dimension, indicates unidimensionality (Raiche, 2005). In the present study, the PCAR shows that the raw variance explained by the measure is 17.5 in eigenvalue units or 41.1% in original analysis and 11.0 in eigenvalue units or 30.6% for tailored analysis. The eigenvalue for the minor dimension is 1.5 for the original analysis and 1.6 for the tailored analysis. The hypothesis of unidimensionality was supported based on evidence of the minor dimension.  After unidimensionality evidence was found, successful implementation of Rasch measurement requires items that approximate local independence, where a person's responses to different test items are statistically independent of each other (Wright, 1996). In this study, the researchers used the standardized residual correlation between each item pair (Linacre, 1998), with a value lower than 0.30 considering the absence of local dependence (Saggino et al., 2020). In this study, there were no items from the Basic Statistics Test that showed local dependence. Based on this evidence, the hypothesis of the local independence of items was accepted.

Item parameter estimates and fit statistics
Item calibration results with the Rasch model from the original analysis and tailored analysis showed items ordered by difficulty starting from the item with the highest difficulty level to the item with the lowest difficulty level. A comparison of the results of item calibration and item fit statistics is presented in Table 2. Table 2 shows that the item statistics differed from the original analysis results after guessing was corrected with the tailored analysis procedure. The item order based on the level of difficulty also changed. Changes also occurred in the fit statistics for items. The original analysis resulted in two misfitting items (item 22 with δ = 1.73 and item 9 with δ = −3.12), and two items were also misfit based on tailored analysis (item 12 with δ = −1.69 and item 9 with δ = −3.61). However, one item that did not fit the Rasch model based on the tailored analysis was different from the misfitting item based on the original analysis. The standard deviation of the item's difficulty level and the average infit and outfit statistics also changed, although these changes were relatively small.
A review of the misfitting items shows that item 9, based on both the original and tailored analyses, has the lowest difficulty. The chance of answering item 9 correctly due to guessing behavior was minimal. Meanwhile, item 22, based on the original analysis, is classified as a difficult item with δ = 1.73 and underfit (Outfit MNSQ> 1.5). When a guessing adjustment was made in the tailored analysis, the item was changed to fit the Rasch model. This shows that the item in the original analysis was impacted by guessing. Meanwhile, item 12, which fit the Rasch model in the original analysis, turned into an underfit in the tailored analysis. With an item difficulty level of −1.69, this item was relatively easy. The underfit for this item in the tailored analysis was very likely related to the high-expectation response (carelessness) that was beyond the scope of this study.

Person ability estimates and reliability
After describing the impact of guessing correction on item parameters in the previous section, it is necessary to know whether the correction to guessing using the tailored analysis procedure also affects the person ability parameter, reliabilities, and proportion of students who pass the predefined criteria. The impact in question can be explored using the information presented in Table 3.
In Table 3, in the original analysis, the mean of person ability was 0.49, while in the tailored analysis, the mean ability was 0.43, which was slightly lower than that in the original analysis. The SD of person ability from the original analysis was 1.16, while the SD of the tailored analysis was 1.49, which was greater than that of the original analysis. Correction for guessing results in a wider person distribution, with a slightly lower mean value of person ability. The wider distribution of person abilities from the tailored analysis was caused by an increase in the person measure estimate for the high ability group and a decrease in the person measure estimate for the low ability group.
Another statistic affected by the correction of guessing is person reliability. When the guessing correction procedure was applied, the reliability increased. This increase occurs because the distribution of person abilities (mean and variance) also changes, resulting in a mean of 0.43, which is closer to the mean of the item difficulty, indicating better test-person targeting. Another finding relates to the number of respondents who met the predetermined passing criteria, namely, a total score of 60. In the original analysis, 29% of the participants scored 60 or above. After adjustment for guessing, it was found that 35% of the participants had scored above 60. There was a significant increase in the percentage of persons who passed the criteria from the tailored analysis. The comparison of mean person ability from the original analysis and tailored analysis showed a statistically significant difference with t = 3.59 (487), p < 0.01 (mean difference = 1.03; SD = 6.32; S.E. of the mean = 0.29). Although the mean person ability from the two analyses was statistically significant, the person's scores from the original analysis were significantly positively correlated with the tailored analysis scores (r = 0.92, p < 0.01). This indicates that the correction for guessing had little impact on the position of person ability.
In sum, correction for guessing by applying tailored analysis resulted in different person ability estimates, wider person distributions, and measurement precision. Appendix A provides a full ordinal-to-interval conversion table to maintain with original passing criteria.

Wright map
In addition to information about the results of item parameter estimates described in the previous section, the Rasch model analysis also provides an overview of the relationship between person abilities and item difficulty, which can be compared simultaneously using the Wright Map (Wilson & Draney, 2002). The Wright map of the original analysis and tailored analysis side by side is shown in Figure 1. Figure 1 shows that the mean of item difficulty for the original analysis and tailored analysis was 0 and served as the logit scale's arbitrary origin. As shown in Figure 1, the mean of person ability is generally slightly above the mean level of item difficulty of the Basic Statistics Test based on both the original and tailored analyses. Thus, the mean difference in tailored analysis is not significantly different from the original analysis. However, the difference is in item difficulty distribution, where the distribution of tailored analysis is wider. This also occurred at a larger standard deviation. Thus, as shown in Figure 1, the easiest items become easier, and the hardest items become harder in parallel. This finding can implicitly illustrate that the proportion of students who pass the minimum passing levels also changes.

Test information function
The test information function (TIF) is the amount of information given by a test at each ability level with each standard error (De Ayala, 2009). The TIF from the original analysis and tailored analysis is shown in Figure 2. For the original analysis, the curve's peak is in the theta range of 0.14 to 0.35, with an information value of 3.93. After guessing adjustment in tailored analysis, the curve's peak is in the theta range of −0.56 to 0.00 with an information value of 3.38. This means that the Basic Statistics Test provides optimal information for measuring persons with moderate statistical competency. It is not optimal to measure persons with very high or low statistical competencies. The TIF value at the peak of the curve indicates the smallest measurement error. Although the TIF curve for tailored analysis has flatter peaks, the information value for persons with very high and very low ability shows slightly higher values than the original analysis (see, Figure 2).

Discussion
This study investigated the psychometric characteristics of the Basic Statistics Test for Indonesian undergraduate psychology students and adjustment for guessing by means of tailored analysis using the Rasch model. The results of the Rasch analysis show that the assumptions of

Figure 1. Wright maps of original (left panel) and tailored analyses (right panel)
unidimensionality and local independence are met. However, two items did not fit the Rasch model. This study results can be used to revise these two items or consider not using these items. Finally, this study shows why guessing adjustment by performing tailored analysis is recommended for application at the piloting stage of an instrument, especially if the measurement results will be used for policymaking regarding the decision to fail or pass test takers.
After adjusting for guessing with a tailored analysis, we found a change in item difficulty level, where difficult items became harder and easy items became easier. This finding is in line with previous research results, which found a change in item difficulty when guessed responses were corrected (Andrich & Marais, 2018). When guessed responses are not corrected, item difficulty level would be underestimated (Bansilal et al., 2019;Lin, 2018). Further, tailored analysis involving a larger proportion of missing data affected all item statistics (Waterbury, 2019). This finding is also related to the change in person ability, which tends to be lower than when guessed responses are corrected (Andrich et al., 2012). The tailored analysis also provides unbiased item difficulty estimates (Andrich et al., 2016). Guessing corrections improve person ability estimates, especially for persons with high abilities (Prihoda et al., 2008).
Another finding related to practical implications is the effect of adjustment for guessing in terms of passing criteria. Using a total score of 60 as the passing criterion, as many as 29% of the participants passed the test based on the original analysis. However, after correcting for guessing, the number of participants who passed the test increased to 35%. This difference was statistically significant. As shown in previous studies, this is a concern where students and faculty members worry when there is a significant change after guessing is corrected, especially if related to decision-making (Prihoda et al., 2008). This study findings also show that correction for guessing results in a different total score than the original analysis, but both are highly correlated in a positive direction. Again, this finding is in line with those of previous studies conducted by Choppin (1974).
The results of this study indicate that correction for guessing also improves test reliability. This result is in line with another study, which found an increase in the reliability coefficient after

Figure 2. Test information function of original and tailored analyses
correcting for guessing. Guessing corrections have an impact on changes in the distribution of person ability. By removing the guessing responses from the data, the person ability level distribution is closer (matched) to the item difficulty level (Paek, 2015). Comparing the mean person abilities of the original analysis with tailored analysis shows a decrease to get closer to the mean item difficulty level of 0.
The result of the tailored analysis also affected the TIF curve. The tailored analysis curve had lower and flatter peaks than the original analysis TIF curve. This is related to the form of item difficulty level distribution compared to the person ability distribution. The distribution remains steeper when item difficulty is tightly clustered (Baker & Kim, 2017). The tailored analysis results indicate that respondents with high and low abilities have better information values than the original analysis. These results are in line with the findings of previous research (see , Lin, 2018).
Regarding the TIF, Ostini and Nering (2006) stated that the notion of test information is rarely employed in Rasch measurement literature. From the IRT perspective, it is known that the higher the peak TIF, the higher the reliability. This happens because reliability is generated with 1-1/ information (Reeve & Fayers, 2005). In the present study, compared to the original analysis, the PSR in the tailored analysis was found to be higher. This finding relates to the Rasch analysis program that produced PSRs for both extreme scores included and removed (Clauser & Linacre, 1999), which were affected by missing data in the tailored analysis. Thus, information about reliability is used through PSR without calculating it through TIF.
Although not the focus of this research, another statistic influenced by guessing adjustment is PCAR. The first contrast value of PCAR for tailored analysis was lower than that for the original analysis. This finding occurred because the tailored analysis produced a larger proportion of missing data. This finding is in line with the results of another study, which found that PCAR was influenced by missing data (Wind & Schumacker, 2021). The same was true for the Infit and Outfit statistics, which were affected by the proportion of missing data (Hohensinn & Kubinger, 2011). Future research should focus on exploring these findings further.
In addition, an important factor in the implementation of tailored analysis is the probability threshold (Waterbury & DeMars, 2019). This study used a 30% threshold based on the results of previous research (Andrich et al., 2012(Andrich et al., , 2016. However, recent research has shown that selecting a threshold needs to be based on the sample size (Wyse, 2016). In addition, threshold selection should be made based on the results of comparing various thresholds on the same data (Waterbury & DeMars, 2019). These two issues have not been investigated in this study and can be explored in future studies.
This study has various theoretical and methodological limitations. The theoretical limitations are related to the test content. The content coverage of the Basic Statistics Test refers to the curriculum of the Faculty of Psychology at Syarif Hidayatullah State Islamic University. The statistics course is mandatory for psychology students at the university. Although this curriculum does not reflect the curriculum across psychology faculties all over Indonesia, all psychology faculties in Indonesia require their students to take between three and nine credit hours of statistics courses. Based on our literature review, statistical proficiency tests often include teaching materials and curricula based on statistical literacy theoretical frameworks, such as the Basic Literacy in Statistics Scale (Ziegler, 2014). However, the theoretical framework of statistical literacy used in these studies was not considered in this study. This aspect can be used to strengthen the conceptual basis for developing statistical tests in the future.
Methodological limitations are closely related to the non-random sampling. The sample comes from the capital city, where many other psychology students outside the big cities do not have access to learning materials tested by the test. In addition, the results of the analysis show that one item (item 12 in the tailored analysis) is affected by a high-expectation response (e.g., carelessness), which can be more precisely investigated by the cut response with high expectation (CUTHI) procedure (see, Gershon et al., 1994). Future research may expand guessing studies into more general contexts, such as measurement disturbances, also known as aberrant responses (e.g., Li & Olejnik, 1997). Additionally, in the future, the total score-based passing criteria can be improved by incorporating standard-setting methods that relate the scores to the statistical ability measured by the test in the framework of criterion-referenced testing.

Conclusion
The Rasch analysis results provide evidence that the construct validity of the Basic Statistics Test is satisfactory. Evaluation of the psychometric characteristics of the test showed good results related to item difficulty distribution, test information function, person ability distribution, and test reliability. However, two test items did not fit the Rasch model, suggesting further improvement in the test. The application of tailored analysis has been proven to affect the distribution of item difficulty and person ability, as well as test reliability. Comparison of original and tailored analyses showed that the order of item difficulty and person ability from the two analyses was not significantly different and significantly correlated. In sum, this study suggests that the MPL based on total score can be employed by the university without changing the current policy.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article. Note. Scale transformation to 0-100 in original analysis using Mean = 50.934 and SD = 9.145; tailored analysis using Mean = 51.195 and SD = 8.301