Questionnaire-taking motivation: Using response times to assess motivation to optimize on the PISA 2018 student questionnaire

Abstract This study aims to assess student motivation to provide valid responses to the PISA student questionnaire. This was done by modeling response times using a three-component finite mixture model, comprising two satisficing response styles (rapid and idle) and one optimizing response style. Each participant’s motivation was operationalized as their probability of providing an optimizing response to questionnaire items. Overall, the model offered a good fit to the data. Results indicate that most responders were motivated to optimize, with a slight decline toward the end. Further, results showed a positive effect of questionnaire-taking motivation on PISA performance, suggesting a positive relationship to test-taking motivation. In conclusion, response times can be valuable indicators for assessing survey response quality and may serve as a proxy for test-taking motivation.


Introduction
Students participating in PISA (Programme for International Student Assessment) complete a test mainly focusing on math, science, and reading tasks (OECD, 2019).Following the test, they fill out a questionnaire covering background information and attitudes related to their education.However, concerns exist that scores from international largescale assessments, such as PISA or TIMSS (Trends in International Mathematics and Science Study), may be impacted by varying motivation levels, potentially threatening the validity of results and cross-group comparisons.One factor assumed to influence students' motivation on the assessment is its low-stakes nature-there are no consequences contingent on how well the test-taker performs.Concerns related to the effects of motivation are substantiated by several studies showing positive effects of test-taking motivation on test performance (Eklöf et al., 2014;Ivanova et al., 2020;Pools & Monseur, 2021;Silm et al., 2020;Wise & DeMars, 2005).
Prior research on test-taking motivation and its link to performance in low-stakes assessments has mainly concentrated on students' self-reported or observed effort during the test.However, it is also interesting and important to consider students' motivation to thoroughly complete items in the post-test questionnaires; their questionnaire-taking motivation.Although there are studies investigating response biases and aberrant response patterns in the questionnaires that are part of large-scale assessments (Hopfenbeck & Maul, 2011;Khorramdel et al., 2017;van de Vijver, 2018) students' response behavior when taking the questionnaire, as indicated by response times, has received less attention (see, however, Soland et al., 2019;Ulitzsch et al., 2021).Since 2015, both the PISA questionnaire and literacy tests have been computer-based, with response-time data from items included in the publicly available dataset.
It seems reasonable that a student lacking motivation to complete a two-hour, low-stakes assessment may also be unmotivated to carefully finish a 30-min questionnaire afterward.Drawing on previous studies examining test-taking effort assessed through response times together with the satisficing framework used in survey research (see Krosnick, 1991), the current study aims to infer students' motivation to answer the PISA questionnaire using models of response times.Additionally, the study aims to explore the connection between students' questionnaire-taking motivation and their performance on the PISA reading test, hypothesizing that questionnaire-taking motivation may act as an independent proxy for their motivation to do well on the test.

Assessing test-taking effort and motivation
Assessing test-taking effort and motivation is not straightforward.Most existing studies have assessed test-taking motivation with post-test self-reports (Silm et al., 2020).However, computer-based tests that log response times enable exploring the use of response times to assess test-taking motivation (see, Wise & Kong, 2005).An expectation is that response times may mitigate some shortcomings linked to post-test self-reports.A meta-analysis shows larger correlations between response time-derived effort measures and test performance than between self-reported effort and test performance (Silm et al., 2020).In the context of PISA it can however be challenging to infer test-takers' motivation from response times since different test-takers take different sets of items due to the matrix sampling design of the PISA assessment.See OECD (2022) for an explanation of the design of PISA.A less cognitively demanding part of the assessment, which has a fixed number of items in a fixed order that all test-takers must complete, is the student questionnaire.Since questionnaire items are less dependent on the test-takers' subject knowledge and academic proficiency level, the response times should to a lesser extent be affected by test-takers' skill and knowledge, which could otherwise cause increased difficulty in both modeling and interpretation of their test-taking motivation.An additional advantage of the questionnaire is that all items are publicly available for inspection.Approaches to assessing test-taking motivation with response times on achievement tests.

Approaches to assessing test-taking motivation with response times on achievement tests
There are two primary approaches to using response times as indicators of test-taking effort and motivation.The most common is using a fixed threshold approach, which assumes excessively fast response times result from rapid guessing.Thresholds for each item classify test-takers' response times, indicating a rapid guess if below the threshold or a solution behavior otherwise.From these classifications, an index reflecting the test-taker's Response Time Effort (RTE) can be created, representing their effort on the entire assessment (Wise & Kong, 2005).RTE is calculated as the proportion of responses classified as solution behavior.Much of the work on threshold-based approaches and RTE has been conducted by Steven Wise and colleagues (Wise, 2006;Wise & Gao, 2017;Wise & Kong, 2005;Wise & Ma, 2012).A challenge in the threshold-based approach is determining the threshold and its rationale.One approach has been to set thresholds at 3 s, 10 s (see, Kong et al., 2007), or at 10% of the mean of the response times given to an item; the NT10 threshold (Wise, 2017;Wise & Ma, 2012).Another issue is the potential overlap between motivated and unmotivated response times.For less cognitively demanding items, response times for notso-rapid guessers may be similar or even longer than those of alert and fast test-takers attempting to solve the problem quickly.This implies that it is not conceptually consistent to set hard deterministic thresholds that categorize test-takers' response times as rapid guessing and solution behavior without considering the uncertainty of the classifications.An alternative approach models rapid guessing and solution responses with finite mixture models, which assume that the observed data were generated from a set of latent data-generating processes.In the context of response time effort, a two-component mixture model has often been used where one component is assumed to generate the response times from rapid guessing behavior and the other response time from solution behavior (see, Meyer, 2010;Pools & Monseur, 2021;Schnipke & Scrams, 1997;Wang & Xu, 2015).The finite mixture models reconcile the problem with the fixed thresholds and provide a "soft" classification where each test-taker's response time is given a probability of being generated by either component.The state of the art when using mixture models adjusts for the effects of non-engagement by jointly using information from response times and item response patterns (see, Liu et al., 2020;Lu et al., 2020Lu et al., , 2023;;Ulitzsch et al., 2020Ulitzsch et al., , 2022)).The conceptualization of and methodological approach to the response times, rapid guessing, and solution behavior reviewed in this section have primarily applied to achievement tests.However, similar approaches could also be relevant to questionnaires.

Satisficing in survey research
The satisficing and optimizing framework used to evaluate the quality of questionnaire responses (Krosnick, 1991), is in many ways similar to that of rapid guessing and solution behavior described above.Herbert Simon's satisficing theory, developed for administrative behavior and human problem-solving, acknowledges that due to the physical limitations of human knowledge and reasoning power, pure rationality is not realistic in most real-world decision-making scenarios.Thus, problem solvers need to settle on a good-enough, non-optimal alternative that is sufficient for the task, a phenomenon termed satisficing (Simon, 1956(Simon, , 1997)).Krosnick (1991) borrowed Simon's theory and terminology for theorizing the cognitive demands of survey responses and separated survey response strategies into two major classes: satisficing and optimizing.In an optimizing response strategy, the participant exerts full mental effort to answer the question in the best possible way to provide high-quality information.The cognitive activities needed in an optimizing response are (a) carefully interpreting the questions, (b) trying to recall relevant information from memory, (c) using the information from these memories to create a summary judgment of the question, and (d) reporting their response in a way that best fits their judgment or opinion.Satisficing, on the other hand, refers to a response style where respondents in some way take a mental" shortcut" to provide a seemingly valid answer.Krosnick further refined the concept into strong and weak satisficing, where strong satisficing is the response strategy that requires the least amount of information processing, e.g., "I don't know" answers, answering in a non-differentiating manner by repeatedly selecting the same point on the rating scale, or responding by randomly selecting between available alternatives.The slightly more thoughtful response strategy named weak satisficing is biased due to incomplete deliberation, such as choosing the first plausible answer that comes to mind.Empirical studies that have investigated satisficing have often relied on indicators such as response-order effects (for weak satisficing), non-differentiation, and selecting "I don't know" answers.However, many other indicators have also been used, such as item nonresponse and response latencies, predictors which in a review study by Roberts et al. (2019) have shown overall consistent findings relating to satisficing.The conditions that lead to satisficing behavior are suggested to be task difficulty, and the ability and motivation of the responder (Krosnick, 1991), conditions that when used as predictors have shown significant effects on satisficing indicators (Roberts et al., 2019).There is, however, a lack of research on more direct measures of satisficing.

The relationship between satisficing and response times
Conceptually, the interplay between response times and satisficing is not entirely clear, but it seems like a consistent finding is that satisficing takes less time than optimizing, which seems reasonable given the definition of satisficing.For instance, Revilla and Ochoa (2015) found that shorter response times were related to lower-quality answers, and Gummer and Roßmann (2015) found that test-takers interested in the topic surveyed spent more time completing the questionnaire.Messages that encourage responders to carefully consider the response options before answering are associated with longer time spent on items (Clifford & Jerit, 2015;Dumitrescu & Martinsson, 2016) and decreased failure on trap questions (Clifford & Jerit, 2015).The format of survey questions further seems to affect response times (Roßmann et al., 2018).Responders in one study were unwilling to identify and read important definitions that could be revealed by mouse clicks (Galesic et al., 2008).Moreover, results from an eye-tracking study indicate that test-takers' effort to process questions, indicated by the fixation duration of stems and response categories, declined toward the end of a questionnaire (Chauliac et al., 2022).
It is hypothesized that satisficing behavior can also be displayed in the form of overly slow responses.Read et al. (2022) suggested a theoretical framework that included inattentive responders with fast response times, attentive responders with baseline duration, and inattentive responders with slow response times, where the slow responders' inattentiveness is a result of multitasking while filling in surveys.Their results suggested that 46% of the responders belonged to the fast and inattentive cluster, 40% to the baseline attentive cluster, and 14% to the slow inattentive cluster.This three-part conceptualization is somewhat in line with the suggestion that there might be disengaged response styles on achievement tests that do not occur rapidly (Wise et al., 2020;Zumbo et al., 2023).
Due to the variation in item and test-taker characteristics, it might be difficult to separate weak satisficing from optimizing.If a person is a slow reader and quick to decide on which option they want to choose, they might answer with a similar response time as a fast reader who deliberates longer.Strong satisficing that happens very quickly is similar to a rapid guess in RTE terminology while optimizing is similar to solution behavior.Nevertheless, since rapid responses to questionnaire items without right or wrong answers could be guesses or simply filling in answers, we will refer to them as rapid satisficing responses.
Building on previous literature, we assume that there are three broad classes of response styles in the PISA questionnaire.If a PISA test-taker is motivated to answer a questionnaire item, they will try to optimize when answering the item.When a test-taker is not motivated to answer a questionnaire item, they will either satisfice with a rapid response or be inattentive or disengaged from the task, producing a slower response time.
A simple but crucial aspect often ignored in answering questionnaires is the motoric behavior needed to process information to comprehend questions and provide answers.In computer-administered surveys, this typically involves using a pointer device to click buttons or typing answers.The response times related to these behaviors are essential to consider when thinking about how different styles differ in the response times they produce.If a test-taker skips a question, they may only need to respond to visual cues, locate the button, and advance in the questionnaire, a process similar to an aim-and-click process that takes around 0.25 to 1 s (see the distribution at https://humanbenchmark.com/tests/ aim).Thus, satisficing responses can occur very quickly.Conversely, if a test-taker aims to optimize their answer, they must read and comprehend questions, deliberate, and then physically respond-a process that should take longer time than strong satisficing.

Results from previous research using response times to understand test-taking effort and motivation in low-stakes tests
Previous research reveals mixed results concerning the proportion of motivated test-takers on low-stakes assessments, but it is generally estimated to be high.Further, correlations between response-time-based measures of motivation and performance tend to be moderate to strong (Silm et al., 2020).Results from previous research indicate that most test-takers receive very high RTE scores, which suggests frequent engagement in solution behavior.In Wise and Kong (2005), 299 students out of 506 had an RTE = 1.0, with the lowest RTE being 0.2; in Kong et al. (2007), mean RTE varied between 0.93 and 0.95 depending on which threshold was used.Similar results have been reported by Wise and DeMars (2010) and Wise and Gao (2017), where 4.6% of test-takers had an RTE < 0.9.On the other hand, Rios et al. (2014) found that 23% of their sample qualified as unmotivated using one method of setting the threshold, while percentages ranged between 6 and 11% when using other methods.Michaelides et al. (2020) investigated the response time effort on PISA 2015s multiple-choice science items and reported below 6% rapid guessing (on average) and Spearman correlations between response time effort and performance of around 0.4.Results from another study using PISA 2015 data and a mixture modeling approach indicated that the average posterior probability of test-takers classified as effortful was between .94 and .85,with the lower proportion for item clusters administered later in the test (Pools & Monseur, 2021).Pools and Monseur (2021) further concluded that test-taking effort had a significant effect on PISA test scores, with correlations ranging between 0.38 and 0.59 depending on item cluster position.In a study by Ulitzsch et al. (2022), the rates of careless and insufficient effort responses varied between 6 and 13% depending on what time variable was used in the mixture model.A study examining variations between countries found that the proportion of rapid guessers ranged from 0.03 to 0.16 (Rios & Soland, 2022), and most individual items also receive a high proportion of solution behaviors ranging from 0.90 to 1 in previous studies (Wise, 2006;Wise & Gao, 2017).In Ulitzsch et al. (2021), the range of per item level of rapid guesses was 1.7 to 24.1%.Items administered toward the end of the questionnaire tend to receive less effortful responses (Wise, 2006).
Regarding the percentage of satisficing in surveys, previous studies have resulted in estimates ranging from 8 to 30% (Gao, House, & Bi, 2016;Gao, House, & Xie, 2016;Kapelner & Chandler, 2010).Another estimate is that 30% percent of students taking a questionnaire engaged in satisficing behavior at least once (Vriesema & Gehlbach, 2021).The only study we found using a response time-based indicator of motivation on a questionnaire in a large-scale assessment (OECD Test for Schools) was Soland et al. (2019) which applied an RTE approach with various threshold levels.Using a 2-s threshold, they found the RTE to be around 0.98, and a correlation between RTE on the achievement test and the RTE equivalent on the questionnaire to be 0.35.
Although most findings and estimates presented above relate to achievement test data rather than questionnaire data they still offer a reference point for interpreting this study's results.

Aim
The purpose of the present study is to (a) develop a model that uses response time data from questionnaire items to capture students' motivation to optimize on the PISA 2018 student questionnaire.Further, the study aims to use the model to (b) investigate whether students' motivation to optimize on the questionnaire is related to their performance on the PISA reading test.In addition, the study aims to (c) explore if the proportion of motivated responses changes by item position and (d) compare the estimated cutoff values derived from mixture modeling with fixed and NT10 thresholds.

Sample and pre-processing of data
Response time data from the student questionnaire and reading performance test scores (plausible values) from 5,124 Swedish students participating in PISA 2018 were used in the analysis.Reading was chosen as the performance outcome variable as it was the main subject in PISA 2018.The available response time data describes the test-takers' total time spent on one "screen" or "page" in the questionnaire.Such a screen contains one main theme or question, often accompanied by multiple subitems (e.g., a number of items asking for the student's reading enjoyment).As a response time to a questionnaire screen may or may not include subitems, we reserve the term item to refer to the full screen, and item response time to responses time spent on the questionnaire screen.Some of the response time data were related to introductory and ending screens STIntro1_TT, STInfo1_TT, and STEnd01_TT), but these were removed from the dataset.Further, items coded ST001_TT, ST002_ TT, ST003_TT, and ST004_TT were short (e.g., asked for grade, gender, and date of birth) and required few cognitive steps and small amounts of effort to respond to, which would make satisficing and optimizing responses difficult to disentangle.Furthermore, these items were administered at the beginning of the questionnaire, and there was no observable multimodality in the response time distributions, which made these items difficult to fit and influenced our decision to remove these items from the analysis.The final response time data set contained response time data from 61 out of the 79 questionnaire items.

Conceptual model
As noted in the introduction, we assume there are three different response styles that test-takers can engage in when responding to items in the questionnaire: A strong satisficing (unmotivated) response style, which could be a skip response produced so rapidly that it would be practically impossible to generate if all steps of an optimizing response where to be completed (in the following called "rapid" satisficing); and an optimizing response style, which implies trying to answer the questionnaire items with reasonable effort by reading the text and deliberating about answers before answering (a motivated response style).We further assume that there may be an unmotivated response style characterized by very long response times rather than very short, for example, idle participants who are slow to respond due to their lack of focus on the task (in the following called "idle" satisificing).It is assumed that each test-taker had a specific level of motivation that affected their decision to use either of the response styles when presented with a questionnaire item.We assume that when a test-taker is presented with a questionnaire screen, a highly motivated test-taker will have a greater tendency to answer with an optimizing response style.A test-taker with low questionnaire-taking motivation will, in comparison, have a low tendency to respond with an optimizing response style and will instead decide to use either of the non-optimizing response styles (rapid or idle).The decision to use a rapid, optimized, or idle, response style on either questionnaire screen is related to different response times, which together produce the bimodalities, long tails, and outliers that can be observed in the response time distributions.We assume that a rapid satisficing response is on average faster than an optimizing response and that an idle response must take at least 10 s, and maximally as long as the longest observed response time to the item.Further, different questionnaire items have different characteristics: some are very simple while others are more cognitively demanding, which will affect the time it takes to complete an optimizing response, so it is anticipated that response times for optimizing and satisficing can differ between screens.We furthermore hypothesize that a test-taker's motivation to invest effort and optimize on the questionnaire could be related to the test-taker's performance on the PISA reading test.The above verbal account of the conceptual model provides the groundwork for the statistical Bayesian model that is presented in the section below.

Bayesian model
A test-taker's motivation is represented by the latent variable θ, which governs the probability that a test-taker will decide to use an optimizing response style when presented with a questionnaire item, screen m.The response time rt m it takes to answer a questionnaire screen is modeled by a finite mixture model with three components.Each of the components, c 1 , c 2 , and c 3 , represents the component distribution for response time data generated by a rapid, optimizing, and idle response style, respectively.The rapid and optimizing response times are assumed to follow the typical shape of a response time distribution, which is often characterized by a skewed distribution with long right tails, characteristics that can be modeled by a lognormal distribution.Thus, the response times generated by the rapid component on the m th screen are assumed to follow a lognormal distribution with parameters µ σ , m m Slow idle satisficing response times are represented by a uniform distribution between 10 s and the maximum observed response time max( ) rt m .The uniform distribution was selected due to uncertainty regarding the distribution shape of idle response times.This third idle component accounts for outliers and could reduce bias when estimating the parameters related to rapid and optimizing response.The mixture proportions for each component response style λ m are modeled individually for each item m.When the parameters of the finite mixture models have been estimated, we can use Bayes' Theorem to calculate the probability that the response time of test-taker n to item m was generated by the optimizing component.Each test-taker, will for each screen, receive a probability value that their observed response time was generated by the optimizing component.These probabilities are used to update the parameters of a beta distribution, which represents the posterior belief over θ n -questionnaire-taking motivation.To investigate the effect of questionnaire-taking motivation on PISA reading performance, a linear regression was used to model the effect of θ n on the mean of reading plausible values.For a graphical representation of the conceptually important parts of the model, see Figure 1.In Figure 1, the top node with θ n (motivation) affects the generation of a decision z n m , to, on item m, use a response style modeled by mixture component c m 1, (rapid), c m 2, (optimizing), or c m 3, (idle), which in turn generates a response time rt n m , .Finally, to the right in the figure, a test-taker's questionnaire-taking motivation θ n is linked to plausible values pv n through a linear model.

Model equation
The model equation can be written as follows: The priors for λ expect 0.08 [0.00, 0.34] proportion of rapid and idle satisficing responses and 0.83 [0.53, 0.99] proportion of optimizing responses before update with empirical data, 99% Credible Interval (CrI) within brackets.The hyperpriors for the rapid satisficing component imply, before being updated with any data, a response mean response time of 0.29 [0.002, 1.34] seconds (99% CrI within bracket).So while this prior expects very rapid response times it was required for convergence.The hyperpriors for the optimizing component imply response times with a mean of 43.329 [0.257, 198.78] seconds, (99% CrI within bracket).All the standard deviation parameters pv sd , σ 1 , σ 2,m were sampled from truncated distributions such that their values were ensured to be greater than 0.

MCMC estimation and inference
To estimate the parameters, 4 chains of length 4000 were sampled: 2000 for warmup, and the remaining posterior.Traceplots were examined for the mixture model parameters and did not indicate any problems with stationarity, mixing, or convergence.All Rhat values of estimated parameters were below 1.01 (Vehtari et al., 2021).To evaluate the fit of the models, posterior predictive plots were produced.Posterior predictive checks provide a qualitative way of evaluating how well the model fits the data by comparing how similar simulated datasets generated by the model are to the real empirically observed data set.So, what we would like to see in a plot is that the data from the model look the same as data produced by model; this is indicated by a close overlap between empirical and simulated data.See Gabry et al. (2019) for further explanation of posterior predictive checks.
The probabilistic programming language Stan (Carpenter et al., 2017) and the CmdStan 2.30 interface were used to implement the model.R 4.2.1 was used for data cleaning, visualization, and summarization.To ensure transparency, complete model code, as well as code for cleaning and creating figures, can be examined in the complementary supporting materials in the repository at https://doi.org/10.5281/zenodo.7831743.

Evaluation of model fit
Figure 2 shows posterior predictive plots that compare model simulated data to empirical data from real test-takers.In this plot, we can see a black line on top of gray shadowing.The black line is a density plot over the response time to each item.The gray shadowing is constructed by multiple lines, and each gray line is a density plot over response times produced by a simulation of a complete data set by the fitted model.As we can see, for most items the gray simulated lines and the black empirical lines show a close overlap, which indicates a good model fit.(For increased interpretability, the x-axis is limited to showing 0 to 60 s).However, a number of items displayed a rather bad fit (ST015, ST158, ST164, ST165, ST166, and ST188), as can be seen by the gray lines not resembling the black line.Note however that the parameter estimates for these items would not overestimate the proportion of motivated test-takers, and err on the conservative side of the estimation of the proportion of satisficing responses.Will return to these problematic items in the discussion.
Figure 3 shows posterior predictive checks for the PISA scores, and since the simulated and empirical data overlap, the model indicates a good fit to the PISA scores.Since the model did not converge with weak priors, strong priors were set on the parameters of the lognormal distributions.Still, posterior predictive checks suggest that the data fit was overall good, and the results show that posteriors estimates have changed considerably compared to the priors, which implies that our model is substantially affected by the data and not only sampling from the priors.

Overall estimates of questionnaire-taking motivation
The mean of the motivation parameters was 0.97 (mode = 0.998, 95% Credible Interval, CrI = [0.81,1.00]).Thus, the model implies that there is a 95% probability that test-takers in the sample will have a θ between 0.815 a 1.00.We can interpret the thetas as the motivation to engage in an optimizing response style.If a test-taker is estimated to have a θ of 0.9 when presented with 100 questionnaire items the testtaker is expected to answer 90 of them by optimizing, and using either a rapid or idle response style on the remaining items.The results suggest that, in general, the motivation to optimize on the questionnaire is high.

Relationship between questionnaire-taking motivation and test performance
The top panel in Figure 4   between latent motivation on the student questionnaire and performance on the PISA test.For example, using the point estimate of the mode, the predicted score of a student with a motivation of 0.85 is 461, and an increase of 0.10 in motivation to 0.95 predicts the score increasing to 498.A proficiency level in PISA 2018 covered around 80 points on the PISA scale (OECD, 2019); this implies that a motivational increase of around 0.25 would amount to an increase in one proficiency level.For comparison with previous studies that use correlations to assess the relationship between test-taking motivation and test score, we calculated Pearson r p , and Spearman r s , correlations between the point estimated mean θ and PV1READ (r p = 0.31, r s = 0.75).Both types of correlations are calculated and reported in several previous studies using RTE and often yield a distribution similar to the one obtained in the present study, although in previous research it is not always clear which type of correlation coefficient has been reported.The difference between the two measures of correlation simultaneously suggests a strong positive monotonic relationship and a weak positive linear relationship.This suggests that there may be a non-linear positive relationship between motivation and performance.Due to this observation, we fitted a piecewise regression between the mean estimate of θ for each respondent and their plausible value, which includes a breakpoint bp where the slope changes.The bp was estimated at 0.998, see bottom right panel of Figure 4 which show the piecewise model zoomed in on the x-axis to show the breakpoint where the effect of θ on performance score increases, as well as the strong covariance that the scatterplot indicates.The proportion of responders with a θ above the bp was 64%.We then calculated correlations below and above the bp.For θ below the threshold, r p was 0.18, and r s was 0.12.For θ above the threshold, r p was 0.81 and r s 0.81.Thus, overall results indicate a positive relationship between questionnaire motivation and test performance.This relationship was especially strong for test-takers, whose motivation was estimated to the greater regions of the motivation scale.

Exploring the relationship between item position and response style
In Figure 5, estimates of the mixture component parameter λ are presented.This parameter reflects the proportion of each response style component for each item.The proportion of optimizing responses ( , λ m 2 ) is very high for the first items but gradually diminishes to around 0.94 for those presented near the end.At the same time, we see a concurrent increase in the proportion of rapid responses ( , λ m 1 ) to around 0.06 for the items presented near the end.The occurrence of an idle response style ( , λ m 3 ) was very rare, the item with the greatest proportion of idle responders was the tenth item (ST019) where the mean lambda for the third component was 0.0006; so, when satisficing occurred it was primarily by rapid satisficing.In summary, the results suggest that the relationship between item position and response style is positive for satisficing, negative for optimizing, and that the proportion of idle satisficing was very small.The results indicate that optimizing responses gradually decrease toward the end of the questionnaire, while rapid satisficing responses progressively increase toward the end of the questionnaire.

Model implied thresholds compared to fixed and NT10 thresholds
Figure 6 displays a comparison between the model-implied thresholds to fixed (3 and 10 s) and NT10 thresholds.The thresholds implied by the model used in this study indicate the cutoff at which response times become more likely to be generated by either satisficing or optimizing responses.For example, for the first item ST005, we see that the thresholds separating rapid satisficing responses and optimizing responses coincide for the model-implied (dots/whiskers) and fixed 3-s (dashed vertical line) and NT10 methods (triangles), while for item ST011, the model implied thresholds suggest a cutoff at around 12 s, which is substantially different compared to the fixed and NT10 thresholds, which are all set at lower response times.As can be seen in Figure 6 there is great variability between the model-implied thresholds, and they seldom overlap with the fixed and NT10 thresholds.However, most thresholds fall between fixed 3 and 10-second thresholds.The idle thresholds show even greater variability and uncertainty (note the x-axis scale).Indeed, most are around 200 and 300 s, while for some items thresholds are around 500-900 s.

Discussion
In the present study students' questionnaire-taking motivation on the PISA 2018 questionnaire was inferred through the modeling of response times.A Bayesian model was developed that uses response times to estimate a parameter reflecting the overall motivation to use an optimizing response style while taking into consideration the motivation to produce rapid and idle responses.We further examined the distribution of motivation (to optimize), and how it related to PISA test performance.We also explored order effects and compared model-implied thresholds to RTE thresholds (3 s, 10 s, and NT10).
The model provided an overall good fit to the response time data, and, with the exception of a few items, the questionnaire item response times could be well described by this model.The results indicate that most test-takers in this Swedish sample were highly motivated to respond to the PISA questionnaire.The mean estimate of the probability of optimizing was 0.97, which is in line with previous RTE research, and somewhat higher than results from studies using similar models but different samples (Pools & Monseur, 2021;Ulitzsch et al., 2022).Note, however, that the uncertainty related to the estimate indicates that the level of motivation could be lower and that the estimation is probably a bit biased toward higher proportions of optimizing responses.Still, the findings indicate that the vast majority of test-takers within this sample were expected to optimize on all questions.The per item level motivation differs between 1 and 0.94, which is in line with previous research.
The results further showed that questionnaire-taking motivation had a positive effect on PISA plausible values.This indicates that the motivation to optimize on the questionnaire is, at the population level, related to higher performance.Correlations between motivation and performance also indicate a positive relationship, in line with previous studies looking at correlations between test-taking effort and performance.However, since the type of correlation calculated in previous studies is often not explicated, it is difficult to make further comparisons.The Spearman correlation suggested a strong relationship, while the Pearson correlation suggested a weak positive relationship.This can likely be explained by the relationship having a stronger monotonously increasing relationship rather than a linearly increasing one.This discrepancy complicates interpretation, which led to an additional analysis done by applying a piecewise (breakpoint) regression model.These results indicate a weak relationship between motivation and performance for the test-takers below the regression breakpoint (among test-takers with lower motivation) and a strong relationship for those above the regression breakpoint (among test-takers with higher motivation).Explaining the drastic shift in the relationship above the breakpoint is challenging because all responders above the breakpoint have a very high probability of having used optimizing responses on all items.The results show that students with response times highly aligned with the mode of the optimizing component are predicted to score higher than those with response times slightly less aligned.It's unclear whether this is due to increased motivation or efficiency in the reading of the questionnaire items.This suggests that differences in reading ability may account for parts of the effect of questionnaire-taking motivation on test performance.
When comparing the model implied thresholds used in the current study with the fixed 3 s-, fixed 10 s-, and NT10 thresholds often used in previous studies on achievement test data, the results show that while the thresholds were often between 3 and 10 s, they seldom overlapped.The NT10 threshold was in most cases set more conservatively at a shorter time compared to the model-implied thresholds.There were overall very small proportions of responses that were generated by the idle component.This indicates that most response times were better modeled by either the rapid-or optimizing component.This raises the question of whether it is necessary to consider these slow responders, and how common this kind of response style is (if it even exists at all).However, without the idle component, the parameters of the rapid-and optimizing models could become slightly biased due to the outliers that were accommodated by the idle component.It could also be the case that some idle responses were misattributed to the optimizing component.
Our exploration of the order effect shows that items presented at the end of the questionnaire are related to lower proportions of optimizers and higher proportions of satisficers.The proportion of optimizers was close to 1 for the first items and dropped almost monotonically to 0.94 for the last two items.The magnitude of the drop in motivation is in line with results from Wise (2006) and Pools and Monseur (2021).A possible factor that could affect the estimation of motivation on the questionnaire is fatigue effects from taking the test-part of the PISA assessment before the questionnaire.This could be a source of error when using motivation from the questionnaire as a proxy for test-taking effort.However, results by Ivanova et al. (2020) indicate that test-takers regain some engagement after the break between test-sessions.So, recovery from the break together with the fact that the questionnaire is less mentally taxing could also imply that eventual fatigue effects from the test part of the test are of minor concern.

Limitations and suggestions for future research
Although the overall results suggest a good model fit, there were a few questionnaire items (e.g., ST 164 ST165, ST166, and ST188) that displayed a quite poor fit.A closer look at these items reveals that most of the poor fitting items display a quite pronounced third mode or "bump", which is located between the rapid satisficing component and the optimizing component.It seems reasonable to assume that this extra modality in the distribution is a result of a distinct data generation process, probably some kind of satisficing behavior.Future studies could investigate whether a four-component mixture model (rapid/ rapid/optimizing/slow) or a three-component mixture model (rapid/ rapid/optimizing) provides a better fit to these items.Perhaps a combination of a three-component and four-component model fitted to items depending on item characteristics could be implemented.This is, however, deemed a challenging task and could lead to problems of model fit for the items that are currently well fitting.Very strong priors would have to be set to ensure convergence, and one would have to figure out how to combine information from different component mixtures into one motivation parameter.Future research should also do a more in-depth analysis of the item response patterns of satisficing and optimizing and try to model them jointly with response times and investigate if scores from individual questionnaire items are affected by satisficing responses.
The current study used data from one national sample only (Sweden) and cannot be generalized to other countries.Future studies could apply similar models to other samples to investigate the generalizability of findings and possible cross-country differences (Michaelides & Ivanova, 2022;Rios & Soland, 2022).
More fine-grained response process data (e.g., number of actions, the response time between actions) were not available to use in the current study but if such data would be available, they could be used to further the understanding of how different response styles affect response times (e.g., to discern between slow idle satisficing and slow optimizing responses).
While we found a positive correlation between questionnaire-taking motivation and PISA reading test performance, it remains unclear if motivation derived from questionnaire response times and test item response times are related and measure the same motivation type, necessitating further research.While the results from this study suggest that overall, most items in the PISA questionnaire receive high proportions of motivated, optimizing responses, certain items could be exposed to high proportions of satisficing responses, potentially affecting the validity of information gathered from specific items.Future research should address this concern.
In conclusion, estimating test-taker motivation on questionnaires through models of satisficing and optimizing response times appears to be a promising addition to existing methods that address test-taking motivation.
response times generated by optimizing on the m th screen come from a lognormal distribution with parameters µ σ

Figure 1 .
Figure 1.graphical representation of the model.

Figure 2 .
Figure 2. Posterior predictive density plots of response time distributions (bandwidth = 0.2). the dark line shows empirical data and the light gray lines show 50 simulations from the fitted model.
shows the model-implied relationship between motivation to optimize on the PISA questionnaire and performance on the PISA reading literacy test.The scatterplot represents the joint location of the test-takers' plausible values and estimated θ value.The top panel shows the simple linear model, and the two bottom panel shows the piecewise regression model.The estimates of the b 0 coefficient (median = 148.10,95% CrI = [139.84,156.51]) and b 1 coefficient (median = 368.45,95% CrI = [359.80,376.96]) imply a clear positive relationship

Figure 3 .
Figure 3. Posterior predictive density plots of plausible value distributions (bandwidth = 1).the dark line shows the empirical data and light gray lines show 50 simulations of data from the fitted model.

Figure 4 .
Figure 4. PIsa reading performance predicted by motivation to optimize.the line shows the mean estimate of reading plausible values for reading (pv) as a function of latent motivation (θ). the shading shows 95% prediction intervals.the scatterplot shows the mean estimate of test-takers θ against their pv.the top panel shows the simple linear model, and the two bottom panel shows the piecewise regression model.

Figure 5 .
Figure5.Plot over λ parameters 1, 2, and 3 which show estimates of the proportion of rapid, optimizing, and idle, responses received by each item.the dots show the mean point estimate and the whiskers show 95% CrI (some intervals are so small that they are hidden behind the dot).estimates are plotted in order of presentation from top to bottom.

Figure 6 .
Figure 6.Model implied thresholds for each item.the thresholds show the response time at which a satisficing response style becomes more likely than an optimizing response style.the left panel shows thresholds separating rapid satisficing and optimizing responses.the right panel shows thresholds separating optimizing and satisficing idle responses.the black filled-in dots with whiskers represent the model-implied thresholds, the mean point estimate, and 95% CrI. the triangles show nt10 thresholds.the dashed vertical line shows a 3-s threshold and the dotted line show a 10-second threshold.estimates are plotted in order of presentation from top to bottom.