Methodological issues with value-based decision-making (VBDM) tasks: The effect of trial wording on evidence accumulation outputs from the EZ drift-diffusion model

Abstract Most value-based decision-making (VBDM) tasks instruct people to make value judgements about stimuli using wording relating to consumption, however in some contexts this may be inappropriate. This study aims to explore whether variations of trial wording capture a common construct of value. This is a pre-registered experimental study with a within-subject design. Fifty-nine participants completed a two-alternative forced-choice task where they chose between two food images. Participants completed three blocks of trials: one asked which they would rather consume (standard wording), one asked which image they liked more, and one asked them to recall which image they rated higher during a previous block. We fitted the EZ drift-diffusion model to the reaction time and choice data to estimate evidence accumulation (EA) processes during the different blocks. There was a highly significant main effect of trial difficulty, but this was not modified by trial wording (F = 2.00, p = .11, ηp 2 = .03, BF10 = .05). We also found highly significant positive correlations between the EA rates across task blocks (rs > .44, ps < .001). The findings provide initial validation of alternative wording in VBDM tasks that can be used in contexts where it may be undesirable to ask participants to make consummatory judgements.

positive correlations between the EA rates across task blocks (rs > .44, ps < .001). The findings provide initial validation of alternative wording in VBDM tasks that can be used in contexts where it may be undesirable to ask participants to make consummatory judgements. Amber Copeland ABOUT THE AUTHORS Amber Copeland is a PhD student, and in her PhD project she is applying computational models of value-based decision-making (VBDM) to investigate the potential mechanisms that underlie behaviour change and recovery from addiction.
Dr Tom Stafford studies learning and decision making. Much of his research looks at risk and bias, and their management, in decision making. He is also interested in skill learning, using measures of behaviour informed by work done in computational theory, robotics and neuroscience. More recently a strand of his research looks at complex decisions, and the psychology of reason, argument and persuasion.
Professor Matt Field conducts research into the psychological mechanisms that underlie alcohol problems and other addictions. He is particularly interested in the roles of decision-making and impulse control in addiction, recovery, and behaviour change more broadly.

PUBLIC INTEREST STATEMENT
Existing tasks used to explore value-based decision-making (VBDM) often require people to make consummatory judgements, however there are particular research contexts where this may be inappropriate. This study explored how robust VBDM tasks are to minor alterations in trial wording that do not require people to think about their momentary desire to consume objects depicted in pictures. Results showed that differences in evidence accumulation (EA) rates were equivalent across variations of trial wording. This study contributes towards initial validation of alternative trial wording that can be appropriate to implement in future research where it is undesirable to ask participants to make consummatory judgements, such as research into VBDM in people who are in recovery from addiction.

Introduction
In everyday life, people are confronted with multiple choices which range from trivial (e.g. whether to drink tea or coffee) to important (e.g. whether to say yes or no to a marriage proposal). Valuebased decision-making (VBDM) is a theoretical framework that posits, on average, a person makes decisions based upon the things that they value (Berkman et al., 2017;Levy & Glimcher, 2012;Rangel et al., 2008). Based on this account, the overall value for each response option is calculated through a dynamic integration of different sources of value which incorporate the anticipated positive and negative consequences. The computation of the overall value is essential: it enables a person to compare and subsequently choose the response option with the highest overall value (Berkman et al., 2017).
Computational models, such as the drift-diffusion model (DDM; Ratcliff & McKoon, 2008), are widely used because of their ability to parameterise the internal cognitive components of decision-making. The DDM takes behavioural data (response time (RT) and choice accuracy) from two-alternative forced choice (2AFC) tasks as input, and then through a principled reconciliation, recovers decision parameters including evidence accumulation (EA) rate (also known as the 'drift rate', the rate at which momentary evidence is accumulated), response threshold (response caution represented in the speed-accuracy trade-off that a participant maintains), and non-decision time (encoding of stimuli and response execution; Stafford et al., 2020). The assumption that evidence accumulates noisily until it reaches a threshold for responding underlies the DDM (Ratcliff et al., 2016) which has been implemented across various domains of decision-making, including value-based (Berkman et al., 2017;Krajbich et al., 2010;Polanía et al., 2014).
In typical VBDM tasks (e.g. Polanía et al., 2014) participants initially make value judgements about a set of images, and in a subsequent 2AFC task they select the image they rated higher in value as quickly as possible. This experimental procedure generates the behavioural data that is necessary for the DDM to parameterise the internal processes of decision-making because it measures RT (the speed at which participants respond) and choice accuracy (whether the participant chose in accordance with their previous value ratings). The majority of existing VBDM tasks rely on the strength of desire to consume a commodity as a reflection of value (Krajbich et al., 2010;Mormann et al., 2010;Polanía et al., 2014;Tusche & Hutcherson, 2018): participants are initially instructed to rate and subsequently make choices about food images according to how much they would like to consume the food depicted in the image at the end of the experiment. However, the exact terminology used within the trials is often not made explicitly clear. 1 Conceptual work Field et al., 2020) extended VBDM to recovery from addiction and outlined a number of hypotheses that await empirical testing. More recently, the application of computational models has been advocated to improve methodological rigour in the field of addiction (Pennington et al., 2021). However, methodological considerations-such as trial wording-have impeded the implementation of this research. This is because asking people in recovery to make consummatory judgements about substance-related images could be unethical (e.g. triggering desire to consume a substance could jeopardise recovery) or cause discomfort for people trying to abstain. Variations of trial wording, such as 'which do you like more?', could be less aversive; however, no research has explored whether variations of wording in VBDM tasks do indeed capture a common construct of value.
Given the lack of a standard methodology and the ambiguity in some existing research, it is important to explore how sensitive VBDM tasks are to alterations in wording. To explore this, we investigated whether behavioural data from the same task-but with variations of wording other than 'which would you rather consume'-reflect a coherent construct of value as captured by EA rates. Design, hypothesis, and analysis strategy were pre-registered before data collection commenced (https://aspredicted.org/2tm3s.pdf). Our hypotheses were: (1) There will be no significant differences in EA rates for food images when participants complete the VBDM task with the trial wording 'which would you rather consume?' versus 'which do you like more?' and 'which did you rate higher?'. Specifically, we hypothesise A) no significant main effect of trial wording on EA rates, and B) no significant interaction between trial wording and trial difficulty.
(2) There will be a 'difficulty 2 effect' such that EA rates for food images will significantly decrease alongside increasing difficulty level, and this will be consistent regardless of trial wording.
To complement these pre-registered hypotheses, we also predicted that there would be significant correlations (>.7) between EA the rates across trial wording conditions.

Design
This was an experimental study with a within-subject design. The dependent variable was the EA rate (estimated by fitting the DDM to RT and accuracy data from the VBDM task). Independent variables were trial wording ('consume', 'like', and 'recall' variants, see below) and trial difficulty (easy, medium, and hard). There is not a G*Power function to conduct a power analysis for a two-way repeated measure ANOVA, but we offered study participation to 60 participants based on heuristics (Lakens, 2021). However, existing research was considered only as a guide, and we recruited a larger amount than is the norm in this field (see Table S1 in supplementary materials).

Participants
We recruited 60 participants through Prolific (https://www.prolific.co/) but removed data from 1 participant who failed all attention checks in line with the preregistered exclusion criteria (see supplementary materials for detail on attention checks). Our total sample therefore comprised 59 participants (33 females and 26 males) and ages ranged from 19 to 66 years old (M = 35.08, SD = 12.78). Inclusion criteria were age ≥18 years, current residence in the United Kingdom, having no dietary restrictions (e.g. being vegetarian/vegan), not following any diet (e.g. Weight Watchers), and having ≥95% approval rate from previous Prolific participations. The University of Sheffield research ethics committee approved the study, 3 and all participants gave informed consent. Recruitment took place in November 2020 and participants were reimbursed with £6.25 Prolific credit for their time.

Pictorial stimuli
The 30 food images used in this study were taken from an image database (CROCUFID: Toet et al., 2019) which is accompanied by valence ratings. There are a variety of images: the food depicted varied from being fresh to moulded, rotten, and partly consumed. This meant that images could be selected in order to solicit differential value judgements (standardised CROCUFID images are available from the OSF repository at https://osf.io/5jtqx; see supplementary materials for images used in this study).

Brief self-report questions
Participants answered demographic questions (age and gender) and their current level of hunger using a visual analogue scale that ranged from 0 (I am not hungry at all) to 10 (very hungry). The mean hunger level was 4.49 (SD = 2.61).

Procedure
Participants completed the study online which took an average of 28.45 minutes (SD = 14.29). Participants first completed self-report questions prior to completing the image-rating phase and the VBDM task (both programmed in PsychoPy and hosted on Pavlovia; Peirce et al., 2019).

Image-rating phase
Participants viewed 30 food images and made value judgements about them by placing each of the images into one of the four boxes using a computer mouse to indicate how much they would like to consume the food depicted in the image 'right now', ranging from: 'A lot', 'A little bit', 'Not really', and 'Not at all'. Participants were instructed to rate all 30 images while assigning at least five to each value category. Subsequently, five images were randomly selected from each value category for use in the VBDM task.

VBDM task
To begin, five images randomly selected from each of the four value categories were displayed in the centre of the screen for 3 seconds each, followed by a 500 ms fixation cross, in order to remind participants about how they had ranked the images and to show them which subset of images had been randomly selected for use in the task. Participants subsequently completed the 2AFC trials. In each trial, two images appeared (one on the left and one on the right), and participants were instructed to press one of the two computer keys ('Z' for left and 'M' for right) to select one of the images as quickly as possible (see Figure 1). Participants first completed a practice block consisting of six trials. In the real task, there were three blocks of trials and in each block, participants were asked to think about the images in a different way. Specifically, the trial instructions varied between 'which would you rather consume?' ('consume' condition), 'which do you like more?' ('like' condition) and 'which did you rate higher?' ('recall' condition). Block order was randomised, with 150 trials in each, making 450 trials in total with a short break after every 50 trials. Difficulty levels across trials were varied, in that the difference in Figure 1. Example trials in the VBDM task. Note. Trial wording varied between three blocks from 'which would you rather consume?', 'which do you like more?', and 'which did you rate higher?'. Participants were instructed to press a key to select either of the images ('Z' for left, 'M' for right). Participants had up to 4 seconds to make their decision per trial, and each trial was followed by a 500 ms fixation cross located in the centre of the screen. ratings between the two images could be 1, 2, or 3 (hard, medium, and easy choices, respectively). In each trial, there was a correct answer, and whether this appeared on the left or the right of the screen was random. Participants were given a maximum of 4 seconds to respond in each trial, responses outside of this window were classed as 'miss trials' as commonly used in VBDM tasks (e.g. Polanía et al., 2014).

Data preparation and analysis
On the VBDM task, 'miss trials' (responses exceeding 4 seconds) were removed (0.24%) as well as trials that were under 300 ms (0.03%) as these are likely to be fast guesses (Ratcliff et al., 2006) which resulted in the overall removal of 0.27% of trials.
There are a wide variety of sequential sampling models that are based on the common underlying assumption that decisions arise from a noisy accumulation-to-threshold process (Bogacz et al., 2006;Busemeyer et al., 2019). A prominent example is the DDM (Ratcliff & McKoon, 2008), of which there are also variations, but no general consensus of which model is optimal to use. We fitted the EZ-DDM (Wagenmakers et al., 2007) to RT and accuracy data from the VBDM task. This simplified version of the DDM is a powerful tool (Van Ravenzwaaij et al., 2017) that allows researchers to overcome the complexity of the parameter fitting procedure of the 'full' DDM . Crucially, research has demonstrated that relatively simple models such as the EZ-DDM yield comparatively accurate and robust inferences when compared to more complex model fitting approaches (Dutilh et al., 2019;Lin et al., 2020). The EZ-DDM takes the mean correct RT, variance of correct RT, and response accuracy as input and produces three key parameters, which are drift rate (v), boundary separation (a), and non-decision time (T er ). We estimated the parameters for each participant in each condition and for each difficulty level.
Two-way repeated measure ANOVA and correlational analyses were used to analyse EA rates in accordance with our primary hypotheses. We calculated Bayes factors in JASP using default priors 4 (version 0.16; JASP Team, 2021) because we hypothesised specifically that trial wording would not affect EA rates and this method allows us to quantify evidence in favour of the null hypothesis beyond p-values (Wagenmakers et al., 2018). We used common cut-offs for interpretation (Jeffreys, 1961) with Bayes factors greater than 3 or else lower than 0.3, representing evidence in favour of the experimental and null hypotheses, respectively. All other analyses were conducted in RStudio version 4.0.2 (R Core Team, 2020). We did not make any pre-registered hypotheses about other DDM outputs (response thresholds and non-decision times), but these exploratory analyses are reported in the supplementary materials.

Preregistered analyses
EA rates were analysed using a two-way repeated measure ANOVA with trial wording (3: 'consume'; 'like'; 'recall') and trial difficulty (3: easy; medium, hard) as within-subject variables. There was a significant main effect of difficulty, F(2, 116) = 286.34, p < .001, η p 2 = .83, with the Bayes factor indicating extreme evidence in favour of the experimental hypothesis (BF 10 > 100), but not of trial wording, F(2, 116) = 2.81, p = .06, η p 2 = .05, with the Bayes factor indicating moderate evidence in favour of the null hypothesis (BF 10 = .19). Furthermore, there was no significant interaction between trial wording and trial difficulty, F(3.39, 196.47) = 2.00, p = .11, η p 2 = .03, with the Bayes factor indicating strong evidence in favour of the null hypothesis (BF 10 = .05). 5 Post-hoc tests for the significant main effect of difficulty 6 (applying the Holm-Bonferroni correction to p-values for multiple comparisons) revealed that EA rates in the easier trials (M = 2.53, SD = .82) were significantly higher compared to medium trials (M = 1.98, SD = .73; p < .001) and hard trials (M = 1.19, SD = .50; p < .001). Furthermore, EA rates on medium trials were significantly higher compared to EA rates on hard trials (p < .001). Overall, as shown in Figure 2, these findings demonstrate that EA rates increased as trial difficulty decreased, and this was not modified by trial wording.
We conducted Pearson's correlation coefficient analyses to explore the direction, strength, and significance of the relationships between the EA rates across the different trial wording blocks. As shown in Figure 3, these correlational analyses revealed highly significant positive correlations between the EA rates in all three blocks (all ps < .001; consume and recall wording, r(57) = .68, p < .001; consume and like wording, r(57) = .46, p < .001; recall and like wording, r(57) = .44, p < .001).

Figure 2. Mean evidence accumulation (EA) rates split by trial difficulty and trial wording.
Notes. Light blue (circle) represents EA rates with the wording 'which would you rather consume', orange (triangle) represents EA rates with the wording 'which do you like more', and dark blue (square) represents EA rates with the wording 'which did you rate higher'. Error bars represent the standard error of the mean (SE).

Figure 3. Scatterplots to show correlations between the EA rates during the three blocks of trials.
Notes. On the left is the correlation between the EA rates during the block of trials with 'which would you rather consume' and the block of trials with 'which did you rate higher'. In the middle is the correlation between the EA rates during the block of trials with 'which would you rather consume' and the block of trials with 'which do you like more'. On the right is the correlation between the EA rates during the block of trials with 'which did you rate higher' and the block of trials with 'which do you like more'. The grey dashed line represents the line of equality. Shaded areas represent the 95% confidence interval.

Discussion
We explored whether a coherent construct of value was captured by variations of trial wording that do not require participants to think about their momentary desire to consume objects depicted in pictures. As hypothesised, EA rates significantly increased alongside decreasing trial difficulty, and this was not modified by trial wording. Contrary to our hypothesis, there was a trend towards a main effect of trial wording on EA rates, regardless of trial difficulty, reflective of lower EA rates when participants considered how much they 'liked' the images compared to when asked to indicate how much they wanted to consume the food depicted or to recall which they had rated higher previously. This effect however was sensitive to the software used to fit the DDM, and it fell short of statistical significance when data were analysed using the EZ-DDM in accordance with our pre-registered analysis plan. Finally, the EA rates across the different blocks were significantly positively correlated with each other, although the coefficients were not as large (>0.70) as we predicted (Abma et al., 2016).
These findings are important because the majority of VBDM tasks require participants to make choices across trials about their strength of desire to consume a commodity (Krajbich et al., 2010;Mormann et al., 2010;Polanía et al., 2014;Tusche & Hutcherson, 2018), and our findings suggest that variations of trial wording are viable alternatives that can capture value-based choices. Importantly, the variations in trial wording are interpreted to be viable alternatives as opposed to completely identical substitutes because the correlations between the EA rates varied in strength across comparisons and the main effect of trial wording approached statistical significance. Furthermore, in line with other research, the EA rates increased as trial difficulty decreased (Polanía et al., 2014). This establishment of a 'difficulty effect', regardless of the wording of the value-based question on that block of trials, is important because it demonstrates that the way that participants were responding during the 2AFC trials was compatible with the value judgments that they had made during the image-rating task.
The core focus of this study was EA rates because this parameter is hypothesised to represent the value (Field et al., 2020); however, analyses on other decision parameters derived from the EZ-DDM (response thresholds and non-decision times) are presented in the supplementary materials. An implication of this research is that variations of trial wording used in this study may be appropriate to implement in future research contexts whereby it is undesirable to ask participants to make consummatory judgements, such as those in recovery from addiction Field et al., 2020). It may be less aversive for people in recovery to express preferences in relation to how much they like an image rather than in relation to their desire to consume the item depicted; indeed, similar procedures have been implemented with people with addiction in clinical settings (Moeller & Stoops, 2015). A limitation of this study is that the EZ-DDM model fitting approach (Wagenmakers et al., 2007) precluded the fixing of decision parameters across conditions; however, when we used alternative software (the fast-dm-30; Voss et al., 2015) that enabled us to fix decision parameters, the main effect of trial wording on EA rates became statistically significant (see supplementary materials). A broader issue is that alternative, more complex models may provide a better fit to behavioural data and thereby be more sensitive to the effects of trial wording (Colas, 2017)-an important topic for future research. A further limitation is that participants made value ratings and were reminded of these, prior to completion of the 2AFC task which may have inadvertently affected participant behaviour. Future research could explore the impact of all combinations of initial value ratings and subsequent wording of 2AFC trials using a full factorial design to investigate the robustness of the findings reported in this study.
To conclude, this study was an exploration into the sensitivity of VBDM tasks to minor alterations in trial wording. Results demonstrated robust evidence that differences in EA rates (sensitivity to whether trials are easy, medium, or hard determined by participants' own value ratings) were equivalent across the variations of trial wording. EA rates were affected by trial wording, however this was confined to absolute EA rates, was not robust, and appeared sensitive to the software used to fit the DDM. This study contributes towards an initial validation of alternative wording that can be appropriate to implement in future research in which it is undesirable to ask participants to make consummatory judgments about the stimuli, such as research into VBDM in people who are in recovery from addiction. Table S1 in supplementary materials for a comparison of methods across different studies (including whether precise trial wording is included within the manuscript). 2. Difficulty is operationalised as the difference in value rating between the two images presented on each trial of the VBDM task. Trials are categorised into the following difficulty levels: hard (difference of 1), medium (difference of 2), or easy (difference of 3). 3. Ethical approval number: 033881. 4. r scale fixed effects = 0.5. 5. As requested by an anonymous reviewer, we repeated these analyses using alternative software -the fast-dm-30 (Voss et al., 2015)-which enabled us to explore if trial wording would affect EA rates after fixing response thresholds across difficulty conditions and non-decision times across all conditions (i.e. per participant). These analyses are reported in the supplementary materials, but in brief we found that the main effect of trial wording became statistically significant when analysed with the fast-dm-30, with a Bayes factor indicative of moderate evidence in favour of the experimental hypothesis (BF 10 = 4.18); contrasting with moderate evidence in favour of the null (BF 10 = .19) when the EZ-DDM was used. 6. See supplementary materials for post-hoc tests on the main effect of trial wording and interaction between trial wording and trial difficulty.

Correction
This article was originally published with errors, which have now been corrected in the online version. Please see Correction (http://dx.doi.org/10.1080/23311908.2022. 2086707)

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement
Data and analysis scripts are available and can be found on researchbox: https://researchbox.org/505

Open Scholarship
This article has earned the Center for Open Science badges for Open Data and Preregistered. The data and materials are openly accessible at https://researchbox. org/505 and https://aspredicted.org/2tm3s.pdf.

Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/23311908.2022.2079801