Probability of minority inclusion is underestimated

ABSTRACT Perception of the probability of minority inclusion in the groups with which we interact is important to daily behaviours (e.g., teachers may consider the probability that a class of 30 students includes at least one gay/bisexual student). The present study showed that the participants surprisingly underestimated this probability even when the group size and the prevalence of the minority were given. Approximately 90% of the participants estimated lower than the mathematically normative probability. The underestimation was larger than in the case of the arithmetically isomorphic probability of the cumulative risk, suggesting a cognitive bias specific to the probability of inclusion. Some of the heuristics used for the estimations, such as the participants using an expected value, were relevant to the underestimation. This cognitive bias may mislead people into believing that minorities are irrelevant to them. It was also shown that the participants’ attitudes became more inclusive when they were informed of the normative probability of inclusion.


Introduction
Imagine you are a teacher addressing 30 new students.What is the probability that one or more of these students has colour deficiency?What is the probability of including gay or bisexual students?If you estimate these probabilities to be low, you may not be motivated to use accessible colour schemes and bias-free language.However, these minority traits are not visually apparent.In addition, minority people often hide these traits/identities to avoid harassment and discrimination.For example, in the EU and the UK, only 5% of LGBTI youth (aged 15-17) are very open (European Union Agency for Fundamental Rights, 2020).We have difficulty perceiving the presence of minorities in everyday situations, which requires probabilistic thinking.Hence, estimating the probability of inclusion (p inc ), namely, the probability that a person with a given trait is included in a group, is critical for real-world decision making in uncertain situations.
In this study, I examine how people estimate p inc .When information about a group is scarce, p inc can be estimated using a binomial distribution.Given the group size n and prevalence of the trait in the general population q, p inc is obtained as 1 -(1q) n .However, in most cases, it is beyond humans' cognitive ability to calculate p inc mentally (e.g. 1 -(1 -0.03) 30 can only be calculated using a computer).Further, the cognitive psychology literature indicates that binomial probabilities are often misperceived.Using gambling tasks, early studies showed that conjunctive probabilities (q n ), namely, the probability of winning n consecutive gambles with a single win probability q, are overestimated, whereas disjunctive probabilities (1 -(1q) n ), namely, the probability of at least one win, are underestimated (Bar-Hillel, 1973;Cohen et al., 1971;Cohen & Hansel, 1957;Slovic, 1969).The latter is arithmetically isomorphic to p inc .The same misperception has been reported for risk.If the risk of an accident occurring each year is q, then the cumulative risk (i.e. the probability of at least one accident during n years) is given by 1 -(1q) n .This is also isomorphic to p inc .Underestimation has often been reported for the cumulative risk (De La Maza et al., 2019;Juslin et al., 2015), but it can also be overestimated (Doyle, 1997;Fuller et al., 2004).
Although it is not surprising that people cannot mentally calculate these probabilities, the estimated probabilities are not random but biased because people often use cognitively effortless strategies, called heuristics, to estimate probabilities.Tversky and Kahneman (1974) proposed an "anchoring and adjustment" heuristic as a source of such bias.The elementary risk q is usually small, say 1%, and serves as a starting point ("anchor") for the estimation.However, the cumulative risk is often much higher than q (e.g.39.5% if n is 50).While people do make adjustments, such adjustments are often insufficient, leading disjunctive probabilities to be underestimated.Subsequent studies have found a variety of heuristics for estimating the cumulative risk and shown that both underestimation and overestimation occur depending on q, n, and the heuristic used.People often use an additive heuristic (i.e.adding a constant value for each additional n).If one simply adds the elementary risk q for each n, it is also called a multiplicative heuristic (e.g. if annual risk q = 1%, then 10% risk for 10 years).The multiplicative heuristic provides good estimates when q and n are small, while it incurs overestimation bias for large q and n (Doyle, 1997;Fuller et al., 2004).Additive and multiplicative heuristics may even yield values above 100% (e.g.q = 5% and n = 50 yield 250%).In such a case, people consider the cumulative risk to be estimated as it is (De La Maza et al., 2019) or cap it at 100% (truncated heuristic; Doyle, 1997).
Another common heuristic is the constant heuristic, which ignores n and sticks to q (e.g. if annual risk q = 1%, then the cumulative risk is 1% for any number of years; De La Maza et al., 2019;Doyle, 1997;Shaklee & Fischhoff, 1990).Using the constant heuristic underestimates the cumulative risk.People also use a mean heuristic to estimate the cumulative risk (Juslin et al., 2015) based on the mean of elementary risks over time.This heuristic is equivalent to the constant heuristic when the elementary risk is constant.Do people misperceive p inc as well as the cumulative risk?In the present study, I asked participants to estimate p inc with various topics, qs, and ns.It seemed likely that p inc would be estimated more accurately than the cumulative risk, since a knowledge-based cognitive schema may mitigate reasoning fallacies (Chen & Holyoak, 1985).For example, through daily lives, people may have learned that p inc for women among a group of 100 randomly chosen individuals should be much higher than the prevalence (≈ 50%), as gender is visually expressed in most cultures.Such learning is unlikely to occur for cumulative risks over many years.Another hypothesis was that people overestimate p inc because social surveys have revealed that the prevalence of ethnic and sexual minorities in the general population is often overestimated compared with reality (Citrin & Sides, 2008;Ipsos, 2015;Newport, 2015;Wong, 2007).Both sociological factors (Alba et al., 2005;Gallagher, 2003;Lee et al., 2019;Martinez et al., 2008) and cognitive factors (Kardosh et al., 2022;Khaw et al., 2021;Landy et al., 2018) underlie this misperception.If there is a general bias towards overestimating the number of minorities, p inc would also be overestimated.
Examining the perception of p inc also raises a theoretical issue about the study of probability judgements.As reviewed above, previous studies have considered disjunctive probabilities as an issue on the temporal axis (e.g. the cumulative risk over time and consecutive gambles).By contrast, p inc is a disjunctive probability over a population rather than over time.While cognitive fallacies in judgements of population-based probability are well known (e.g.base-rate neglect; Casscells et al., 1978;Kahneman & Tversky, 1973;Stengård et al., 2022), the perception of p inc has not been empirically studied yet.If people make probability judgements using different mental models in the time domain and the population domain, then p inc is estimated differently from the cumulative risk.

Participants
A sample size of 90 was planned for each of five conditions.A total of 450 individuals were recruited through a crowdsourcing platform (crowdworks.jp).The participants received a monetary reward of JPY 250.Twenty-one participants did not follow the attention check instructions (Gummer et al., 2021) and their data were excluded from the analyses.The remaining 429 comprised the final sample (236 women, 193 men, self-reported age 18-76 years [M = 40.8,SD = 9.8]).See the Supplementary Materials for the other demographic data and determination of the sample size.All the participants provided consent to participate in advance.

Design and conditions
Figure 1 shows the overview of the experiment.The participants were randomly assigned to one of five conditions: Negative, Positive, Visible, Neutral, and Majority.Each participant made five estimations of p inc (Q1-Q5).In each problem, the prevalence q (as a percentage) of the trait in question and group size n were given.The content of Q1-Q4 varied among the conditions, while Q5 was constant (blood type problem, q = 20%, n = 30).Hereafter, for convenience, the combination of q and n values used in each problem is abbreviated to q%−n (e.g."20%-30").
In the Negative condition, the participants estimated p inc for minority traits often negatively stereotyped (colour deficiency [q = 3%], gay/bisexual students [7%]) in Q1-Q4.In the Positive condition, synaesthesia (3%) and absolute pitch (7%) were used as minority traits often positively stereotyped.These conditions were used to examine any effect of stereotypes on minorities.In the Visible condition, relatively visible minorities were used (foreigners [3%], police women [7%]).If participants had accurate knowledge of p inc through everyday experience, their estimations should be more accurate in this condition than in the Neutral condition.The Neutral condition was designed to examine baseline without the influence of participants' knowledge of real minority traits.In this condition, fictional topics of sweat content "PS22" (3%) and gene "Cg-1X" (7%) were used.In the Majority condition, the participants estimated p inc for normal colour vision (97%) and heterosexual students (93%).The purpose of this condition was to test whether p inc for minorities and p inc for majorities are estimated in the same way.In all conditions, two group sizes (n = 30 and 80) were used.All the topics and q values were determined to be realistic in the context of Japan at the time of the experiment (see Table 1).

The questionnaires
The online questionnaires were designed using lab.js(Henninger et al., 2022).Since there were five conditions, five Figure 1.Overview of the conditions and procedures in Experiment 1.Each participant estimated p inc (probability of inclusion) for five problems (Q1-Q5).The topics of Q1-Q4 varied across conditions.The original questions were in Japanese.q, prevalence presented in the problems.n, group size presented in the problems.N, sample size.questionnaires were designed.See the Supplementary Materials for the complete list of the questions and texts.All the instructions and questions were written in Japanese, and the examples below are the English translations.
On the cover page, the participants reported their age, gender identity, highest level of education, and blood type.The blood type item was included to test whether the estimated p inc for blood type B (blood type problem, Q5) differed between the participants with blood type B and others.The contents of the subsequent pages varied by condition, as described below.
2.1.3.1.Negative condition.On page 2, two problems on colour deficiency were presented (Q1, 3%-30 and Q2, 3%-80): Please answer in numbers.No need to enter "%".Some people have difficulty distinguishing between reddish and greenish colours.Medically, this is called colour deficiency.It is said that 3% of the population has colour deficiency.
What do you think the percent probability is that there is even one person having colour deficiency among 30 people? [ ] What do you think the percent probability is that there is even one person having colour deficiency among 80 people?[ ] On page 3, two problems on gay/bisexual students were presented (Q3, 7%-30 and Q4, 7%-80): Then, the blood type problem (Q5, 20%-30) was presented on page 4 (see Figure 2(a) for the question text).On the next page, the participants performed comprehension rating (Q6, "Did you understand the meaning of the question on the previous page?",1: not at all-4: very well) and attention check (Q7, "To confirm that you are reading the questions, choose 1 for the options below", 1: not at all-4: very well).The attention check item was used to detect inattentive respondents (see Gummer et al., 2021).The prevalence estimation task followed (page 6).This task was introduced to examine whether prevalence would be overestimated for the topics used in the experiment.The topics that appeared in the p inc problems (Q1-Q4) of the other conditions (except for the fictional topics of the Neutral condition) were used for this task.In the Negative condition for example, the participants estimated the prevalence of synaesthesia, absolute pitch, foreigners, and police women.
Finally, the information on the actual prevalence (Table 1) of the minority examples that appeared in the questionnaire was provided (debriefing, page 7).
2.1.3.2.Positive condition.Two p inc estimation problems on synaesthesia were presented on page 2 (Q1, 3%-30 and Q2, 3%-80) and another two problems on absolute pitch were presented on page 3 (Q3, 7%-30 and Q4, 7%-80).The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, foreigners, and police women.
Table 1.The topics used in Experiment 1 and the results of the prevalence estimation task.The prevalence of each topic was estimated by the Experiment 1 participants who did not see the topic in the preceding p inc problems (see Figure 1).Medians of the estimated prevalence are shown with 95% CIs.The actual prevalence values were adopted from the references.In reality, it is often impossible to determine the prevalence of the traits definitively.These traits have a wide continuum of individual differences and may even be multidimensional (Bermudez & Zatorre, 2009;Bosten, 2019;Epstein et al., 2012;Simner, 2012).2.1.3.3.Visible condition.Two p inc estimation problems on foreigners were presented on page 2 (Q1, 3%-30 and Q2, 3%-80) and another two problems on police women were presented on page 3 (Q3, 7%-30 and Q4, 7%-80).The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, synaesthesia, and absolute pitch.
2.1.3.4.Neutral condition.Two p inc estimation problems on fictional sweat content were presented on page 2 (Q1, 3%-30 and Q2, 3%-80) and another two problems on a fictional gene were presented on page 3 (Q3, 7%-30 and Q4, 7%-80).The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, synaesthesia, absolute pitch, foreigners, and police women.In addition, the participants were debriefed that the topics used in the p inc estimation problems were fictional.
2.1.3.5.Majority condition.Two p inc estimation problems on normal colour vision were presented on page 2 (Q1, 97%-30 and Q2, 97%-80) and another two problems on heterosexual students were presented on page 3 (Q3, 93%-30 and Q4, 93%-80).The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were synaesthesia, absolute pitch, foreigners, and police women.Results of Experiment 1, participants' estimates of the probability of inclusion (p inc ) at various settings of q and n.A histogram of the estimated p inc for the blood type problem (Q5) is shown in a.The blood type problem was presented to all the participants of Experiment 1.The results of the other four p inc problems (Q1-Q4) of the conditions other than the Majority condition are shown in b to e.The topics used in these problems varied among the conditions.The results of the Majority condition are shown in f to i. Estimates less than 0 or larger than 100 are not shown in this figure.q, prevalence presented in the problems.n, group size presented in the problems.N, sample size.EV, expected value.

Analysis of the estimates
All the p inc estimates were recorded as integers because of the function of the online questionnaires.Negative estimates and estimates over 100 were included in the analyses, but were rare (0.2% of all the p inc estimates).Figure 2 shows the distributions of the participants' p inc estimates.

Blood type problem (Q5)
The blood type problem (20%-30) was common to all the participants.As shown in Figure 2(a), p inc was greatly underestimated.The normative p inc was 99.9, and 90.9% (390) of the estimates were below this.The median estimate was 20, which was significantly below the normative p inc (signed-rank test, T = 924, p < .001).Further, 50.8% of the estimates were equal to or less than q.A substantial proportion of the participants (17.9%) estimated "20" ( = q), suggesting the use of the constant heuristic.An unexpected result was the considerable number of estimates well below q, particularly "6" (16.3%).Six estimates corresponded to the expected value (EV), namely, 20% of 30.These estimates were not attributable to using the previously reported heuristics to estimate the cumulative risk.The estimates by blood type B participants (Mdn = 25, N = 86) were not different from those by non-type B (including "don't know") participants (Mdn = 20, N = 343; Wilcoxon ranksum test, W = 14088, p = .517,Cliff's d = .045).
The participants were partially aware of their difficulty in estimating p inc .The blood type problem estimates of the participants who reported a subjectively good comprehension (rated 3 or 4 for the comprehension rating; N = 279) were significantly higher (Mdn = 40) than those who felt they had not comprehended the question (Mdn = 20; W = 17215, p = .002,Cliff's d = .177).However, 90.0% of those who reported a good comprehension still underestimated the p inc and 45.2% estimated equal to or less than q.
Did the participants lack the relevant mathematical knowledge?Although I did not ask the participants if they understood binomial probability, I did find an effect of education.When the participants were split into relatively highly educated (N = 265, approximately 4 or more years of higher education) and relatively less educated (163) groups, the former provided higher estimates for the blood type problem than the latter (Mdn = 30 and 20, respectively; W = 18122, p = .005,Cliff's d = .161).Nevertheless, the highly educated group still showed a substantial underestimation (90.6%).Participant age did not correlate with the estimates (Spearman's ρ = .03).

P inc estimates for minorities and majorities
For the p inc problems (Q1-Q4) of all the conditions, p inc was underestimated compared with the normative p inc (signed-rank tests, ps < 10 −10 ). Figure 2(b) shows the results of the 3%-30 problems for the 348 participants in the four minority conditions (Negative, Positive, Visible, and Neutral).Most of the estimates (85.1%) were less than the normative p inc of 59.9.Both the constant heuristic (estimated "3") and the EV heuristic ("1" ≈ 0.03 × 30, note that the estimates were in integers) were evident.The results for the other settings of q and n (3%-80, 7%-30, and 7%-80) showed a virtually identical pattern (Figure 2 (c-e)).As shown in Table 2, when q was 3% or 7%, roughly 90% of the estimates were underestimations , while 40-70% were equal to or less than q.
The participants estimated p inc differently for minorities and majorities.In the Majority condition, the use of the constant heuristic was still apparent, whereas the EV heuristic was not used (Figure 2(f-i)).About 75% of the estimates in the Majority condition were underestimations (Table 2).However, this was a lower percentage than for the Negative condition in which the participants estimated p inc for the counterpart minorities (Q2-Q3, χ 2 s(1) > 7.43, ps < .006,Cohen's ws > .210),except for Q1 (χ 2 (1) = 2.72, p = .099,w = .127).

Possible determinants of the p inc estimations
Why was p inc underestimated?Did the participants believe that the prevalence of the minorities in question was much less than the q values presented?This was unlikely, as people often overestimate minority prevalence (Citrin & Sides, 2008;Ipsos, 2015;Newport, 2015;Wong, 2007).In fact, the prevalence estimates by the participants were comparable to or larger than the qs presented (Table 1).
Are negative stereotypes (if any) about minorities relevant?Affect and motivation may distort people's probability judgements (Keller et al., 2006;Knäuper et al., 2005;Slovic & Peters, 2006;Weinstein, 1989).However, the p inc underestimation occurred as frequently in the Neutral condition (Table 2).Further, the estimates in the Negative condition were comparable with those in the Positive condition.The median estimates did not differ between these conditions (Wilcoxon rank-sum tests, ps > .192,Cliff's ds < .116).The proportion of the underestimation did not differ either (Fisher's exact probability tests, ps > .575,ws < .050).As an unexpectedly large number of estimates were below q, I examined the frequency of the estimates ≤ q in an ad-hoc analysis (Table 2).The results of χ 2 tests on the four p inc problems (3%-30, 3%-80, 7%-30, and 7%-80) showed no significant difference between the Negative and Positive conditions (χ 2 s(1) < 2.40, ps > .122,ws < .120).
Would the underestimation be eliminated if relatively visible minority traits were used?The comparisons between the Visible and Neutral conditions revealed that this was partly the case, but only for the 7%-80 problem (Q4).The median estimate was significantly larger in the Visible condition Q4 than in the Neutral condition Q4 (W = 3392.5,p = .036,Cliff's d = .180).For the other settings of q and n (3%-30, 3%-80, and 7%-30), there were no such differences (ps > .139,Cliff's ds < .125).The proportion of estimates ≤ q was significantly lower in the Visible condition than in the Neutral condition for 7%-80 (χ 2 (1) = 10.54,p = .001,w = .241),but not for 3%-30, 3%-80, and 7%-30 (ps > .060,ws < .140).The frequency of underestimation did not differ between the conditions (ps > .253,ws < .100).Although a partial contribution of visibility to the p inc estimations was found, p inc was greatly underestimated in the Visible condition as well.

Experiment 2
The purpose of Experiment 2 was threefold.It tested the replicability of the p inc underestimation with a student sample.To find out what heuristics were used, the participants were asked to verbally report their heuristics.To rule out the effect of computational difficulty, they were also asked to report a solution without actually calculating and calculate the normative p inc using computers.
Experiment 2 was not preregistered because its primary purpose was to examine qualitatively the participants' verbal reports on heuristics.The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2022-0030).

Participants
Forty-eight undergraduate students completed Experiment 2 (30 women and 18 men, age M = 19.4,SD = 0.8).See the Supplementary Materials for the other demographic data.They received a voucher worth JPY 300 as a reward.They were instructed in advance that they would need a computer to participate in this experiment.All the participants provided consent to participate in advance.

The questionnaire
The participants were not divided into conditions and they completed the same online questionnaire.See the Supplementary Materials for the entire list of the questions and the texts.
Table 2. Descriptive statistics of the estimated probability of inclusion in Experiments 1 and 2. Prevalence (q) and group size (n) were presented in the problems.Underestimation was defined as estimates less than the normative p inc (probability of inclusion) for each problem.On the cover page, the participants reported their age, gender identity, and the faculty to which they belonged.Then, pages 2 and 3 showed four p inc estimation problems (3%-30, 3%-80, 7%-30, 7%-80), which were identical to those in the Negative condition in Experiment 1.On page 4, the comprehension rating item was shown, which was also identical to that in Experiment 1.Thereafter, the participants retrospectively reported how they made their estimation in a text field (verbal report, "How did you determine your answer to the question on the previous page?").Experiment 2 did not incorporate an attention check.On page 5, the maths solution question was presented: the participants were asked to provide a mathematical solution for estimating p inc ("If you had to solve the following problem as a maths problem, how would you solve it?Please describe the solution in formulae or words.You do not have to provide the actual answer").Then, the fictional gene problem of 7%-30 used in the Neutral condition of Experiment 1 was provided as the problem.On the next page, the mathematically normative solution for the fictional gene problem of the previous page was provided to the participants.They were asked to calculate it using a computer and any other devices (the calculation task).
Similar to Experiment 1's debriefing, the actual percentages of colour deficiency and gay/bisexual people known from surveys were given on page 7. Further, the participants were debriefed that the gene "Cg-1X" of the maths solution question was fictional.

Results and discussion
The general tendency to underestimate as well as the use of constant and EV heuristics were apparent (Figure 3, Table 2).For all the p inc estimation problems, the median estimates were significantly lower than the respective normative p inc (signed-rank tests, ps < .001).The percentage underestimation did not differ from those of the Negative condition in Experiment 1 that used identical problems (Fisher's exact probability tests, ps > .220,ws < .127).However, the extent of the underestimation did reduce in Experiment 2. The median estimate was significantly larger than in the Negative condition in Experiment 1 in all the p inc problems (Wilcoxon rank-sum tests, ps < .001,Cliff's ds > .340).The estimates ≤ q were also less frequent in Experiment 2 than in the Negative condition in Experiment 1 (χ 2 s(1) > 10.6, ps < .001,ws > .279).The students provided relatively more accurate estimates than did the online workers in Experiment 1.
By analysing the verbal reports after the p inc estimation problems, only four students (8.3%) were identified as having used the normative solution in any of the p inc estimation problems.This low percentage was not due to computational difficulty.For the maths solution question, only 12.5% of the students described the normative solution and an additional 8.3% reported partially normative solutions.However, when the normative solution was given (the calculation task), the majority of the students (60.4%) reported the correct answer ("88" or "89") and an additional 14.6% reported its complementary probability (e.g."11").
In summary, the p inc underestimation was not due solely to computational difficulty.Even when no actual calculation was required, most of the participants could not find the normative solution on their own.The results suggested that people have difficulty understanding the nature of p inc .

Experiment 3
How can p inc estimations be improved?In Experiment 3, online workers estimated p inc for several modified Estimates larger than 100 are omitted in this figure.EV, expected value.q, prevalence presented in the problems.n, group size presented in the problems.
versions of the 7%-30 problem in the Negative condition of Experiment 1.
Experiment 3 was preregistered.Four conditions (Control, Hint, Complementary, and No Group Size; see Section 4.1) were preregistered first (https://doi.org/10.17605/OSF.IO/JCPVG).After obtaining the data for these conditions, two additional conditions (Frequency and Cumulative Risk) were preregistered (https://doi.org/10.17605/OSF.IO/HMZB2) and conducted.The experimental procedures were approved in advance by the Niigata University Ethical Review Board for Human Research (2022-0170).

Participants
Online workers were recruited in the same way as in Experiment 1.A sample size of 90 was planned for each of six conditions.As a result, 589 participants completed Experiment 3 in exchange for JPY 110.The author excluded the data from the 42 participants who had already participated in Experiment 1 or 3, and the 21 participants who failed to follow the instructions of the attention check (see Figure 4).The remaining 526 participants comprised the final sample (329 women and 197 men, age 18-74 years [M = 39.7,SD = 10.4]).See the Supplementary Materials for the other demographic data and determination of the sample size.All the participants provided consent to participate in advance.

Design and conditions
Each participant was randomly assigned to one of six conditions: Control, Hint, Complementary, No Group Size, Frequency, and Cumulative Risk.The key difference between the conditions was Q2 of the questionnaires (Figure 4), which was the critical question to compare.

The questionnaires
Six online questionnaires were prepared corresponding to the six conditions.Figure 4 illustrates an overview of the conditions.All the instructions and questions were written in Japanese.See the Supplementary Materials for the original texts.On the cover page, the participants reported their age, gender identity, and highest level of education in the same manner as in Experiment 1.As shown below, the contents of the subsequent pages varied by condition.
4.1.3.1.Control condition.In the Control condition, the following question (Q2, critical question) was shown on page 2. Q1 was not shown (see Figure 4).The critical question was an improved version of the gay/bisexual student problem (7%-30) used in Experiment 1.For clarity, the phrase hitori demo ("even one") in Experiment 1 was replaced by sukunakutomo hitori ("at least one").Furthermore, "%" was displayed next to the response field to emphasise that a percentage, not a number of people, should be estimated.
(Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex.It is said that 7% of college students are gay/bisexual.What do you think the percent probability is that there is at least one gay/bisexual student among 30 students?[ ]% On page 3, confidence rating (1: not confident at all-4: very confident, Q4), attention check (Q5), and verbal report task (Q6) were presented.Q3 was not presented in the Control condition (see Figure 4).The attention check was identical to that in Experiment 1.The verbal report task asked the participants to report how they determined the answer to the critical question.Finally, similar to the Experiment 1 debriefing, the actual percentage prevalence of gay/bisexual people known from surveys was described for the participants (page 4).
4.1.3.2.Hint condition.On page 2 of the Hint condition, the gender problem (Q1) and the critical question (Q2) were presented.The gender problem was used only in this condition and the critical question was common to the Control condition.In the gender problem, the participants estimated the p inc of men/women (i.e.q = 50%) for n = 30: (Q1) About 50% of the population is [male/female].
Imagine you are on a train.There are 30 passengers in the train carriage beside you.
What do you think the probability is that there is at least one [male/female] person among the passengers?[ ]% The target gender (male/female) was randomly determined and displayed by the function of the online questionnaire.This question was expected to serve as a hint that p inc is higher than q.Then, the critical question followed.Pages 3 (confidence rating, attention check, and verbal report) and 4 (debriefing) were identical to those in the Control condition.
4.1.3.3.Complementary condition.This condition tested whether people could estimate the complementary probability of p inc , namely, the probability that there is no minority member in a group (100% − p inc ).
On page 2 of this condition, the following critical question was presented: (Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex.It is said that 7% of college students are gay/bisexual.
What do you think the percent probability is that there is no gay/bisexual student among 30 students?[ ] % The rest of the questionnaire (pages 3 and 4) was identical to in the Control condition.
4.1.3.4.No group size condition.The EV heuristic observed in Experiment 1 may have been due to the participants' simple strategy to use the given numbers in a calculation without a specific aim (e.g.calculating the EV).Hence, it was hypothesised that if the group size n were not given, the EV heuristic would not be used, the constant heuristic would be used more, and thus the estimated p inc would be larger than that in the Control condition.To test this hypothesis, the participants of this condition were given q, but not n, in the critical question: (Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex.It is said that 7% of college students are gay/bisexual.
A company held a job fair for college students.Imagine what the venue looks like.Students are gathered at the venue.
What do you think the percent probability is that the students in the venue include at least one gay/bisexual student?[ ] % On page 3, the participants were first asked to report the imagined number of students in the critical question ("In the previous question, how many students did you imagine in the venue?").This "imagined n" question (Q3) was presented only in this condition.The responses to this question were expected to be correlated with the estimated p inc in the critical question if the participants accounted for the imagined group size in the p inc estimations.Following the imagined n question, the confidence rating, attention check, verbal report questions were presented in the same way as in the Control condition.The debriefing (page 4), which was also identical to in the Control condition, followed.
4.1.3.5.Frequency condition.In the critical question (Q2) of this condition, the participants were asked to estimate how many classes out of 100 classes included at least one gay/bisexual student, where each class consisted of 30 students: (Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite Figure 4. Overview of the conditions and procedures in Experiment 3. The content of the critical question (Q2) varied between the conditions.The original questions were in Japanese.q, prevalence presented in the problems.n, group size presented in the problems.N, sample size.
sex.It is said that 7% of college students are gay/ bisexual.
In one college, there are 100 classes with 30 students per class.How many of these classes do you think include at least one gay/bisexual student?[ ] classes The expected frequency was 88.7 classes, which was equivalent to the normative p inc .This condition was investigated because risk communication studies have found that risks expressed in a frequency format are perceived differently from risks expressed in a percentage or decimal format (Visschers et al., 2009), while a frequency format may also improve probability reasoning (Gigerenzer & Hoffrage, 1995;Girotto & Gonzalez, 2001;Hoffrage et al., 2000; but see also McCloy et al., 2010).
The rest of the questionnaire (pages 3 and 4) was identical to in the Control condition.
4.1.3.6.Cumulative risk condition.For comparison purposes, in the critical question (Q2), the participants of the Cumulative Risk condition were asked to estimate the percentage cumulative risk isomorphic to p inc .They were informed of the elementary risk and asked to estimate the cumulative risk: (Q2) A certain medicine has the side effect of causing a mild headache.If you take this medicine once, there is a 7% probability that you will experience the side effect.
What do you think the percent probability is that you will experience the side effect at least once if you take this medicine once a day for 30 days?[ ] % Page 3 (confidence rating, attention check, and verbal report) was identical to in the Control condition.The debriefing (page 4) was not included because the topic of gay/bisexual students was not used in this condition.

Results and discussion
Even with the improved wording, the participants still showed a prominent underestimation in the Control condition critical question (Mdn = 7, 85.1% underestimation; see Figure 5(a) and Table A1).In the Hint condition, the estimates for the critical question (Mdn = 7) did not differ from those in the Control condition (W = 3512.5,p = .863,Cliff's d = .015;see Figure 5(c)).The percentage underestimation (86.6%) did not differ from that in the Control condition, either (χ 2 (1) = 0.1, p = .776,w = .022).Hence, providing the gender problem as a hint had no effect.In the Complementary condition, the median estimate for the critical question was 40, which was significantly higher than the normative percent probability of 11.3 (signed-rank test, T = 3478, p < .001;see Figure 5(d)).The participants overestimated the probability that there is no minority member in a group, consistent with the p inc underestimation.
In the No Group Size condition, the estimates for the critical question were comparable with those in the Control condition (Figure 5(e)) even though n was not given.The median estimate was 9.5, which was not statistically different from that in the Control condition (W = 3458.5,p = .179,Cliff's d = .117).It was expected that the constant heuristic would be more frequently used since q was the only numerical information given.However, the proportion of estimate "7" ( = q) did not differ between this condition (11.1%) and the Control condition (5.7%; Fisher's exact probability test, p = .281,w = .096).The imagined ns (Q3) ranged from 1 to 10,000 and the median was 100.No clear correlation between the imagined n and p inc estimate was found (ρ = .17,p = .115).These results suggest that it is difficult to account for group size when estimating p inc .
The estimates in the Frequency condition critical question were larger (Mdn = 10) than those in the Control condition (W = 2827, p = .001,Cliff's d = .278),suggesting the partial mitigation of the underestimation.As shown in Figure 5(f), the estimates below q decreased compared with the Control condition, while the "7" estimates increased.The frequency format thus produced a clear difference in the participants' responses.Nevertheless, many of the participants (76.7%) in this condition still provided an underestimation and this proportion did not differ from that in the Control condition (χ 2 (1) = 2.0, p = .157,w = .106).
The estimates of the Cumulative Risk condition critical question (Mdn = 11) were higher than those in the Control condition (W = 2755, p = .002,Cliff's d = .272).Consistent with previous studies, the constant heuristic ("7") was often used to estimate the cumulative risk (Figure 5(g)).The estimates below q were infrequent.Hence, people estimate p inc and the cumulative risk differently.However, the proportion of underestimation (78.2%) was comparable with that in the Control condition (85.1%, χ 2 (1) = 1.4,p = .240,w = .089).
The results of confidence rating (1: not confident at all-4: very confident) showed that the participants were not confident in any of the conditions.The average confidence rating for each condition ranged from 1.66 to 2.07.

Heuristics for estimating p inc and the cumulative risk
Did the heuristics used by the participants differ between estimating p inc and the cumulative risk?Based on the participants' verbal reports (Experiments 2 and 3), I identified seven heuristics (Table 3).The composition of the heuristics used to estimate p inc significantly differed from that used to estimate the cumulative risk.A χ 2 test of independence revealed that the frequency of the heuristics used significantly differed between the Control and Cumulative Risk conditions in Experiment 3 (χ 2 (7) = 43.8,p < .001,with continuity correction, w = .502).Further, a post-hoc analysis of the residuals (α = .05)found that the frequencies for five of the seven heuristics were significantly different between the conditions (Table 3).The normative solution was more frequently observed in the Cumulative Risk condition than in the Control condition.

Use of the EV
A relatively common heuristic for estimating p inc was to calculate the EV (e.g.0.07 × 30 = 2.1) and somehow translate it into a probability (EV-to-probability translation).Those participants using this heuristic typically stated that an EV of one or more indicated very high p inc (e.g."90" and "100").Although such a translation method is not mathematically normative, it is a relatively reasonable way to estimate p inc without the normative calculation.Indeed, high estimates (≥ 90) were often provided by using this heuristic, although some of the participants translated the EV into low probabilities (e.g."10").Importantly, this heuristic was infrequently used in the Cumulative Risk condition.
The EV heuristic was also frequently used to estimate p inc .It was apparent in Experiment 1 (Figure 2(a-e)) and was used by 25% of the students in Experiment 2 (Table 3).It was relatively rare in the Cumulative Risk condition in Experiment 3, although no significant difference from the Control condition was found.Interestingly, those participants who tried to calculate the EV (2.1) as a p inc estimate in Experiment 3 actually reported two separate heuristics.Most of them calculated 0.07 × 30, probably because they simply confused p inc with the EV.Others calculated 0.3 × 7.They seemed to posit that a q of 7% meant that p inc for a group of 100 was also 7%, and assumed that p inc should be proportional to n.The EV heuristic implied that as well as the additive heuristic, the participants correctly thought that p inc increases as n increases, but it still yielded underestimations.

Constant and additive heuristics
Constant and additive heuristics were used less often for estimating p inc than for estimating the cumulative  Q2).In the Control condition, (a) the participants estimated p inc for q = 7% and n = 30 (Q2, gay/bisexual student problem with revised wording).For reference, b shows the result of the gender problem (Q1), which appeared only in the Hint condition.The results of the Control condition are plotted together with the results of the other conditions (c, e, f, and g) for comparison purposes.EV, expected value.q, prevalence presented in the problems.n, group size presented in the problems.
risk (Table 3), consistent with previous studies of cumulative risks (De La Maza et al., 2019;Doyle, 1997;Fuller et al., 2004).Figure 5(g) also shows the predominant use of the constant heuristic ("7") for estimating the cumulative risk.The additive heuristic typically multiplies q by n (7 × 30, i.e. multiplicative heuristic).Since the result exceeded 100, an additional heuristic was often used (e.g. to truncate it to 100), as also reported in previous studies (Doyle, 1997;Fuller et al., 2004).
Given the above findings, the p inc estimation task was likely to have encouraged the participants to focus on the EV.By contrast, they were more likely to focus on q to estimate the cumulative risk.

Other heuristics
A heuristic of calculating 1/n was unexpected.The participants using this heuristic seemed to confuse the p inc of group size n with the proportion of one person in n ("the probability of one person out of 30 is 3%").The confusion between the proportion and p inc was similar to that when using the constant heuristic.Because of the 1/n heuristic, "3" was the most frequent estimate in the Control condition of Experiment 3 (Figure 5(a)).However, it was not used in the Cumulative Risk condition (Table 3) or in the p inc estimations in Experiments 1 and 2 (Figure 2(d), Table 3).These observations suggest that the 1/n heuristic is likely to be used when facing a single p inc estimation problem.Since two problems using different ns were given simultaneously on the questionnaires of Experiments 1 and 2, the participants easily noticed that the 1/n heuristic contradicts the intuition that increasing n must increase p inc .
Finally, approximately 30% of the participants in Experiments 2 and 3 simply guessed the estimate without making a calculation.Of these, many referred to their own experiences, beliefs, and knowledge from the media (e.g."In my life, I have met those people at about that probability", "According to the TV and Internet, gay and bisexual people are all around us more than I thought", and "I estimated it based on the fact that my body often experiences side effects").
While the EV-related heuristics and constant heuristic were consistently observed in the present study, other heuristics (e.g.1/n heuristic) were less common.Future studies are needed to examine these heuristics in more detail.

Experiments 4a and 4b
Would the p inc underestimation be replicated in a more realistic, ecologically valid situation?Experiments 1-3 may have had low ecological validity because they were conducted online and the participants estimated p inc for a fictitious group.By contrast, the participants in Experiments 4a and 4b were asked to estimate p inc for the group of people in the classroom with them.Each experiment was a one-shot experiment conducted as a group.Experiment 4a was conducted for a relatively small group, whereas Experiment 4b used a relatively large group.
Experiments 4a and 4b were not preregistered because even if the sample size had been determined in advance, it would have been difficult to adhere to it.The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2023-0220).
Table 3. Percentage frequency of the heuristics used in Experiments 2 and 3.The analyses of the estimates and participants' retrospective verbal reports identified seven heuristics used in the estimation tasks in Experiments 2 and 3. Raw frequencies are in parentheses.In Experiment 3, the frequency of the heuristics was compared between the Control (p inc estimation) and Cumulative Risk conditions.6.1.Method 6.1.1.Participants Students were recruited after a class.The experimenter asked them to remain in the classroom if they wanted to participate and those who remained became participants.This procedure was performed once for a small class (Experiment 4a) and once for a large class (Experiment 4b).As a result, 16 individuals (10 men and 6 women) participated in Experiment 4a.They comprised 15 undergraduates (19-21 years) and one student from the local community (65 years).In Experiment 4b, 50 undergraduates (26 men and 24 women, age 18-22 years, M = 19.1)participated.See the Supplementary Materials for the other demographic data.In both experiments, all the participants took part without receiving monetary rewards or course credits.They provided consent to participate in advance.Each experiment took approximately several minutes.

Procedure and the questionnaire
The procedure was the same in both experiments.First, questionnaire booklets were distributed to the participants.They were instructed to read the instructions on the cover page and respond to the demographic items (age, gender identity, and the faculty to which they belonged).Then, the experimenter asked them to proceed to the next page, which was as follows (all the content of the questionnaire was written in Japanese).

Do not fill this out until instructed.
There is a gene called ALDH2*2/*2.It is said that 7% of the population have this gene in Japan.

Now, there are [
] people in this classroom.
What do you think the percent probability is that there is at least one person having the gene ALDH2*2/*2 in this classroom?[ ] % Once you have completed your answers, proceed to the next page.
The experimenter instructed the participants to complete the number of people in the classroom in the first blank space, which was counted by the experimenter in advance.The number was 17 (16 participants and one experimenter) in Experiment 4a and 51 (50 participants and one experimenter) in Experiment 4b.Next, the participants were instructed to answer the p inc estimation problem (i.e.complete the second blank space).
After the p inc estimation, the participants proceeded to the knowledge check (page 3): Then, similar to other experiments, a description of the gene and its actual percentage prevalence was given as debriefing (page 4).After reading the debriefing, the participants submitted their questionnaires to the experimenter and the experiment finished.
The prevalence value (q) of the p inc estimation problem was set to 7% since it was also used in Experiments 1-3.The topic of the problem was a real genotype of the aldehyde dehydrogenase gene ALDH2.This was adopted because it is not fictional and the 7% prevalence value is realistic in Japan (Eng et al., 2007).

Results and discussion
The underestimation of p inc was replicated in both experiments (Figure 6).In Experiment 4a (7%-17), the normative p inc was 70.9 and most of the participants (15 out of 16) estimated below it.The median estimate was 3.9, which was significantly lower than the normative p inc (signed-rank test, T = 2, p < .001).In Experiment 4b (7%-51), the normative p inc was 97.5 and the median estimate (7) was significantly lower than it (T = 3, p < .001).96.0% of the participants underestimated, while 52.0% reported estimates equal to or less than q.Only one participant (Experiment 4b) reported prior knowledge of the gene.
The use of the EV and constant heuristics was again evident, as in Experiments 1-3.Four participants Figure 6.Results of Experiments 4a and 4b.Each participant estimated p inc for a real group of people in the classroom.The only difference between the two experiments was n (i.e. the group size presented in the problems).EV, expected value.q, prevalence presented in the problems.N, sample size.
The results were comparable to those observed in the online experiment with student sample (Experiment 2).The p inc underestimation and use of the heuristics were replicated even in the more realistic situation.

Experiment 5
The purpose of Experiment 5 was twofold.First, it examined whether the p inc underestimation is related to less inclusive attitudes towards minorities.Second, it examined whether being given information on the normative p inc changes people's attitudes towards minorities to be more inclusive.The participants first rated the extent to which they agreed with inclusive statements on colour deficiency (first attitude rating), followed by a middle task comprising three conditions.In the p inc estimation condition, the participants estimated p inc for colour deficiency as the middle task.In the EV estimation condition, the middle task was to estimate the number of people with colour deficiency in a group.The participants in the Normative p inc condition were given the normative p inc and rated how believable that p inc was.Finally, the participants in all the conditions again rated the extent to which they agreed with the inclusive statements (second attitude rating).If the p inc underestimation were related to less inclusive attitudes towards minorities, the estimated p inc of the p inc estimation condition middle task would be correlated with the first attitude rating.If the information on the normative p inc changes participants' attitudes and simply estimating p inc or EV without knowing the normative p inc does not, there would be a difference between the first and second attitude ratings in the Normative p inc condition, but no difference in the other conditions.

Participants
Online workers were recruited in the same way as in Experiments 1 and 3. A sample size of 100 per condition was planned.As a result, 301 participants completed the experiment.They received a monetary reward of JPY 100.The data from 12 participants who did not pass the attention check (see Sections 7.1.2and 7.2.1) were excluded from the analyses.The remaining 289 participants comprised the final sample (145 women, 143 men, and 1 unknown; age 20-77 years [M = 41.5, SD = 9.5]).Ninety-nine were assigned to the p inc estimation condition, 96 to the EV estimation condition, and 94 to the Normative p inc condition.See the Supplementary Materials for the other demographic data and determination of the sample size.All the participants provided consent to participate in advance.

Conditions and the questionnaires
As in Experiments 1 and 3, three online questionnaires were designed for each of the three conditions.The only difference between the conditions was the middle task.See the Supplementary Materials for the complete list of the questions and texts.All the instructions and questions were in Japanese.
On the cover page, the participants reported their age, gender identity, and highest level of education, in the same manner as in Experiment 3. On the next page, they reported the extent to which they agreed with each of the three statements on colour deficiency (first attitude rating).A visual analogue scale (VAS) ranging from 0 (disagree) to 100 (agree) was presented under each statement.Some people have difficulty distinguishing between reddish and greenish colours.Medically, this is called colour deficiency.It is said that 3% of the population has colour deficiency.
How much do you agree with the following statements?Please answer by moving the slider.
Displays and information boards in public facilities should have colour schemes that are easy for people with colour deficiency to see.
School teachers should always consider that there may be students with colour deficiency in their classes.
Company management should take responsibility for creating a workspace in which people with colour deficiency can work comfortably.
Then, the participants performed the middle task (page 3).As shown below, the content of this task varied by condition.The second attitude rating (page 4) followed the middle task.The participants again reported their attitudes towards the three statements on colour deficiency as in the first attitude rating.In addition, an attention check item was presented at the end of this page ("To ensure that you are reading the questions, please move the slider below to the leftmost position labelled disagree").
On page 5, the actual percentage prevalence of colour deficiency known from surveys was described as a debriefing in the same way as in Experiment 1.This experiment took 5-10 min.

Data exclusion
To exclude inattentive responses, data from 12 participants who responded 6-100 on the VAS in the attention check item were excluded from the analysis.The remaining 289 who responded 0-5 were considered to have passed the attention check.

Middle task results
In the p inc estimation condition, the participants underestimated p inc as in the previous experiments.The median estimate was 4, which was significantly lower than the normative p inc of 91.3 (q = 3%, n = 80) (signed-rank test, T = 230, p < .001).Eighty five participants (85.9%) underestimated p inc and 49 (49.5%) reported estimates equal to or less than q.See Figure A1 for the histogram.
In the EV estimation condition, most of the participants reported estimates very close to the EV of 2.4.Among the 96 participants, 72.9% responded "2" and 18.8% responded "3" (the online questionnaire only accepted integer responses).The minimum and maximum estimates were 1 and 5, respectively.
In the Normative p inc condition, the mean rated believability of the normative p inc (91.3) was 4.4 (SD = 1.69).On average, the participants rated the normative p inc as "neither believable nor unbelievable" (4) or "a little believable" (5).This result was unexpected given the strong underestimation of p inc in the previous experiments.Perhaps because the participants were not so confident in their p inc estimates, they were comfortable believing the normative p inc , which was inconsistent with their intuitive estimates.

Correlation of the initial attitudes with the p inc estimates
As an index for the initial attitude of each participant, the responses to the three items of the first attitude rating were averaged (first attitude score, see Figure 7).The higher the attitude score, the more inclusive the participant's attitude.Likewise, the second attitude score was calculated by averaging the three items of the second attitude rating.
Contrary to the hypothesis, the first attitude scores in the p inc estimation condition did not correlate with the p inc estimates (Spearman's ρ = .025,p = .810).The p inc estimates were not related to the second attitude score either (ρ = .046,p = .653).

Attitude change
To examine whether the middle tasks caused attitude changes, the second attitude scores were compared with the first attitude scores for each of the conditions.Each comparison was made using the within participant t-test, and the p-values were adjusted using Holm's procedure so that the overall alpha was set to .05 for the triplet of t-tests.
As shown in Figure 7, the second attitude scores were significantly higher than the first attitude scores in the Normative p inc condition (t(93) = 5.79, p adj < .001,d z = .597).As hypothesised, being given information about the normative p inc changed the participants' attitude in favour of the inclusive statements.In the p inc estimation condition, the second attitude scores were also significantly higher than the first attitude scores (t(98) = 2.60, p adj = .021),while the effect size was relatively small (d z = .261).In the EV estimation condition, there was no such difference (t(95) = 0.71, p adj = .482,d z = .072).In summary, either knowing the normative p inc or estimating the p inc significantly changed the attitude scores, while estimating the EV did not.
Did the magnitude of attitude changes differ across the conditions?As an ad-hoc analysis, I compared the attitude score change (the second attitude score minus the first attitude score) between the three conditions.The mean attitude score change was 2.08 (SD = 7.97), 0.55 (7.67), and 5.23 (8.76) in the p inc estimation, EV estimation, and Normative p inc condition, respectively.A pairwise Welch test with Holm's correction revealed that the attitude score change was significantly larger for the Normative p inc condition than for the p inc estimation condition (t(183.7)= 3.91, p adj < .001,Cohen's d = .568)and the EV estimation condition (t(187.0)= 2.60, p adj = .020,d = .376),while there was no difference between the p inc estimation and EV estimation conditions (t(193.0)= 1.37, p adj = .173,d = .196).
These results suggested that knowing the normative p inc changed the participants' attitudes towards being more inclusive.The attitude changes caused by simply estimating p inc or the number of minority members in a group without knowing the normative p inc were small or negligible.

General discussion
The presented results suggest a cognitive bias specific to estimating p inc .There were significant differences in the participants' use of heuristics when estimating p inc and the cumulative risk, even though they require the same calculation.In particular, while the participants' frequent focus on the EV suggested that it served as an "anchor" for estimating p inc , q was primarily used as an "anchor" for estimating the cumulative risk.However, both these anchors are much smaller than the normative probabilities.Therefore, p inc and the cumulative risk are often underestimated.
Another intriguing finding was that the EV heuristic was not used when the participants estimated p inc for majorities (Figure 2(f-i)).Specifically, p inc estimates approximately equal to the EV became infrequent as q increased.For the cases in which n = 30, the proportion of estimates approximately equal to the EV was 35.9% for q = 3% (Figure 2(b)) and 30.2% for q = 7% (Figure 2 (d)), whereas it dropped to 16.3% as q rose to 20% (Figure 2(a)).For q = 50% (Experiment 3 gender problem, Figure 5(b)), only one participant (1.2%) reported the EV (15) as the p inc estimate.Hence, the EV heuristic was specific to the p inc estimation with low q values, namely, the p inc of minorities.
The presented results do not indicate that the participants were irrational or intolerant towards minorities.The p inc underestimation was not because of negative stereotypes, but rather a common characteristic of human cognition.The participants often used heuristics that seemed reasonable, at least partially, in a situation where the normative calculation could not be made without using electronic devices.It is reported that people made normative judgements on the cumulative risk when the normative calculation is easy (Pelham et al., 1994).
By reframing the problem when estimating difficultto-calculate probabilities, people can translate such problems into mental models that can be processed with limited cognitive capacity.However, unsuitable translations lead to erroneous estimations.In the cases of estimating p inc and the cumulative risk, people often fail to translate the problem properly.In studies on conditional probability judgements, methods have been suggested to help participants build appropriate mental models, such as using natural frequency formats (Cosmides & Tooby, 1996;Gigerenzer & Hoffrage, 1995), adopting visual presentations (Brase, 2009;Tubau et al., 2019), and partitioning cases into subsets (Girotto & Gonzalez, 2001;Sloman et al., 2003), although the effects can be small or even absent (Evans et al., 2000;Stengård et al., 2022).Partitioning cases into subsets also improves the accuracy of estimating the cumulative risk (McCloy et al., 2010), whereas using the other two methods does not (Bar-Hillel, 1973;McCloy et al., 2010).Similarly, Experiment 3 showed a very slight improvement in the p inc estimation by reframing the problem in terms of the natural frequency of classes.Further studies are needed to assess how to improve the p inc estimation.
The present study clearly showed the difference between estimating p inc and the cumulative risk, despite arithmetic isomorphism.This implies that the mental models used to understand disjunctive probability over a population may differ from those used to understand disjunctive probability over time.Empirically, the additive and multiplicative heuristics are the most frequently used to judge the cumulative risk (Doyle, 1997;Fuller et al., 2004) and the major source of the overestimation of such risks.However, the present study showed that the EV and constant heuristics were the most frequently used for estimating p inc , which yielded an underestimation.
How did these differences arise?Doyle (1997) analysed how the participants used additive heuristics ("multiplicative" in his paper) for cumulative risk judgements and found that they are likely to notice the flaws of the heuristic when the result of the simple multiplication exceeds 100% (e.g. annual risk 5% × 25 years = 125%).Thus, one of the reasons for the infrequent use of the additive heuristic in the present study may be that q and n exceeded 100% when multiplied (e.g.7% and 30).However, even for the 3%-30 problem (Figure 2 (b)), the additive heuristic (3% × 30 = 90) was rarely used, while the use of the EV heuristic was the most frequent.Furthermore, even with the same setting of q and n, p inc was underestimated more than the cumulative risk, while the EV heuristic was frequently used to estimate p inc but not to estimate the cumulative risk (Experiment 3).Interestingly, Doyle (1997) noted that participants using the additive heuristic to estimate the cumulative risk of flooding frequently referred to the EV ("Your home … will be hit 2 1/2 times in 50 years [by flooding]", p. 520).Since the expected number of hits by flooding is proportional to years, Doyle (1997) suggested that such a focus on the EV led participants to use the additive heuristic.For the p inc estimation, however, my participants often reported the EV as the probability; the additive heuristic was rarely used.Hence, while people are likely to reframe both p inc and the cumulative risk in terms of the EV, the resulting mental models seem to differ considerably.The causes of such a difference must be clarified in future research using a wider range of q and n.

Conclusions
People greatly underestimate the probability of inclusion.Some of the participants in this study made relatively reasonable estimates using the EV-to-probability translation heuristic, but many showed underestimations comparable to or greater than the case of the cumulative risk.This cognitive fallacy seemed to be due to the confusion between prevalence, EV, and the probability of inclusion.As probability of inclusion may be unfamiliar to many people, it is difficult to translate this concept into suitable mental models.
One might assume that if strong incentives exist (e.g.formal achievement tests), people provide more accurate estimates (Kruglanski & Freund, 1983).However, in everyday situations, there is no strong incentive to spend considerable effort making more accurate estimates.Hence, p inc estimations in everyday situations are likely to be comparable to those in the experiments reported herein, suggesting that people may be unaware of the relevance of minorities in their lives.Fortunately, as Experiment 5 suggested, knowing the normative p inc may help to reduce such unawareness.Further research is needed to determine how and to what extent p inc guides decision making in real-world settings.

Figure 2 .
Figure2.Results of Experiment 1, participants' estimates of the probability of inclusion (p inc ) at various settings of q and n.A histogram of the estimated p inc for the blood type problem (Q5) is shown in a.The blood type problem was presented to all the participants of Experiment 1.The results of the other four p inc problems (Q1-Q4) of the conditions other than the Majority condition are shown in b to e.The topics used in these problems varied among the conditions.The results of the Majority condition are shown in f to i. Estimates less than 0 or larger than 100 are not shown in this figure.q, prevalence presented in the problems.n, group size presented in the problems.N, sample size.EV, expected value.

Figure 3 .
Figure3.Results of Experiment 2, the estimated probability of inclusion for the student sample.Students (N = 48) with PCs estimated the probabilities of inclusion for four problems (Q1-Q4) identical to those in the Negative condition of Experiment 1 (colour deficiency problems, gay/bisexual student problems).Estimates larger than 100 are omitted in this figure.EV, expected value.q, prevalence presented in the problems.n, group size presented in the problems.

Figure 5 .
Figure5.Results of Experiment 3 critical question (Q2).In the Control condition, (a) the participants estimated p inc for q = 7% and n = 30 (Q2, gay/bisexual student problem with revised wording).For reference, b shows the result of the gender problem (Q1), which appeared only in the Hint condition.The results of the Control condition are plotted together with the results of the other conditions (c, e, f, and g) for comparison purposes.EV, expected value.q, prevalence presented in the problems.n, group size presented in the problems.

Figure 7 .
Figure7.Results of Experiment 5.The participants provided their attitude ratings twice, namely, before (blue plot) and after (red plot) a middle task.The vertical axis represents the attitude score (the average of the three rating items).A high attitude rating score indicated strong agreement with the inclusive statements on colour deficiency.The middle task varied by condition: p inc estimation (a), EV estimation (b), and reading about the normative p inc value and rating how believable the value was (c).The open dots with thin lines represent the participants' scores and the filled dots with thick lines represent the averages.The p-values were adjusted for the triplet of t-tests.EV, expected value.
Statistically significant difference (p < .05) between Control and Cumulative RIsk conditions in Experiment 3.