ORIGINAL Assessment of the response of pollinator abundance to environmental pressures using structured expert elicitation

Policy-makers often need to rely on experts with disparate fields of expertise when making policy choices in complex, multi-faceted, dynamic environments such as those dealing with ecosystem services. For policy-makers wishing to make evidence-based decisions which will best support pollinator abundance and pollination services, one of the problems faced is how to access the information and evidence they need, and how to combine it to formulate and evaluate candidate policies. This is even more complex when multiple factors provide influence in combination. The pressures affecting the survival and pollination capabilities of honey bees (Apis mellifera), wild bees, and other pollinators are well documented, but incomplete. In order to estimate the potential effectiveness of various candidate policy choices, there is an urgent need to quantify the effect of various combinations of factors on the pollination ecosystem service. Using high-quality experimental evidence is the most robust approach, but key aspects of the system may not be amenable to experimentation or may be prohibitive based on cost, time and effort. In such cases, it is possible to obtain the required evidence by using structured expert elicitation, a method for quantitatively characterizing the state of knowledge about an uncertain quantity. Here we report and discuss the outputs of the novel use of a structured expert elicitation, designed to quantify the probability of good pollinator abundance given a variety of weather, disease, and habitat scenarios.

Activity (IARPA). In making their estimates for each question, experts may answer questions about quantities / variables measured on a continuous scale, or about probabilities of discrete variables. In asking about probabilities, IDEA uses a 3 step format (Hanea et al., 2016;Sutherland & Burgman, 2015).
In the pre-elicitation stage, the problem needs to be defined as precisely as possible to minimize any risk of semantic or other misunderstandings arising, and to aid in the identification of the suitable experts. A detailed search is used to find potential experts and these are approached for their involvement. The data on which the calibration questions will be based is ideally identified at this stage and designed to be as close to the experts' domain and the questions of interest as possible. Finally, some training is delivered to the experts to explain what is required of them and, if relevant, discuss the estimation of probabilities. The elicitation stage consists of three phases: investigate, discuss and estimate. It begins with experts investigating several resources, individually. A list of essential documents and resources is circulated amongst experts. If any expert knows of relevant evidence which does not appear in the list, they suggest these to the facilitator who makes the resource available to all the experts. However, the amount of resources provided should be limited to the essential reading, since too much information might bias the experts. In the discussion stage, more information and resources are revealed and discussed. Based on these investigations, the experts provide individual estimates of the quantities of interest by answering the questions without discussing with, or disclosing their responses to, the other experts. They are asked to provide their estimates in a particular order: their lowest plausible, highest plausible and then best estimate of the quantities of interest. This ordering is designed to avoid anchoring the upper and lower estimates around the best estimate and leads to better accuracy.
The second phase is a facilitated discussion of the anonymized results for each question in turn, which irons out any residual semantic difficulties and allows experts to share their reasoning and any further additional evidence with each other. This ensures that every expert is answering the same question based on the same evidence. The third phase is a second round of individual, private estimates, allowing experts the opportunity to revise (or not) their estimates, based on what they heard in the discussion. The privacy afforded for providing their second round estimates protects them from any pressure to conform to the views of others.
The calibration exercise is identical in format to the elicitation. The principal difference is that the `answers' to the calibration questions can be checked -either they will become known shortly or they are already known to the elicitation facilitator, but not accessible to the experts. The calibration data is used to construct a set of test questions for which experts are asked to give highest plausible, lowest plausible and best estimate with a facilitated discussion and second private estimate identical to the elicitation exercise. It is important that these questions are as central to the experts' area of expertise as possible and sufficiently similar to the quantities of interest in the main elicitation. The experts' estimates on the test questions can be compared by the elicitation team to the known values and performance measures can be calculated. These performance measures can be used to inform the mathematical aggregation into a single distribution or point estimate. (See (Hanea et al., 2016) for details):

Performance measures
 The Brier score is twice the squared difference between an estimated probability of an event (best guess) and the actual outcome; therefore it takes values between 0 and 2. Lower values are better and are achieved if an expert assigns large probabilities to events that occur, or small probabilities to events that do not occur.
We calculated this per question, per expert.
 The average Brier score is a participant's accuracy is measured over many questions and averaged to represent long term accuracy. Since scores close to 0 are good for the individual questions, it follows that scores close to 0 are good for the average of those individual question scores and that a big score corresponds to poor performance. A score of 0.5 can be achieved by setting all answers to 50% i.e. 'I don't know'. We calculated this per expert.
 The length of the uncertainty interval is an indication of an experts' confidence in the best estimate given; small scores are better because they represent certainty. We calculated this per question, per expert.
 The calibration term of the Brier score places forecasts into groups or bins with the same forecast probability and is based on the difference between the empirical and the theoretical distribution of each bin, therefore smaller scores are better. We calculated one number per expert calculated from all questions.
 Relative informativeness is based on measuring entropy. For details see (R. Cooke, 1991). We calculated one score per expert calculated from all answers. Larger scores are better The final stage is the mathematical aggregation of experts' judgements. Commonly, some form of weighting is used based on the calibration exercise, which provides insight into the ability of the experts to estimate probabilities -a task known to be difficult. The ideal expert is both expert in the domain of interest and good at estimating probabilities.

Background materials
These materials were circulated as background ahead of the elicitation:

The elicitation questions
The experts then gave their individual, subjective estimates of the lowest plausible, highest plausible and best estimate of probabilities (in %) for the questions: Q1.1 What is the probability of observing good honey bee abundance, given that the environment is supportive, the weather is average, and the varroa control is good? Q1.2 What is the probability of observing good honey bee abundance, given that the environment is supportive, the weather is average, and the varroa control is poor? Q1.3 What is the probability of observing good honey bee abundance, given that the environment is supportive, the weather is unusual, and the varroa control is good?
Q1.4 What is the probability of observing good honey bee abundance, given that the environment is supportive, the weather is unusual, and the varroa control is poor? Q1.5 What is the probability of observing good honey bee abundance, given that the environment is unsupportive, the weather is average, and the varroa control is good? Q1.6 What is the probability of observing good honey bee abundance, given that the environment is unsupportive, the weather is average, and varroa control is poor? Q1.7 What is the probability of observing good honey bee abundance, given 32 that the environment is unsupportive, the weather is unusual, and the varroa control is good? Q1.8 What is the probability of observing good honey bee abundance, given that the environment is unsupportive, the weather is unusual, and the varroa control is poor?
Q2.1 What is the probability of observing good other bee abundance, given that the environment is supportive, the weather is average? Q2.2 What is the probability of observing good other bee abundance, given that the environment is supportive, the weather is unusual?
Q2.3 What is the probability of observing good other bee abundance, given that the environment is unsupportive, the weather is average? Q2.4 What is the probability of observing good other bee abundance, given that the environment is unsupportive, the weather is unusual?
Q3.1 What is the probability of observing good other pollinator abundance, given that the environment is supportive, the weather is average? Q3.2 What is the probability of observing good other pollinator abundance, given that the environment is supportive, the weather is unusual?
Q3.3 What is the probability of observing good other pollinator abundance, given that the environment is unsupportive, the weather is average? Q3.4 What is the probability of observing good other pollinator abundance, given that the environment is unsupportive, the weather is unusual?

Calibrations Questions
Based on (beeinformed.org, 2015;Seitz et al., 2015;Spleen et al., 2013) CQ1 Total losses over winter together with a 95% CI were reported. What is the probability that the upper bound of the reported CI is higher than 25%? CQ2 Total losses over the entire 20142015 beekeeping year, together with a 95% CI were reported? What is the probability that the upper bound of the reported CI is less than 45%? CQ3 Participants of the survey indicated a loss up to 18.7% on average as acceptable over winter. What is the probability that the proportion of all beekeepers who had higher colony losses than they deemed acceptable was less than 70%?
Based on (Liolios et al., 2015) CQ4 Bees collected pollen from a large number of taxa, but only some of those contributed significantly to their nutritional requirements. What is the probability that the proportion of the taxa that included more than 80% of the total proteins that were available for bees, to be larger than 25%?
CQ5 What is the probability that the average crude protein of these selected pollen sources was larger than 30%?
CQ6 What is the probability that the average protein content of the plants blooming in the spring is less than 20%?
CQ7 What is the probability that the correlation between the average protein content per mixed sample and the corresponding average amount of collected pollen was statistically significant (significantly different than 0) during the year?
CQ8 What is the probability that the correlation between the protein content and the number of collected taxa was statistically significant (significantly different than 0) during the year?
Based on (Alvarez et al., 2015) CQ9 What is the probability that the average length of the cells constructed by M. concinna is larger than 10mm? CQ10 33 nests were studied. What is the probability that the average number of cells per nest is lower than 10? CQ11 What is the probability that the proportion of overall cells of M. concinna which did not reach maturity (for whatever reasons) is larger than 40 %? Based on (Jones et al., 2016) CQ12 Given that in 2008, during January, February, and March, total rainfall was 14.22 cm, 51.5% of total annual rainfall, and during the growing season the highest average monthly temperature recorded was 27.63C for August and 10.57C for December, what is the probability that the total number of bumble bees observed (at all sites) decreased by more than 80%?
CQ13 If the season and sites are combined the primary pollinators by percent of visits in digger bees; longhorned digger bees and sweat bees. What is the probability that humming birds and Acton giant flower-loving flies accounted for more than 50% of the total visits? CQ14 If the season and sites are combined the primary pollinators by percent of visits in 1995 were: longhorn digger bees; hummingbirds; sweat bees; bumble bees and the Acton giant flower-loving y. What is the probability that hummingbirds and digger bees accounted for more than 50% of the total visits? CQ15 A comparison of the pollination data collected for early, mid, and late blooming seasons during the 1995 and 2008 efforts elicits patterns of abundance, indicating transition of pollinator presence across the season. During the 1995 season, the distribution of visitor abundance relative to total abundance was similar in early (39%) and mid-season (37%), whereas it tapered off during late season (24%). What is the probability that the visit abundance during the early season of 2008 is significantly different than the visit abundance during the early 34 season of 1995 (as indicated by a one-way ANOVA)?
CQ16 Abundance of primary pollinators observed for years when data were available was correlated to mean annual precipitation for the year before observations to elicit trends in pollinator abundance. What is the probability that a statistically significant positive relationship was found for the collective grouping of butteries and moths?
Based on (Liu et al., 2016) CQ17 Given that all Nosema bombi infections from 2013 were detected using PCR methods, what is the probability of commercial Bombus colonies (detected to be) infected with Nosema bombi to amount to more than 1% in 2013?
CQ18 Consider the entire period of the study, all the inspected colonies, and the four pathogens and parasites tested for. What is the probability of the total (detected to be) infected colonies to amount to less than 0.5 %?
CQ19 Consider the entire period of the study and all the inspected colonies. What is the probability of commercial Bombus colonies (detected to be) infected with Crithidia bombi to amount to more than 0.3%?

Group average
Q1.6 What is the probability of observing good honey bee abundance, given that the environment is unsupportive, the weather is average, and varroa control is poor? Figure S3. Second round estimates for each expert, given following the discussion of round one estimates, were compared with the estimates for the same expert in round one anonymously, along with the mean of the upper, lower and best estimates for round 2.  Figure S4. Bayesian network populated with the best estimate probabilities from the Table 1 with A. good varroa control and B. poor varroa control. The numbers and bars show the probability of each state being true. Image produced in NETICA.
A B Figure S5. Bayesian network populated with the best estimate probabilities from Table 1 with a combination of A. good varroa control and supportive environment and B. unusual weather, good varroa control and supportive environment. The numbers and bars show the probability of each state being true. Image produced in NETICA.

Varroa Control
Good Poor  Figure S6. Bayesian network populated with the best estimate probabilities from Table 1 showing the values for varroa control, weather and environment which would be required to ensure good abundance of A. honey bees alone, B other bees alone, C. hover flies alone.
The numbers and bars show the probability of each state being true. Image produced in NETICA.

Varroa Control
Good Poor