Boldness-Recalibration for Binary Event Predictions

Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions range from .26-.78 to .10-.91).


Introduction
Probability predictions are made for everyday events, from the mundane, like the probability it will rain, to the life-altering, like the probability that a natural disaster hits a particular city.These predictions arise from both sophisticated statistical and machine learning techniques and/or simply from human judgement and expertise.Regardless, probability predictions are commonly used in important decision making processes in the fields of medicine, economics, image recognition via machine learning, sports analytics, entertainment and many others, so it is critical that we have methods that assess such predictions.
The purpose of this paper is to develop boldness-recalibration that enables forecasters to achieve well calibrated, responsibly bold probability predictions for binary events.
We describe the assessment of probability predictions in terms of three aspects: calibration, accuracy, and boldness.The first aspect is calibration.Predicted probabilities are well calibrated when the events they aimed to predict occur with the same probability that was forecasted.For example, if the home team wins 40% of games for which a hockey forecaster predicted a 40% chance of a win, the forecaster is well calibrated.Probability calibration is well studied within the fields of statistics, meteorology, psychology, machine learning, and others (Bross 1953, Murphy & Winkler 1977, Dawid 1982, DeGroot & Fienberg 1983, Gonzalez & Wu 1999, Guo et al. 2017).Naturally, calibration is considered a minimal desirable property of predicted probabilities (Dawid 1982).Without calibration, if a forecaster says the probability of a home team win is 70%, you cannot rely on that prediction to reflect the true probability of a win.However, well calibrated predictions are not necessarily accurate nor bold enough to be useful.
The second aspect to assess probability predictions is classification accuracy.Classification accuracy measures how well predictions distinguish between the events they aim to predict.Receiver-Operating Characteristic (ROC) curves and the corresponding area under the ROC curve (AUC) are frequently used to assess classification accuracy for probability predictions.These accuracy assessments do not measure calibration since any monotone transformation applied to forecaster probability predictions will produce the same ROC curve and AUC as the original predictions.
The third aspect to assess probability predictions is boldness.We define boldness simply as the spread (i.e. standard deviation) in probability predictions.To illustrate, in National Hockey League (NHL) games in the 2020-21 season, the home team won 53% of games, thus the sample proportion is ȳ = 0.53.A hockey forecaster who simply predicts the base rate of 0.53 for every NHL game is well calibrated, but lacks the boldness needed for actionable predictions.However, forecasters who produce bold predictions alone without calibration or good classification show they have misplaced confidence in their prediction ability.
Figure 1: Venn Diagram highlighting the possible combinations of three aspects of probability predictions: calibration, boldness, and classification accuracy.We propose a boldnessrecalibration approach that enables forecasters to maximize boldness while maintaining a high probability of calibration, subject to their classification accuracy.The star represents probability predictions that are well calibrated, bold, and accurate.Empty set ϕ indicates that forecasts that are not accurate cannot be both well calibrated and bold.This brings into focus the core tension between calibration and boldness, subject to classification accuracy.Notice the star in Figure 1 at the center of the three aspects.This intersection is where forecaster predictions are accurate, well calibrated, reasonably bold, and thus, actionable for decision making.It is important to note that the level of boldness considered "reasonable" depends directly on classification accuracy and the overall decision making goal.In some cases (such as forecasting rain), maintaining the highest level of calibration achievable may be most important.Increasing boldness here may be useful so that event planners can consider well informed "worst case" weather predictions for a specific day.In other settings, poor classification accuracy (e.g., due to badly informed forecasters) may limit the amount of emboldening that is responsible.Sacrificing a small amount of calibration for greater boldness allows the analyst to responsibly examine riskier predictions in a variety of areas where investments of time, effort, and/or money are called for (e.g., sports betting, medical diagnostics, financial investing, hiring employees, etc.).This paper allows the analyst to examine emboldened probability predictions in the context of a user-specified requirement for calibration using Bayesian reasoning.
Several techniques exist that purely focus on correcting miscalibration.Dalton (2013) leverages the Cox linear-logistic model to test for calibration and proposes a relative calibration metric, but makes no mention of prediction boldness.Platt (2000) introduces Platt scaling which recalibrates Support Vector Machine (SVM) output via sigmoid curves.Guo et al. (2017) proposes recalibration via temperature scaling, a one-parameter extension of Platt scaling for Neural Network output.Zadrozny & Elkan (2001) propose a nonparametric approach called histogram binning where probabilities are bin-wise recalibrated to minimize squared loss.Zadrozny & Elkan (2002) extends this by fitting a piece-wise recalibration function on each bin interval.Naeini et al. (2015) extends this further to Bayesian Binning into Quantiles (BBQ) where multiple binning strategies are considered via Bayesian model averaging.However, none of these methods incorporate boldness in their adjustment.Some approaches to assessing calibration also consider notions similar to boldness.Reliability diagrams plot the predicted forecaster probabilities versus the observed frequency within each bin (Murphy & Winkler 1977).A calibration metric called Expected Calibration Error (ECE) quantifies the miscalibration seen in reliability diagrams by averaging the distances between the predictions and observed frequencies within each bin.Sometimes histograms (Ranjan & Gneiting 2010, Dimitriadis et al. 2021) or density plots (Satopää 2022) of the predicted probabilities are included with reliability diagrams to visualize boldness.However, boldness is not quantified in this approach.
A common metric for prediction accuracy, the Brier Score, can be decomposed into three parts such that one component measures calibration, another measures resolution, and the last measures the uncertainty in the outcomes themselves (Brier 1950, Murphy 1973).Resolution (or discrimination) refers to how well forecasts distinguish between the two possible outcomes.Predicted probabilities have high resolution when they are further from the base rate.While resolution is similar in concept to boldness, they are not mathematically equivalent.We will show that boldness is measured independently of the event outcomes, where measures of resolution rely on the base rate, and thus are not fully disentangled from the overall uncertainty of event outcomes.
A few methods both recalibrate and embolden predictions in highly specific circumstances.Lichtendahl et al. (2022) and Satopää (2022) focus on aggregates of forecaster predictions and their spread but they specifically focus on a forecast aggregation approach that is not applicable to individual forecasters.We focus on appropriately emboldening predictions from a single forecaster subject to their calibration and classification accuracy.
The predictions in the case study we present are not aggregate forecasts, and thus the approaches of Lichtendahl et al. (2022) and citeSatopaa2022 are not applicable here.Turner et al. (2014), Baron et al. (2014), andAtanasov et al. (2017) also focus on aggregates, using the linear log odds (LLO) recalibration function to adjust aggregate boldness.Roitberg et al. (2022) employs a network based temperature scaling approach to recalibrate and correct overly bold softmax pseudoprobabilities.However, Turner et al. ( 2014), Baron et al. (2014), Atanasov et al. (2017) and Roitberg et al. (2022) all rely on the Brier Score and/or ECE to assess calibration.Han & Budescu (2022) focus on LLO applied to forecasts of continuous, rather than binary, events.Gonzalez & Wu (1999) use LLO to recalibrate single forecaster predictions but focus solely on the psychological implications of probability perception for binary events.
None of these methods provide direct control of the calibration-boldness tradeoff.
To the best of our knowledge, no methodology yet exists that provides a mechanism to directly control the tradeoff between calibration and boldness.To address this gap in the literature, we propose boldness-recalibration.Boldness-recalibration allows users to set the desired level of calibration in terms of the posterior calibration probability and then maximizes boldness by maximizing spread in predictions subject to calibration level.Three key virtues of this approach are that it (a) quantifies the calibration-boldness tradeoff in an interpretable manner (in a Bayesian sense), (b) is forecaster agnostic, meaning it operates only on probability and event data, not on how the forecaster made the predictions, and (c) does not rely on binning.
The rest of this paper is organized as follows.Section 2 introduces boldness-recalibration methodology and pertinent real-world and simulation data examples.In Section 3, we provide the results of our real-world and simulated examples.Section 4 provides a discussion and concluding comments.

Methods
The following approaches are forecaster agnostic, meaning they can be applied to any probability forecasts of binary events produced by forecasters from many domains, regardless of how the predictions were made.By forecaster, we mean any entity that produces probability predictions, regardless if they are machine learning output and/or a product of human judgement and expertise.

Linear Log Odds (LLO) Recalibration Function
To assess calibration, we use the linear log odds (LLO) recalibration function.Let c(x i ; δ, γ) be the LLO function where x i is a probability prediction from a forecaster, δ > 0 and γ ∈ R. We call the outputted probability, c(x i ; δ, γ), the LLO-adjusted probability.The LLO-adjusted set is based on shifting and scaling each of the original forecaster probabilities x i on the log odds scale using δ and γ.Thus, on the log odds scale, the LLO-adjusted set is linear with respect to x i according to intercept log(δ) and slope γ, and can be re-written as Suitable choices of δ and γ can calibrate poorly calibrated probabilities.The flexibility of the LLO function can capture many forms of miscalibration (Gonzalez & Wu 1999, Turner et al. 2014).When both δ = 1 and γ = 1, the LLO function imposes no shifting nor scaling, returning the original prediction x i (Gonzalez & Wu 1999).Thus, null values of δ 0 = γ 0 = 1 corresponds to the hypothesis that x i is well calibrated.This is similar to how Reliability Diagrams operate in that when predicted forecaster probabilities are close to the observed frequency within each bin (i.e.forecasts are well calibrated), the result resembles the x=y line.The same is true when plotting event rates by LLO-adjusted probability forecasts via δ = 1 and γ = 1 under calibration.It is important to note that if c(x i ; δ, γ) is considered the LLO-adjusted "event" probability, the corresponding LLO-adjusted "nonevent" probability is 1 − c(x i ; δ, γ) rather than c(1 − x i ; δ, γ).

Likelihood Function
We adopt a Bernoulli likelihood where the events are presumed independent and the probability of each event is governed by LLO-adjusted probabilities.Let y be a vector of n outcomes corresponding to the n predictions in x from a single forecaster.Then, we have This likelihood enables calibration maximization via maximum likelihood estimates (MLEs) for δ and γ.The δMLE and γMLE values produce optimally calibrated probabilities, c(x i ; δMLE , γMLE ).Shifting via δMLE on the log odds scale adjusts the average prediction to match the sample proportion.Scaling by γMLE on the log odds spreads out or contracts predictions based on accuracy.This may be a desirable approach when probability calibration is the sole priority.Our approach of adopting a Bernoulli Likelihood governed by LLO-adjusted probabilities is equivalent to a specialized logistic regression model.Details can be found in the online supplement.

Bayesian Assessment of Calibration
Using the likelihood function in the previous section, we take a Bayesian model selectionbased approach to calibration assessment.We compare a well calibrated model, M c (where , to an uncalibrated model, M u (where δ > 0, γ ∈ R).The posterior model probability of M c given the observed outcomes y serves as our measure of calibration for the testing framework and can be expressed as Here, P (y|M i ) is the integrated likelihood of the observed outcomes y given M i and P (M i ) is the prior probability of model i, i ∈ {c, u}.The Bayes Factor comparing the uncalibrated model to the calibrated model is defined as Inverting (4) gives us Thus the expression in (4) can be re-written as An essential component of Bayesian model selection is the specification of prior model probabilities.To the best of our knowledge, this is the first attempt to assess calibration probability through Bayesian model selection, and thus best practices for setting P (M c ) and have not yet been established.We set these prior probabilities to 1 2 in subsequent analyses for illustrative purposes only.Fully reproducible code will be made available with final acceptance of the article in the online supplement, which will allow users to set alternate model priors.
Using the likelihood in ( 18), the integrated likelihoods, P (y|M i ), are not analytically tractable.While a fully Bayesian approach could be implemented, we advocate for a useful approximation.We employ a large sample approximation to the Bayes factor using the Bayesian Information Criteria (BIC) such that to form the posterior model probability in (8).See Kass & Raftery (1995), Kass & Wasserman (1995) for more information about this approximation.Here, the BIC under the well calibrated model M c is defined as The penalty term for number of estimated parameters is omitted in (10) as both parameters are fixed at 1 under M c .The BIC under poorly calibrated model M nc is defined as With this approximation for BF , we form P (M c |y), which can be interpreted as the probability the set of forecasts x is well calibrated given the observed data y.Again, P (M c |y) corresponds to calibration as δ = γ = 1 implies events happen at the rate forecasted with no further adjustment.The interpretability of the posterior model probability, P (M c |y), is the key feature of this Bayesian test for calibration.By quantifying the calibration of probability forecasts with a readily interpretable metric, we enable easier comparison of forecasters in terms of calibration and more informed decision making.We posit that P (M c |y) is interpretable to the extent that the Bayesian posterior probabilities that condition on observed data are interpretable.For a frequentist approach to assessing calibration using ( 18), see the Likelihood Ratio test presented in the online supplement.

Boldness-Recalibration
The previous Bayesian model selection approach assesses calibration alone with no regard for boldness.We now consider the boldness of predicted probabilities measured by their spread (i.e. standard deviation) The goal of boldness-recalibration is to maximize s b , or boldness of predictions subject to a user-specified constraint on the calibration probability, P (M c |y).To accomplish this we let the user set the calibration level, t, that P (M c |y) must achieve.For example, if we want to ensure our recalibrated probabilities had a posterior probability of at least 95%, we would set t = 0.95.Then boldness (s b ) is maximized subject to P (M c |y) = 0.95.We call x t the (100 * t)% Boldness-Recalibration set where x i,t = c(x i ; δt , γt ) and To visualize the process of boldness-recalibration, consider the two schemas in Figure 2.
The panel on the left depicts predictions that vary in boldness.The "less bold" predictions are closer to the base rate ȳ.The "more bold" predictions arise by moving the original predictions away from the base rate, and thus increasing spread.
The panel on the right of Figure 2 shows a boldness-recalibration contour plot.This plot is used in the case study to show P (M c |y) across a grid of LLO-adjustment parameters δ and γ.Rather than focus solely on where P (M c |y) is high (i.e.high calibration), we can draw a contour at P (M c |y) = t to focus on our user specified level of calibration.Then, along that contour we identify the δ and γ that maximize spread in the LLO-adjusted probabilities via a grid-search based approach.The δ and γ values corresponding to the star indicate precisely how to use Eq.( 1) to embolden predictions subject to t.In Figure 2, we identify these parameters with the star along the red contour at t = 0.95.We call the LLO-adjusted probabilities under these parameters the 95% boldness-recalibration set.

Other Methods to Assess Calibration
We report the Brier Score and Expected Calibration Error for the examples in this paper.

Brier Score Calibration Component
For binary events, the Brier Score takes on the form where x i is the predicted probability for event i and y i is the binary outcome (0 if a nonevent, 1 if event).The Brier Score in ( 14) can take on any value from 0 to 1, where lower values are better.
The Brier Score can be decomposed as follows: where x k is the average prediction for bin k, ȳk is the relative frequency of events corresponding to the observations in bin k, ȳ is the overall base rate, K is the total number of bins, n k is the number of observations within bin k, and n is the total number of predictions (Murphy 1973).The first addend on the right hand side of ( 15) is a measure of calibration, which we will refer to as Brier Score Calibration (BSC), and is the measure we will compare to P (M c |y).The second addend on the right hand side of ( 15) is a measure of resolution, which we will abbreviate as BSR, and is a measure we will compare to s b .
The third addend is a measure of uncertainty in the outcomes, which we will abbreviate as BSU.Lower values of BSC are better, with BSC = 0 indicating perfect calibration.Higher values of BSR are better, with BSR = BSU indicating perfect resolution.

Expected Calibration Error (ECE)
For binary events, Expected Calibration Error (ECE) takes on the form where K is the number of bins, n k is the number of predictions partitioned into bin k, ȳk is the proportion of observed events in bin k, and xk is the average probability prediction in bin k.ECE can take on any value from 0 to 1, where lower values are better.

Hockey Home Team Win Predictions
To demonstrate the capabilities of boldness-recalibration, we assembled data from FiveThir-tyEight that pertain to the 2020-21 National Hockey League (NHL) Season.FiveThir-tyEight produced predicted probabilities for all 868 regular season games that season via modelling with carefully constructed components based on expert knowledge of the game of hockey.These predictions were furnished prospectively pre-game, with no in-or post-game updating.FiveThirtyEight probabilities are potentially hedged towards the base rate of 0.53 with an inter-quartile range of 0.12 (0.47, 0.59), their full range being (0.26, 0.77).
More detailed information about this data set can be found in the online supplement.
In addition to this real-world forecaster, we generated a set of 868 random probability predictions to represent a hockey forecaster who is completely uninformed about the NHL games they aimed to predict.We call this forecaster our "random noise forecaster." To mimic this behavior and better enable comparability, our random noise forecaster is generated by taking random uniform draws from 0.26 to 0.77, the observed range in the FiveThirtyEight data.The purpose of the random noise forecaster is to demonstrate how boldness recalibration operates when predictions are unrelated to the events they predict.
We want to ensure our method does not blindly embolden inaccurate forecasts.

Simulation Study
Table 1 shows the four forecaster types in our simulation study.The data generation process for our simulation follows: 1. Generate n true event outcomes via random independent Bernoulli draws, where the probability of success at each draw takes on a random uniform draw from 0 to 1.
The p i s make up the well calibrated forecaster predictions by construction, as they directly correspond to the true probability of each event outcome.
2. To manipulate classification accuracy, add varying amounts of random noise, v i , to each p i on the log odds scale, which is equivalent to where p i,σ is the set of noisy probabilities and v i ∼ N (0, σ 2 ), σ ∈ {0, 0.1, 0.5, 1, 2}. forecaster type is simulated along with description of forecaster type.

To
The first forecaster type, called Well Calibrated, represents forecasters whose predictions correspond to the true event rate.Notice in Table 1 that these predictions are LLO-adjusted under δ = γ = 1, so the Well Calibrated Probabilities are equivalent to the perfectly calibrated probabilities with added noise, i.e. p wc i,σ = p i,σ .Thus under σ = 0, p wc i,0 = p i , the perfectly calibrated probabilities.
Our second forecaster type is called Hedger.The Hedger compresses probabilities around the base rate, 0.5 in this case.We call their predictions "hedged" as they reflect forecasters who are lacking boldness even though their accuracy could be high.In contrast, our third forecaster type, Boaster, represents a forecaster who exhibits excessive boldness.The majority of their predictions are far from the base rate and very close to the extremes of 0 and 1.The fourth forecaster type is Biased.These forecasters systematically make predictions that are higher or lower than the event rate.
While this simulation focuses on miscalibration simulated via LLO-adjustment, we also explore miscalibration simulated from Prelec's two parameter function: where α > 0 and β > 0 (Prelec 1998).We follow the same simulation procedure as shown above, except using (17) rather than LLO in Step 3. Similar to LLO-adjustment, the Well Calibrated forecaster is Prelec-adjusted under α = β = 1.The other forecaster parameter values for α and β are summarized in Table 1.Note the original formulation of this function in Prelec (1998) limits 0 < α < 1, we allow α ≥ 1 as the function provides valid probabilities under these settings.Given that our methodology assumes miscalibration follows the LLO function which likely will not hold in all scenarios, we simulate miscalibration via the Prelec function to assess how well our methodology does until miscalibration misspecification.
To explore the effect of sample size on our methodology, we generated data sets of size n = 30, 100, 800, 2,000, and 5,000.In total, we present results from 100 Monte Carlo (MC) replicates for each value of n.Throughout the study, one MC replicate consists of a set of n outcomes, and a corresponding set of n predicted probabilities for each of the four forecasters types under each of the five noise settings and both LLO and Prelec adjustment (35 total predicted probabilities sets for each replicate).For each of the 35 sets, we (i) LLO-adjust via MLEs δMLE , γMLE , (ii) 95%, 90%, and 80% boldness-recalibrate, and then (iii) evaluate the calibration and boldness of the adjusted sets from (i) and (ii).

Hockey Home Team Win Predictions Case Study
We applied boldness-recalibration to the two Hockey forecasters at three specified levels of calibration: t = 0.95, 0.9, and 0.8. Figure 3    each recalibration procedure.Points and lines colored blue correspond to predictions for games in which the home team won.Red corresponds to games in which the home team lost.The posterior model probability of calibration is reported in the parentheses in the axis label.Note that the posterior model probabilities are not necessarily linear from left to right.We order the sets in this way, starting with the original forecasters sets, to make consistent comparisons throughout the results of the paper.
First, notice that the probability of calibration given the event outcomes for the original FiveThirtyEight forecasts is very high at 0.9904, whereas the probability for the random noise forecaster rounds down to 0.000.This indicates that FiveThirtyEight is well calibrated to begin with and, as we would expect, the random noise forecaster is not.Next, notice that by maximizing P (M c |y), the range of FiveThirtyEight's predictions expands from (0.26, 0.77) to (0.18, 0.84) and s b increases from 0.091 to 0.124 as seen in Table 2. FiveThirtyEight can achieve a maximal probability of calibration of 0.9988.In contrast, for the random noise forecaster to achieve their maximal calibration of 0.9988, they must pull their predictions in toward the base rate of 0.53.Their prediction range contracts from (0.26, 0.77) to (0.51, 0.56) and s b drops from 0.146 to 0.011.Not only does this imply the random noise forecaster is poorly calibrated, but it also suggests that their predictions do not have useful predictive information.We know this to be true because these predictions were randomly generated with no association with the outcome.Now compare the spread of original predictions to the spread of the 95% boldnessrecalibration set.FiveThirtyEight can further embolden their predictions by accepting a 5% risk of mis-calibration, expanding their range to (0.10, 0.90).This suggests that FiveThirtyEight could embolden predictions with a modest decrease in P (M c |y), where the random noise forecaster has no knowledge of the outcome and should make far more cautious calls.In this example, there is minimal gain in boldness moving from 95% B-R to 90% or 80%.It is up to the discretion of the user to determine if accepting an additional risk of 5% or 10% risk of miscalibration is worth the minimal reduction in boldness.
Regardless, boldness-recalibration successfully increases the boldness of our skilled hockey forecaster while maintaining a user-specified level of calibration.For our random noise forecaster, boldness-recalibration suggests that increasing boldness would not be responsible, and instead contracts predictions.
In terms of the Brier Score, FiveThirtyEight and the random noise forecaster achieve scores of 0.2346 and 0.2675 respectively.It is hard to say how this practically translates to how much "better" FiveThirtyEight is compared to the random noise forecaster and what a "good" Brier Score is for this application.Despite the substantial increase in s b and prediction range for FiveThirtyEight, the Brier Score shows very little change, improving by 0.001 under MLE recalibration and an additional 0.002 under 95% B-R.The BSC is the same for the original and 95% B-R sets and BSR improves by 0.003.In contrast, BSC for the random noise forecaster drops to near zero after MLE recalibration and then worsens to 0.002 under 95% B-R.The BSR worsens after MLE recalibration and remains at 0.000, the worst achievable score for BSR, for all B-R sets.This further reflects that the random noise forecaster can only improve calibration by reducing boldness and resolution.We see that ECE is minimized when calibration is maximized for both forecasters and worsens by 0.017 under 95% B-R for FiveThirtyEight and by 0.028 for the random noise forecaster.
In terms of classification accuracy, FiveThirtyEight produces an AUC of 0.65.This implies their predictions are better than chance and provide some information in classifying a home team win.Our random noise forecaster produces an AUC of 0.51, which is very close to the underlying 0.5 we would expect as this forecaster makes predictions completely via random chance.Notice that AUC stays the same across all sets for both forecasters.
This is due to the fact that the LLO function is a monotonic function and the ordering of predictions does not change and neither does AUC.

Simulation Study
The Hedgers with large added noise appear well calibrated.This is not surprising as we have already established that hedging predictions with poor classification accuracy is a favorable strategy to achieve calibration.Also notice that under miscalibration misspecification (i.e. we assume miscalibration follows LLO when it actually follows Prelec), we see similar, if not improved, detection of miscalibration.
All simulated prediction sets were MLE recalibrated, and 95%, 90%, and 80% boldnesscalibrated.Of the 17,500 total prediction sets, 95%, 90%, and 80% boldness-recalibration was successful in 99.4%, 99.2%, and 98.7% of cases, respectively.By "successful", we mean that these sets were maximally emboldened while calibration of t = 0.95, 0.9, or 0.8 was maintained.In most of the small percentage of cases where boldness-recalibration was not successful, the underlying optimization was unable to converge to parameters that achieved the desired level of calibration.In the other few cases, adding random noise to the As sample size increases with vertical panels, P (M c |y) decreases for all forecasters except Well Calibrated with little to no added noise.
probabilities caused perfect separation of events and non-events.Under MLE recalibration, these predictions are all moved to either 0 or 1, where no further emboldening is possible.
The sets where boldness-recalibration was not achievable were removed from the results, as our focus is demonstrating the capabilities of boldness-recalibration.The second column of points represents the same metric for the 95% B-R set.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.We choose to only show the metric values under MLE recalibration and 95% boldness-recalibration for two reasons: (i) it is best practice to not operate under poor calibration, whether it be the original set or sets at low boldness-recalibration thresholds like 80% and (ii) we see little change in these metrics moving from the 95% to the 90% B-R set.Additionally, we found that the results for the Prelec function were nearly identical to those for LLO miscalibration, so we focus on LLO here.Results for all sets can be found in the online supplement.
Notice in Figure 6a that when moving from maximal calibration to 95% calibration, all sets are emboldened to some degree.However, as added noise increases, boldness decreases.This is desirable as less noisy predictions should be more bold than more noisy predictions.
The distinction in s b between levels of added noise becomes more clear with higher sample sizes.While the increase in s b from the MLE recalibrated set and 95% B-R set diminishes with sample size, we expect that the degree to which emboldening is appropriate also diminishes.Where data is abundant and proved to be extremely reliable, there may be less need for emboldening, as the predictions are already useful for decision making.
As for the Brier Score Resolution, notice in Figure 6b that there is little change between MLE recalibration and 95% B-R under small sample sizes.In just a few cases under n = 30, we see that BSR decreases after emboldening.There is virtually no change in BSR under large sample sizes.As expected, with more added noise, BSR is typically lower.
Brier Score Calibration is much less consistent across MC runs than the other metrics.
Notice in Figure 6c that there is little to no distinction in BSC between levels of added noise.
Additionally, at small samples sizes BSC sometimes improves but other times worsens.
Under large sample sizes, BSC always worsens as we would expected given that we are sacrificing calibration in favor of boldness.Similarly, in Figure 6d we see that there is little distinction in expected calibration error between levels of added noise.However, we see more consistency in terms of the degree of increase in ECE across MC runs.Under small samples sizes, we see little to no change in ECE.As sample size increases, we see larger This paper develops boldness-recalibration methodology surrounding the fundamental tension between calibration and boldness of predicted probabilities, subject to classification accuracy.While some methods consider concepts similar to boldness, such as resolution, relative to calibration, none provide direct control of the trade-off between the two.Our proposed Bayesian calibration assessment and boldness-recalibration approaches address this gap.This paper is for those who would consider making a small sacrifice in posterior probability of calibration to gain boldness so as to study riskier predictions for decision making.
The backbone of these approaches is the interpretable (in a Bayesian sense) posterior model probability, P (M c |y), which serves as a measure of calibration and is interpreted as the probability a set of predictions is calibrated, given the data observed.We define boldness as the spread (i.e. standard deviation) in predictions.In boldness-recalibration, the user pre-specifies a tolerable risk of miscalibration (e.g.P (M c |y) = 0.95) and subject to this constraint, our method maximizes spread in predictions and thus, boldness.The difference in the posterior model probabilities for the original and boldness-recalibrated sets concisely quantifies the calibration-boldness trade-off.By pre-specifying calibration via P (M c |y), the user is given direct control of the boldness calibration trade-off.
Boldness-recalibration provides a means of appropriately emboldening probability predictions.The Hockey case study shows that "appropriate" may have different meaning depending on the quality of the data.The predictions from FiveThirtyEight were substantially emboldened while maintaining reasonable calibration.This indicates their predictions are reliable but overly cautious.However, boldness-recalibration showed that the random noise forecaster should bring their predictions in towards the base rate rather than embolden.In this case, it is more appropriate to un-embolden as their predictions were not reliable or useful for decision making.
We demonstrate via simulation study that P (M c |y) correctly identifies miscalibration and appropriately emboldens in nearly all forecaster types at a reasonable sample size, even under miscalibration mispecification.After correcting miscalibration in the simulated predictions, we see an increase in boldness across all sets.In cases where the original predictions are noisy, spread is lower in the boldness-recalibrated set than in the original and higher in for those that are accurate.
While we leverage spread to measure boldness and the LLO function to recalibrate, one could consider alternatives to these choices.The core idea of selecting an emboldening plan that satisfies a required probability of calibration still holds.Another potential future research goal of interest is that of subjective elicitation of prior probabilities of calibration.
While we use P (M c ) = P (M u ) = 1 2 , this may not be ideal in situations where prior information can be obtained.Additionally, another goal is to investigate potential dependence structure among forecasts, as this methodology does not currently enable analysis of dependence between forecasts in a set.While we provide one example of a use-case in this paper, we propose these methods are useful in many situations where there are predicted probabilities of binary event.These methods allow decision makers to rely on these predictions for make informed decisions.Appropriately emboldened predictions, as produced by Boldness-recalibration in an interpretable manner, enable better decision making in these critical situations.regression and our approach is that we let p = c(x), or in words, we let the governing probabilities of our outcomes equal the LLO-adjusted probabilities from the LLO function.In Figure 7a, as expected we see that s b is higher for Boasters and lower for Hedgers compared to the Well Calibrated forecasters.We also see that s b is the same for Biased compared to Well Calibrated as the Biased probabilities are only shifted, not scaled.Additionally, we see that the mean s b across MC runs does not change with sample size.However, in Figure 7b we see that the mean Brier Score Resolution Component (BSR) degrades (decreases) with sample size.Additionally, the difference in BSR between the Well Calibrated Forecasters and the Hedgers and Boasters is not as clear as it is with s b .

Additional Simulation Study Results
Figures 7c and 7f show that as sample size increases the Brier Score Calibration component (BSC) and Expected Calibration Error (ECE) improve (decrease) for all settings, not just those that are well calibrated.At large sample sizes, the Hedgers with large added noise score nearly indistinguishable from the perfectly calibrated forecaster on the far left under BSC.While BSC for the Boasters and Biased forecasters are higher than the well calibrated, the difference in minimal.ECE provides greater distinction between forecasters than BSC.
Notice that the Brier Score Uncertainty component (BSU) does not change across forecaster types (Figure 7d).This is due to BSU not relying on the probabilitiy predictions themselves.Rather, it is simply a measure of the uncertainty in the outcomes.Since BSU is a metric that cannot be changed by improving (degrading) forecasts and thus is out of the control of forecasters, we choose not to discuss BSU further.
Often it is more common to use the overall Brier Score to assess forecasters.Figure 7e shows that the Brier Score changes minimally with sample size, unlike it's BSC and BSR components.However, the distinction between forecasters is still minimal under this metric.Unlike posterior probability of calibration, it is difficult to interpret an individual forecaster's Brier Score in isolation.These results are an extension of those show in the main text.

Likelihood Ratio Test for Calibration Results
Figure 11 shows that as sample size increases, the LRT p-value for calibration also decreases in all settings except for the Well Calibrated forecasters with little or no noise, where the null hypothesis is true by construction (and thus p-values are uniform).This reflects increasing power to detect miscalibration with increasing sample size.Again, notice that Hedgers with large added noise appear well calibrated, but as power increases, our LRT more easily detects the miscalibration.The second, third, fourth, and fifth columns of points represent the same metric for the MLE recalibrated, 95%, 90%, and 80% B-R sets, respectively.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.Note that the points/lines are plotted in a randomized order.The few sets where boldness-recalibration failed as explained in the main manuscript are removed from this plot.The second, third, fourth, and fifth columns of points represent the same metric for the MLE recalibrated, 95%, 90%, and 80% B-R sets, respectively.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.Note that the points/lines are plotted in a randomized order.The few sets where boldness-recalibration failed as explained in the main manuscript are removed from this plot.

Figure 2 :
Figure 2: Schemas to visualize boldness-recalibration.The left panel shows boldness as a function of spread in predictions.Each line corresponds to a prediction.The right panel shows a boldness-recalibration contour plot where the x-axis is shift parameter δ, y-axis is scale parameter γ, and z-axis is P (M c |y) achieved by δ and γ.Contours correspond to P (M c |y) = 0.95 (solid red), 0.9 and 0.8 (dashed black).The × corresponds to ( δMLE , γMLE ) such that the resulting probabilities under LLO-adjustment have maximal probability of calibration.The star on the 0.95 contour corresponds to ( δ0.95 , γ0.95 ) such that the resulting probabilities have maximal spread subject to 95% calibration.These LLO-adjusted probabilities are called the 95% boldness-recalibration set.
shows the boldness-recalibration plots for FiveThirtyEight (Left) and random noise forecaster (Right).Regions in red show where P (M c |y) is high for the LLO-adjusted x via the corresponding δ (x-axis) and γ (y-axis) values.Regions in blue show where P (M c |y) is low.As expected, δMLE and γMLE , marked by the white × in Figure 3, lie at the point where the probability of calibration is maximized.The values for δt and γt are marked by white points along the contour for each t.Recall these represent the set of LLO-adjustment parameters for which maximal boldness is achieved with a probability of calibration of at least t.These parameter values, along with the achieved P (M c |y), s b , prediction range, Brier Score, BSC, BSR, ECE, and AUC are summarized in Table2.

Figure 3 :
Figure 3: Boldness-recalibration contour plots for FiveThirtyEight (Left) and random noise forecaster (Right).Regions in red reflect high P (M c |y) for the LLO-adjusted x via corresponding δ (x-axis) and γ (y-axis) values.Regions in blue show low P (M c |y).The × marks δMLE and γMLE where the probability of calibration is maximized.Contours at t = 0.95, 0.9, and 0.8 are drawn in white and δt and γt are marked by white points along each contour.

Figure 4 :
Figure 4: Lineplot visualizing how the predictions for FiveThirtyEight (Left) and the random noise forecaster (Right) change under LLO-adjustments via MLEs and boldness recalibration.The first column of points in each panel is the original set of probability predictions.The second column of points is the predictions after recalibrating with the MLEs.The last three columns are the predictions after 95%, 90%, and 80% boldnessrecalibration respectively.A line is used to connect each original prediction to where it ends up after each recalibration procedure.Points and lines colored blue correspond to predictions for games in which the home team won.Red corresponds to games in which the home team lost.Achieved P (M c |y) is reported in the parentheses in the axis label.
posterior model probability of calibration, P (M c |y), prior to MLE recalibration and boldness-recalibration is summarized in Figure 5 for all 100 MC runs.The boxplots are grouped by simulated forecaster type shown on the x-axis.The y-axis shows the value of P (M c |y). Sample size increases with vertical panels from top to bottom.Horizontal panels indicate whether miscalibration was simulated under the LLO or the Prelec function.As the Well Calibrated forecasters do not change under either LLO or Prelec adjustment, they are separated out into their own panel.Within each group of boxplots, added noise σ increases from left to right.Thus, for the Well Calibrated group, only the first boxplot with no added noise (σ = 0) is perfectly calibrated and both calibration and accuracy decrease as noise increases.Boxplots for s b , BS, BSR, BSC, and ECE can be found in the online supplement.We expect P (M c |y) to be high for well calibrated forecasters and low for poorly calibrated forecasters.As sample size increase, P (M c |y) decreases for all settings except the Well Calibrated forecasters with little to no added noise.This indicates our Bayesian approach performs sensibly in that the ability to correctly detect miscalibration increases with sample size.Additionally, as more noise is added to the Well Calibrated forecaster, their probability of calibration decreases as expected.Notice that under low sample sizes,

Figure 5 :
Figure 5: Boxplots summarizing the posterior model probability of calibration, P (M c |y), on the y-axis for 100 MC runs on simulated forecasters.Boxplots are grouped by forecaster type on the x-axis.Within groups, added noise increases from left to right.Only the leftmost boxplot in the Well Calibrated group is perfectly calibrated, and ← indicates calibration increases as noise decreases.Horizontal panels indicate which adjustment (if any) was applied to create the forecaster type.

Figure 6
Figure 6 summarizes the change in s b , BSR, BSC, and ECE moving from the MLE recalibrated set to the 95% B-R set under LLO miscalibration.These lineplots are different from the lineplots from the Hockey example in that the y-axis shows the value of the metric of interest.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.It is important to note the y-axis is not fixed across vertical panels.The first column

Figure 6 :
Figure 6: Lineplots summarizing on the y-axis the change in (a) boldness measured by s b , (b) Brier Score Resolution, (c) Brier Score Calibration, and (d) Expected Calibration Error for 100 MC runs on LLO-miscalibrated simulated forecasters.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.The first column of points in each panel represents the value of each metric for the MLE recalibration set.The second column of points represents the same metric for the 95% B-R set.A line is used to connectpoints that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.Note that y-axis is not fixed across vertical panels, the points/lines are plotted in a randomized order, and one set is removed from each panel due to random perfect separation issues as described in the text.

Figure 7
Figure 7 summarizes (a) s b , the Brier Score (b) Resolution, (c) Calibration, and (d) Uncertainty components, (e) overall Brier Score, (f) Expected Calibration Error for each simulation setting prior to MLE recalibration and boldness-recalibration.The boxplots are grouped by simulated forecaster type shown on the x-axis.The y-axis shows the value of the metric of interest.Sample size increases with vertical panels from top to bottom.Within each group of boxplots, added noise σ increases from left to right.Thus, for the Well Calibrated group, only the first boxplot with no added noise (σ = 0) is perfectly calibrated and calibration decreases as added noise increases.Notice that there is little to no difference between the boxplots under LLO miscalibration compared to Prelec miscalibration.

Figures 8
Figures 8, 9, 10, summarize the change in P (M c |y), s b , BSR, BSC, BSU, overall Brier Score, and ECE, respectively, moving from the original simulated set to MLE recalibration to 95%, 90% and 80% boldness-recalibration under both LLO and Prelec Miscalibration.These lineplots show the value of each metric on the y-axis.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.It is important to note the y-axis is not fixed across vertical panels.The first column of points in each panel represents the value of each metric for the original set of predictions.MLE recalibration set.The second column of points represents the same metric for the MLE recalibration set.The third, fourth, and fifth columns represent the same metric for the 95%, 90%, and 80% B-R sets, respectively.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.

Figure 12
Figure 12 summarizes the change in the p-value across the original, MLE recalibrated, and 95%, 90%, and 80% boldness-recalibration sets.Notice that in nearly except Biased forecasters, added noise causes little change in the p-values within each forecaster type

Figure 7 :
Figure 7: Boxplots summarizing (a) s b , (b) the Brier Score Resolution, (c) Calibration, and (d) Uncertainty components, (e) overall Brier Score, and (f) Expected Calibration Error (y-axis) for 100 MC runs on simulated forecasters.Boxplots are grouped by forecaster type on the x-axis.Within groups, added noise increases from left to right.Only the leftmost boxplot in the Well Calibrated group is perfectly calibrated, and ← indicates calibration increases as noise decreases.Horizontal panels indicate which adjustment (if any) was applied to create the forecaster type.Sample size increases with vertical panels.

Figure 8 :
Figure 8: Lineplots summarizing the change in (a) P (M c |y) or (c) s b on the y-axis for 100 MC runs on LLO-miscalibrated simulated forecasters.Panels (b) and (d) show P (M c |y) and s b under Prelec miscalibration.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.The first column of points in each panel represents the value of each metric for the original set.The second, third, fourth, and fifth columns of points represent the same metric for the MLE recalibrated, 95%, 90%, and 80% B-R sets, respectively.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.Note that the points/lines are plotted in a randomized order.The few sets where boldnessrecalibration failed as explained in the main manuscript are removed from this plot.

Figure 9 :
Figure 9: Lineplots summarizing the change in (a) Brier Score Resolution, (c) Brier Score Calibration, and (e) Brier Score Uncertainty on the y-axis for 100 MC runs on LLOmiscalibrated simulated forecasters.Panels (b), (d), and (f) show BSR, BSC, and BSU respectively under Prelec miscalibration.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.The first column of points in each panel represents the value of each metric for the original set.The second, third, fourth, and fifth columns of points represent the same metric for the MLE recalibrated, 95%, 90%, and 80% B-R sets, respectively.A line is used to connect points that correspond to the same original set of predictions.The lines and points are colored based on the amount of added noise.Note that the points/lines are plotted in a randomized order.The few sets where boldnessrecalibration failed as explained in the main manuscript are removed from this plot.

Figure 10 :
Figure 10: Lineplots summarizing the change in (a) the overall Brier Score and (c) Expected Calibration Error on the y-axis for 100 MC runs on LLO-miscalibrated simulated forecasters.Panels (b) and (d) show the Brier Score and ECE under Prelec miscalibration.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.The first column of points in each panel represents the value of each metric for the original set.

Figure 11 :
Figure 11: Boxplots summarizing the P-value from the Likelihood Ratio Test of calibration for 100 MC runs on simulated forecasters.Boxplots are grouped by forecaster type on the x-axis.Within groups, added noise increases from left to right.Only the leftmost boxplot in the Well Calibrated group is perfectly calibrated, and ← indicates calibration increases as noise decreases.Horizontal panels indicate which adjustment (if any) was applied to create the forecaster type.As sample size increases with vertical panels, the P-value decreases for all forecasters except Well Calibrated with little to no added noise.

Figure 12 :
Figure 12: Lineplots summarizing the change in the P-value from the Likelihood Ratio Test of Calibration on the y-axis for 100 MC runs on LLO-miscalibrated simulated forecasters.Sample size increase with vertical panel.Horizontal panels denote the forecaster type.The first column of points in each panel represents the value of each metric for the original set.

Table 1 :
manipulate boldness and create the four forecaster types, LLO-adjust p i,σ under varying δ and γ values, summarized in Table1.Since the LLO function is monotone, forecasters LLO-adjusted from p i,σ maintain the same classification accuracy as p i,σ .

Table 2 :
Values of the posterior model probability of calibration P (M c |y), boldness (s b ), prediction range, Brier Score (BS), Brier Score calibration component (BSC), Brier Score resolution component (BSR), expected calibration error (ECE), and area under the ROC curve (AUC) for the original sets of predictions and those achieved under MLE recalibration, 95%, 90%, and 80% boldness-recalibration (B-R) via estimated adjustment parameters δ and γ for FiveThirtyEight and random noise forecaster.