Is agent-based modelling the future of prediction?

ABSTRACT This article argues that Agent-Based Modelling, owing to its capabilities and methodology, has a distinctive contribution to make to delivering coherent social science prediction. The argument has four parts. The first identifies key elements of social science prediction induced from real research across disciplines, thus avoiding a straw person approach to what prediction is. The second illustrates Agent-Based Modelling using an example, showing how it provides a framework for coherent prediction analysis. As well as introducing the method to general readers, argument by example minimises generic discussion of Agent-Based Modelling and encourages prediction relevance. The third deepens the analysis by combining concepts from the model example and prediction research to examine distinctive contributions Agent-Based Modelling offers regarding two important challenges: Predictive failure and prediction assessment. The fourth presents a novel approach – predicting models using models – illustrating again how Agent-Based Modelling adds value to social science prediction.


Introduction
Prediction is a notoriously contentious and conceptually challenging aspect of social science. In this article, I show how viewing it through the lens of a novel research method (the computer simulation technique called Agent-Based Modelling) offers both conceptual clarification and novel research tools. To avoid handwaving, I begin by analysing real research across the social sciences to see how prediction is done in practice, generalising to identify core interdisciplinary elements for subsequent analysis. Then, to minimise generic discussion, I introduce Agent-Based Modelling through an example specifically chosen to focus on prediction. This discussion illustrates how Agent-Based Modelling offers a coherent framework for analysing predictions. The two main sections of the article show how, building on real research and the distinctive approach of Agent-Based Modelling, it can contribute to three important challenges in social science prediction: Predictive failure, assessing predictions and evaluating predictive approaches when the nature of the underlying social process is, of necessity, imperfectly known. The final section sums up the contribution of the article (and of Agent-Based Modelling).

What is social science prediction? Induction from real research
Prediction in social science has a long history across many disciplines. The aim of this section is therefore to identify its common features inductively (by focusing on the arguments of real research) thus supporting the relevance of the subsequent Agent-Based Modelling discussion. This approach avoids both straw person claims about what prediction is and reduces bias towards particular approaches (though space does not allow coverage of prediction in all disciplines). The first example will be described in detail to illustrate important concepts in prediction but, again for space reasons, later examples will just be sketched to confirm existing claims or support new ones.
The first example (Burgess & Cottrell, 1936) comes from sociology, a field which used to publish prediction research regularly in prestigious journals but has now stopped. 1 Burgess and Cottrell research what they call marital adjustment, how happily married people are. They hypothesise predicting this adjustment using relatively measurable partner characteristics. Immediately, two crucial aspects of prediction appear: research design and aims. The best research design for this study has characteristics measured before marriage with adjustment measured subsequently. This avoids the possibility that rationalisation can increase apparent association. However, such longitudinal designs are more costly and suffer distinctive data problems (like sample attrition exactly because some marriages fail). It is also important to understand why one would want to predict. One common aim is avoiding negative outcomes in society. In this case, if marrying someone with a quick temper is likely to produce unhappiness then individuals may choose not to.
We now face a general problem with older studies which is that researchers did not seem to see research design issues as clearly as we do now. The article strongly suggests that data were collected cross-sectionally after marriage so that happiness ratings were taken at the same time as reports about whether (for example) the couple did activities outside the home together. Furthermore, some items are clearly not about characteristics (for example, whether partners are tolerant or quick tempered -as measured by psychological scales for example) but about behaviours (sharing activities) or practices (agreeing how to handle in-laws) which might reasonably change. This makes interpreting the associations predictively problematic. It is one thing if tolerant partners make good marriages but quite another if good marriages motivate sharing activities or agreement about in-laws. (This issue about causal order is well known in statistics -see, Davis, 1986.) I suggest, however, that research that would now be done better is still not valueless. This style of prediction is still a coherent and useful thing to attempt (but of course that differs from actually succeeding).
This discussion leads to another crucial dimension of prediction, namely, why it might work at all. There are competing intuitions about this, but they are just intuitions, and the aim is to design research actually proving or disproving claims. So it is with prediction. If one accepts relatively stable psychological dispositions (which are themselves empirically supported -see, for example, Conley, 1985), one can easily see how being tolerant might help marriage partners to cope generally with negative events like unemployment. Equally, however, it would be implausible to claim that no endogenous processes (like creating shared experience or mutual adaptation) will affect marital happiness. Or that there aren't phenomena (like adultery or alcoholism) that may be beyond the protective capacity of dispositions and interactions (see, for example, Previti & Amato, 2004). But then a properly designed piece of research is exactly what establishes whether psychological variables can predict marital outcomes.
Another interesting aspect of analysing specific research is that early prediction did not develop independently across disciplines. My next example (Sarbin, 1944) is published in a good psychology journal but the author also published in sociology journals. Sarbin's article is conceptual rather than empirical but makes an important point for my argument (while confirming the importance of research design and that avoiding negative outcomes is a recurring goal of prediction). Sarbin makes a key distinction between what he calls actuarial prediction (which is what Burgess and Cottrell do in relating variables to outcomes) and individual prediction (often involving expert assessment). 2 This is clearly very important in criminology (for example) where the decision to parole someone may literally have life and death consequences. There is a tension here between general discomfort with supposing that simple models might predict at all and the possibility that (although expert judgements could be far more nuanced and individual) they might just not perform well. (Indeed, that is what Sarbin, 1943 appears to show.) The other important concern that Sarbin raises is that it may simply be fallacious to apply actuarial probabilities to individuals. If people like Bob (according to the model) have a 72% chance to break the terms of their parole then Bob does not have a 72% chance of doing this. He either will or he will not (and that will depend on why people like Bob have a 72% chance of breaking parole including characteristics that researchers haven't yet modelled). While it seems hard to dispute the logic of this point (you cannot predict the spin of one coin by knowing that many spins come out 50/50) the implications for prediction are less clear. Part of the problem is the absence of a stated mechanism in such accounts. Bob could have a 72% chance of breaking parole if the outcome for each prisoner resulted from an independent dice roll (but that seems implausible). On the other hand, if reoffending was perfectly predicted by some non-modelled phenomenon (like recurrent toothache) which nonetheless correlated with some model variables (like poverty or rural residence) then Bob's actual chance of breaking parole could be very different than the actuarial prediction. This is part of a wider difficulty in keeping a clear conceptual distinction between what Hendry and others call the Data Generation Process -hereafter DGP (Hendry & Richard, 1983), the actual set of social processes giving rise to the data collected and attributed theory/model accounts of these. Hendry's key point is that we cannot start from the assumption that any model is 'true' because the nature of abstraction is such that this assumption cannot be correct. If researchers believe breaking parole is caused by IQ (theory) and IQ merely correlates with what actually causes it (DGP) then the theory will be weakly confirmed (but erroneously). This style of theorising also creates a problem because there needs to be a clear mechanism by which general traits can cause decisions (for example, to abscond). At this stage, all I can do is draw attention to the role of mechanism in the possibility of effective prediction. I can develop this argument further (and cover prediction in another discipline) by considering work by Ohlin and Duncan (1949). Again, there is no clear disciplinary boundary (with research labelled as criminology appearing in a core sociology journal). Ohlin and Duncan's key point is that even allowing for good research design, effective measures of predictive success are needed. Their research reiterates the observation that prediction tends to survive in fields seeking avoidance of negative social outcomes like crime. (In fact, criminology is a rare discipline where prediction research was still consistently represented until recently -see for example, Brennan and Oliver 2013).
Unlike Burgess and Cottrell, the research design of typical criminological prediction studies is fairly clear. Data is collected about incarcerated prisoners. From this, models are developed predicting who will break parole on release. If such models predicted perfectly, having fewer than a certain number of favourable factors (like good home circumstances) would result in broken parole while having more would result in adherence. Unsurprisingly what actually arises is two overlapping peaks (Ohlin & Duncan, 1949, p. 442). People with few favourable factors will probably break parole. People with many probably will not but those in the middle could go either way. But there are obvious caveats to this approach relevant to subsequent arguments. The first is that data can only be about known parole violations (which aren't an unbiased sample of all violations). The second is that prediction effectiveness may depend on whether the parole decision is itself independent from attributes of the criminal. 3 Ideally, the comparison would be for a common crime like domestic burglary and the parole mechanism would be non-selective (free prisoners after 70% of their sentence automatically). Murderers may never qualify for parole or only after much longer sentences. Parole boards may also expect much stronger evidence of rehabilitation. This implies that a comparison of outcomes for all crimes (which is what Ohlin and Duncan offer) does not constitute an independent homogenous sample.
Another well-known prediction challenge can be found in demography, forecasting future population. The reasons for doing this are both practical (how many schools might be needed?) and again to avoid negative outcomes (what must be done about human sustainability?) In a useful review, Booth (2006) makes a key distinction between extrapolative methods (what is predictable about future population from past population) and structural methods (any modelbased attempt to predict). Interestingly, her main critique of structural methods is the risk of misspecification (omitting effects that are actually causal) owing to the weak state of demographic theory. The problem with extrapolative methods is fairly obvious. Why suppose that sufficient information is contained in past aggregates to determine future aggregates? This problem has three related aspects (all of which are relevant to the possible distinctive contribution of Agent-Based Modelling). The first is our conceptual understanding that birth rate results from large numbers of individual decisions within a social and regulatory context. This being so, it is unlikely that aggregate values are directly causal such that the birth rate next year would follow from the birth rate this year. The fact that this appears to happen actually results from individual level stabilities (beliefs about suitable family sizes for example). The second aspect is that this approach does not have access to the data underpinning predictive failure. If, for example, there was an endogenous tendency towards smaller families, this would change the trend but extrapolation would only reveal this fully after the fact. The third aspect, which lacks clear conceptualisation in existing research, is the role of policy. Society often wants to falsify predictions of negative outcomes (not releasing people who might violate parole, not allowing human population to become unsustainable). This being so, intervention occurs precisely to change the underlying process that extrapolative methods fail to access. Retrospectively, it shows that a policy worked but can neither predict the effect accurately nor make good on its ex ante prediction (absent the policy.) The final example covers two different disciplines in applying economic prediction methods to epidemics (Doornik et al., 2020). Economics echoes the tension between extrapolative and model-based prediction but adds to my overview in several ways. Firstly, Doornik et al. also emphasise that model-based prediction suffers from imperfect theory. Secondly, epidemics reiterate the problems of extrapolation in that society wants to falsify predictions of negative social outcomes (in this case COVID deaths). Thirdly, Doornik et al. produce very short range predictions (over weeks). This draws attention to the information content of past data and the surprise potential inherent in predictions. If the number of murders in a particular town was growing 500, 1000, 1500, 2000 . . . each year (and suppose, implausibly, that there was no plan to intervene) then a prediction of 2500 would be sensible but also completely unimpressive given the trend. On the other hand, a prediction of 1000 would be completely unjustified in trend terms but hugely impressive (if realised) because it would imply that the predictive model captured the DGP well enough to detect and quantify a turning point before it was realised in the trend. Thus another ingredient in convincing prediction is the need to operate far enough ahead that not merely predicting the trend can have additional information content. Finally, economics reveals a further complication that another aim of prediction (for example, predicting stock prices) is profit. But just as society wants epidemic predictions to be falsified as negative social outcomes, so predictions for profit may falsify (or self-fulfil) themselves. (You predict that stock will go up so you buy it and it goes up. Or you predict that stock will go up and others try to second guess your prediction using futures and the stock actually goes down.) In the Cottrell and Burgess case, knowing what makes marriages work may merely reduce the number of unhappy people with no other social externalities but certain economic predictions illustrate the opposite extreme where whole markets consist of people trying to second guess each other which may therefore become fundamentally unpredictable. The takeaway message is that careful thought must be given to who is predicting for what purpose and what the social consequences of such predictions might be.
Finally, and it is interesting that this did not arise in the examples, prediction ethics must be considered. Suppose one really could show that Bob had a 78% chance of reoffending based on compelling evidence. Could society then justify denying him parole? What is an acceptably low chance of reoffending? Realistically it cannot be zero. Thus, even if it were possible to develop accurate predictions, that might not exhaust the social challenge.
To sum up then, analysing prediction research across disciplines suggests a consensus about its core components: • It needs an appropriate research design (probably longitudinal) so that predicted outcomes clearly occur after supposed predictors. • In designing predictive research, the possible impact of predictions needs to be considered as part of the design. If marital happiness is predicted based on attributes then while this may change who marries who, there is no reason to suppose that it will change the underlying mechanism supporting the prediction (grumpy people remain hard to live with). By contrast, while society wants epidemic predictions falsified, it is problematic if (by falsifying them) their models become untestable. • A clear conceptual framework is needed to show how different prediction approaches function and, in particular, to support the difficult task of thinking clearly about temporal logic. In a rising trend, it is probably impossible for model-based prediction to outperform extrapolation, but the situation is reversed when (in the future) there is a turning point which the extrapolative method will miss (at least until it is too late). Some way is needed of characterising the information content of existing data (and the amount of latitude in models) so that the claim that one approach to prediction really is outperforming another can be convincingly justified. (I will return to this in section 5.) • The same clear conceptual framework is also needed for assessing claims about mechanism.
What is it that remains stable (and what changes) such that prediction is viable? This is easy to see for stable psychological dispositions (being tolerant making happy marriages) but much harder for other possibilities (like the long-term persistence of social practices or homeostasisincreased birth rate in resource limited societies simply leading to increased death rate.) Furthermore, this framework needs to say something intelligible about the relationship between individual choices and aggregates so that sense can be made of social change and policy.
The generality resulting from this inductive approach to real research ensures that discussion of Agent-Based Modelling in subsequent sections can contribute to actual practice.

How does agent-based modelling frame prediction? A worked example
Agent-Based Modelling (Gilbert, 2020) is a technique that involves representing social processes as computer programmes rather than equations (in regression for example) or narratives (as in ethnography). It is also distinctive in attempting to represent these social processes directly rather than, for example, just solving pre-existing theoretical equations by computer (rather than using a pencil). The best way to illustrate these points clearly (particularly with reference to prediction) is to use an example. The point of the example is therefore not to be empirically accurate but to explain clearly how Agent-Based Modelling is distinctive and how it captures the key components of prediction identified in the previous section. The example chosen is the 'Wolf Sheep Predation' Agent-Based Model (hereafter ABM). 4 In this ABM, sheep eat grass (which is depleted but recovers depending on the sheep population) and reproduce (which puts more pressure on the grass). Wolves (which also reproduce) eat sheep and thus their population expands with larger sheep populations but contracts when these are smaller. Thus, the current state of the grass and the sheep/ wolf populations depend on past interplay between these species. The ABM also includes parameters which shape the overall system behaviour. These are the initial numbers of sheep and wolves, the amount of energy sheep and wolves get from eating and the chance that each species will reproduce. 5 Although this ABM is simplistic and therefore subject to almost infinite criticism both conceptual and empirical, in terms of explanation it is both concise and precise. The exact state of the simulated world at time zero is known as are all the processes and parameters for its subsequent evolution. This being so, it is possible to let it evolve till time t (considered to be the present). The ABM can then continue to evolve into the future on request but, in the meantime, prediction can be attempted (for example, what the wolf population will be at time t + 20) using any information and techniques desired.
However, one just needs to run this ABM twice to identify the first serious challenge to successful prediction. Because the model is not deterministic (reproduction occurs probabilistically for example) even the same starting conditions and parameters will not produce identical time series (although they do resemble each other strongly in the magnitude and duration of population changes for example). Given this, it might occur to the reader that the same situation applies over any specific time period (rendering prediction impossible) but this is too pessimistic. Firstly, the initialisation is untypical with sheep and wolf populations set by the user (rather than system interactions themselves). Thus the initial system situation may be outside that ever found during its endogenous evolution. Secondly, it can be seen from the resemblance of population waves that generally, once the sheep population is rising, it will continue to do so for a while and then stabilise and fall. It is not observed that it rises for ever or that it rises for a while and then drops to zero. Thus while exact prediction may be impossible, the identification of less exact regularities may not be.
But it is important to be clear that my argument at this stage is not about whether prediction can succeed (which is an empirical matter) but merely whether it can be made objectively intelligible. Within the framework of this example, it is. It makes complete sense to say '20 periods after the present I predict that the wolf population will be 30' and for that prediction to be definitively confirmed or refuted by running the ABM. In addition, it is clear how some challenges to prediction fit naturally into this framework. For example, if I believe the wolf population on 1 January 1980 is 40 but it is actually 25, then even the correct deterministic ABM will not be able to track system evolution (so the effect of data error on prediction always matters). Further, this ABM encapsulates the assumption (relevant for subsequent discussions of intervention) that there is no structural change. Sheep and wolves are always equally fecund and grass has constant nutritional value. If, at a specific time, farmers started leaving contraceptive laced meat lying about, one could explore the ability of different prediction approaches to identify and accommodate that change.
However, before taking the argument further it is necessary to digress into Agent-Based Modelling methodology and its relation to data. The crucial element here is that, in principle, it is possible to test ABMs. In designing one, existing data is identified (for example, time series of wolf and sheep populations), a decision is made on how to specify the ABM (for example, does observation show that starvation and predation are the only causes of death?) and how to calibrate it (what is the reproduction rate for sheep with more or less grass perhaps using literal field experiments.) Having designed an ABM that is as empirically grounded as possible, is it true that simulated outcomes reproduce real ones? (This is called validation. See, for example, Hägerstrand, 1965.) I have already suggested that perfect prediction is impossible in stochastic ABMs but is it possible, for example, to predict the population range of species or the probabilities that populations will be within specified ranges? This raises an interesting issue about different degrees of abstraction according to which data can be compared (which needs to be developed further in Agent-Based Modelling. See, for example, Bloomfield, 2000).
This aspect of Agent-Based Modelling methodology takes my argument in two crucial directions. Firstly, it is clear how direct representation makes ABMs congruent with data. The ABM makes an assumption about birth rate and there is also an empirical fact of the matter about birth rate. This contrasts with theorised representations (or technical assumptions on which different modelling approaches like System Dynamics -see, for example, Wilensky, 2005 -depend) where there is no guarantee that concepts like transition probability or discount rate have real world referents. Following from this (and very relevant to policy and agency) it also makes perfect sense to say 'After 15 February 1981 wolf fertility began declining to half its previous value as farmers started distributing contraceptive laced meat.' (As I shall show, however, it is very important to be clear about which statements can be made ex ante and ex post. Ex ante, one can only say that wolf fertility is unlikely to rise after this distribution but claiming that fertility drops by half can only be justified ex post.) Thus endogenous system changes can also be represented directly in an ABM. (Of course, this capability is not costless. In an epidemic ABM, for example, the death toll may be reduced by 50% if the population locks down but both intervention and status quo predictions could be wrong -if the ABM is faulty -and one still has to establish how much compliance there really was for evaluation purposes.) At this stage, however, the argument remains one of potential and not practice. ABM can represent the changes that arise from human agency and policy (unlike extrapolative prediction and, arguably, model-based predictions where mechanisms are only implicit). But there is a still much hard work to do before this capability translates into predictive success. Secondly, now the temporal logic of prediction is clearer, so are claims about testing predictions using ABMs. As I shall argue subsequently, science should always worry about possibilities for cheating (either deliberately or through flawed methods) but let us suppose for the moment that researchers are totally honest and immune to self-deception. In this case, they run the ABM for enough time periods to generate data (which may be used, for example, to calibrate parameters or train a machine learning algorithm) and then make a prediction after that point. If the prediction succeeds (by whatever assessment criteria), then the approach is endorsed and it is sensible to suppose that the ABM may also predict the actual future. It is important to be clear, therefore, that while society wants real prediction (and it is the only totally cheat proof test) it does not follow that testing on known data is either pointless or specious. 6 Having identified key components of social science prediction and explained Agent-Based Modelling in the context of a predictive example, I am now in a position to demonstrate the contribution of Agent-Modelling to two important areas, namely analyses of predictive failure and prediction assessment.

What can agent-based modelling contribute to social science prediction? Two examples
In this section, I examine how the distinctive approach of Agent-Based Modelling can improve conceptual understanding and research practice in two key areas: Predictive failure and evaluating predictions.

The challenge of predictive failure
Some possible causes of predictive failure have already been examined. Extrapolation simply does not allow new information (like underlying behavioural change) to be incorporated into the prediction until it starts affecting the aggregate being predicted. This is the classic problem of turning points given the belief that the aggregate somehow determines itself rather than simply being a summary of an underlying social process changing endogenously. By contrast, model-based prediction might work if the model could be mapped onto reality (and that means not only access to relevant data but also an effective representation of mechanism: How exactly does education level show association with birth rate? Will that association support successful causal intervention?) But apart from the challenge of devising prediction tests (and avoiding deliberate cheating), it is also necessary to consider how different research methods may permit self-deception. This can occur when, instead of data being used as an independent test for an ABM, the model (on the presumption that it is correct) is fitted to data (has its design and parameters adjusted to maximise match.) The problem with this approach is so obvious that it can only be a belief that there is no alternative which has allowed it to be disregarded. If you start from the presumption that your model is correct, then you have no capacity to identify misspecification. Having created this problem, whether you can in fact fit the model merely depends on the information content of the data and the number of free model parameters. (The relationship between available data, model size and fitting versus calibration is a complicated one, there is no space to discuss fully here. See [ANONYMISED REFERENCE] for more analysis of this relationship.) With enough free parameters, you can fit anything (while in Agent-Based Modelling neither the specification nor the calibration should be free to allow this, each being empirically grounded as far as possible). However, the apparent success in fitting models to available data is illusory because misspecification and associated incorrect parameter values are not discovered until prediction of new data is attempted. 7 Having summarised the possibilities for predictive failure in existing approaches, I can now explain how Agent-Based Modelling provides a framework distinguishing sources of predictive failure that are avoidable (with suitable research design) and those which are unavoidable (and can thus only be properly acknowledged in interpreting prediction results).
I have already suggested that designing models directly representing social processes (and particularly causes) is one important way to avoid predictive failure (because it means that corresponding data is more accessible and there are fewer opportunities for spurious reasoning, for example, that association is somehow causal.) I have also suggested how fitting (rather than calibration and validation) may create problems for predictive models by obscuring misspecification and resulting faulty parameters. By contrast, an ABM tries to achieve correct specification and calibration from the outset (however badly it in fact succeeds) so possible weaknesses cannot be concealed.
Nonetheless, it is clear how additional phenomena (like data error and stochasticity) impose limits on effective prediction even if (somehow) the correct DGP were known. Nonetheless, such problems can be explored (and perhaps even quantified) using the special capabilities of ABMs, as I shall argue in section 5. But one source of predictive failure is unavoidable and all we can do is acknowledge it clearly. In a model, up to the present, one can assess the extent to which underlying processes (for example, a shift to preference for smaller families) might affect predictions. But the one thing that cannot be done logically is to anticipate future changes in that preference. If the family size preference trend has been stable or falling up to the present, then (if it subsequently rises) prediction will simply fail. This issue may underpin the radical (but actually spurious) prediction critique that you never can tell. In predicting the outcome of the Oxford-Cambridge boat race, the chance that one crew bus will be beamed up by aliens is not part of the model. Prediction must always take place in a credible context of ceteris paribus, in this case, that both crews arrive to race. Because a model of everything is impractical, there may always be events that not only falsify a specific prediction but actually invalidate the prediction process. (You did not get the winner wrong. There was no winner because there was no race.) But of course it is an empirical matter whether, in some circumstances, the ceteris paribus conditions do hold (generally the boat race does take place with two crews) so that prediction is legitimate and can meaningfully succeed or fail.
The most obvious manifestation of this issue in a prediction context is genuine novelty. Logically, no prediction method can quantify the possibility that, at some future point, an infallible contraceptive will be invented. But this fact should have no bearing on prediction attempts until it occurs and it does not (in fact) undermine concrete attempts to predict. One has to clearly distinguish that conjectured events should have no bearing on attempting prediction but, of course, every bearing on its success if they arise. This issue involves confusion between ex ante and ex post claims that must be rigorously avoided.

The challenge of assessing predictions
I have already suggested at various points that issues arise with assessing predictions and I draw these together here. The first is the time scale over which predictions are made. If this scale is too long, it is likely that misspecification, data error and genuine novelty will result in unavoidable predictive failure for all approaches. On the other hand, if the scale is too short, it is very hard for model-based approaches to distinguish themselves convincingly from extrapolative ones. Taking the classic example of a turning point, both approaches ought to predict a rising trend for a while but the crucial difference is that extrapolative methods will keep doing so until the variable starts to level off, while an effective model-based prediction will, over a suitable horizon, actually predict a lower value (a surprising prediction given the trend which thus has very high information content.) Thus, it is necessary to consider whether models should have to show improved performance over simple trend prediction (since many time series have significant elements of mere trend.) The second issue has also been mentioned, but the argument will be consolidated here. There are obviously different ways of characterising data and (like significance levels) predictive performance probably cannot be absolute. A model that can predict the range of wolf and sheep populations is better than one that cannot. A model that can predict the distribution of populations across ranges (45% chance to be between 50 and 75) is better still. 8 But it is known that even an exact model of a stochastic system will be unable to achieve perfect prediction. Progressive research therefore requires us to identify steadily more demanding predictive challenges and to evaluate as better those ABMs that meet them. (This raises another important issue. The weaknesses of extrapolative methods are universal because they do not evaluate anything underlying the aggregate. Modelbased methods may work well or badly depending on specific areas of application -marriage success based on psychological traits versus speculative markets -and how clear they are about mechanism claims. But the possibility remains that we may able to show that certain approaches to prediction are not just successful in specific cases but generally because they accurately represent the social processes implicated in predictive success or failure.) This argument brings us full circle to issues of effective research design. It is very important that popular prediction ideas do not muddle us into making incoherent claims. For example, 'Donald Trump will be re-elected' is a falsifiable prediction if made before the election. But 'Donald Trump has a 65% chance to be re-elected' is not. (For that, he would have to be re-elected in 65 parallel universes out of 100!) In contrast, '65% of incumbents will be re-elected' is again falsifiable. And the attempt to 'cheat proof' prediction reminds us that one has a 50% chance to be 'right' about Donald Trump's re-election (in a two candidate race at least) by spinning a coin. So, for credibility, the claim actually needs to be (again based on model comparisons) 'I can successfully call this many presidential elections'. Thus, as with all other research, prediction must occur in a rigorously specified context: What is an appropriate sample size of potential predictions given the current best model? In what circumstances can the credibility of unique predictions actually be demonstrated or must these always be instances of more general classes?
These arguments also lead to the consolidation of another important issue already discussed, namely the relationship between attempted prediction and the present moment. One reason for the high status of model-based prediction as a gold standard for social science is that it is robustly cheat proof (absent time travel). But this presumes that there are no other ways of cheat proofing model testing (and arguably with fitting there aren't). But if, as argued, the testing of ABMs has its own methodological protection from cheating (namely empirical calibration rather than fitting) then this problem may be practically less damaging (and there are other analogous solutions like prediction competitions or out-of-sample testing). Further, ensuring cheat proof prediction is not costless. If the need to prevent negative social outcomes does not allow you to test predictions ceteris paribus then the danger is that you will neither test the model without the policy nor be able to test it with. Far from being cheat proof then, unless we think carefully about research design and ex ante/ ex post claims, the danger is that policy relevant predictions will be influential without any testing.
(If a credible model predicts 10 million dead, then it is very likely that huge efforts will be made to falsify that outcome by intervention. This being so, a much lower ex post death toll will not tell us whether, in fact, the non-intervention prediction was completely spurious.) This argument has an important corollary. We have much data about the 1918 flu pandemic (among others). Obviously it is not the data we might collect now and it will be inaccurate. Further, we are well aware that the 1918 flu is not the same as COVID but nonetheless two important questions arise. Firstly, when COVID was new and it was simply impossible to test for it or to get accurate data about model parameters, might it still have been better to develop models from historical data and then build on them than to guess? Secondly, once we qualify the idea that the only way to prevent cheating is to predict future events, might such historical modelling actually be quite valuable in narrowing the space of model possibilities against the day when we cannot ethically afford to test ABM predictions because the outcome may be avoidable deaths?
Finally, it is well known that, while not cheat proof, there are standard techniques for organising data to increase model credibility like out-of-sample testing. Even with fitted models, this approach adds credibility as long as out-of-sample performance is good, but there are reasons for worrying that it may not discussed above (and also why an empirically grounded Agent-Based Model might do better.) In this section, I have therefore supported the earlier claim that the ABM approach can make a distinctive contribution to conceptualising and researching recognised specific problems with prediction (predictive failure and assessing predictions). In the final section, I show how it can also make a novel contribute to addressing a more general problem: Developing effective prediction techniques when we do not actually know the DGP.

Another distinctive use for ABMs: the prediction laboratory
At this point, my argument shifts gear somewhat. The previous section dealt with the problems of making and evaluating specific predictions against data and the contributions that Agent-Based Modelling can make. But there is also a deeper challenge to which it can usefully contribute. That is rigorously analysing general claims about prediction when the actual DGP is not known. Thus, although, in principle, data error can be acknowledged as a phenomenon, nothing concrete can be said about it because the whole point is that true data values cannot be known. But we can assess, in as much detail as desired, the capability of different prediction methods to perform on data generated by a known ABM. So, if one tries to fit the correct wolf sheep grass ABM to data generated by the same ABM but perturbed by a fixed amount of data error, what happens? Does the difference manifest only in static system properties (like population ranges) or also in dynamic ones (like the time scale over which sheep populations rise and fall?) In this way, trustworthy insights about the relationship between DGP and ABMs can be developed (since we can repeat them over many different ABMs many times) which may therefore be applied with more confidence when the DGP is not known.
Other applications of this approach, already raised, would be devising performance measures for extrapolative and model-based methods over whole data sets (including the prediction time scale). There is no point in comparing models over timescales where none can perform well (because of things like genuine novelty) but can a time scale be established over which the ability of modelbased methods to find turning points can be demonstrated effectively?
We can also develop this prediction laboratory approach to show, for example, exactly how extrapolative methods err and how fitting on past data may generate poor predictive performance compared to specification and calibration. Researchers can accept these issues in principle (although with difficulty) but that is very different to actually seeing them worked out concretely. Further, as already suggested, this approach could be used to illuminate (if not yet actually forecast) the consequences of policy, genuine novelty and so on. (How would population predictions change if perfect contraception was invented in 1 year, 5 or 10? What would one see in epidemic dynamics if 60% lock down compliance could be achieved within a month?) As already shown by the example of running the same ABM twice, problems that are both conceptually and practically difficult to engage with can rapidly be bought down to earth: What does it actually mean for prediction that social processes are stochastic? (And how do researchers engage sensibly with this idea when they cannot perceive it in the unique realisations of actual data?) Two more applications of this laboratory approach immediately suggest themselves based on previous arguments. Firstly, one could explore the extent to which characterisations of systems are invariant to other system properties (like stochasticity). If the exact evolution of wolf and sheep populations cannot be predicted, is it possible instead to robustly predict population ranges or distributions of populations? This kind of analysis can thus be conducted on ABM data and then 'let loose' on real data once it is better conceptualised and we have some evidence that it might work.
Finally, and this is a very important issue, I have already hinted above that there is a general problem with ABM in operationalising certain quality control ideas from statistics. For a regression model we know what it means (and even what it may result in) if there are too many parameters relative to data (or they are the wrong ones). By contrast, while it is just as easy to see how an ABM may be mis-specified (random mixing is assumed when a disease is actually transmitted via social networks) it is much less clear how we formally establish how many social processes and parameters a given amount of data allows us. Using the laboratory approach we can devise and evaluate tests that can then be used in cases where we do not know the DGP. For example, an ABM might be considered insufficiently discerning if, within its specification and calibration uncertainty, it can reproduce both a time series and its mirror image. This is a very ad hoc suggestion (and might simply not work) but it is only by devising procedures that we can concretely attempt and analyse that we can hope to clarify our conceptual thinking to the point where we can develop effective tests.
In this section, I have shown how the ABM approach can not only contribute to existing prediction challenges but also provide new tools for evaluating prediction strategies in general through the 'prediction laboratory' insight based on models of models.

Conclusion
In this article, I have argued that ABMs (and their methodology) have a distinctive role to play in social science prediction by serving as a coherent framework changing our perspective on several crucial issues. The argument began by showing that in practice, prediction across the social sciences shares core elements (avoidance of social ills, challenges of research design and conceptualisation, issues with the nature of models -and particularly their claims about mechanism -and so on.) I then illustrated Agent-Based Modelling using a simple example and showed that this could coherently represent prediction (for example, about the wolf population 20 time periods hence) in terms of these elements. The next stage of the argument was to show how various challenges of predictive failure (whether avoidable or not) and prediction assessment would be viewed differently (and perhaps ameliorated) using ABMs. For example, that an ABM both directly represents the individual processes adding up to the aggregate being predicted and that it can also explicitly represent the changes resulting from policy -for example, that people stay at home and thus transmit infection less. Next, I illustrated valuable contributions that might arise from using Agent-Based Modelling as a kind of laboratory to develop concepts and tools that could be evaluated using a known DGP before being used more confidently on an unknown one. Finally, I drew on previous ideas to show how, although future prediction is totally cheat proof, other approaches (like good empirical methodology and prediction competitions) may also make cheating harder and have other advantages (like using data that is more readily available and avoiding the possibility that predictive models used first in crises actually end up untested.) Notes 1. As of 17.06.21, for example, the flagship American Journal of Sociology reports the following articles with prediction in the abstract using JSTOR search: 2020s (0), 2010s (0), 2000s (3), 1990s (3), 1980s (4), 1970s (4), 1960s (8), 1950s (9), 1940s (11), 1930s (4), 1920s (2). 2. An interesting point arises here. The arguments I present do not depend on whether prediction models are explicit. An expert can predict well even if they cannot explain how. The same issue arises in machine learning. As long as we design predictive research rigorously, we might trust algorithms even if we cannot understand them. 3. There is also a counterfactual problem for all prediction that changes the environment. If someone is never given parole they can never violate it. It is then impossible to tell whether they would have done so had the decision been otherwise. 4. This example is plainly not social but was chosen partly for brevity of explication and partly because there seem to be (perhaps surprisingly) no ABMs that are socially plausible, simple and yet generate long term dynamics with equivalent richness to the synchronised rise and fall of sheep, wolves and grass.
5. The ABM used here (Wilensky, 1997) is part of the models library for a package called NetLogo which can be downloaded for free (Wilensky, 1999). 6. The awe surrounding accurate predictions may distract from the fact that testing models honestly can simply involve scientific organisation. One gives a modeller 1000 periods of a time series, asks them to predict it and simply does not provide the 200 periods they are supposed to predict until afterwards! This is the logic of prediction competitions (Erev et al., 2010). 7. This is another way of explaining the difference between testing and fitting. If an ABM fails you revisit your specification and calibration assumptions but that does not 'exhaust' your testing data. By contrast, all you can do under fitting is more fitting until you have again 'exhausted' your data and therefore have to make a leap of faith about whether your model will actually work with new data. 8. It is harder to characterise so called qualitative prediction but my arguments are not intended to rule out this approach. Arguably the claim 'the trend will mostly slope up', while having lower information content than 'the number of murders will increase by about 400 per year', is still clearly falsifiable.