Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges when there are Non-Overlapping Lists

Multiple systems estimation strategies have recently been applied to quantify hard-to-reach populations, particularly when estimating the number of victims of human trafficking and modern slavery. In such contexts, it is not uncommon to see sparse or even no overlap between some of the lists on which the estimates are based. These create difficulties in model fitting and selection, and we develop inference procedures to address these challenges. The approach is based on Poisson log-linear regression modeling. Issues investigated in detail include taking proper account of data sparsity in the estimation procedure, as well as the existence and identifiability of maximum likelihood estimates. A stepwise method for choosing the most suitable parameters is developed, together with a bootstrap approach to finding confidence intervals for the total population size. We apply the strategy to two empirical data sets of trafficking in US regions, and find that the approach results in stable, reasonable estimates. An accompanying R software implementation has been made publicly available.


Introduction
Multiple systems estimation, a generalization of the mark-recapture approach (Petersen, 1896;Schwarz and Seber, 1999) is a class of methods that can be used to estimate the size of hard-to-reach populations in many contexts, including, in recent years, those comprised of human trafficking or slavery victims. The methods are typically applied to wildlife populations (Williams et al., 2002) and to hidden populations such as injection drug users (King et al., 2013). In the administrative or law enforcement context, multiple systems estimation aims to read across from lists of observed or identified individuals from a study population to estimate the total population of interest; see, for example, Bales et al. (2015) and Cruyff et al. (2017). A mathematical model is posited for the pattern of incidences across the lists, and the "dark figure", the number of unobserved cases, is estimated. A survey of the history of the methods and a range of applications is provided, for example, by Bird and King (2018).
Because the method estimates the number of victims including those that are not directly observed or detected, it plays an especially important role in policy making decisions to help combat human trafficking and modern slavery. For example, as set out in Bales et al. (2015), a multiple systems estimate constructed from data collated by a government agency was a key component of the strategy (Home Office, 2014) leading to the UK Modern Slavery Act 2015.
A frequent specific challenge posed by data on human trafficking is sparse overlap between the observed administrative lists; indeed, it appears to be the norm rather than the exception that there will be pairs of lists between which there is no observed overlap. This sparsity can lead to inferential and algorithmic difficulties and instabilities if it is not addressed. In applications such as wildlife populations, the researcher may be able to continue capturing from the study population until sufficient overlap is observed between the capture occasions. Such a strategy is not available in the human trafficking context, nor usually in other human rights areas either.
A pair of lists may fail to overlap for a number of reasons: there may be a genuine structural reason why the particular lists cannot overlap; there may be negative correlation between lists; or it may simply be that the overall sample size is relatively small and, especially if the two lists have small capture probabilities, there do not happen to be any cases which are on both lists. In this area, there is as yet limited understanding of data and of mechanisms, and furthermore data are often highly anonymized for reasons of confidentiality and security. Typically, those analyzing the data may not know anything about a list other than an uninformative label, because the collation between lists is carried out by a single trusted individual or agency on that understanding. See for example Bales et al. (2015) and Bales et al. (2019). Hence, there may be no further information available, beyond simply the number of overlapping cases, as to why no cases are observed in common between two lists.
We approach inference via Poisson log-linear regression modeling applied to counts of individuals that are observed on each possible combination of the lists. This is a well-known technique that allows one to model correlations and dependencies between lists. The standard approach is set out by Sandland and Cormack (1984), Cormack (1989), Cormack (1992), Rivest and Daigle (2004) and Bird and King (2018), among others, and implemented in Baillargeon and Rivest (2007) and Rivest and Baillargeon (2014). However, as Fienberg and Rinaldo (2012a) discuss in a much more general context, contingency tables with zero entries, as will arise if there is sparse overlap between lists, may lead to cases where to carry out maximum likelihood estimation it is necessary to extend the range of parameters to include −∞, and even then the maximum likelihood estimate of the model parameters may not exist or may not be identifiable.
In our context, therefore, empty overlaps between lists require careful treatment. The primary objective of this paper is to introduce inferential procedures and computational implementations that explicitly handle this case. For simplicity, we have focused attention only on models which include parameters for two-list effects (also called 'first-order interactions' by some authors), but the basic concepts of allowing for empty cells, and of checking for the existence of estimates, are straightforward to extend.
We first of all develop a method which fits a model stably, taking proper account of existence and identifiability issues that can arise if the data are sparse. We then consider a model-selection procedure to choose the most suitable set of parameters on which to base inference. A stepwise approach to model selection is used, but so that any effects of choosing the specific model are taken into account in the inference, confidence intervals for the estimation are constructed using the BC a bootstrap method (DiCiccio and Efron, 1996).
The methods are motivated and illustrated by data sets based on human trafficking victims in the New Orleans area (Bales et al., 2019) and the West-ern site of a research study in the USA (Farrell et al., 2019). Simulations studies are used to validate the stepwise approach and are based on data sets generated from these empirical data sets, where multinomial sampling is use to assign capture histories to the population members as suggested by Cormack (1992). We conduct our analyses in the R programming language (R Core Team, 2016), and have developed an accompanying R software package SparseMSE . The package allows readers to implement the methodology on their own data as well as to reproduce the results presented in the paper.
The paper is organized as follows. Section 2 outlines the Poisson log-linear model and gives the notation and likelihood setup, details specific issues concerning the existence and identifiability of maximum likelihood estimates, and also discusses issues relating to the breakdown of the assumptions underlying standard likelihood-ratio and information-theoretic approaches. It also develops algorithms for checking efficiently whether models present problems of nonexistence or unidentifiability of estimates. Section 3 develops the model-selection routine and corresponding inference procedure, setting out an efficient algorithmic approach to the bootstrap in this case. A simulation comparison with one of the current standard methods is included. Section 4 presents the results from the two empirical applications, as well as a simulation study informing the choice of threshold in our procedure. The concluding remarks in Section 5 include a comment on the R package, as well as discussions of the possible extension of the procedure to higher-order interactions, and to data with covariate information. There is further detail of several topics in the supplementary materials to the paper.

The Model
We first define notation and set out the model. The framework leads to an algorithmic approach facilitating correct and stable calculations. We then discuss the implications of sparse counts on existing inferential methods, followed by a discussion on checks for existence of maximum likelihood estimates and identifiability of the model.

Notation and Definitions
Suppose we have t capture occasions, or lists, on which members of the population can occur. An individual's capture history is the set of lists on which the individual is actually observed, or captured. A capture history is a subset ω of {1, 2, . . . , t}.
Now suppose that there are m individuals captured at least once in our study. Denote the m observed capture histories by ξ 1 , ξ 2 , . . . , ξ m . For any particular capture history ω define N ω to be the number of individuals observed to have exactly that capture history, i.e. the number of ξ i equal to ω. It is important to note that the actual data consist of a sample of size m from a discrete distribution over the possible capture histories.
The order of a capture history is defined to be the number of captures in the set. The braces are often omitted when the members of the history are given as suffices. Thus, for example, if t = 4 the capture history {1, 3} has order 2, and N {1,3} , usually written N 13 , is the number of individuals which are observed on both lists 1 and 3 but not on lists 2 or 4. A particular capture history of interest, with order 0, is the null capture history ∅. The quantity N ∅ is the dark figure of individuals which are not captured on any list, and therefore cannot be observed. The observed data give rise to the 2 t − 1 values {N ω : ω = ∅}, which we will also write as N.
It is characteristic of data collected in the modern slavery context that there will be some capture histories for which the observed count is zero. Typically, each list only records a relatively small proportion of the total population, because of the "hidden" nature of modern slavery as a crime, and the numbers of cases recorded on any particular pattern of overlaps between lists can easily be considerably smaller.
For any capture history ω define Thus N * ω is the number of observed cases that appear on all the lists in ω, regardless of whether they do or do not appear on other lists. For example, N * 12 is the total number of cases that are on both lists 1 and 2, while N 12 is the number of individuals that are on lists 1 and 2 but not on lists 3, 4, . . . . We will call {i, j} a non-overlapping pair of lists if N * ij = 0, so that no individual appears on both lists. The main objective of this paper is to develop estimation procedures and algorithms that properly account for these non-overlapping pairs of lists.
Because of the restriction to ψ = ∅ in the defining sum, the quantity N * ∅ does not include the dark figure but is the sum of all the observed values N ψ , the total of individuals actually captured at some point in the study.

The Poisson Loglinear Model
A standard model for the analysis is the Poisson loglinear model as set out by Cormack (1989). This assumes that, independently for each ω, for certain parameters α θ indexed by the possible capture histories. Capture histories are used in two different ways, firstly to index the observed data, and secondly to index the parameters. Usually, but not invariably, the letter ω will be used when observations N ω are indexed and θ for parameters α θ . The index ψ will be used in either case, as required. Thus, for example, the dark figure has expected value exp α ∅ , while the expected value of N 13 is exp(α ∅ +α 1 +α 3 +α 13 ). Denoting byα ∅ the maximum likelihood estimate of the parameter α ∅ , the estimate of the total population size will be N * ∅ +expα ∅ , the sum of the total number of cases actually observed and the estimate of the dark figure.
Altogether, there are 2 t parameters α θ , corresponding to the 2 t capture histories including the null capture history. There are only 2 t − 1 observable data points N ω from which to estimate the parameters; without placing constraints on the α θ parameters, the model is not identifiable.
As Cormack (1989) sets out, the natural approach is to set some of the α θ to zero, and then to estimate the remainder by maximum likelihood; for example, one may set all coefficients indexed by third-or higher-order histories to zero, and we will do this throughout. Even if all the two-list coefficients (those indexed by pairs of lists) are included, the number of parameters to be estimated is 1 + t + 1 2 t(t − 1) ≤ 2 t − 1 provided t ≥ 3. Model choice then reduces to deciding which two-list coefficients to include, and will be discussed further in Section 3. For any particular choice of coefficients, the estimation can be put into a standard generalized linear model formulation.
A consequence of the definition is that, for each ω, Unlike the N ω , the N * ω are not independent. For example, if capture histories ω and ψ share any lists, then the variables N * ω and N * ψ will be dependent.

The Log Likelihood Function
Before considering the treatment of non-overlapping pairs of lists, we derive some properties of the likelihood function. Let Θ be the collection of indices of parameters included in the model, and α = (α θ : θ ∈ Θ) the vector of parameters to be estimated. Note that Θ always contains ∅. Up to an additive constant depending only on the data, the log likelihood is given by Substituting the definition of the model, reversing the order of summation, and then substituting the definition (1), to obtain Turning to the other term in the log likelihood, say. Regarded as a function of the α θ , each µ ω is an increasing function of each of its arguments, and hence C(α) is a decreasing function of each of its arguments {α θ : θ ∈ Θ}. Furthermore, C(α) is a sum of concave functions of linear combinations of its arguments, so (α|N) is the sum of a linear and a concave function, and hence is a concave function. However, as Fienberg and Rinaldo (2012a) show in a much more general and abstract context, and as we shall see below, the maximum likelihood estimate of α need not be unique or even exist at all. The expressions for the components of the log likelihood function demonstrate the following, which will be useful in our discussion of model choice: 1. The statistics {N * θ : θ ∈ Θ} are jointly sufficient for the parameters α.

Dealing with Non-Overlapping Pairs
Suppose that {i, j} is a non-overlapping pair, so that N * ij = 0, and that α ij is one of the parameters in the model being fitted, so that {i, j} ∈ Θ. In the terminology of Fienberg and Rinaldo (2012a) we allow an extended maximum likelihood estimate, which means that that the parameters may take values in [−∞, ∞). If a parameter α θ = −∞ then we will have µ ω = 0 for all ω ⊇ θ, so the actual Poisson parameters will still all be finite. This section gives an elementary recapitulation of some of the results Fienberg and Rinaldo (2012a) cast into our specific framework.
In the first term (3) of the log likelihood, the coefficient of α ij is zero, so the maximum likelihood estimate of α ij will be obtained by maximizing C(α). Because C(α) is a decreasing function of each of its arguments, whatever the value of the other parameters the likelihood will be maximized as α ij → −∞. The maximum likelihood estimate of α ij may therefore be regarded as α ij = −∞. This explains why existing software packages yield errors or warnings if there are non-overlapping pairs in the data and the corresponding parameters are in the model. Because the linear model is expressed in terms of the logarithm of the Poisson parameter, the value −∞ for α ij gives the value zero for µ ω for all ω ⊇ {i, j}, a legitimate value for the actual Poisson parameters, regarding a Poisson distribution with parameter zero to be the degenerate distribution with value zero.
Substituting these zeroes for µ ω back into the expression for the log likelihood yields, writing α † ij for the vector of parameters with α ij excluded, This is exactly the Poisson log likelihood based on all the observations except those for the 2 t−2 capture histories which include both i and j. Note that the sum is over ω that do not include the set {i, j}, in other words both of i and j. If there is more than one non-overlapping pair in Θ, the same calculations can be carried out for each pair, leading to the following algorithm: 1. Initially define Ω † be the set of all non-null capture histories and Θ † = Θ.
2. For each {i, j} in Θ for which N * ij = 0, record that the maximum likelihood estimator of α ij is −∞ and remove α ij from the set of parameters Θ † yet to be estimated.
3. For each such {i, j} also remove from Ω † all ω for which ω ⊇ {i, j} (because N * ij = 0 the corresponding N ω will all be zero).
4. Use the standard generalized linear model approach to estimate the parameters with indices in Θ † from the observed counts of the capture histories in Ω † . The set Θ † comprises all the two-list parameters in the model that are not estimated to be −∞.
In the next section, we will see that the final step should also involve an explicit check for the existence and identifiability of the parameter estimates.

How Existing Methods go Wrong
Where there is a pair of non-overlapping lists, existing methods typically iterate towards a large negative estimate for the corresponding parameter α ij , only stopping because the number of iterations exceeds a prescribed limit, or because the second derivative of the log likelihood is numerically nearly zero. An error or warning message may be produced. By contrast, our approach deals explicitly with α ij , immediately giving it the value that maximizes the likelihood over the extended range [−∞, ∞). Once the parameters corresponding to non-overlapping pairs of lists have been correctly estimated, all the other parameters are estimated by an iterative process which converges rapidly and does not yield any errors.
Suppose, for the moment, that a large negative value of α ij is used, say α ij = −20 rather than α ij = −∞. For practical purposes e −20 is zero, so the fitted values of µ ω will be essentially zero for all ω ⊇ {i, j} and the corresponding terms will make no contribution to the maximization of the likelihood of the other parameters. Hence, the fitted values of the other parameters will be much the same as in our approach which actually estimates the parameter α ij correctly. We are not fitting a different model than other approaches; rather, we are correctly fitting a model which other approaches can only fit approximately and in an unsatisfactory way. Not only is it inelegant to use an iterative method to approach a known −∞ value of a parameter, but it leads to misleading estimates of the precision of the estimates. Because the second derivative of the log likelihood also rapidly tends to zero, the estimated parameter tends to have very large reported standard error, suggesting that its estimate is essentially uninformative. Furthermore, standard packages use approaches to inference and model choice based on likelihood and information criteria. The asymptotic theory and arguments behind these approaches, for example Wilks (1938) and Akaike (1974), break down when parameters are at an extreme of their ranges, as is the case in our application for the parameters corresponding to non-overlapping lists (see Section 1 of the supplementary materials for a simulation example illustrating that the likelihood asymptotics do not hold).
An exploration of the possibly misleading precision estimates, for two real data sets, is given in Table 1. The data sets are from the UK (Home Office, 2014) and The Netherlands , both tabulated in Silverman (2020). In each case the data consist of six lists, and in both cases there are two non-overlapping pairs. We will see that the corresponding parameters are significant in one case but not the other. The standard errors and p-values for glm are those produced using the default method for that routine.
The table shows the result of fitting the model including all two-list effects, using two algorithms, one being a 'standard' approach (Rivest and Baillargeon, 2014) which makes use of the R program glm, and the other the method set out above. In the standard approach, the call to glm actually records convergence, but after 21 and 22 iterations respectively, which is close to the default maximum number 25 of iterations in glm. In both cases a warning is generated. By contrast, the call to glm within our approach only requires 6 or 7 iterations. The estimates of all the other parameters, as expected, are virtually the same in both cases. The p-value for our approach is the probability that the non-overlap of the relevant pair could have occurred by chance when the model is fitted without the corresponding parameter; see Section 3.1 below. It can be seen that the effects are highly significant for the Netherlands data but not significant for the UK data. The reported standard errors and p-values are not meaningful for the standard approach.
In fact there are additional aspects not handled by current methods that need to be addressed, even if one allows for the parameters to be estimated over the extended range [−∞, ∞), and these are discussed in Section 2.6.

Existence and Identifiability Issues
Two estimability issues may arise when applying multiple systems estimation to sparse data, both of which will mean that the model will not give a welldefined finite estimate of the population size.
One possibility is that there is no value of the parameter vector α that maximizes the likelihood, even allowing the extended range [−∞, ∞) for the parameters. The other, separate, possibility is that (whether or not the likelihood can be maximized) there is parameter redundancy and the estimates are not identifiable. We discuss the existence question first. Fienberg and Rinaldo (2012b) show that existence of the estimate can be checked by solving a linear programming problem. Defining Θ † and Ω † as in the algorithm set out in Section 2.4, let A be the incidence matrix that maps the parameters in Θ † to the logarithm of the expected values of the counts of capture histories in Ω † . From (2), for θ ∈ Θ † and ω ∈ Ω † , A ωθ = 1 if θ ⊆ ω and 0 otherwise. Let t be the vector of sufficient statistics N * θ for θ ∈ Θ † . Then set up the linear programming problem of finding the maximum value of s over all scalars s and all real vectors A necessary and sufficient condition for a maximum likelihood estimate of α to exist (possibly allowing some parameters to be −∞) is that the maximizing value s max of s is strictly greater than 0. Setting x ω = N ω for all ω and s = min N ω will yield a feasible solution satisfying (4). Hence s max will be at least the minimum of the observed N ω over Ω † . In the non-sparse case, where every combination of capture histories is observed at least once, this minimum will be strictly positive and hence the maximum likelihood estimator will always exist. The other possibility is that, even if the likelihood can be maximized, the parameters are non-identifiable, so that the estimate is not unique, a state of affairs also called parameter redundancy; see, for example, Far et al. (2020). The model will be identifiable if and only if A is of full column rank. We show in Section 3 of the supplementary materials that non-identifiability can only arise if all list pairs are in the model and if the data are so sparse that every set of three lists contains at least one non-overlapping pair. This condition is easily checked.
Fienberg and Rinaldo (2012a) point out that most or all standard generalized linear modeling packages fail to check for existence of estimates. Nor do programs necessarily report unidenfiability directly, more often arbitrarily removing one or more of the parameters. Unless every possible capture history is actually represented in the observed data, therefore, it is important to check that a potential model gives a strictly positive value for the linear programming problem. If the full model containing all two-list parameters is being fitted then, in addition, identifiability should also be checked. If the model fails on either count it should be ruled out. These checks incur only a small computational overhead. A simple example is given in Table 2. As there are three possible two-way interactions, there are 2 3 = 8 possible choices of the two-list parameters to include in the model. We summarise the linear programming output s max and test results in Table 3. Table 3: Summary of linear programming output and test result for all possible choices of two-list effects to include in the model. For the model containing all three two-list effects, there are finite values of the Poisson means µ ω that maximize the likelihood, so s max > 0, but these do not correspond to unique values of parameters in the model.

Two-list parameters included
Test result s max none no error The results show that there is no immediate hierarchical relationship between models that do or do not satisfy the criterion for estimates to exist. For example, the linear program result is zero for the model including AB and AC, but either adding the third effect BC, or removing AB, will yield a model for which the result is strictly positive. This issue is elucidated further in Section 2.7 below.

Checking All Models
Given a particular data set, it is useful in certain contexts to check that the estimates exist no matter which two-list terms are included in the model. An appropriate algorithm allows the Fienberg-Rinaldo conditions to be confirmed much more quickly than the brute force approach of simply checking the criteria for every possible model. It will be assumed throughout that the model contains the intercept parameter α ∅ and the main effect parameters α i for i = 1, . . . , t. The model choice to be made is which, if any, of the two-list parameters α ij also to include. Because there are 1 2 t(t − 1) pairs {i, j}, the number of possible models is 2 t(t−1)/2 , which rapidly becomes very large as the number of lists increases.
Suppose that {i, j} is an overlapping pair of lists, in that N * ij > 0, and that the parameter α ij is in the current parameter set Θ. Consider the effect of removing this parameter from the model. Because {i, j} is an overlapping pair, this will not change the set Ω † , but because {i, j} is removed from Θ it will also be removed from Θ † (again defining Θ † and Ω † as in Section 2.4). In the linear programming problem (4), this will remove one column from the matrix A and the corresponding element of t. Hence one constraint will be removed, and therefore the maximum value of s cannot decrease. Therefore, if the estimate exists for parameter set Θ it will necessarily exist for subsets of Θ obtained by removing overlapping pairs. It follows that, to confirm whether all models satisfy the conditions for estimates to exist, it is only necessary initially to test parameter sets Θ that include all overlapping pairs, together with a subset (possibly empty) of the non-overlapping pairs. If there are M non-overlapping pairs in the data, then the number of such models is 2 M ; solving the linear programming problem for all these models is now feasible for a much larger range of data sets than if all 2 t(t−1)/2 models have to be considered explicitly. If the estimates exist for all such models, then they will exist for all models.
These checks were carried out for all the data sets discussed in this paper. For the full UK and Netherlands data, the number of models to be checked by solving a linear programming problem is reduced by a factor of 8192. Details for two other data sets are given in Sections 4.1 and 4.2. In every case, in contrast to the example set out in Table 2, the extended maximum likelihood estimate exists and is unique for every possible choice of model.
In the event that there are models for which the estimate does not exist, the approach can be extended to find a list of all such models efficiently. Let Θ 1 be the set of parameter indices corresponding to the empty capture history and all capture histories of order 1, Θ non 2 those corresponding to nonoverlapping pairs and Θ over 2 overlapping pairs. The initial search is over all models containing both Θ 1 and Θ over 2 . Suppose it yields a subsetΘ non 2 with the property that there is no maximum likelihood estimate within the model with parameter set Θ 1 ∪ Θ over 2 ∪Θ non 2 . We then perform a hierarchical search, retaining Θ 1 ∪Θ non 2 , over models where overlapping pairs are removed. At the first stage, parameters in Θ over 2 are removed individually and each resulting model checked. If any such model yields a zero result in the linear program, that is recorded, and the possibility of removing a second overlapping pair is investigated, and so on. At each stage, if the linear program yields a positive result so that the estimate exists, there is no need to investigate that branch of the hierarchy any further.

Inference and Model Choice
We now consider how to assess the significance of any particular two-list parameter, and develop a forward stepwise approach to model choice. We also develop the bootstrap procedure for evaluating confidence intervals, and present simulation results comparing the bootstrap with an approach that carries out inference conditional on the model actually selected.

Calculating Significance
Given any model defined by parameter set Θ, for any ω definê Under these definitions,μ ω [Θ] andμ * ω [Θ] are the estimated expected value of N ω and N * ω respectively. The notation [Θ] makes explicit the dependence on the parameter set Θ.
First, consider how to deal with non-overlapping pairs within the data. Suppose that for some θ ∈ Θ that N * θ = 0. Should we actually include θ in the model? We test the null hypothesis that α θ = 0, which is equivalent to saying that θ is not in the model. We fit the model without θ and then consider the p-value of a test statistic. A natural test statistic is N * θ , because of the results on sufficient statistics in Section 2.3. Recall that this is also a Poisson random variable since it is the sum of independent Poisson random variables (see 1 and 2). Hence, we test whether 0 is a surprising value to observe for a Poisson distribution estimated from the data but leaving out the parameter indexed by θ. If θ is in the model, then the observed value has probability one if θ takes its estimated value.
Hence, proceed as follows: 1. Fit the model leaving out the parameter α θ , in other words using just the parameter set Θ\θ. For the resulting fitted model, find the estimatê µ * θ [Θ \ θ].

The estimated parameter has
. This is the estimated probability that N * θ = 0 in the model defined by Θ \ θ. Unless we have already checked that the parameter set Θ \ θ passes the linear program test for the existence of the maximum likelihood estimate, that should be done; if the model fails that test then the effective p-value is zero because the parameter α θ cannot be removed from the model. This approach can be generalized to construct a (one-sided) p-value for any parameter θ ∈ Θ whether or not N * θ = 0. The p-value is the minimum of Here F Poiss (n, λ) is the lower tail probability that a Poiss(λ) random variable X satisfies X ≤ n, whilẽ F Poiss (n, λ) is the probability that X ≥ n.
An alternative approach is to use the sufficient statistic N * θ for α θ evaluated against its distribution conditional on the observed values of the sufficient statistics in the model with parameters indexed by Θ \ θ, rather than, as we have, against its unconditional distribution on the estimated model. The conditional distribution does not seem to be easily tractable, but this is an interesting avenue for future research.

Model Fitting
The model-fitting procedure is detailed stepwise, as follows: • Step 1: Set a threshold value for the p-value and fit the model with the main effects parameters only. • Step 2: Consider in turn each two-list parameter not already added to the model, and check that adding it to the model would not lead to a nonexistent estimate (or to non-identifiability if the full two-way model is proposed). • Step 3: Among those parameters that pass the checks, find the one with the smallest p-value, using the approach set out in Section 3.1. If that p-value is less than or equal to the given threshold, add the parameter to the model, and go back to Step 2. If the p-value is greater than the threshold, finish.

Note that in
Step 2 all two-list parameters not already included are considered, whether the pairs they correspond to are overlapping or nonoverlapping. The method is akin to forward stepwise regression. Note also that if the algorithm set out in Section 2.7 has already demonstrated that nonexistence and non-identifiability cannot arise for any model for the data set in question, then the check in Step 2 is not necessary.
It remains to choose the threshold p-value. We conduct a detailed simulation study in Section 4.3 which points to the choice p = 0.02, and that is the value which we would suggest, but users might wish to explore the sensitivity of the result to adjusting the parameter.

Boostrapping to Find Confidence Intervals
In general, current approaches find confidence intervals for the population size conditional on the terms actually included in the model, either for the Poisson log-linear model itself, or for modifications such as the multinomial model considered by Sandland and Cormack (1984). Because the choice of model itself depends on the observed data, it is preferable to construct confidence intervals which take account directly of the effect of model selection. A natural way of doing this is to use a bootstrap approach, which will also take account of any biases that the model selection approach may introduce. The BC a methodology of DiCiccio and Efron (1996) gives second-order accuracy and does not depend on any transformation of the scale on which the data are observed and the estimate of the total population made.
The observed data in our case are the original m observed capture histories (ξ 1 , ξ 2 , . . . , ξ m ). To construct each bootstrap sample, we could draw a random sample (ξ boot 1 , ξ boot 2 , . . . , ξ boot m ) of size m, with replacement, from the original data. If we denote by N boot ω the number of times the capture history ω occurs in the bootstrap sample, then the N boot ω have a multinomial distribution corresponding to m trials and probabilities proportional to the original N ω . In practice, therefore, the ξ boot i are not actually constructed, but we sample direct from the multinomial distribution of the capture history totals. The parameter for the number of trials in the multinomial distribution is the number m of capture histories actually observed and does not depend on any estimate of the dark figure.
For each bootstrap sample, we carry out the stepwise fitting procedure and obtain an estimate (bootstrap replication) of the population size. There is no constraint on choosing the same model. The BC a confidence intervals use percentiles of the bootstrap distribution of the population size, but they adjust the percentile actually used. The adjusted percentiles depend on an estimated bias parameterẑ 0 , defined so that Φ(ẑ 0 ) is the proportion of the bootstrap estimates that fall below the estimate from the original data, and an estimated acceleration factorâ, whose derivation depends on a jackknife approach.
The jackknife requires the population size to be estimated from every sample constructed from the original data by leaving out one of the data points ξ i . However, the number of jackknife estimates that need to be evaluated can usually be dramatically reduced, making for considerable computational savings, because the number of distinct values taken by the ξ i , the number of different capture histories actually observed, is in general much smaller than m. If there are K capture histories for which N ω > 0, only K jackknife estimates actually have to be calculated. These are then weighted in the calculations by the number of times N ω that the value ω appears in the original sample. To be explicit, letθ (i) be the estimate of the population size constructed from the original sample leaving out capture history ξ i . The effect of leaving out that capture history is to reduce N ξ i by one, and sô θ (i) =θ (−1) ξ i , where, for each capture history ω actually observed in the data, θ (−1) ω is the estimate of the population size from the original sample but with N ω replaced by N ω − 1. Only the K valuesθ (−1) ω have to be calculated. To calculate the acceleration factor, letθ (·) be the average of the jackknife estimatesθ (i) . Then Applying a similar weighting argument to the defining equations (6.6) and (6.7) of DiCiccio and Efron (1996), the estimated acceleration factorâ is then given bŷ These values of the parametersẑ 0 andâ are then used to choose the appropriate percentiles of the bootstrap distribution, using the standard BC a formulation.

Some Simulation Results
In order to compare our method with the standard BIC approach as implemented within Rcapture (Rivest and Baillargeon, 2014), a simulation study was carried out. The model fitted to the five-list UK data by the stepwise approach was used as a starting point. For this model, the predicted probabilities of each of the 32 possible capture histories (including the empty capture history) were calculated. The overall population size was that estimated by the model fit. The reason for using this model as a basis for a simulation is that it is reasonable to suppose that it will display features likely to be seen when using the methods in the human trafficking context. An example with five lists was used so that the repeated use of the BIC method does not become computationally burdensome.
The population size and the capture history probabilities were regarded as fixed, and were used as parameters for multinomial sampling to create 500 simulated data sets. For each simulation, population estimates and confidence intervals were constructed both using the BIC approach and using the stepwise method we have set out. For the BIC method, multinomial confidence intervals using the routine closedpCI.t within Rcapture were found; the confidence intevals for the stepwise method were constructed using the BC a approach. Because the simulations are constructed from a model with known population size, it was possible to assess the accuracy of the estimation. The root mean square error of the estimation was 3057 for the stepwise approach and 5834 for the BIC method. The root mean square errors of the estimate of the log of the population size were 0.19 and 0.34 respectively, so again, for this example, the stepwise approach has much better performance.
The coverage rate of the estimated confidence intervals was also determined. For the stepwise method using the BCa approach, the nominal 95% confidence interval contained the true value for 90% of the simulations, while the nominal 80% intervals had an actual coverage rate of about 70% (346 out of the 500 replications). While these rates are not perfect, the corresponding observed coverage rates for the methods using routines in Rcapture were considerably lower, 61.4% and 42.8% respectively.

Empirical Applications
In this section, our methods are applied to two data sets relating to victims of modern slavery and human trafficking in the USA. Both data sets display the sparseness of overlapping entries typical of data collected in this field. In addition they are also typical of data collected in local regions (rather than entire large countries) in having relatively small counts, with the total number of observed cases in the hundreds and not the thousands. The two data sets, together with those discussed in Section 2.5 are then used to construct a simulation study investigating the appropriate choice of threshold parameter.

The New Orleans Data
Bales et al. (2019) discuss a data set collated from a number of sources in New Orleans, given in Table 4. Altogether there are eight lists, and so the full incidence table of observable capture histories, including those combinations for which the actual observed number is zero, has 255 rows. The null capture history, corresponding to the dark figure, of course cannot be observed, and estimating it is the task of the analysis. There are 28 possible pairs of lists, and of these there are 18 non-overlapping pairs. Using the threshold p = 0.02 fits a model including one two-list parameter, indexed by the pair DE. The point estimate of the total population size is 1184. The BC a bootstrap confidence interval, based on 1000 bootstrap replications, is (717, 1657). If main effects only are chosen (which will be the case for the threshold p = 0.01 or smaller), then the resulting model yields a 95% confidence interval of (644, 1618) with a point estimate of 997. Arguably, with as many as 28 possible two-list parameters, there is some merit in using a smaller threshold.
Because some of the list counts are so small, the effect of combining the four smallest lists into one, to give a five-list version of the data, was also investigated. If this is done, none of the two-list parameters is significant even at the 5% level, and the BC a confidence interval is (589, 1703) with a point estimate of 1034, a result very close to that yielded by the eightlist data with the smaller threshold. As a further illustration of the issues discussed earlier in the paper, and the need to handle non-overlapping lists in the way we have developed, the Rcapture routine closedpMS.t was used to fit every possible choice of model with two-list effects. There are 1024 such models, and in only 124 of these was the fit successful without generating a warning. In the majority of cases there was a warning that the asymptotic bias is large.
Return to the full data as an example for the methodology set out in Section 2.7. There are 2 28 possible models, and 18 non-overlapping pairs. To check every possible model for existence of the maximum likelihood estimate, there are 2 18 linear programming problems to solve. This check, which would have been impossible if all 2 28 models had to be considered explicitly, only takes a few minutes on a standard PC. Neither of the problems identified in Section 2.7 arises for any model for these data.

The Western Site Data
One of two data sets considered by (Farrell et al., 2019) is collated from a number of sources in the Western site of a research study in the USA. The data are given in Table 5. Altogether there are 5 lists, and so the full incidence table including those combinations for which the observed number is zero has 31 rows. There are 10 possible pairs of lists, and of these there are 2 non-overlapping pairs. It is very quick to check that all possible models lead to estimates which exist and are identifiable.
The threshold of p = 0.02 yields a model including the two-list effect AE, with a point estimate of 2483. The BC a confidence interval is (1293,3670).
It should also be noted that the application of closedpMS.t to this data set again generated warnings in more than half of the 1024 possible models. In both this data set and the New Orleans five-list data set, warnings were generated among the top ten models according to the BIC that closedpMS.t displays by default, but not, as it happens, by the very top model. For the Netherlands data considered earlier, six out of the ten top models generates a warning.

Choosing the Threshold: A Simulation Study
In order to gain insight into the appropriate choice of threshold, a simulation study was carried out. In order to make this relevant to the context of human trafficking, the models considered are all based on the data sets referenced in this paper, in an attempt to ensure that the simulation study is based on data sets which have the kinds of characteristics likely to be encountered. The data sets considered were the UK, Netherlands, New Orleans and Western site data; in the case of the UK, Netherlands and New Orleans data, both the full and the five-list versions were included, giving seven data sets in all. For each of these, four different models were fitted; the 'full' model with all two-list effects included, the model based on main effects only, and the models chosen by the method we set out, using thresholds 0.001 and 0.05, to give a more parsimonious and a less restrictive fit. In every case, the model fit gives an estimate of the total population and of the probabilities of all possible capture histories.
For each of these 28 test cases, 1000 realizations of the capture history totals were simulated, by drawing from a multinomial distribution with the given population size and capture history probabilities. Each realization can be conceptualized as an example of a multiple systems sample from a population of known size, with characteristics similar to those likely to be observed in the human trafficking context. For each realization, estimates of the total population were obtained using a range of thresholds (0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1), as well as for the main effects model and the model with all two-list effects included, corresponding to thresholds 0 and 1 respectively. The estimates of the total population size were, as one would expect given the log-linear nature of the modeling, asymmetrically distributed around the true value for each simulation scenario, and a log transformation is appropriate. With this in mind the measure of accuracy used for each of the 28 sets of simulations, for each threshold parameter, was the mean square error of the logarithm of the estimate of the dark figure.
The general level of mean square error varied quite substantially across the 28 models considered. In order to take account of this variation, for each threshold, the mean, over the 28 models, of the logarithm of the mean square error was calculated to give an overall score for that threshold. The threshold with the minimum score is p = 0.02. Further details of the simulation study are given in Section 5 of the supplementary materials to this paper.

Concluding Remarks
The R software package SparseMSE  includes implementations of all the methodology described in this paper. In particular, it contains programs to check whether a particular model leads to either of the estimability issues set out in Section 2.6 above, and it incorporates these checks within a routine to fit any particular model, or to make the model choice using the stepwise procedure described in Section 3.2. It also allows for the possibility of checking all possible models using the approaches discussed in Section 2.7. Full details are given in the package documentation.
To conclude, in this paper we have investigated inference for multiple systems estimation using Poisson log-linear models, taking proper account of the possibility that the underlying data tables contain non-overlapping lists, as commonly arises when the data are collected in the context of studies on modern slavery and human trafficking. We have also set out an approach to model choice and demonstrated the utility and practicality of our approach on real data sets. This area is especially challenging for methodological development because there is no "ground truth" against which methods can be assessed, and frequently there are no details of the data available beyond anonymized list data of the form presented in the tables above. Nevertheless, reliable and stable methods are important for applications in public policy, even if they are conditional on assumptions that it may not be possible to verify.
For simplicity and clarity, the procedure has been discussed and detailed in full for models which consider up to terms indexed by pairs of lists. In principle, the model fitting and inference aspects can easily be extended to consider models based on higher-order terms, though it seems unlikely that any data sets collected in the contexts of human trafficking would merit this. For example, if a three-list parameter α 123 were a candidate for inclusion within the model, then the estimate of α 123 would be −∞ if the three-list overlap N * 123 were empty, and to fit the other parameters one would then remove all capture histories including all three lists 1, 2 and 3 from the glm stage.
Similarly, another possible extension is to the case where there is covariate information rather than just presence/absence on various lists. As in our main discussion suppose there is a pair (or larger set) of lists whose interaction parameter is in the model but for which no overlapping cases are observed for any value of a covariate. Then the right approach (depending on the exact details of the modeling) would be to set the corresponding interaction parameter to −∞ and then remove various zero cells containing the non-overlapping set of lists from the fitting procedure for the other parameters including those relevant to covariates.
One possible topic for future research is the combination of our insights with those of International Working Group for Disease Monitoring and Fore-casting (1995), which explores the effect of heterogeneity. Some of the approaches suggested in that paper may not be available. For example in the human trafficking context we may not be able to stratify the population, nor may the statisticians analyzing the data have any information about the lists themselves to evaluate the possibility of heterogeneity. On the other hand, if one is in a position to implement the proposals, then the possibility of effects of the kind we have explored has to be taken into account.

A Supplementary Materials
The supplementary materials contain additional details to complement the main paper "Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges when there are Non-Overlapping Lists". In particular, the document provides: additional information and details for the simulation carried out to investigate the asymptotic likelihood theory that supports Section 2.5 of the main paper; the UK and Netherlands data sets, used as empirical examples; justification of the conditions for non-identifiability as mentioned in Section 2.6; and full details, including R code, of the simulation study conducted to inform a suitable choice of p-value threshold, as discussed in Section 4.3.

A.1 Investigation of the Asymptotic Likelihood Theory
One of the standard approaches in the literature when fitting log-linear models is to use methods based on likelihood ratios for inference and on information criteria for model choice. The asymptotic theory and arguments behind these approaches, for example Wilks (1938) and Akaike (1974), break down when parameters are at an extreme of their ranges, as is the case in our application for the parameters corresponding to non-overlapping lists. The simulation study to demonstrate this break down is based on models with three lists and an expected true population size of 1000, with captures on the various lists being independent, with probabilities (p 1 , p 2 , p 3 ), so that the probability of capture history ω is i∈ω p i i ∈ω (1 − p i ). The first model used has capture probabilities set to 0.3 for each capture occasion, which is more reminiscent of a classical mark-recapture study, while the second has capture probabilities set to 0.01, 0.04, and 0.2, which is somewhat typical of the sparse capture case. Therefore for the first model, if the history contains k captures then the probability is 0.3 k 0.7 3−k . For the second model the probability depends on the actual lists; for example the probability of capture history {1, 3} is 0.01 × (1 − 0.04) × 0.2.

QQ plot
Chi−Squared distribution

Simulated deviances
The simulation presented is carried out using the Poisson model. The probabilities are multiplied by 1000 and, for each simulation replicate, Poisson random values with expectations equal to these values are generated to give a full set of observed capture histories; together with the null capture history the expected number of counts (population size) is equal to 1000. The correct model for these data includes main effects only. Inference was carried out both for that model, and for the model with the addition of a two-list parameter indexed by the first two lists. The reduction in deviance between the two models was determined. In line with the standard asymptotic theory, the QQ-plots presented in Figure 1 show that the χ 2 1 distribution fits the observed deviances well for the "classic" model. However, not at all surprisingly, the fit for the sparse model is not good. This illustrates that likelihood ratio tests, and hence methods based on information criteria, cannot be relied on for fitting models in the sparse context.

A.2 Description of Empirical Data Sets
In Sections 2.5 and 4.3 of the main paper, the UK and Netherlands data sets are used in an empirical application and simulation study. These data sets have both been published previously, but for convenience they are included here.

A.2.1 The UK Data Set
As part of the UK Government's Modern Slavery Strategy (Home Office, 2014), Silverman (2014) discussed and analyzed a data set collated from a number of sources in the UK. The lists and corresponding abbreviations are given in Table 6. Table 6: List names and abbreviations for the UK data set (Silverman, 2014).
LA Local authorities NG Non-government organisations such as charities PF Police forces GO Government organisations GP The general public, through various routes NCA The National Crime Agency In the original analysis, the lists PF and NCA were combined into a single list, to yield what is referred to as the five-list UK data set. The sixlist version of the data set was published in Bales et al. (2015). A rationale for combining the lists PF and NCA is that the National Crime Agency has many of the characteristics of a police force.
Counts of the capture histories over combinations of the lists are given in Table 7. The full incidence table, including those combinations for which the observed number is zero, has 63 rows. There are 15 possible pairs of lists, and of these there are 2 non-overlapping pairs.   Table 8.  . I Inspectorate SZW K Border Force O Residential centers/shelters P National Police R Regional Coordinators Z Others Counts of the capture histories over combinations of the lists are given in Table 9. The full incidence table, including those combinations for which the observed number is zero, has 63 rows. There are 15 possible pairs of lists, and of these there are 2 non-overlapping pairs. Analysis is carried out both for the full data set, and for the 'five-list Netherlands data set' obtained by combining the two smallest lists I and O into a single list.

A.3 Non-identifiability
In Section 2.6 of the main paper we stated conditions for non-identifiability which can be summarised in the following proposition.

Proposition 1
The model is non-identifiable if and only if both the following conditions are satisfied: 1. All two-list parameters are included in the parameter set Θ.
2. Every set of three lists in the data contains at least one non-overlapping pair.
Proof. We will use the notation defined in the main paper. To maximize the likelihood, all two-list parameters in the model corresponding to nonoverlapping pairs will be −∞ and only the N ω for ω ∈ Ω † will remain in consideration. Therefore the model is identifiable if and only if the model matrix used in Step 4 of the algorithm in Section 2.4 is of full rank. This matrix is the incidence matrix A defined for ω ∈ Ω † and θ ∈ Θ † by A ωθ = 1 if θ ⊆ ω and 0 otherwise. The matrix will be rank deficient, and hence the model non-identifiable, if and only if there is a non-zero vector λ = (λ θ : θ ∈ Θ † ) such that Aλ = 0.
This condition will be equivalent to By its definition, the set Ω † includes all capture histories {i} of order 1.
Setting ω = {i} and using the definition of A, equation (5) implies that, for each i, in other words λ i = −λ ∅ for all i. Furthermore, for any pair of lists {i, j} ∈ Θ † , {i, j} must be in Ω † , so equation (5) so that λ ij = λ ∅ for all {i, j} ∈ Θ † . Now suppose that the model does not contain all two-list effects. Then there is a pair of lists, without loss of generality {1, 2}, that is not in the parameter set Θ. Suppose that Aλ = 0. We show that λ = 0 and hence A is of full column rank and the model is identifiable. The capture history {1, 2} will not have been removed from Ω † , and will not be in Θ † since Θ † ⊆ Θ, so there will be no λ 12 term to be considered. Therefore Substituting equation (6) for i = 1 and i = 2 shows that λ ∅ = 0 and hence λ i = λ ij = 0 for all i and for all {i, j} ∈ Θ † . Thus λ θ = 0 for all θ ∈ Θ † .
Thus the only possible non-identifiable model is the one which includes all the two-list parameters. Consider that model, and suppose as above, that Aλ = 0. Suppose, now, that there are three lists i, j and k such that all three of N * ij , N * jk and N * ik are non-zero. so that none of the pairs {i, j}, {j, k} and {i, k} are non-overlapping. Hence {i, j}, {j, k} and {i, k} are all in Θ † . The capture history {i, j, k} will be in Ω † , and so making use of equations (6) and (7). As before this will imply that λ θ = 0 for all θ ∈ Θ † , so that λ = 0, and hence the model will be identifiable. Now suppose, conversely, that Θ † contains no such triple pairs, so that every set of three lists contain at least one non-overlapping pair. The set Ω † will contain no capture histories of order 3 or above. Now, define λ ∅ = 1, λ i = −1 for all i and λ ij = 1 for all {i, j} ∈ Θ † . Then every element of Aλ will be calculated as in one of the two equations (6) or (7) and will be zero. Hence Aλ = 0 for a non-zero vector λ and so the model is, in this case, not identifiable. This completes the proof of the proposition, but it should also be noted that the non-identifiability will involve the parameter α ∅ . The form of the vector λ shows that it will not affect the likelihood if any value is added to the intercept parameter α ∅ and all two-list parameters, and subtracted from all one-list parameters. Therefore, even if there is a maximum likelihood estimate, the subset of parameters maximizing the likelihood will contain all values of α ∅ and hence will give no estimate of the dark figure.
To sum up, for the model containing all pairs {i, j}, A is of full column rank if and only if there is at least one set of three lists {i, j, k} that contains no non-overlapping pairs. To check this simply, define the matrix J by J ij = 1 if {i, j} is an overlapping pair, and zero otherwise, with all J ii = 0. The model will be identifiable if and only if trace(J 3 ) > 0. To see this, note that trace(J 3 ) = i,j,k J ij J jk J ki . The terms in this sum are all zero or one, and the trace will be strictly positive if and only if there is at least one non-zero term J ij J jk J ki , in other words if {i, j, k} contains no non-overlapping pairs.

A.4 Choosing the Threshold: A Simulation Study
This section gives further details of the simulation study carried out (Section 4.3 of the main paper) to inform an appropriate choice of p-value threshold for the stepwise algorithm. The full and five-list versions of the UK, Netherlands, and New Orleans data sets, and the Western site data set were considered in this study.

A.4.1 Description and Results of the Simulation Study
For each of the seven data sets, four models were fitted, the full model with all two-list effects, the 'main effects only' model with no two-list effects, and models based on the stepwise algorithm with p-value thresholds set to 0.001 and 0.05. For each of the resulting 28 data/model combinations, an estimate of the population size and capture history probabilities was obtained. Then, 1000 realizations of the observed capture history totals were drawn from a multinomial distribution with the given population size and capture history probabilities. The rationale of each of these simulation scenarios is to produce fully specified models which have the characteristics likely to be observed in real data in our context of interest.
For each simulated realization, estimates for the total population size were obtained using the models selected by the stepwise algorithm with p-value threshold set to 0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, and 1. Threshold 0 corresponds to the main effects only model and threshold 1 to the full model. If a realization yielded nonexistent estimates for the main effects model or non-identifiability of the full model, it was removed from consideration completely, to ensure the comparability of results across thresholds. As long as the main effects estimate exists, for other values of p Step 2 of the stepwise algorithm (as set out in Section 3.2 of the main paper) automatically excludes any models for which estimates are nonexistent and so the algorithm will produce bona fide estimates of the population size for all thresholds, as long as the full model estimate is identifiable.
The estimates of population size are highly asymmetric around their true value, and so accuracy was assessed by regarding the logarithm of the population size as the quantity to be estimated. Within each scenario, we then consider the mean square error of the estimate of the logarithm of the population size for each of the threshold values for the estimation. The overall value (and variability over estimation thresholds) of this mean square error varies considerably between scenarios. Some exploratory analysis suggests that taking the log of the mean square errors gives similar variability over thresholds within scenarios, and therefore produces results for the various scenarios that can be reasonably combined to give an overall score. The logarithms of the mean square errors are tabulated in Table 10.