Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models

We present a method to generate contingency tables that follow loglinear models with prescribed marginal probabilities and dependence structures. We make use of (loglinear) Poisson regression, where the dependence structures, described using odds ratios, are implemented using an offset term. We apply this methodology to carry out simulation studies in the context of population size estimation using dual system and triple system estimators, popular in official statistics. These estimators use contingency tables that summarise the counts of elements enumerated or captured within lists that are linked. The simulation is used to investigate these estimators in the situation that the model assumptions are fulfilled, and the situation that the model assumptions are violated.


Introduction
In simulation studies that make use of contingency tables, samples are drawn from a population with properties that are well understood.In this paper a method is proposed for the generation of contingency tables with fixed marginal probabilities and prescribed dependence structures that are defined in terms of odds ratios.An odds ratio describes the strength of the association for a two-way contingency table.
For multi-way tables conditional odds ratios play the same role for partial associations between variables (Agresti, 2013).A simple example of such a population would be a two-way contingency table with marginal probabilities of 0.7 and 0.8 and the dependence represented by an odds ratio of 2.
Loglinear models have interaction parameters that can be directly understood in terms of odds ratios.Therefore the proposed method is built around the use of loglinear models.Models with two and three variables are explored, where each variable is dichotomous.However, the methodology presented can generate populations using loglinear models of any size, in terms of both the number of variables and the number of levels of the variables.Also, we can generate populations following logistic regression models having only categorical explanatory variables, and latent variable models that can be formulated as loglinear models, such as the latent class model (Haberman, 1988) and the extended Rasch model (Hessen, 2011).
There are a few ways to fit loglinear models.It suits our purposes to define loglinear models in terms of Poisson regression models with a log link (Nelder and Wedderburn, 1972).Such models can be fitted with, for example, iteratively reweighted least squares (IRLS) (Green, 1984).The models we use require the marginal probabilities and odds ratios as input parameters.As a first step, the given marginal probabilities are used to generate the probabilities in a loglinear model for which complete independence holds (Good, 1963 andBishop et al., 1975, p.345) .The prespecified interaction structure is imposed by using an offset, which is an additional explanatory variable with a prespecified parameter value that is fixed and not estimated.
When each margin of a contingency table has only two categories, the contingency table is a summary of the observations of correlated binary variates.Previously, Lee (1993) proposed a linear programming method for this problem.In comparison to his method, our method is simpler.Also, we present our method in the context of various applications of the loglinear model where conditional odds ratios can also be specified, and of latent variable models.The work of Gange (1995) is related to our work in the sense that he uses the iterative proportional fitting algorithm, an algorithm that was very popular in the early ages of loglinear modelling; yet he does not make a link to loglinear models and mentions odds rations only in passing.Choi (2008) also presented a method based on loglinear models, where the association is formulated in terms of the Pearson chi-square coefficient, and iterative proportional fitting is used in the generation of probabilities.
Thus the way in which the association is specified by Choi is different from our method, that is more closely related to the way in which the loglinear model is defined.Similar to our work, Kateri (2014, section 2.2.5, p. 43) described that local odds ratios, row and column marginal distribution uniquely determine the table of joint probabilities, which also holds for any other minimal set of odds ratios.
Our work further develops this and demonstrates a simple approach to generating data from this information.Geenens (2020) sets out a similar approach into a framework where the dependence is measured using copulas.He explains that the bivariate distributions are uniquely determined by a 'margin-free' association parameter (and that multiple different kinds of parameter are possible, but with one-to-one relations between them. There is also much related but different work on generating correlated binary variates, see, for example, Emrich and Piedmonte (1991), Lee (1993), Gange (1995), Park, Park and Shin (1996), Lee (1997), Lunn and Davies (1998), Al Osh and Lee (2001), Demirtas (2006) and Jiang, Song, Hou and Zhao (2021).There is also some (though rather less) work on generating correlated multinomial or ordinal variables which can be used when the margins have more than two categories, for example Gange (1995), Choi (2008), Amatya and Demirtas (2015) and Touloumis (2016).
These approaches mostly focus on generating population probabilities in order to assess the performance of generalised estimating equation (GEE) models.In passing, some of these papers mention that their methodology can also be used with odds ratios instead of correlations, but they do not provide a detailed description of how this works.Geenens (2020) extends the connection of copulas and margin-free association measures to general discrete distributions.Combined with marginal information these generate distributions and Geenens demonstrates their existence and uniqueness.Our work is different in that we focus on generating data from loglinear models where the dependence is specified using odds ratios rather than correlations, and describe how this leads to a simple implementation using standard functions for fitting loglinear models.
The methodology we propose is illustrated with an application to population size estimation, as has been used in population censuses.This follows on from the work of Brown, Abbott and Diamond (2006), Baffour, Brown and Smith (2013) and Gerritse, Bakker and van der Heijden (2015) who used a fixed odds ratio to investigate the impacts of dependence on population size estimation.In this application, lists where individuals within the target population are enumerated are used to estimate the population sizes.These lists are subject to incompleteness (undercoverage error), and multiple system estimation has been proposed to adjust for this undercoverage error.Some assumptions about the dependence of the lists must be made (though these may be quite weak where there are many lists), so it is important to test these methods under prespecified conditions.The proposed methodology and its application allow investigations into the impacts of given dependence structures within contingency tables to be carried out, varying the sizes of odds ratios, population sizes and response probabilities, and under different models.
The structure of the paper is as follows.In Section 2 we outline the methodology around log linear models and model fitting.From this we develop the proposed method for including prespecified interaction parameters in section 2.4.In Section 3 we present an application where the proposed method is used to investigate the properties of population size estimation under dependence.A simulation study is presented for two-way and three-way contingency tables, alongside steps for users to implement the methodology.Section 4 makes an assessment of the advantages of our approach compared with existing methods for correlated binary or categorical data.

Models for two-way tables
The simplest situation that we consider is the loglinear model with two dichotomous variables A and B. Let A be indexed by i = 0, 1 and B by j = 0, 1.We denote the expected count for cell (i, j) of the two-way contingency table of variables A and B by µ ij .We denote the loglinear parameters by λ.Then the saturated loglinear model for two lists is: where λ is the intercept term, λ A i and λ B j are the respective parameters for variables A and B, and λ AB ij is the two factor interaction parameter.The parameters are identified by setting the parameter equal to 0 when an index is 0, i.e.
Thus the model has four parameters and four observed counts, and hence is called saturated.The loglinear interaction parameter λ AB 11 is closely related to odds ratio θ AB and the expected counts by When λ AB ij = 0, the odds ratio θ AB = 1, and the independence model holds.
The proposed methodology can be extended in a straightforward way to variables with more than two levels (compare Kateri, 2014, section 2).Assume a 3 × 3 table with levels i, j = 1, 2, 3, with level 3 being the reference category.The model is described in equation (1) with i, j now having different levels than in the 2×2 context.If we identify the 9 interaction parameters by setting these parameters equal to 0 when (i, j = 3), then there are four interaction parameters to be estimated, namely λ AB 11 , λ AB = (µ 12 /µ 13 ) (µ 32 /µ 33 ) = µ 12 µ 33 µ 13 µ 32 . (3)

Models for three-way tables
Consider the loglinear model for three dichotomous variables, where the third variable is denoted as C.Here the saturated model is where λ is the intercept term, λ A i λ B j and λ C j are respectively the parameters for variables A, B and C, λ AB ij , λ AC ik , and λ BC jk are the two factor interaction parameters and λ ABC ijk is the three factor interaction parameter.In order to identify the parameters, similar identifying restrictions hold as for the two-variable case.
When three dichotomous variables are present, the odds ratio for each pairwise relationship is conditional on the outcome of the variable not included in the pairwise relationship.E.g., when there is a relationship between variables A and B, the odds ratio is conditional on variable C:

Latent variable models
Loglinear models are also used in the context of latent variable models.The Rasch model, popular in the field of item response theory in psychometrics and in the field of population size estimation, assumes a single continuous latent variable that explains the dependence between observed dichotomous variables, i.e. the observed variables are conditionally independent given the latent variable.A version of the popular Rasch model (Rasch, 1980(Rasch, , originally published in 1960) ) is the socalled extended Rasch model (Hessen, 2011), that can be written for three observed variables as model ( 4) with all two-factor interactions identical, i.e.The latent class model assumes the existence of a categorical latent variable X with levels x, and given the level x the observed variables are independent.Let's assume that there are four observed variables A, B, C and D (having levels indexed by l) and a dichotomous variable X. Written in terms of (conditional) probabilities we have, This model was originally proposed by Lazersfeld and Henry (1968).Haberman (1988) showed that the model can also be written as a loglinear model: For the purpose of this paper, writing the latent class model as in equation ( 8) may not be that advantageous to generate a population, as equation ( 7) can be used for this purpose in a straightforward way by specifying the (conditonal) probabilities on the r.h.s. of equation ( 7), thus generating population probabilities for the four-way table.

Logistic regression
Due to the relation between the loglinear model and the logistic regression model with categorical explanatory variables (Bishop et al., 1975, section 2.2.4), the method proposed in this paper can also be used for this subset of logistic regression models.We show the relation between the models for a simple example.Consider model (4).Let the variable C with levels k = 0, 1 be the response variable and variables A and B be categorical variables.Then the logistic regression model is log where the λ-parameters λ C k , λ AC ik , λ BC jk and λ ABC ijk can be replaced by β-parameters respectively.The loglinear model

[AB][C] corresponds with the intercept only model in logistic regression as λ
Multinomial and ordinal response models can also be dealt with using this approach if they have equivalent loglinear models.Examples include the baseline category-logit model and the continuation-ratio logit model, see Agresti (2013).

The proposed method
Loglinear models are usually estimated using the maximum likelihood method.An important property of maximum likelihood estimates of loglinear models is that, for the parameters fitted, the corresponding margins of the observed counts equal the margins of the fitted counts (Bishop et al., 1975, p. 69) .These fitted counts are the sufficient statistics.We denote maximum likelihood estimates by adding a hat to parameters and expected counts.
We focus here on the estimation of hierarchical loglinear models, starting with the two-way table and extending to three-way tables later on.In hierarchical loglinear models, when higher-order parameters are included, the lower order parameters are included as well, for example, if λ AB ij is included, the lower order parameters λ A i and λ B j are included too.The use of hierarchical loglinear models allows for simple and easier interpretation of the model parameters.

Model fitting for two-way tables
For a two-way contingency table with observed counts n ij , when we fit model ( 1 Although the two most relevant models, the saturated model and the independence model, can be fitted in a straightforward way, we fit the model here through loglinear Poisson regression, as this suits our purposes later on.Assume that the four elements n ij are counts derived from four Poisson distributions.Then a likelihood can be set up for µ ij and maximized over the parameters.If we collect the elements m ij into a column vector, then model (1) can be written as, where the matrix with 0's and 1's is the design matrix formulating the loglinear model.Maximizing the likelihood gives maximum likelihood estimates μij and parameter estimates λ, λA 1 , λB 1 and λAB 11 .In fitting the independence model, the last column of the design matrix, (1, 0, 0, 0) T and the parameter λ AB 11 are left out of the equation.In software packages this can be accomplished in routines for Poisson regression, where the four observed counts n ij are provided as the Y -variable, the model is represented by the design matrix in equation ( 10), and the output consists of four fitted values μij and the parameter estimates.

Generating two-way tables with given properties
We now develop a model for the two-way contingency table with prespecified marginal probabilities and a prespecified interaction parameter.For the interaction parameter we want to make use of the odds ratio θ AB defined in equation ( 2).This can be accomplished by using an offset in Poisson regression.An offset is a column in the design matrix for which no parameter is estimated, i.e. the parameter is fixed to 1.The odds ratio is specified to give the desired level of dependence between lists.If we denote the prespecified odds ratio by θAB , then the desired loglinear interaction parameter is λAB 11 = log θAB .Thus the population that we want to generate has to follow the following model: The following equation, an adjusted version of equation ( 10), illustrates how this is implemented in Poisson regression: Notice that the parameter λ AB 11 from ( 10) is set to 1, i.e. it is not estimated, and this value 1 is multiplied with the prespecified log θAB in the design matrix.In software packages for Poisson regression this can be accomplished as follows: • specify an offset vector which is log θAB when (i, j) = (1, 1) and 0 otherwise.
• use an independence model as the design matrix, as we only need estimates for λ, λ A i and λ B j .
• in order to fit a model that leads to estimated counts that follow the prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold, and use these joint probabilities, multiplied with some constant, as the counts n ij to be analysed with model (11).
• then the estimates of expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratio.
As an example, assume we want to generate population probabilities that have the prespecified marginal probabilities 0.7 and 0.8 and the prespecified odds ratio 2.
and the kernel of the loglikelihood is, At the maximum of the likelihood the derivate of the loglikelihood over the parameters is 0. So, for example, taking the derivative over λ A i gives n i+ − μi+ = 0 and this gives n i+ = μi+ .Similarly for n +j = μ+j .This completes the proof.It follows that including an offset for the interaction does not affect the property of ordinary loglinear model that for the fitted parameters the corresponding margins of the fitted values are equal to those of the observed values.
The model for the 3 × 3 table in section 2.1.1 has the following implementation: rectly, iteratively (for example, Bishop et al., 1975, p.84).
We now develop (4) in the form of Poisson regression.There are eight elements n ijk , being counts from eight Poisson distributions.The likelihood can be set up for µ ijk and maximised over the parameters.Model (4) can be denoted as, As before, for (17) the matrix with 0's and 1's is the design matrix formulating the loglinear model.A restricted model can be fitted by dropping columns in equation ( 17).For example, in fitting the model where B and C are independent given A, the last two columns of the design matrix and the last two parameters are left out of the equation.

Generating three-way tables with given properties
We now focus on loglinear models with prescribed odds ratios.We start with the model with no three-factor interaction.The model for a three-way contingency table with three prespecified odds ratios θAB , θAC and θBC is denoted as, In the adjusted version of (17) this becomes (using that log θAB +log θAC +log θBC = log θAB θAC θBC ): where the parameter for the offsets has been set to 1.In software packages for Poisson regression this can be accomplished as follows: • specify log θAB , log θAC and log θBC , and use them to generate the offset column of the design matrix as in ( 19).
• use an independence model in the design matrix, as we only need estimates for λ, λ A i , λ B j and λ C k .
• in order to fit a model that leads to estimated counts that follow prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold by multiplying the relevant probabilities.Use these joint probabilities, multiplied by some constant representing a notional population size, as the observed counts corresponding with n ijk to be analysed with model ( 19).
• then, as before in Section 2.4.1, the estimates of the expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratios.
The R-code we used can be found in the online Appendix.
We now discuss the model with a prespecified three-factor interaction.Above we discussed model fitting for three-way tables, with no three-factor interaction.Here we develop the model to include a prespecified three-factor interaction parameter in a loglinear model.The three-factor interaction is defined as the state where two conditional odds ratios are not equal, for example, θAB|C=0 = θAB|C=1 .These conditional odds ratios are related to loglinear parameters by θAB|C=0 = exp λAB 11 and θAB|C=1 = exp λAB 11 + λABC

111
. As similar results hold for θAC|B=0 , θAC|B=1 , θBC|A=0 and θBC|A=1 this shows that the three-way interaction is represented by exp λABC 111 .A three-way interaction will be handled as follows.In Equation ( 19) we fill in log θAB|C=0 in cell (2,5) of the design matrix and log θAB|C=1 in cell (1,5) of the design matrix, and similarly for the other conditional odds ratios.Thus, the model with prespecified conditional odds ratios can be presented as, We can trivially write the term θAB|C=1 / θAB|C=0 , which is the difference in odds for C = 1 and C = 0, is equal to exp λABC 111 .The terms θAC|B=1 and θBC|A=1 include this same three-way interaction term, i.e. θAC|B=1 = θAC|B=0 θAB|C=1 θAB|C=0 , In fact since the three-factor interaction parameter λABC 111 is the difference between each pair of conditional odds ratios, it is clear that If we specify the denominators of each fraction, then as soon as one numerator is specified then all the fractions are determined.There are therefore only four free odds ratios in equation ( 24), and the three-factor interaction can be specified by providing all the denominators and any one of the numerators.Therefore we can rewrite element (1,5) of the design matrix in (20) as log θAB|C=1 θAC|B=1 θBC|A=1 = log θAB|C=0 θAC|B=0 θBC|A=0 θAB|C=1 θAB|C=0 3 , (25) using only four odds ratios.

Dual System Estimation
Nowadays dual system estimation methods are commonly used to estimate population totals for domains of interest.The dual system estimator is the simplest capture-recapture model and is used to estimate the part of the population that is missing after having linked individuals in two overlapping lists.Originally this method was implemented to estimate the size and density of wildlife populations, but it has also been developed and used in official statistics, where the dual-system estimator was originally used for the 1950 decennial United States Census to estimate coverage error (Wolter, 1986).
The linked data can be presented in a two-way contingency table: we observe the counts of those in both lists, those in list A only, those in list B only, but not those missing from both lists.Thus there are three observed cell counts, and the missing count for those in neither of the lists is to be estimated, which can conveniently be done using the loglinear model framework (Fienberg, 1972).
Dual system estimation relies on five key assumptions, outlined by the International Working Group for Disease Monitoring and Forecasting (1995): 1.The population is closed, i.e. there is no change such as births or deaths.
2. There is perfect matching between the two lists.
3. The capture/inclusion probability of elements in at least one of the lists is homogeneous (Seber, 1982;van der Heijden et al., 2012).
4. The probabilities of inclusion in the lists are independent.
5. There are no erroneous captures in any list (no overcoverage).
One of the key assumptions is that the probabilities of inclusion in the lists are independent, which is easily violated (Gerritse et al., 2015).However, with the data used for population size estimation, this assumption is impossible to verify and so the design of both lists is very important, to create the conditions where this assumption is likely to hold.A way to reduce the impact of violating this strict assumption is to divide the population by certain characteristics, such as age, sex and some geographical breakdown, so that we need only assume that the lists are independent conditional on these classifiers.
Assumption 4 follows on from assumption 3, which implies that individuals who where included in the first list and those who were not have the same probability of being included in the second list, so that presence in the first list does not affect response in the second list, and the lists are therefore statistically independent (International Working Group for Disease Monitoring and Forecasting, 1995).
Therefore, as there are three observed counts, to estimate cell counts μij for a twoway contingency table only three parameters can be used and no interaction term can be modelled, and (1) is adjusted to include no interaction term, i.e. λ AB ij = 0.

Triple System Estimation
The triple system estimator uses three linked lists.Counts can be summarised in a three-way contingency table where one count is missing, namely the count for the individuals that are missed by all three lists.By estimating this count and adding it to the seven observed counts, the population size is estimated.Unlike a twoway contingency table, in a three-way contingency table the pairwise interactions between lists can be modelled.This means that the homogeneous capture/inclusion 2. The prespecified odds ratios hold, and 3.The prespecified marginal response probabilities hold.
In order to investigate the behavior of population size estimators, multinomial samples are drawn from this population.For the study of the dual system estimator, in each sample the cell (0,0) is set to missing and the estimator is applied.For the study of the triple systems estimator, in each sample the cell (0,0,0) is set to missing and the estimator is applied.When t samples are drawn, t population size estimates are obtained that can be compared with the prespecified population size.

Simulation Study
The aim of the simulation study is to illustrate the methodology.Probabilities are generated using prespecified marginal probabilities and odds ratios.2000 multinomial samples are drawn with these probabilities and with the total population size set equal to 1000 or 2000.The prespecified probabilities and odds ratios mirror those used by Brown, Abbott and Diamond (2006) in their investigation of methods to deal with dependence in the coverage estimation for the 2001 population census of England and Wales.The outcomes of the simulations are presented in Table 1 and Table 2.The results include the specified values for the total population, the marginal response probabilities, the odds ratio (OR) between the lists, the mean and median of the population size estimates, empirical 95% confidence intervals (which is the range between the 2.5%) and 97.5% quantiles of the estimates, the Relative Bias %, (μ -N )/N and the Coefficient of Variation σ/N , where N is the prespecified population size, μ is the mean and σ is the standard deviation of the population size estimates over the simulations.
Table 1: Simulation results for dual systems estimation with inclusion probabilities π A and π B , sizes of odds ratios (OR), the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias, and coefficient of variation (CV), where the fitted model is The results presented in Table 1 from these simulations show, that when the lists are independent (OR = 1), and the independence model is fitted, the mean of the estimated population size across 2000 multinomial samples is very close to the true population size.However, the bias appears to depend on the size of the population and the response probabilities for each list.When the total population is smaller, N = 1000, and the response probabilities are lower, π A = 0.7 and π B = 0.8, for the lists, the mean of the estimated population sizes is slightly larger than the true population size.However, the bias is negligible when N = 1000 and π A = 0.8 and π B = 0.9, as the mean estimated population total across all multinomial samples is approximately equal to the true population total.Therefore, the higher the response probabilities of the lists, the less biased the estimates of the population size will be.
When the lists are dependent, and the independence model is fitted, the mean estimated population size is smaller than the true population size.This shows that positive odds ratios result in population size estimates being negatively biased (compare with Gerritse et al., 2015).In some cases, when the odds ratio = 1.1, the true population size falls within the confidence interval.This is an issue of power, as the misspecification of the model is small, and the number of simulations is apparently not large enough to detect it.
Table 2: Simulation results for triple systems estimation with inclusion probabilities π A = 0.8, π B = 0.7 and π C = 0.9, with different sizes of odds ratios (OR), specified models, the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias, and coefficient of variation

Discussion
We proposed a simple method to generate contingency tables with fixed marginal probabilities and dependence structures described by loglinear models.We discuss the generation of two-way and three-way contingency tables and this can be extended to in a straightforward way.
We specify the loglinear model in terms of Poisson regression with a log link.
Using an offset, prespecified interaction parameters can be fitted that can be understood in terms of odds ratios.For two-way tables, odds ratios describe dependence between two variables and conditional odds ratios describe partial dependence between three or more variables.
We think that this method is simpler than previously proposed methods which are discussed in the introduction, and it can be extended to other models that can be presented as loglinear models, such as logistic regression models with categorical predictors and some latent variable models.The structure of our approach to generating data is the same as the standard approach for undertaking multiple system estimation, so it is straightforward to understand how the dependence is introduced.Our approach is less tailored to the generalised estimating equations which have been the main application of other approaches.
In this paper we apply this method to population size estimation for two-way and three-way contingency tables.The results show the impacts of dependence between lists and model (mis)specification.list_a <-factor(c("in", "missed", "in", "missed"), levels=c("missed","in"))
), the fitted values are equal to the observed counts, i.e. μij = n ij .This model is denoted by the variables constituting the highest fitted margins, i.e. [AB].The margins n ij are the sufficient statistic for the saturated model.For the independence model, denoted by [A][B], the margins of variable A and variable B are fitted, i.e. μij = μi+ μ+j /μ ++ = n i+ n +j /n ++ , where a "+" denotes that the sum is taken over the corresponding index.
) 2.4.3.Model fitting for three-way tables For the saturated model with three variables, the fitted values are equal to the observed counts so μijk = n ijk .As an example of a restricted model, the estimates for the [AB][AC] model can be calculated directly the loglinear model [AB][AC][BC] estimates must be calculated indi- (CV) used to analyse the data in the column Model.Thus there are correctly specified models, such as OR = (1,1,1) and Model is [A][B][C], overparameterised models, such as OR = (1,1,1) and Model is [AB][AC][BC], and misspecified models, such as OR = (1.5, 2, 1) and Model is [A][BC].When the model is specified correctly the relative bias is close to zero.The variance of the estimates, decreases as the population size increases, compare, for example, OR = (1,1,1) and Model [A][B][C] with population sizes 1000 and 2000.When the model is overparameterised, the variance is larger than when the model is correctly specified.