A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks

Abstract The last two decades have seen considerable progress in foundational aspects of statistical network analysis, but the path from theory to application is not straightforward. Two large, heterogeneous samples of small networks of within-household contacts in Belgium were collected using two different but complementary sampling designs: one smaller but with all contacts in each household observed, the other larger and more representative but recording contacts of only one person per household. We wish to combine their strengths to learn the social forces that shape household contact formation and facilitate simulation for prediction of disease spread, while generalising to the population of households in the region. To accomplish this, we describe a flexible framework for specifying multi-network models in the exponential family class and identify the requirements for inference and prediction under this framework to be consistent, identifiable, and generalisable, even when data are incomplete; explore how these requirements may be violated in practice; and develop a suite of quantitative and graphical diagnostics for detecting violations and suggesting improvements to candidate models. We report on the effects of network size, geography, and household roles on household contact patterns (activity, heterogeneity in activity, and triadic closure). Supplementary materials for this article are available online.


Introduction
Networks of human interaction provide invaluable insights into epidemiology of directly transmitted infectious disease, and there is a great deal of interest in translating network data into epidemic models (Keeling and Eames, 2005, for a review). It is common to focus on epidemiologically important settings such as households (Goeyvaerts et al., 2018;Grijalva et al., 2015) and schools (Mastrandrea et al., 2015, for example), and such data are often used as they are for simulating disease spread and evaluating the impact of intervention strategies (Cencetti et al., 2021, for example). However, observation of larger and broader epidemiologically relevant networks is limited by time, resources, and considerations such as privacy, so it is often indirect or incomplete in a variety of ways, therefore requiring statistical models to learn network structure from the available data and reconstruct (simulate) networks consistent with it (Krivitsky and Morris, 2017).
Exponential-Family Random Graph Models (ERGMs), also called p * models (Wasserman and Pattison, 1996;Lusher et al., 2012;Schweinberger et al., 2020, among many), are a popular framework for specifying probability models for networks, postulating an exponential family on the sample space of graphs. Most applications of ERG modelling concern a single, completely observed population network, but methods for incomplete or indirect observation exist (Handcock and Gile, 2010; Krivitsky and Morris, 2017, among others). At the same time, the questions of asymptotics and inference-particularly under varying network sizes-have been debated in the literature (Schweinberger et al., 2020).
Though simpler mathematically, inference from samples of independent, non-overlapping networks is no less substantively challenging. In practice, the straightforward "i.i.d." inference scenario (e.g., brain networks) is relatively rare, and a far more prevalent one-particularly for social and contact networks-is that of multiple non-overlapping settings, of similar nature and using the same notion of a relationship, but having varying sizes, compositions, and exogenous influences. This variation is important because size and composition of networks can have profound effects on their structure (Krivitsky et al., 2011). Furthermore, the selection of networks to be observed may itself be a complex process, and the selected networks themselves may be incompletely observed, requiring network model inference to be integrated with survey sampling inference.
Samples of networks also provide opportunities. With only one net-work, ways to diagnose how well the model fits-and how well the inference generalises-are limited to working within that network: Hunter et al. (2008a) proposed a form of lack-of-fit testing that compares the observed network features not explicitly in the model to the distribution of those features simulated from the fitted model, and Koskinen et al. (2018) leveraged missing data techniques to compute an analogue of Cook's distance for each actor (the effect of observing each actor's observed relations on parameter estimates). On the other hand, models for independent (if heterogeneous) samples of networks can be diagnosed using familiar techniques developed for regression-provided those techniques can be adapted to networks, including partially observed networks. Thus, ERGM analysis of samples of networks is increasingly common. One popular approach is meta-analysis, pooling individual networks' estimates (Lubbers, 2003). This approach is impractical for large samples of small networks, because the model may be nonidentifiable on each network individually (Vega Yon et al., 2021). More recent is multilevel (hierarchical) modelling: Zijlstra et al. (2006) developed it for the related p 2 model, and Slaughter and Koehly (2016) for a Bayesian ERGM with random effects. Vega Yon et al. (2021) described exact maximum likelihood inference for samples of very small networks. Also, when modelling a time series of networks, transitions between successive networks are typically treated as conditionally independent (Leifeld et al., 2018, for example).
However, assessing an ERGM's goodness-of-fit for a sample of networks has tended to be limited to comparing distributions of observed network statistics to expected (Slaughter and Koehly, 2016;Stewart et al., 2019, for example) and replicating diagnostics of Hunter et al. for each network (Vega Yon et al., 2021). Little attention has been paid to methods appropriate for partially observed networks, large samples of networks, and to identifying precisely how the model is misspecified.
Here, we consider two samples of within-household contact networks: one more complete but restricted to households with a young child, the other larger and more representative but with only one member's relations observed in each household, and both heterogeneous in household sizes and compositions. We wish to fit a probability model to these samples to pool their information and combine their strengths, which will allow us to learn about the social forces affecting the formation of their contacts (i.e., inference) and predict their unobserved relations or other households in the population (i.e., prediction and simulation). More generally, we seek to answer three questions: 1. What are we estimating when we jointly fit a model to multiple networks?
2. What do we need to assume to combine information from multiple networks?
3. How do we test these assumptions?
In Section 2, we begin to address Question 1 by describing the datasets and applying the principles of model-based survey sampling inference to make explicit assumptions associated with inference from samples of networks that were previously left implicit. In Section 3, we review ERGM inference for missing data, introduce a parametrisation for jointly modelling an ensemble of networks, and discuss its inferential properties-and the requirements for valid inference, addressing Question 2. We then consider in Section 4 the different ways in which these requirements may be violated and combine missing data theory with classic generalised linear model (GLM) diagnostics to produce tools for diagnosing lack of fit in the proposed framework, addressing Question 3; and in addition propose fast model selection techniques for ERGMs for ensembles of networks. Finally, in Section 5, we apply these techniques to our original problem. Further details, discussion, and results are provided in Appendices A-E, referenced throughout this article. When appendix figures and tables are referenced, they are prefixed by the appendix, e.g., Figure B5 is Figure 5 from Appendix B.

Study Designs
Two paper-based surveys (Hoang et al., 2021;Goeyvaerts et al., 2018) were conducted in Belgium in 2010-2011, using similar survey instruments but differing in sampling design. In both surveys, recruited by random-digit dialling, participants (or their guardians) reported their and their household members' demographic information and recorded their contacts over the course of one day, including the contacts' ages and genders. Approximate duration and frequency of each contact was also recorded, but for the remainder of this work, we focus our attention on presence or absence of contacts involving skin-to-skin touching.
The first major difference between the surveys is that whereas in the egocentric (E) survey (Hoang et al., 2021), only one participant per household was sampled; in the other (Goeyvaerts et al., 2018), the whole household (H) was enrolled. This within-household sampling design impacts profoundly the information available about the households: while all contacts are known for the networks in the H dataset (Figure 1a), only contacts incident on the one respondent in the household are known for the E dataset (Figure 1b), though, importantly (Krivitsky and Morris, 2017), enough information was collected to identify these contacts uniquely within the household. The second major difference is that the H survey was restricted to households with a child aged 12 or under, whereas the E survey was not. More incidentally, the surveys differed slightly in their geographical localisation: Report of withinhousehold contacts by ego #8. Figure 1: Example observation units from the two datasets. Household composition is observed for both, but whereas the H households' relations are completely observed, the alter-alter relations in the E households are missing by design.
both surveys included households in the Flemish (Dutch-speaking) areas of Belgium, but only the H survey included participants from the (majority-French-speaking) Brussels-Capital region of Belgium. Also, both surveys called for fine-grained stratification by age, but in practice, the surveyor was not able to adhere to it exactly, and in the H survey, households for which any members' contacts were not successfully recorded were dropped altogether.

Descriptive Statistics
Dataset H comprises 318 households of size 2-7 for a total of 1266 members/respondents. Requiring less effort per household to collect, E comprises 1258 respondents whose households (ranging in size 2-8) have a total of 4000 members with 53% of the households' relationship states observed.
In the H dataset, individuals in their mid 20s are underrepresented; E is more representative in this respect, except for omitting individuals living alone or in shared housing. Both datasets' households are on average gender-balanced, but E's respondents have a different age and gender distribution from its household members in general, with women aged 25-55 overrepresented and adolescents of both genders underrepresented. Most households (H: 86%, E: 75%) were observed on a weekday; 11% of the H households are in Brussels. In the E dataset, 34% had a child 12 or under and so could have been in the H dataset.
With respect to structural properties of these networks, networks are on average dense (H: 93% E: 73%), with the H dataset's networks being more dense on average (P -val. < 0.0001). The H dataset's networks exhibit high triadic closure (global clustering coefficient 3×# triangles #2-stars averaging 92%); it cannot be estimated on the partially observed E dataset.
More information can be found in Appendix A.

Implications for Inference
E dataset is representative but comprises egocentric, incomplete networks; H dataset is very selective but of complete networks. E generalises better to the population of Flanders. H allows higher-order (e.g., triadic) effects to be estimated, and includes Brussels. Combining information from multiple surveys with different strengths is not uncommon, and a variety of approaches can be used (Elliott et al., 2018). These data and the substantive problem are particularly amenable to a model-based approach: the ERGM framework seamlessly integrates exogenous (e.g., age) and endogenous (e.g., friend-of-a-friend) effects likely to be relevant. The model-based approach is also feasible: unlike some egocentric data (Krivitsky and Morris, 2017) each respondent's contacts in E could be identified uniquely within the household, so model-based inference of Handcock and Gile (2010) is possible.
For the purposes of prediction (e.g., given the distribution of household composition, how would an infection brought home from school spread?) we require the analysis to generalise to the population of households in Flanders and Brussels. The missing information principle (Orchard and Woodbury, 1972;Breckling et al., 1994) suggests that if the model is accurate enough, it can be generalised to the population despite the heterogeneous and biased sample. More precisely, we require that the sampling process be ignorable or, if viewed as a missing data process, missing at random: the unobserved relationship states must be conditionally independent of the selection process given the model and what is observed (Rubin, 1976;Handcock and Gile, 2010). In our case, this also means that the model must render the dataset from which the network had come ignorable. To accomplish this, it must also account for network size and composition effects.
For the purposes of inference (e.g., do mothers have more contact with their children than fathers?), we in addition require consistency and a sampling distribution for the parameter estimates. Fortunately, we can treat these networks as an independent sample: the probability that any member of any of the households in either sample has interacted with a member of one of the other households in either sample is low, and such an interaction is unlikely to affect within-household information in a systematic way in the first place. However, there are further nuances, discussed in Section 4.1.

Model Specification and Inference
Here, we review the ERGM framework and inference for it, then describe a framework for specifying ERGMs for multiple networks and its inference in turn. We refer the reader to the text book by Lusher et al. (2012) and a review by Schweinberger et al. (2020) for detailed discussions of ERGMs' formulation, interpretation, and inference.

Exponential-Family Random Graph Models for Completely and Partially Observed Networks
Let N = {1, 2, . . . , n}, for n ≥ 2, be the set of actors whose relations are of interest. Since physical contacts are inherently two-way, we will focus on undirected graphs: the set of potential relations of interest Y ⊆ {{i, j} ∈ N × N : i = j} is a subset of the set of dyads-distinct unordered pairs of actors. Then, the set of possible graphs of interest Y ⊆ 2 Y (the set of all possible subsets of Y). We use y ∈ Y for the graph data structure, and y i,j ∈ {0, 1} as indicator of i and j being connected in y (with y i,j ≡ y j,i ). An ERGM is specified by its sample space Y, a collection x ∈ X of quantitative and categorical exogenous attributes of actors (e.g., age and gender) or dyads (e.g., distance) used as predictors, and a (sufficient by construction) statistic g : Y×X → R p . This statistic operationalises network features believed or hypothesised to be affecting the network process. With free model parameters where κ Y,x,g (θ) = y ∈Y exp{θ · g(y , x)} is the normalising constant. For the sake of brevity, we will omit specification elements "Y", "x", "g", and "θ" where unambiguous.
Network statistics that we will use in this work include the edge count |y| used to model propensity to have relations; edge counts within or between exogenous groups of actors to model homophily and other types of mixing; and endogenous effects: count of 2-stars g 2-star (y) = n i=1 |y i | 2 (where |y i | is the degree-number of ties incident on actor i) to model degree heterogeneity and count of triangles g triangles (y) = n i=1 n j=1 n k=1 y i,j y j,k y k,i /6 to model triadic closure. Ordinarily, we would be wary of using the latter two, because of their well-known tendency to induce badly behaved "degenerate" models in large networks and instead use less degeneracy-prone effects (Schweinberger et al., 2020, Sec. 3.1 for context and history). However, this application's networks are very small and thus largely unaffected, so we use them for their simplicity.
If the network is incompletely observed, likelihood estimation proceeds as follows (Handcock and Gile, 2010): to the unobserved true population network y, an observation process obs(·) (deterministic or conditioned-on) is applied, producing y obs def = obs(y), an observed data structure with y obs i,j ∈ {0, 1, NA} representing observed-absent, observed-present, and unobserved potential relations, respectively. For the E dataset, obs(y) is such that Let Y(y obs ) def = {y ∈ Y : obs(y ) = y obs }: all complete networks that could have produced y obs or, equivalently, all possible imputations of unobserved relations of y obs ; and define conditional expectation and, analogously, conditional covariance Σ(θ | y obs ). Then, under noninformative sampling and/or missingness at random, the face-value log-likelihood is l(θ) = log y∈Y(y obs ) Pr g (Y = y; θ) and observed information is (Orchard and Woodbury, 1972;Sundberg, 1974;Handcock and Gile, 2010) I obs (θ) = Σ(θ) − Σ(θ | y obs ). (1) Unlike the completely observed case, (1) is not the Fisher information, because it depends on the data (y obs ). The Fisher information, then, also takes the expectation over the possible values of y obs under the model:

Multivariate Linear Model for ERGM Parameters
Consider a sample of networks indexed s = 1, ..., S, and for each, let z s be row q-vector of network covariates of interest, which may include its size, various compositional properties, and stratum membership, and let β a q × p parameter matrix. Set network-level parameters θ s def = (z s β) . Then, we write that jointly if Y s ind ∼ ERGM Ys,xs,gs (θ s ). Thus, the components of model specification (sample space, sufficient statistic, and any covariates) are allowed to vary arbitrarily between networks, but the ERGM parameters are parametrised by a matrix, analogously to multivariate linear regression models (e.g., Johnson and Wichern, 2007, Ch. 7). Here, we wrote q and p to be the same for all networks, but no generality is lost if the implementation allows fixing selected elements of β at 0. Strictly speaking, this can be viewed as a submodel of the model proposed by Slaughter and Koehly (2016) in a Bayesian framework, which can be reexpressed, a priori, as θ s i.i.d.
The elements of β control how network covariates affect ERGM parameters for both exogenous and endogenous network properties, and they do so in a familiar fashion, analogous to the linear predictor of a GLM-and as we show in later sections, we can borrow familiar GLM diagnostics as well.
Example: Network size effects For a given type of social setting (e.g., classroom, household), bigger networks will typically have lower density (|y|/{n(n − 1)/2} for undirected networks), with mean degree (|y|/{n/2}) being close to invariant to size; but the "default" ERGM behaviour is to preserve network density (Krivitsky et al., 2011) so that mean degree grows in proportion to n. Krivitsky et al. proposed to adjust this behaviour by an offset term of the form − log(n)|y|: other things being equal, the odds of a relation in a network of size n would be scaled by n −1 , stabilising the mean degree. But, their result is asymptotic, reliant on sparsity, and only adjusts lower-order properties (density, mixing, and, fortuitously, degree distribution).
Butts and Almquist (2015) proposed that, given a sample of networks, the effect of network size on density could be estimated from data by multiplying the log(n) above by a free parameter, say, γ, rather than by −1, making the mean degree approximately proportional to n γ+1 . Here, we can do this by setting z s,k = log(n s ) and g s,l (y s ) = |y s | for some indices k and l; then γ ≡ β k,l . Considering that our networks are small and dense, we can model a nonlinear network size effect by adding a quadratic covariate z s,k+1 = log 2 (n s ) with β k+1,l then becoming its coefficient. Alternatively, orthogonal polynomial contrasts, a spline, or dummy variables could be used.

Inference
We now describe this framework's inferential properties. Let I d be an identity matrix of dimension d; let ⊗ be the Kronecker product; and let Z s def = I p ⊗ z s . Then, we can reexpress θ s = z s β ≡ Z s vec(β), for an exponential family with a complete-data likelihood (z s β) } and analogously for µ s (β), Σ s (β | y obs s ), and Σ s (β). Then, its partially observed Fisher information is For those networks in the sample that are completely observed, µ s (β | y obs s ) ≡ g s (y s ) and Var Ys [µ{β | obs(Y s )}] ≡ Σ s (β). Since this is an independent sample of networks, consistency and asymptotic normality ofβ in S can be shown (Sundberg, 1974), provided the sampling process is noninformative and I(vec β) is nonsingular asymptotically, which requires the model to be identifiable.

Diagnosing Multivariate Linear ERGMs
Whether or not the estimation can be consistent and the inference be generalised to a broader population of households depends on the model being identifiable given available data and on its goodness-of-fit-both of which must take into account that at least some of the networks in the sample are partially observed. Here, we discuss likely causes and diagnostics for nonidentifiability, develop a generalisation to residual diagnostics for partially observed networks, and consider a variety of ways in which the model may fit poorly to the sample and how to diagnose them.

Causes and Diagnostics for Nonidentifiability
The key condition for consistency by Sundberg is that it must be nonsingular. Substantively, there is a number of reasons this condition might not be satisfied.
Nonidentifiable model specification A model may erroneously contain a relationship type or other network feature that is not possible in any potentially sampled network. For a trivial example, counting the number of connections between adults and children is not meaningful in a survey of households without children, and similarly counting 2-stars in households of size 2. Similarly, given the large number of potential network features, and a large number of potential network-level covariates, it is not difficult to inadvertently specify a model that is not full rank. An example of this is network size as a covariate in a sampling process that observes networks of only one distinct size; or a quadratic network size effect if only two distinct sizes are observed. Then the minuend of (2a) (i.e., Σ(β)), respectively, has zeros on the diagonal or linear dependence, and the model is not identified even under complete observation.
This form of nonidentifiability can generally be detected during the course of estimation by examining the variance-covariance matrices of simulated sufficient statistics.
Network observation process not informative of the model If the sampling process entails partially observed networks, some combinations of observation process and model specification may make an otherwise identifiable model nonidentifiable.
Example 1 Consider an undirected network with actors partitioned into groups A and B. A 3-parameter model whose statistic comprises the counts of all edges, of edges within group A, and of edges between members of A and members of B is identifiable, and its Σ(θ) is full-rank. But, if only relationships incident on members of group A (A-A and A-B) are observed, while B-B relations are missing by design, then the elements of µ(θ | y obs ) are affinely dependent, and so I(θ) is singular. (See Appendix B.1.) Example 2 For an i.i.d. sample of S 3-node undirected networks, it is possible to estimate a 3-parameter model with edges, 2-stars, and triangles sufficient statistic; but not if any one of the 3 possible relations is unobserved in each network: a direct enumeration of the sample space in Appendix B.2 shows that (2b) is singular.
This type of nonidentifiability is more insidious. Its main symptom is that intermediate estimates of the difference in (2a) are not positive definite; but the algorithm of Handcock and Gile (2010) obtains this difference by subtracting the two simulated variance-covariance matrices, and for data with high missingness fraction and models with many parameters in particular, this symptom could be triggered by Monte Carlo error.

Residual Diagnostics for Partially Observed Networks
Traditional model diagnostics-whether for linear regression or for ERGMs (Hunter et al., 2008a)-work by comparing the observed data points to those predicted by the fitted model. The approach of Hunter et al. in particular is to simulate networks from the fitted model, and compare the statistics of the simulated networks-particularly those statistics not in the original model-to their observed values. If the observed value falls outside of the range of the simulated, lack of fit is indicated. However, most of the networks in E dataset are partially observed, and this means that there is no "true" observed value for a network feature. We therefore develop the theory for diagnostics for partially observed networks.
For notational convenience, let y = [y s ] S s=1 refer to a vector of completely observed networks. Consider a real-valued function t( y) that evaluates a particular network feature of interest, either cumulatively over all of the networks or for a specific network. Analogously to µ(·) and Σ(·) in Section 3.1, let τ (β) def = E{t( Y ); β} and Ψ (β) def = Var{t( Y ); β}, and likewise for the conditional expectations.
We can form a standardised (Pearson) residual for t( y) by evaluating with the expectation and the variance estimated by simulating from the fitted model. Under the true model, this residual would, by construction, have mean 0 and variance close to 1; this also facilitates outlier detection. If the networks are not completely observed, t( y) cannot be evaluated directly, it is natural to replace it with its empirical best predictor (Hunter et al., 2008b;Stewart et al., 2019;Krivitsky et al., 2021), where y obs is defined analogously to y. Then, Estimating the variance in the divisor in (3b) is not trivial. We discuss it in Appendix C.1.

Causes and Diagnostics for Lack-of-Fit
Within-network It may be the case that the within-network model fits poorly. Network statistics used for diagnostics by Hunter et al. (2008a) include the full degree distribution, the counts of shared partners (i.e., for a given pair of connected actors, how many common connections do they have?), and the distribution of geodesic distances. All of these can be used as t(·), but it may be impractical for two reasons. Firstly, family networks are relatively small and very dense. This makes the statistics typically used less than informative. Secondly, the sheer number of networks in the dataset means that diagnosing each network individually is impractical, but, at the same time, pooling their within-network diagnostics is likely to wash out any effects because of their heterogeneity. Nonetheless, even if a statistic is suboptimal and difficult to interpret, for a model that fits well, R t,s will still have mean 0 and variance close to 1.

Between-network
It may be the case that the model for the networklevel parameters (θ s ) as a function of global parameters (β) fits poorly: in particular, it may fail to account for network size and composition effects. At network level, the model has a form similar to that of a GLM. We can thus use the developments of Section 4.2 directly to make familiar diagnostic plots: for some statistic t s ( y) def = t(y s ) (e.g., density), one can plot residuals R ts for s = 1, . . . , S against their respective E{t s ( Y )} (the fitted values) or against a candidate predictor. Or, we can plot |R ts | instead for a scalelocation plot, analogously to the standard diagnostic plots in R (R Core Team, 2021).
One can also use these residuals to bypass the computational cost of refitting an ERGM for each candidate specification: regressing R ts s on their respective z s,new can be used to assess the explanatory power of a candidate predictor z new . Since the individual networks are independent, the residuals should be nearly independent as well.
Between-dataset If we wish for the fitted model to generalise and render the sampling designs ignorable, the model must account for differences in datasets without incorporating dataset effects directly. This can be done via a hypothesis test, such as a simulation Score test, along the lines of that described by Krivitsky (2012) in the context of valued ERGMs, by testing the significance of an explicit dataset effect without refitting the model. Details are given in Appendix C.2.
Non-systematic heterogeneity Lastly, even if there is no systematic bias in the model, there may be between-network heterogeneity due to unobserved factors. The above-described Pearson residuals incidentally provide us with a way to tell whether there is any heterogeneity left to explain: if there is none, R ts in (3) will, by construction, have mean 0 and variance around 1.

Application
We now return to the datasets we have introduced in Section 2, discuss model specification, and report resulting model diagnostics and results.
We have implemented the methodology described in an extension to the

Model
A model used to join these two datasets must be one that is substantively meaningful and interpretable and accounts for within-network conditional dependence among the relations; while making the network size, composition, and dataset effects ignorable to enable generalisable inference. We therefore dedicate a significant amount of attention to formulating and justifying each of its elements.

Design effects
If we wish to generalise our inference to the population of households, our model must make any design effects ignorable. The substantively motivated effects described below already control for some of those. For example, recall that the H dataset's households were omitted if there was nonresponse from even a single member. To the extent that the nonresponse rate is a function of household size (i.e., the bigger the household, the more likely there is at least one nonresponse), a model that accurately controls for network size will reduce the noninformativeness of this nonresponse.
To account for the fact that only H dataset incorporated households in Brussels, we must add an indicator of the household being in Brussels as a covariate.
Both surveys' selection was strongly affected in different ways by participants' and household members' ages, particularly children. We can adjust for the latter by including effects of the presence of a child in a household as a covariate, or as a part of the model for interaction of different household roles. More generally, when modelling age effects, there is a tension between interpretability and accuracy: family roles are most conveniently modelled with discrete age categories; but outside of a few critical ages defined exogenously (e.g., school attendance, legal adulthood, and retirement), age effects are likely to be continuous, best modelled semiparametrically (e.g., with splines). Our compromise is to use relatively fine-grained edge categories, discussed in the next section.
Household roles In order to study interactions pattern among household members, each member has been assigned to a category, constructed to resemble household roles, unfortunately not available in the original survey. A first classification was done according to age: young child (below 6 years of age), preadolescent (between 6 and 12 years of age), adolescent (between 13 years of age and 18 years of age), younger adult (between 19 years of age and 24 years of age), older adult (between 25 years of age and 60 years of age), and senior (older than 60 years of age). We further divided older adults according to gender, to investigate the possibility of gender-specific interactions (Goeyvaerts et al., 2018), ending with a total of 7 categories (young child, preadolescent, adolescent, younger adult, older female adult, older male adult, senior). The age cut of 12 was chosen to specifically account for the design effect.
We then modelled mixing counting the number of contacts between pairs of household roles. Our data about some age categories are limited, both for the design reasons already discussed and because some age groups, particularly young adults and seniors, have a higher propensity to live alone or in a dormitory or a nursing home and thus be excluded from both datasets. For this reason, and for reasons of substantive parsimony, not every combination was modelled individually. In particular, interaction counts of adolescents with young children and preadolescents were combined, as were those of young adults with preadolescents and adolescents. (Young adults with young children were retained as a separate count, because their chances of being parents of a young child were relatively high.) Similarly, interactions of young adults with both older adult categories were combined, and so were all interactions of seniors with non-seniors.
To further incorporate the design effect of only households with a child 12 or under being eligible for inclusion in H, we added an effect for mixing between older male adults and older female adults in those households that did not have such a child.
Network size effects As discussed in Section 3.2, the effect of network size on our networks is not trivial; we use the polynomial effects of log n s on density described there to model it. In the analysis of the H dataset by Goeyvaerts et al. (2018), three different density parameters were used depending on household size.
Other network-level effects Some of the surveys were conducted on a weekend and others on a weekday (Table A1). Past literature (Goeyvaerts et al., 2018, for example) suggests that contact patterns may differ depending on the day, and so we incorporate this effect as well.
Contact patterns may differ systematically between families that live in detached housing and families that live in apartments. This potential effect has received limited attention in the literature to date. Our data do not include housing type but do include postal codes. The population densities in those postal codes can then be used as a proxy for housing type. We use this potential predictor to illustrate the technique proposed in Section 4.3 of regressing the network-level residuals on potential network-level predictors. Alternatively, we may ask whether or not the post code is in any of Belgium's larger cities.

Endogenous effects
To model actor heterogeneity and triadic closure, we use the 2-stars count and the triangles count, given in Section 3.1. We reiterate the cautions about ERGM degeneracy, and further guard against it by allowing their coefficients to vary with network size as suggested by Goeyvaerts et al. (2018), using the same approach as for density.
An additional caveat here is that the E dataset, by virtue of only containing relations incident on one individual per household, does not contain information about triadic closure. (See Section 4.1 Example 2 and Appendix B.2.) We thus assume that net of all other effects, the effect of triadic closure on a household of a given size that does not have a child is the same as the effect of triadic closure on a household of that size that does have a child. It is not possible to test this assumption with the available data.
Other household properties-location, population density, and weekdaymay also be predictive not only of density but also of higher-order structural features and can be incorporated, but not without cost. We thus begin with a model containing only density effects of location in Brussels and of whether the survey had been done on a weekend, and use its diagnostics to suggest further effects.
We call this starting model Model 1.

Diagnostics
We now apply the techniques from Section 4 to the proposed model. To validate our diagnostic techniques, we also fit a number of reduced models: in Appendix D, we demonstrate how we can identify their deficiencies.
Additional substantive models As proposed in Section 5.1, we regressed Pearson residuals edges, 2-stars, and triangles on a number of candidate covariates, with full results given in Table E16. We found the linear effect of log-population-density to be the most promising covariate. Interestingly, its apparent effect (higher χ 2 or lower P -value) is on the 2-star and the triangle statistic, rather than the edge count statistic. To test whether this distinction has predictive power, we fit two more models: Model 2, adding the effect of population density on edges, 2-stars, and triangles; and Model 3, adding its effect on edges only. For completeness, we also fit Model 4, adding effects of indicators of Brussels postal code, survey being on a weekend, and population density on edges, 2-stars, and triangles, all. Specifications for all models are summarised in Table E1, and their complete results and diagnostics are provided in Appendix E.
Unaccounted-for between-dataset differences For visual diagnostics, it is helpful to distinguish two "subdatasets" of the E dataset: those households with a child 12 years old or under-that could have been in the H dataset (E Y ), and those without, that could not (E N ). Selected residual plots are provided in Figure 2. We also provide smoothing lines for each subdataset individually: these lines diverging would indicate that the model had failed to account for some systematic difference between the datasets.
Overall, the distributions of the residuals are skewed downward. This is to be expected: underlying statistics are highly skewed. We see that the residuals for the H dataset and the E Y subdataset coincide, and that the residual for the E N subdataset diverges slightly. The apples-to-apples comparison of H and E Y suggests that the model fuses the two datasets well, though it does not represent E N households perfectly.
The simulated P -values for edges and 2-stars are, respectively, 0.015 and 0.071, with the omnibus P -value 0.036. Thus, at conventional significance level, it appears that there are differences in network density that are not accounted for by the model.
We examine the practical significance of this via Figure 3. Focusing on the two comparable groups (H dataset and subset of E with a child present), we see that there may be some systematic underestimation of network density for H households and underestimation for E households (taking into account the number of households of each network size); but also that the differences are small in magnitude and are likely to be of no practical significance.
Outliers Our residual plots reveal some outlier networks. For example, observation 1429 (E household #1111) highlighted in Figure 2a and other panels stands out because it is a family of four with two young children but no contacts reported by the father on the day of the survey. Similarly, the respondent in observation 798 (E #480) is an infant with no reported physical contacts-which could indicate a recording error.

Network size effects
The residuals against the network size are also shown in Figure 2b. A model that fails to account for network size would display a linear or curved patter in the residuals. We see no evidence of such a pattern. We further test this by regressing the residual for the edges, 2-stars, and triangles statistics for each network on that network's size, weighting them by the inverse of their estimated variance (from divisors in (3)). If model controls for network size effects adequately, these residuals will have mean zero for each network size, which we test using an omnibus ANOVA test, and we find that none are statistically significant (P -vals. 0.80, 0.69, and 0.69, respectively): network size appears to be accounted for.

Non-systematic heterogeneity
We report the standard deviations of Pearson residuals for edges, 2-stars, and triangles, both for the dataset as a whole and for each of the subdatasets, in Table E18. We find no evidence of overdispersion, though standard deviations being generally less than 1 suggest some degree of overfitting.

Continuous age effects
We plot Pearson residuals of the total number of within-household contacts of individuals of age a against a in Figure 2d. We see, as with other diagnostics, that H dataset density tends to be slightly underpredicted, whereas comparable households in the E dataset are overpredicted, and there are slight trends in the residuals within age categoriesa consequence of discretisation. We also observe that a senior's propensity to interact appears to drop off as they become older, which the cutoff at 60 does not capture. On the other hand, the number of actors affected is relatively small. Table 1 gives the parameter estimates for Model 1, our initial model, and Model 2, suggested by our residual diagnostics. Parameter estimates for Model 3 and Model 4 and measures of fit are given in Tables E23, E27, and E2, respectively. AIC is indifferent between Model 1 and Model 2 within margin of error and prefers them over Model 3, and Model 4. The evidence for the effect of population density on the edge count is stronger in Model 2 (β = 0.18, SE = 0.071, P -val. = 0.01), in the presence of its effect on 2-stars and triangles, than it is in Model 3 (β = 0.05, SE = 0.033, P -val. = 0.17). This is as predicted by the residual regression discussed in Section 5.2: there does not appear to be a net effect of population density on the density of the contact networks in general, but there is an effect on higher-order properties, which through being correlated with network density make its effect apparently significant.

Model comparison
Notably, an omnibus test of all population density effects in Model 2 is not significant at conventional level (χ 2 = 7.4, df = 3, P -val. = 0.06), so whether population density in fact has an effect-or whether we had found the best possible way to model it-is questionable; we leave these questions for future work, except to suggest that type of housing should be considered for future data collection.

Substantive interpretation
We discuss results primarily from Model 1, though Model 2 yields the same conclusions. Only a few of the effects are interpretable in isolation. In particular, we can conclude with some confidence (−β = 0.15, SE = 0.062, P -val. = 0.015) that weekends have a positive effect on the number of contacts that are observed in the household, in line with prior literature (Grijalva et al., 2015). Presence of a child in a household is associated with a higher propensity of older adult females and older adult males (i.e., likely the parents) to interact with each other (if no child:β = −0.8, SE = 0.25, P -val. = 0.0014). We do not find an effect of the household being in Brussels on network density (β = 0.1, SE = 0.20, P -val. = 0.54).
Interactions between age categories reveal tendencies in interaction between household members. We report the parameter estimates with a more intuitive layout in Figure 5. Since we do not use a baseline category ("intercept"), it is not meaningful to interpret them in isolation but only in contrast with each other. We can, for example, conclude that older female adults (i.e. mothers) tend to interact more than older male adults

Conclusion
Motivated by two collections of networks representing the same phenomena but collected using very different sampling designs, we developed a framework for combining the strengths of the two, facilitating population-wide simulation of household networks. In the process, we identified the requirements of this procedure and developed generally applicable techniques for specifying and diagnosing models for large samples of networks, techniques that, through their relationship to GLMs, can be used by researchers from a wide variety of disciplines. In addition, the techniques we have developed do not rely on the networks being completely observed. To make this methodology accessible to a broad audience, we produced a user-friendly R package ergm.multi.
Our two surveys were conducted in Flanders and Brussels in 2010-2011. It is important to design and analyse household surveys in different settings with different inclusion criteria-but, ideally, compatible measurement instruments-to gain further insights on the contact patterns and the effects of endogenous factors such as triadic closure, exogenous individual attributes such as age, and exogenous household attributes such as size and type of residence. This work provides a foundation for identifying and testing these effects and for confirming the validity of the analysis-and opens the door to design of future cost-effective yet highly informative hybrid network studies.
A number of methodological research directions remain. In our work, we used Pearson residuals. Other types of residuals, such as deviance, tend to be better behaved and could, perhaps, be derived for this family of models. Similarly, Cook's distance may be possible to compute inexpensively for each   Coefficient † -additional effect for a household with no child: −0.8. We did not find evidence of non-systematic heterogeneity of networks. Where such is present, it can be accounted for in a mixed effects framework (Slaughter and Koehly, 2016) at an additional computational cost, or perhaps by constructing ERGM sufficient statistics to absorb the variation (Butts, 2017;Krivitsky, 2012). Alternatively, quasi-likelihood and generalised estimating equation approaches may be extended to samples of networks.
ERGM computational and diagnostic techniques are agnostic to the structure of the sample space, so these approaches directly generalise to directed, temporal, valued, and multilayer network scenarios. For our two surveys in particular, physical contact was not the only relational measurement: the respondents were also asked about the approximate duration of interaction (close proximity) on an ordinal scale (time ranges

A Additional Data Summaries
Data summaries critical to the analysis were included in Section 2.2 in the body of the article. This appendix provides additional summaries and visualisations. Figure A1 shows the distributions of household sizes and ages and genders of household members in the two datasets. Differences between the two are due to design, with individuals aged around 25 years are certainly underrepresented in the H dataset (Panel (a)), as they are unlikely to be parents nor children of parents of young children. The members of the households in the E dataset (Panel (b)) are more representative of the population (though not completely so, as individuals living alone (n s = 1) are excluded); but survey respondents (Panel (b), bottom) have a different age and gender distribution from E household members in general, with females aged 25-55 years overrepresented and adolescents of both genders underrepresented.
More household and structural summaries are provided in Table A1.

B.1 Example 1: Exogenous Subgroups
Recall, an undirected network whose actor set N is partitioned into disjoint sets A and B. For actor sets X ⊆ N and Y ⊆ N , let |y X,Y | be the number of edges between actors in sets X and Y , and let |y| be the total edge count, and consider an ERGM with 3 parameters, whose sufficient statistic is g(y) = |y|, |y A,A |, |y A,B | . This statistic does not induce dependence among the dyads, and so under this model,  of g(y), and, in fact, y obs contains all of the information needed to compute g 2 (·) and g 3 (·). Then, | does not depend on y obs . Thus, µ 1 (θ | y obs ) is an affine combination of µ 2 (θ | y obs ) and µ 3 (θ | y obs ) for all y obs , making I(θ) singular. More explicitly, from (2b), Here, the first column of I(θ) is the sum of its second and third.

B.2 Example 2: Triadic Effects
Now, consider an independent, identically distributed sample of S undirected networks of size n = 3 and an ERGM with the following sufficient statistic, counting edges, 2-stars, and triangles in network y: where |y i | is the number of ties incident on actor i. Each network in the sample has a sample space of size |Y| = 2 ( 3 2 ) = 8, enumerated in Table B1. For simplicity, we will set θ = 0, making all networks equiprobable. Then, we can evaluate the population covariance matrix and obtain and |I(0)| = (9/4096)S 3 . Notably, for S = 1, the MLE for θ 3 will be infinite due to the sufficient statistic being at its highest or lowest possible value, but this is unrelated to nonidentifiability. Now, consider an egocentric missing data regime, where relations between Actor 1 and the others are observed but the relationship between Actor 2 and Actor 3 (Y 2,3 ) is missing. The conditional expectations µ(θ | y obs ) and probabilities Pr(Y obs = y obs ) for every possible y obs are given in Ta

Direct method
Substituting simulated values into the expectation and the variance yields a consistent (as R 1 → ∞ and R 2 → ∞) estimator. This direct estimator for the variance of the conditional expectation in the divisor in (3b) has the following form: 1 1 ,r 2 ) ), an estimator of the inner expectation for a given obs( Y (r 1 ) ), and t( Y (·,·) ) r 1 ,·) ), the grand mean. Here, we derive the approximate expression for its bias due to the variance capturing the additional variability (decreasing in R 2 ) in estimating the inner expectation by simulation.
Assuming that R 1 is sufficiently large that the variance of t( Y (·,·) ) is negligible, and since E{t( Y (r 1 ,·) ) − t( Y (·,·) )} = 0, Therefore, this estimator is biased, with bias decreasing as a function of R 2 . An adjusted estimator could then be 1 Law of Total Variance method This approach takes advantage of the Law of Total Variance to write: an unbiased (except perhaps from MCMC autocorrelation) estimator with an added benefit that some of the simulation error in the minuend and the subtrahend cancels. We derive its properties as follows.
To approximate its variance, we must assume a conditional distribution

Its variance is
Var(σ 2 ) = 1 A corollary of this is that increasing R 2 can only increase the precision up to a point, whereas as long as R 2 ≥ 2, R 1 increases precision without limit. Nonetheless, if σ 4 α, it may be worthwhile to increase R 2 , since conditional simulation is computationally cheaper.

C.2 Simulation-based Score Test for Dataset Effects
Here, we describe two variants of the simulation Score test for dataset effects. Here, the null hypothesis is the candidate model, which does not include explicit dataset effects, and the alternative is the same model with an explicit dataset effect. Rejection of this null hypothesis therefore indicates that the dataset effect is not ignorable.
Due to the exponential family nature of the model, the test works by specifying a statistic t dataset ( y) that is not a part of the candidate model and that contains an explicit dataset effect, simulate complete network datasets ∼ ERGM(β) from the candidate model, and evaluate the sample quantile q def = R r=1 I{t dataset ( y) ≤ t dataset ( Y (r) )}. Then, 2 min(q, 1 − q) is a Score test P -value for the null hypothesis θ dataset = 0.
In our application, a straightforward t dataset ( y) = s∈H |y s |: separate density for networks in the H dataset, which only depends on the completely observed networks. (Applying it to partially observed networks would require further adjustment along the lines of Section 4.2.) A joint test for several network features is also possible: if t dataset ( y) is vector-valued, then, with m the estimate of E{t dataset ( Y )} and V the estimate of Var{t dataset ( Y )}, both obtained by simulation, χ 2 = {m − t dataset ( y)} V −1 {m−t dataset ( y)} provides an omnibus score test. Unlike the quantile-based test, this test relies on approximate normality of t dataset ( Y ).

D Capacity to Detect Misspecification
To illustrate misspecification detection, we fit a number of reduced models based on Model 1 : Model 0 model with only edge count, 2-star, and triangle statistics, with their linear and quadratic network size effects; Model 1i dyad-independent submodel of Model 1, excluding 2-star and triangle effects; Model 1n Model 1 without linear or quadratic network size effects.
A comparison of their specification to that of the other models is given in Table E1.

Hypothesis testing for network size misspecification
In Model 1n, which lacks network size effects, tests on residuals reject the hypothesis of fit for edges and triangles (P -vals. 0.022 and 0.035, respectively), but not 2-stars (P -val. = 0.069). In Model 0, which does incorporate them, none do (P -vals. 0.88, 0.82, and 0.84, respectively). Model 1i, which has network size effects for density (edge count) but not other features rejects hypothesis of fit for triangles (P -val. = 0.0005) but not for edges or 2-stars (P -vals. 0.40 and 0.079, respectively). Thus, the method appears to be capable of detecting and identifying network size effect misspecification with some detail.
Residual plots for dataset effects Turning to residual plots, Figure D1 shows the edge residual plots for Model 1 (Panel (a)), Model 1n (Panel (b)), and Model 0 (Panel (c)). Observe that unlike the Model 1 's, Model 0 's residuals for households without a child are systematically different from those with a child (from either dataset). Thus, the dataset effect is not ignorable. On the other hand, the lack of network size effects in Model 1n does not appear to be noticeable. Figure D2 provides density effects plots for Model 1, Model 1n, and Model 0. We see in Panel (b) (Model 1n) that removing the network size effects does not significantly affect density predictions, as the other 25 parameters adjust to compensate, but we can observe a fairly consistent trend in density errors as a function of network size: for very small and large networks, the densities are underpredicted, but for mid-sized networks, they are overpredicted. (This pattern is not found in Model 1 's Panel (a).) Model 0 (Panel (c)), as expected, shows large density prediction errors, since it does not attempt to account for household composition. Pearson residual standard deviation for unaccounted-for heterogeneity Lastly, we consider the proposed diagnostic for unexplained heterogeneity: that Pearson residual variances exceed 1. Residual standard deviations are given in Table D1 for Model 1 and the three reduced models. Interestingly, all the dyad-dependent models-even Model 0 -have residual standard deviations less than 1. This is likely because both the 2-star and the triangle statistic create positive dependence among relations in the network, absorbing the variation along the lines of Butts (2017). For small networks in particular, it might not be possible to distinguish anything but the most severe network-level heterogeneity from more local heterogeneity. On the other hand, Model 1i, despite having as many parameters as Model 1n and almost thrice as many as Model 0, lacks dyad-dependent effects and cannot absorb this variation. Accordingly, its Pearson residual standard deviations exceed 1 for edges and 2-stars.   (1, 2, 3, and 4 ) substantive and three (0, 1i, and 1n) to study misspecification detection were fit. Their motivation is described in Section 5.2 in the body of the article and in Appendix D, respectively. Table E1 shows a side-by-side comparison of which effects were included in each model, and Table E2 gives their fit summaries (number of parameters, log-likelihood, AIC, and BIC).

Density error plots for network size and dataset effects
In the following subsections, we provide the details for each of these models: • full coefficient table; • plots of residuals vs. fitted values and network size for edge, 2-star, and triangle counts; • plot of residuals for edges incident on actors of each age against age; • plot of prediction errors in density by network size and subdataset; • Score tests for dataset effect net of the model; • results from regressing residuals for edges, 2-stars, and triangles on candidate network-level predictors; and • sample standard deviations of Pearson residuals for some network statistics under the model.                                 Table E21: Score tests of the null hypothesis of no effect of dataset H on the specified network statistic over and above Model 2.

E.1 Full results and diagnostics for Model 0 (the minimal model)
Statistic Score χ 2 (df) P -val.      Table E25: Score tests of the null hypothesis of no effect of dataset H on the specified network statistic over and above Model 3.
Omnibus 6.4 (2) 0.0400 edges nonparam. 0.0188 2-stars nonparam. 0.0927     Table E29: Score tests of the null hypothesis of no effect of dataset H on the specified network statistic over and above Model 4.