Integrating Bayesian network and generalized raking for population synthesis in Greater Jakarta

ABSTRACT Constructing agent data with detailed information on their sociodemographics is substantially important for agent-based modelling. However, to collect data about the whole population is not efficient, since it requires an expensive and time-consuming survey, especially for a large population. The paper uses a novel approach that integrates Bayesian network (BN) and generalized raking (GR) multilevel iterative proportional fitting (IPF). Furthermore, the approach is applied to construct the population for Greater Jakarta, Indonesia, which consists of 30 million inhabitants. The results show that the BN approach can produce data that represent the probability distribution of sample data and that the IPF can match it against aggregate census data.


INTRODUCTION
Agent-based transportation models require detailed individual and household information such as sociodemographic and geocoding of activity locations (Balmer, Axhausen, & Nagel, 2006). These data can be collected from multiple data sources. To collect all those data for a population is costly in terms of time and resources. Therefore, several approaches can be used for population synthesis from limited data. One of the initial methods is iterative proportional fitting (IPF), which was first introduced by Deming and Stephan (1940) and first implemented in transport research (Beckman, Baggerly, & McKay, 1996). To tackle the sample problems, Barthelemy and Toint (2013) developed a method for the synthetic population without sample by using the available information of disaggregated-level data and randomly drawing. This method was shown to produce a consistent synthetic population. Basically, IPF consists of iteration steps, where each row and each column are proportionally adjusted to be equal to the marginal row and column totals. The steps are repeated until both row and column convergence or the sum of the rows and columns is relatively similar to their marginal total. However, the control total of IPF is based on marginal individual or marginal household totals. Since IPF is widely used for population synthesis, there are several efforts to improve the quality of IPF and to develop new approaches. As reviewed by Müller and Axhausen (2011) and Sun and Erath (2015), the IPF has been extended by many researchers, for example, to deal with the zero-cell issue (Guo & Bhat, 2007). Ye, Konduri, Pendyala, Sana, and Waddell (2009) proposed an iterative proportional updating (IPU), Pritchard and Miller (2012) addressed memory consumption issues, and also Casati, Muller, Fourie, Erath, and Axhausen (2015); Zhu and Ferreira (2014) introduced hierarchical and multistage IPF procedures. IPU and hierarchical iterative proportional fitting (HIPF) are advanced methods of IPF that consider household and individual control totals. However, if only using one type of control total, it is considered as IPF.
Furthermore, there are other methods, such as combinatorial optimization (CO) (Voas & Williamson, 2001), which estimates the integer weight of a sample and replicates the desired marginal target total. It claims to have less variance than IPF. Farooq, Bierlaire, Hurtubia, and Flotterod (2013) and  proposed Markov chain Monte Carlo (MCMC), which deals with zero-cell issue. There is an extension of MCMC, called hierarchical MCMC, that uses individual and household attributes simultaneously. Hafezi and Habib (2014) further employed fitness-based synthesis (FBS), which allows the consideration of multilevel controls by selecting variables that have maximum fitness. However, this approach gave less accuracy when there is more detailed classification of attributes. Sun, Erath, and Cai (2018) developed a hierarchical mixture model, which combines three different models: probabilistic tensor factorization, multilevel latent class modelling and rejection sampling. This approach can produce a pool of agents that consider the interdependencies of household and individual structure. This method is complicated and not widely applicable, though.
Several use cases employ population synthesis with different approaches. We now summarize the use cases of the various methods for different locations and population sizes. As can be seen in Table 1, most of the locations are in developed countries and have populations < 10 million. Furthermore, related software can be used for synthetic population generation, such as, PopoSyn-Win, ILUTE, PopGen, FSUMTS, CEMDAP, ALBATROS, R package sms, R package synthpop, TRANSIMS, Synthia and SMILE (Müller & Axhausen, 2011;Templ, Meindl, Kowarik, & Dupriez, 2017).
As an alternative, Sun and Erath (2015) and Zhang et al. (2017) employed Bayesian networks (BN) for the task. They claim it can capture the complex interaction and hierarchical household structure of the sample and show that it is better than other methods (i.e., MCMC, directly inflating (DI) and IPF). Moreover, Saadi, Mustafa, Teller, Farooq, and Cools (2016) used the hidden Markov model (HMM), which is a stochastic model that learns a complex joint distribution sample structure or known as the simplest BN. However, this model has less consideration for hierarchical household structure. Since HMM cannot fit the marginal totals, Saadi, Farooq, Mustafa, Teller, and Cools (2018) improved the method that integrated an HMM and IPF. After the combination, the model can give a quasi-perfect result.
In this paper we use a BN. However, similar to an HMM, a BN can only generate synthetic data and gives a similar distribution to the sample (Sun & Erath, 2015;Zhang et al., 2017). Therefore, we add a similar type of integration as in Saadi et al. (2018). However, in this case, we integrated a BN and a generalized raking (GR) multilevel IPF to fit the aggregate census data. The following reasons motivate this choice. First, when the sample is < 40%, BN is known to give better results to model the interdependencies of individual and household attributes (Sun & Erath, 2015). Second, BN outperforms other methods in terms of the resulting square root of the mean square error (SRMSE) and becomes an excellent model with which to capture heterogeneity when the sample is < 70%.
This study makes the following contributions: . We combined two methodological approaches: BN integrated with GR multilevel IPF. . We applied the model to developing countries and a megacity, one of a total of 31 locations worldwide, of which 26 are in the less developed regions (United Nations, 2016) that have lower data availability. . This paper adds to the growing literature on BN for population synthesis and its application in developing countries and megacities.
The paper is structured as follows. The next section explains the concept of BN. The third section discusses the study area and framework model used. The fourth section applies the approach to a population synthesis using BN and GR multilevel IPF to fit aggregate census data. Discussions and conclusions follow in the fifth and sixth sections.

BAYESIAN NETWORK
The BN uses a graphical method to learn probabilities for a model (Cowell, Dawid, Lauritzen, & Spiegelhalter, 2006). It consists of two parts: a directed acyclic graph (DAG) and a set of conditional probability distributions (Sun & Erath, 2015;Horný, 2014), where DAG consists of a set of correlated random variables. The variables of a graphical structure G = (V, A) are represented by node or vertex (V ), and the correlation is represented by the directed edge or arc A. In the example shown in Figure 1, there are variables including income, age and gender. The directed edge from NodeIncome to NodeAge and NodeIncome to NodeGender indicates that NodeAge and NodeGender are linked by a conditional probability with NodeIncome. Therefore, the conditional probability distribution of this condition are P [NodeAge|NodeIncome] and P [NodeGender|NodeIncome].

A learning algorithm for BNs
There are different algorithms for learning BNs, as explained by Scutari (2010), such as the constraint-based algorithm, the score-based algorithm or the hybrid algorithm. Each type includes a learning algorithm. Here we used the R package bnlearn (Scutari, 2010), which implements Tabu Search as part of the score-based algorithm. Tabu Search, as a generic heuristic procedure, is an iterative searching procedure used to obtain the best solution from complex correlation patterns (Glover & Taillard, 1993), and it also can handle local optima by selecting a very close solution to optimality, which can minimize the score (Scutari, 2010). It supports a whitelist and a blacklist; a blacklist means that those arcs will not present be in the network structure; a whitelist is the reverse.

Network scores
The step of measuring the candidate of the graphical structure that fits to the data is important to ensure the structure can produce a fit synthetic population. In this process, several methods were introduced, such as maximum likelihood: is the log-likelihood of a provided pair (G, Θ) given observation D. However, the log-likelihood is not re-presentable, as explained by Sun and Erath (2015), due to overfitting, since this method will always build a fully connected DAG. Therefore, most applicable approaches are using the Bayesian information criterion (BIC) (Rissanen, 1978;Schwarz, 1978) and Akaike information criterion (AIC) (Rissanen, 1978;Akaike, 1974): where Θ, in the first equation, is the maximum likelihood estimate parameter given a hypothetical structure G h ; d is the degree of freedoms in Θ; and m is the number of observations. In contrast to maximum likelihood (1), BIC (2) and AIC (3) have a penalty function for the optimal likelihood log P (G h , Θ|D). For BIC the penalty is d 2 log m; and for AIC is d. Using the scoring function, the best network is selected for constructing the synthetic population.

Constructing the population of the Greater Jakarta Area
The study area is the Greater Jakarta Area, or Jabodetabek, which consists of Jakarta province, parts of West Java province and Banten province. It has 31.7 million inhabitants (BPS-Statistics, 2016a, 2016b, 2016c) ( Table 2). The population data used for the synthetic population generation were partly obtained from the JAPTRAPIS study (Jabodetabek Public Transport Policy Implementation and Strategy) in 2009 (JICA, 2009;Ilahi & Axhausen, 2017).
There are two different types of data in the JAPTRAPIS study: the Household Travel Survey (HTS) and the Activity Diary Survey (ADS). However, we only use HTS data consisting of 178,954 households, or 334973 individuals, for the population synthesis, which are equal to 3% of all households. In the HTS, the respondents are individuals who are going to school or the office. Therefore, the aggregate synthetic population is limited to the census population of individuals who have activities for studying or working with a total of > 20 million (Table 3).
The data used for this approach are from multiple sources within the HTS data. The variables of the individuals include age, gender, education and employment status. The variables of the households are income, housing status, vehicle ownership and address. In the activities-based model, we also need geocoding of activity locations. Therefore, we geocode first based on Google Application Programming Interface (API). Therefore, there are three sources of data in the model. We use the R package dplyr for data joint. Figure 2 shows the framework model of data combination and integration between BN and GR. Further detail of the BN is presented in Figure 4.

BN step
We consider seven variables by combining individual and household data for the population synthesis using BN, as presented in Table 4: type of activities, age, sex, income, housing, car ownership and driving licence. These variables are considered as the main sociodemographic variables to reduce the complexity of the network structure. However, for the variables income, housing status and car ownership, the household data are used. The chosen network structure is based on the AIC score employing the Tabu Search algorithm to learn the network structure of the BN, as implemented in the R package bnlearn (Scutari, 2010; R Core Team and others, 2013). There are two steps in the Tabu Search in this scenario; without using the whitelist and blacklist G structures for the initial search and with using the whitelist and blacklist for the final search, where there are 256 search in the initial search and 64 search for the final search. The final structure is obtained by an iterative process after measuring the error for each arch. The arch, which gives smallest error, is included in the network using the whitelist command and the arch, which gives highest error, is never included in the network using the blacklist command.
In final search, we found an AIC of −1,679,334 for the selected BN. The model structure and the conditional probabilities of the variables can be seen in Figure 3. The tables accompanying Figure 3 presents the conditional probabilities. In this case, we show variables sex and income. There is a directed arc of NodeTA and NodeAge that goes to NodeSex. Therefore, the conditional probability of an income between five and eight and type of activity (TA) school for females is 0.43 and for males is 0.57. This way of interpretation is the same for other variables, and the total of conditional probability is always equal to 1. After we identified the best structure, we generated the data as close as possible to the aggregate census data. In this research, the data are generated for 4, 8, 16 and 22 million agents.
We randomly joined the generated BN population to the remaining 20 variables left in the HTS data, based on the available seven variables in both data sets. Complete sets after both data were joined are used for the agent-based model. We obtain a joint data set consisting of 27 variables (Figure 4). For the extended target of 16 million agents, this joint operation takes three days and 13 h. Moreover, for 22 million agents, this joint operation takes more than five days on the servers of ETH Zürich computer cluster (ETH Zürich, 2016).  To visualize the goodness of fit after the BN approach and the joint data, we compared HTS data and the final structure from BN. However, in this case, we only visualize two different type of joint distributions. Figure 5 shows the joint distribution of income and age. Figure 6 shows the joint distribution of income and region. The left figures show the joint distribution data from HTS, the middle figures show the joint distribution of the data from BN, and the right figures show the fit of joint distribution of both HTS and BN. We considered the variable region, which was not used as a variable in the BN approach in Figure 6 to ensure that a joint data procedure gives a good fit with unselected variables. We found a similar distribution after both HTS and BN data were joined.
Furthermore, as also found in Sun and Erath (2015) and Zhang et al. (2017) that the BN model can reproduce the distribution of the HTS data. This is because we can only assign to this target value, whether we estimate 4 million or 22 million, and it is not based on the control total of aggregate census data in Table 3. Therefore, to solve this issue, we employ multilevel IPF with the R-package MultilevelIPF (Müller, 2017) to adjust the artificial sample to the aggregate census data (Table 3).
GR multilevel IPF for fitting against census data Several multilevel IPF algorithms have been implemented such as HIPF, IPU, entropy optimization (Ent) and GR. Nevertheless, we use a GR algorithm that has been shown to outperform the other algorithms, which generates weights with minimal squared error and is faster. It matches the data with the aggregate control groups by estimating a weight for each observation in the sample (Müller, 2017). There are two group controls used: region and gender, and region and age.
Furthermore, the weakness of IPF is its non-integer weights (Müller, 2017;Lovelace & Ballas, 2013), while it requires integer weights for generating the target population. Integerization is the process to convert the decimal weights (related with how many times each agent is replicated) to integer values. Therefore, integerization is an important step to produce the best fit to the marginal census data. This can be done using weighted random sampling without replacement using the wrswoR package (Müller, 2016), and also using the truncate, replicate, sample (TRS) method (Lovelace & Ballas, 2013). However, we employ random sampling without replacement, which consists of three steps. The first is removing the decimal values. The second is calculating the decimal remainders that will be used as a vector of probability weights. The third is implementing weighted sampling without replacement. A crank algorithm is used in this operation for faster results (Müller, 2016).
The data used from BN for fitting using GR differ in size from 4, 8, 16 and 22 million agents. Less fit result was acquired when we use data from 8, 16 and 22 million. It can be because of the huge size and complex data structure that creates errors in implementation. Moreover, integerization has also created an error in replication. Therefore, we used the 4 million agent data set that fitted with < 0.2% data difference compared with the target census data. The synthetic population data fits after conducting GR multilevel IPF to the target of 20 million marginal census data and the control totals in Table 3,

DISCUSSION
Population synthesis is an important step when we develop an agent-based model. Data become an important factor to ensure the population synthesis gives a precise description of  Figure 7. Difference (%) between population synthesis and census data of region and age.
the agents. There are several methods available. Some tried to improve the previous methods; others tried to use new methods, as mentioned above. Each region has different difficulties, since each country has different regulation regarding data issues and data availability. Less developed countries may have less accessibility and quality of the data. One of the studies generated a synthetic population without a sample to address the availability of sample issue (Barthelemy & Toint, 2013). However, it becomes questionable at which level of disaggregation this can give quasi-perfect distribution. On the other hand, it would be a solution for the initial development of an agent-based model. The size of the data set also influences the processing. As mentioned, it takes more than a day to join the different data sets. It will be an interesting topic to accelerate the joining process for huge data sets. Moreover, a BN alone cannot fit the marginal census data. However, using GR only also fails to make sure that the dependency of each variable is perfectly reproduced. On the other hand, the combination of BN and GR is a solution to maintain the stability of the dependencies and to ensure a quasi-perfect distribution. Saadi et al. (2018) also combined HMM and IPF in the same spirit.

CONCLUSIONS
In the present case, we found that the BN can construct a synthetic population and reproduce the HTS data. The result after data joining between population synthesis and HTS data gives well-fitting distributions. However, BN has some differences vis-à-vis marginal data from the census. Therefore, we need to fit against the marginal distribution of the census data.
The differences are addressed with GR. We made an effort to remain as close as possible to the target census data when we fit with GR. We started from 4, 8, 16 and 22 million agents produced by the BN approach. However, it gives different distribution against census data when we tried with 8, 16 and 22 million. Therefore, we used 4 million agents from the BN to fit with the 20 million target agents of the marginal census data/control total using GR, which is equal to 20% of final data. Based on the results described above, we can conclude the following: . The results confirm that the BN approach can be used to produce large samples with wellfitting distributions, which is useful for any researcher who has a limited sample to start with. . The result from GR can fit to the control totals or marginal census data. . Combing BN and GR can help researchers produce data that fit to control totals or marginal census data.
This synthetic population will be used for our further research to develop an agent-based model using multi-agent transport simulation (MATSim) (Horni, Nagel, & Axhausen, 2016). This is the first scenario of an agent-based model for Greater Jakarta (Ilahi, Balac, Li, & Axhausen, 2019). Several variables from the synthetic population will be used in the agent-based modelling, such as age, gender, income, coordinates of both home and office, activities, licence, and car ownership. Furthermore, we will integrate secondary activities in our population synthesis, since no data have been available so far. However, the steps will be similar with what we have explained in this paper. In addition, another method to address joint time use should also be investigated in future work.