Bayesian estimation of MSM population size in Côte d’Ivoire

Côte d’Ivoire has one of the largest HIV epidemics in West Africa with around half million people living with HIV. Key populations like gay men and other men who have sex with men (MSM) are often disproportionately burdened with HIV due to specific acquisition and transmission risks. Quantifying the MSM population sizes at subnational level is critical to improving the HIV prevention interventions. While survey-based direct estimates of MSM numbers are available at a few urban centers in Cˆote d’Ivoire, no data on MSM population size exists at other areas without any community infrastructure to facilitate sufficient access to the MSM community. We use this limited data in a Bayesian regression setup to produce first empirically calculated estimates of the numbers of MSM in all areas of Cˆote d’Ivoire prioritized in the HIV response. Our hierarchical model imputes missing covariates using geospatial information and allows for proper uncertainty quantification leading to meaningful confidence bounds for the predicted MSM population size estimates. The intended impact of this process is to increase uptake and use of high quality, comprehensive epidemiologic and interventional data in program planning. These estimates will help design future surveys and support the planning of the scale and content of HIV prevention and treatment programs for MSM in Cˆote d’Ivoire.


Introduction
The last five years have witnessed significant advancements in the response to HIV including universal treatment for those living with HIV, antiretroviral based pre-exposure prophylaxis to prevent HIV, and new diagnostic approaches including HIV-self testing (UNAIDS 2017). However, leveraging these strategies to achieve an AIDS-Free generation by 2030 necessitates understanding who and why people continue to acquire HIV (Beyrer et al. 2014, Stahlman et al. 2016. In concentrated epidemics, it has long been understood that the majority of HIV infections are among populations with specific acquisition and transmission risks for HIV including gay men and other men who have sex with men (MSM), sex workers, people who use injection drugs, and transgender women (Beyrer et al. 2014).
However, in generalized HIV epidemics, the specific proximal risks for HIV have been less studied which challenges the ability to effectively specify both the most appropriate benefactors for these new interventions as well as the number of people in need (Boily et al. 2015, Mishra et al. 2016. To address the former, there has been a number of epidemiologic and mathematical modeling studies demonstrating the importance of addressing the HIV prevention and treatment needs of key populations in the context of generalized HIV epidemics (Mishra et al. 2016). However, there remain limited data on the sizes of key populations across most generalized HIV epidemic settings (Abdul-Quader et al. 2014).
Characterizing the numbers of key populations facilitates an understanding of the numbers of potential eligible candidates for more intensive HIV prevention interventions, the overall potential impact of those interventions when implemented at scale, and finally an improved understanding of the local HIV epidemic (Abdul-Quader et al. 2014, Holland et al. 2016. Moreover, to effectively parameterize mathematical models characterizing the modes of transmission, high quality data regarding the size, characteristics, and HIV burden among key populations are needed (UNAIDS/WHO Working Group on Global HIV/AIDS and STI Surveillance 2010). Concurrently, there has been increasing consensus on the appropriate methods for population size estimation for key populations (Quaye et al. 2015, UNAIDS/WHO Working Group on Global HIV/AIDS and STI Surveillance 2010). However, many current size estimates that have been completed resulted in national estimates, with less in the literature focused on subnational estimates in the majority of low and middle income settings (Tanser et al. 2014). Ultimately, size estimates at the subnational level are those most often used by by local ministries of health, implementing partners, and bilateral and multilateral funding agencies to guide the geographic and population prioritization of resources and efforts (Yu et al. 2014). Often the direct estimates of key population size have been in urban or peri-urban areas where the population densities of key populations are higher and where the community infrastructure exists to facilitate sufficient access to the community (Yu et al. 2014). However, HIV prevention and treatment needs are universal, necessitating methods for estimating population size of key populations at high risk of HIV acquisition and transmission at the subnational level (Tanser et al. 2014).
There exist a range of extrapolation methods to generate estimates at the national and subnational level. These methods differ in terms of their reliance on data, cost, and scientific rigor (Yu et al. 2014). Expert opinion involves consulting experts, including national stakeholders, technical experts, and key population groups, on how confident they are with the direct estimates and seeking their advice on how to apply these numbers to other off-sample areas. This method has low reliance on data, little cost, and relatively low scientific rigor. Simple and stratified imputation apply the mean from areas with direct estimates to the areas where predictions are needed. These methods have some reliance on auxiliary data and result in arguably more evidence-based rigor than relying on expert opinion alone. Less is known about other more complex methods, including regression, treating off-sample areas a missing data problem, and utilizing geospatial covariation or correlation to predict values, i.e., small-area estimation.
In West Africa, the epidemiology of HIV has been distinct from that in Eastern and Southern Africa (Djomand et al. 2014, Holland et al. 2016, Papworth et al. 2013. Specifically, the population HIV prevalence among all adults has not surpassed 5% though very high burdens have been observed among key populations Djomand et al. (2014). The burden of HIV in Côte d'Ivoire was estimated to be 3.2% among all adults equating to nearly half a million people living with HIV, nearly all of whom are over fifteen years old. In the national strategic plan for HIV, key populations including MSM, female sex workers (FSW), and people who inject drugs (PWID) have been deemed to be priority populations for HIV prevention and universal treatment for those living with HIV. However, similar to other settings, the enumeration and representative sampling of key populations has been challenged by the criminalized nature of these populations combined with high levels of stigma (Beyrer et al. 2012). Consequently, specialized sampling strategies for key populations in these settings have been used including respondent-driven sampling, time-location sampling, the prioritization for local AIDS control efforts (PLACE) and others. However, the majority of these studies have taken place in urban centers resulting in limited study of population size estimates for key populations in the majority of the country including rural, less densely populated settings Abdul-Quader et al. (2014).
Given limited data on population size of key populations in Côte d'Ivoire, the objective of this study was to assess the utility of small area estimation approaches to estimate the population size in the organizational units prioritized by the Presidents Emergency Plan For AIDS Relief (PEFPAR) along with proper quantification of uncertainty of those estimates. Specifically, the study aimed to utilize available direct estimates of population size and other demographic covariates in Côte d'Ivoire to generate model based estimates of population sizes of MSM for subnational areas.

Data
The size of MSM population was directly estimated for five regions of Côte d'Ivoire: Abidjan, Agboville, Bouake, Gagnoa, and Yamoussoukro from March 2015-February 2016. Regions for direct size estimation were selected to coincide with a respondent-driven sampling study of adult MSM in progress in these same five areas.

Direct estimates
Size estimates were generated through the use of various multiplier methods, including the unique object multiplier, NGO membership multiplier, service multiplier, and social event multiplier. Briefly, multiplier methods for size estimation compare two independent sources of count population size, assessing the overlap between these independent estimates in order to determine the total number of individuals in the population. The basis for these methods rests on the understanding that the proportion of individuals in the population who appear at a specific institution during a certain time period, for example a current registry of NGO membership, is equal to the proportion who appear at that same institution among the survey participants.
The first independent source of data consisted of a count or listing from programme data for NGO membership, service provision, and social event attendance. For the unique object multiplier method, the first source of data was a log of how many objects were distributed. The total number of MSM attending services at "Clinique de Confiance" was captured from program logs. The total number of MSM who attended the social event "evening GNARA" was recorded. The total number of MSM belonging to NGOs was also captured from program logs.
The second independent source of data for our estimates was a representative survey in which MSM were recruited through Respondent Driven Sampling (Heckathorn 1997, RDS), which is a strategy employed when individuals in the target population are hard-to-reach and when no known sampling frame exists. Methods for respondent-driven sampling have been described previously. Individuals were purposively asked questions to generate this second independent source of data for size estimation (Appendix A). As one example, for the service multiplier method, participants were asked if they had ever received services from " Clinique de Confiance." From these two independent sources of data, direct estimates were generated using the formula S = (N 1 × N 2 )/R , where S is the estimate of total population size, N 1 and N 2 respectively are the total numbers of people captured in the first and second independent source of data (e.g. programme log and MSM recruited to the cross-sectional RDS study), R is the overlap between two independent sources of data (e.g. number of MSM participating in cross-sectional RDS study who reported accessing services). Since, the estimates were based on RDS, 95% confidence intervals could be calculated (Salganik 2006) for the proportion R/N 2 of RDS participants who were also enlisted in the first programme.
Subsequently they are used to create confidence intervals for S.
Because those sampled were 90% from the age group 18-29, population size estimates were age-standardized to get a better estimate of all men (15-49) who have sex with other men. In order to age-standardize, a population size estimate proportion was calculated using the male population 18-29 of the given region. In accordance with previous studies looking to estimate the size of men who have sex with men (Purcell et al. 2012), it was assumed that the proportion of MSM remains constant across age groups. Therefore, the calculated population proportion was applied to the total male population 15-49 to get an estimate of the total number of men 15-49 who have sex with other men. However, for the data analysis we used the direct estimates for the 18-29 age group, instead of the age-standardized version for the 15-49 age group. Hence, the data analysis and subsequent predictions of number of MSM in the age group of 18-29 years does not rely on any such assumption.

Prediction areas
In total, there were 61 prediction areas that were selected to coincide with PEPFAR's official Organizational Units (OUs), and that also roughly correspond with Côte d'Ivoire's department-level administrative division. Prediction areas were selected to coincide with PEPFAR Organizational Units (OUs) in order to provide evidence-based estimates for targeted prevention, care and treatment programs and to inform Country Operational Processes. These PEPFAR OUs also roughly correspond with the official departmentlevel administrative division in Côte d'Ivoire. In settings where public health systems are decentralized, estimates for program denominators are needed at both the national and sub-national level in order to set actionable targets. This is especially important if there are large regional differences in burden of disease, resources, etc. The intended impact of this process is to increase uptake and use of high quality, comprehensive epidemiologic and interventional data in program planning, while building consensus on small area estimations of available data to guide additional data collection and programmatic efforts focused on HIV among key populations.

Covariates
Covariates were selected based on relevance to prediction model and availability of quality data at the appropriate administrative division (department level). Data for population density, density change and male population was obtained from publicly available data pub- There was no department-level age-stratified, sex-stratified data. We assumed a constant age and sex distribution across all departments: 55% of total male population for each of the departments/region seats is in the age group 18-29. Also, for Abidjan, close to 90% of our sample reported being from either Abobo, Cocody, Marcory, Triechville, or Youpougon. This is just five communes out of the total ten in Abidjan. We therefore considered our sample to better represent these five communes of Abidjan rather than the whole city. The total male (15-49) population for these communes was 842551 rather than 1286750 for the whole city and for men 18-29 was 368097 rather than 562160. We also assumed that the age-sex distribution was the same across all the communes.
Additionally, we also used estimated population density from the Landscan database where p i = logit(x T i β). However, note that not all the direct estimates are equally reliable. For example, we observe in Table 1 that the NGO membership based estimate of MSM population in Bouake differs by an order of magnitude from the other three estimates for the same region. The confidence interval for this estimate is also very wide suggesting limited reliability of the estimate. While it is less clear how to incorporate information from the confidence intervals in a binomial GLM setup, we can leverage these data in a linear regression setup via heteroskedastic errors.
Defining y ij = log(n ij /N i ), we specify the linear regression model as y ij ind ∼ N (x T i β, τ 2 ij ). The variance of a normally distributed random variable is proportional to the square of two-tailed 95% coverage interval. Therefore, we specify τ 2 ij = τ 2 (y ij·u − y ij·l ) 2 where y ij·u and y ij·l denotes the upper and lower bounds corresponding to y ij . This ensures that more uncertain estimates with very wide confidence bounds are given less weights in the model than the ones with narrow confidence intervals providing more precise information.
Regression models for small area estimation often include region specific random effects to improve estimation (Fay & Herriot 1979). However, Datta et al. (2011) argued that when the number of regions is small, the simpler model without random effects often performs better. Owing to the very small sample size of the dataset, we decided against introducing region specific random effects as it involves additional parameters.
Finally, we have used a log-transformation to define the y ij 's instead of a logit transformation, although the latter is more natural, as n ij /N i is a fraction. In the dataset, the proportions n ij /N i are typically very small (80% are less than 0.05). So the two transformations yield very similar y ij 's. Furthermore, as we discuss in Section 3.2, the log transformation is more interpretable in our final model which includes log(N i ) as one of the covariates. Hence, we preferred the log-transformation.

Covariate Selection
The covariates described in Section 2 were region specific. Hence, although there were 19 datapoints, there were only 5 unique sets of covariate values, one for each region. This impeded exploring nonlinear models linking y ij 's to x i and confined us to the parsimony of the linear model. Even in a linear setup, we want to select only one or two most relevant covariates from the five available -male population, population density, density change, HIV prevalence and Landscan density. We kept HIV prevalence in the model as we expected areas of where there are more MSM to be areas with high HIV prevalence, as MSM are disproportionately affected by HIV given the biology of HIV transmission combined with limited programming focused on mitigating risks specifically among gay men and other MSM. The total male population in the age group (N i ) has already been used to define y ij 's. Hence, it seems natural to exclude it from the linear model. However, Figure 1 reveals that y ij 's have a very strong negative correlation with log(N i )'s. Initially this negative correlation seems counter-intuitive given a rural-to-urban migration. One likely explanation is that in large urban centers the MSM community grows at a slower rate than the overall population even if the absolute numbers of MSM is higher. For example, if the numbers of MSM population grows at a rate proportional to N γ i for some γ < 1, then cov(y ij , log(N i )) = γ − 1 which is negative. Hence, we included log(N i ) as a covariate, which in turn justifies use of a log-transformation to define y ij 's, as it imparts a nice interpretability about of the relative growth rates of the MSM population and the total population. The other explanation would be that in urban areas a higher proportion of MSM are not accounted for in the survey or that the independent assumption is violated in the capture-recapture method. While both reasons are conceivable, the first is a feature of MSM population dynamics while the second is a sampling issue. In the absence of additional data collection, it is not feasible to discern between these two scenarios. However, internetbased surveys may facilitate learning more about the numbers of MSM in more stigmatizing settings.
Due to limited sample size, we restricted model selection only from linear models using no more than two of the five covariates and used leave-one-out cross validation to choose the covariate (alongside HIV prevalence). For each of the 4 models, we leave one region out and fit the model using the remaining regions to predict the MSM population size at the left out region. We repeat this for all the five regions and the average squared error between the predicted estimate and average direct estimate for each region gives the leave-one-out cross validated (LOOCV) error. Landscan density + HIV prevalence 5.7 × 10 −3 population density + HIV prevalence 6.5 × 10 −3 density change + HIV prevalence 6.7 × 10 −3 H i denotes the HIV prevalence for the i th region.

Spatial model for HIV prevalence
HIV prevalence data were missing at around 50% ( where v(s) = σ 2 (exp −φ||s − s i ||) s i ∈S . For the cross validation, we use these kriging equations to impute the HIV prevalence at each left out region, based on parameter estimates using data from the remaining regions. For comparison, we used the mean HIV prevalence of the in-sample data to predict at the left out region. The leave-one-out MSE for the spatial model (MSE=0.37) was around 20% better than for the mean imputation (MSE=0.48).
Hence, we use the spatial GP model for imputing the missing HIV prevalence data.

Hierarchical Bayesian Modeling
Obtaining meaningful confidence bounds for the predicted MSM population is critical. The uncertainty of the regression parameters and especially the spatial parameters are often ignored in frequentist predictions. Furthermore, for regions with missing HIV prevalence data, the kriging estimates in Equation 1 are accompanied by the kriging variances which can be large if the location is far from the data locations. Hence ignoring this source of uncertainty can lead to narrow prediction bounds. In a frequentist setting, it is unclear how to utilize the kriging variance when the imputed HIV prevalence will be used as a covariate to predict MSM population size. However, we can seamlessly integrate this multistage procedure into a hierarchical Bayesian model which allow for proper propagation of uncertainty associated with all different parts of model estimation into the final prediction of population size estimates.
Let S denote the set of locations where HIV data are available. Also, for any location s, let N (s) and H(s) respectively denote the male population of 18-29 years and the HIV prevalence. Finally, defining y j (s i ) = y ij , w ij = (y ij,u − y ij,l ) 2 and β = (β 0 , β 1 , β 2 ) , the full specification of the hierarchical model is given by: The top row of 2 is the log-normal regression model for the MSM percentages, the middle row is the spatial Gaussian Process model for HIV imputation and the bottom two rows are the parameter priors. Gamma(a, b) denotes the Gamma distribution with shape parameter a and rate parameter b and Unif(a, b) is the uniform distribution on (a, b). We use the  prevalence is relatively weak. The estimates of the spatial parameters indicate a strong spatial dependence in HIV prevalence, previously insinuated by the variograms in Figure   2.

Prediction
We use composition sampling to obtain posterior predictive distributions of MSM population size at a new location. If s ∈ S, then this is simply given by the samples {y  We observed that for some areas with very low population, the predicted MSM population percentage was abnormally high. Further investigation into this reveals that the minimum total male population among the five regions with survey data corresponds to the 36 th percentile of the empirical distribution of total male population among all the 61 regions.
Hence, the training data corresponds to larger areas with greater population and does not inform much about the regression relationship in regions where the population is very low. This, combined with the strong negative value of β 1 in Table 3 results in such high estimated MSM fractions.
As a heuristic remedy, we assume that the negative relationship flattens out below a certain population threshold. We truncate the total male population at the 10% quantile of the empirical distribution and use these thresholded values for prediction. While this is ad hoc, more formal methods like estimating the truncation point based on the data will always truncate within the data values, whereas replacing a linear regression with a general monotonic function will involve more parameters and hence is infeasible for our small dataset. Of course, our truncation does not affect parameter estimation as all the total population values for the training data are above the truncation level. This issue is less severe for HIV prevalence as the observed values for the five regions better represent the empirical distribution of HIV prevalence. Since, it also has a much weaker association with population size of MSM, we do not truncate the HIV prevalence values. represents observed data. IQR 95% is the 95%

Size estimates
Inter-quantile range, i.e. width of the 95% credible interval.
typically varied between as low as 0.5% to around 10%. The highest MSM percentages are predicted in Katiola, Kouassi-kouassikro and Bettie. However, these areas also had the widest credible intervals indicating the large uncertainties associated with the predictions. Figure 6 demonstrates the impact of HIV imputation on the prediction uncertainties. Since, the variance and width of confidence intervals of log-normal distribution are proportional to the mean, we use relative width (ratio of the 95% cofidence interval width to the estimate) as a more meaningful measure of uncertainty. In Figure 6a we plot the predictions of represents observed data. IQR 95% is the 95% Inter-quantile range, i.e. width of the 95% credible interval.
MSM population percentage against the relative width. We observe that the relative width was in general larger for locations with missing HIV prevalence data. This is expected as the Bayesian model properly propagates the uncertainty associated with the imputation of HIV prevalence in the final predictions. This is nicely reflected in the CI widths. In is the 95% Inter-quantile range, i.e. width of the 95% credible interval.
as 100.   Figure 6: Impact of imputing HIV on uncertainty: Plots of (a) relative 95% CI widths versus estimates and (b) relative 95% CI widths versus the leverages leverages. Red and blue dots correspond to respectively regions with and without HIV prevalence the joint model in (2). Bao et al. (2015) have demonstrated how to estimate populations sizes by incorporating data from multiple surveys and other data sources, in a fully Bayesian setup. While we have multiple estimates for each region, all of them are based on RDS and it is not clear how to adapt that approach when working with estimates from RDS. This once again highlights the need for more research on properly using RDS data in hierarchical models. Other relevant datasets, like MSM populations in other countries, if available, can be potentially leveraged to borrow strength in parameter estimation. However, care has to be taken when leveraging data from other countries, as different countries often have entirely different key population dynamics and borrowing strength may not be meaningful. Perhaps, more useful will be data for other associated key populations like Female Sex Workers, for the same set of regions. The correlation can be exploited in a multivariate setup to improve estimation of both populations.
The small sample size of numbers of cities with population size data for MSM has been a major limiting factor in modeling. However, the predictions presented here from this model represent the first empirically calculated estimates of the numbers of MSM in all areas of Côte d'Ivoire prioritized in the HIV response. This scenario of limited centers with measured population size is also not uncommon in the areas of the world where HIV prevalence is the highest given that these settings often also tend to criminalize same-sex practices or at least have significant stigma affecting MSM. In Southern and Eastern Africa, there is often only HIV prevalence data and size estimate data in one or a few urban centers for MSM though where studied, the HIV prevention and treatment needs are significant across these countries. While Côte d'Ivoire is in West Africa, it has one of the larger HIV epidemics in the region though limited information has been traditionally available for the numbers of MSM and the HIV burden among them. Thus, while the estimates provided here require further validation by supporting data to be collected in additional centers for MSM where predictions were completed, in the interim, these estimates can support the planning of the scale and content of HIV prevention and treatment programs for MSM in Côte d'Ivoire. Specifically, areas with wide credible intervals should be targeted for future surveys to improve modeling precision. Subsequently, validation and additional data points, will highlight the strengths and weaknesses of the current approach and pave the way for modeling improvements.

Funding and acknowledgment
We would like to thank the participants of these studies and community groups that support them for engaging in the study. Data collection in Côte d'Ivoire was funded by the Global