Using molecular transmission networks to understand the epidemic characteristics of HIV-1 CRF08_BC across China

ABSTRACT HIV-1 CRF08_BC has become a major epidemic in heterosexuals and intravenous drug users (IDUs) in southern China. In order to evaluate the trends of its epidemic and facilitate targeted HIV prevention, we constructed the genetic transmission networks based on its pol sequences, derived from the National HIV Molecular Epidemiology Survey. Through retrospective network analysis, to study the epidemiological and demographic correlations with the transmission network. Of the 1,829 study subjects, 639 (34.9%) were clustered in 151 transmission networks. Factors associated with increased clustering include IDUs, heterosexual men, young adults and people with lower education (P < 0.05 for all). The IDUs, MSM, young adult and person with low education had more potential transmission links as well (P < 0.05 for all). The most crossover links were found between heterosexual women and IDUs, with 30.9% heterosexual women linked to IDUs. The crossover links heterosexual women were mainly those with middle age and single (P < 0.001). This study indicated that the HIV-1 CRF08_BC epidemic was still on going in China with more than one third of the infected people clustered in the transmission networks. Meanwhile, the study could help identify the active CRF08_BC spreader in the local community and greatly facilitate précising AIDS prevention with targeted intervention.


Introduction
Inferring human immunodeficiency virus (HIV) molecular networks via sequence analysis can provide insight into the dynamics of viral transmission in populations and sub-populations [1][2][3][4]. As genetic similarity between HIV sequences generated from two persons with HIV, it suggests that these people likely belong to the same transmission network [5][6][7]. Previous studies using these molecular transmission methods have: revealed transmission routes, predisposing factors and epidemiologic linkages [6], and inform prevention efforts [2,8,9].
As the HIV epidemic in China has diversified, many new subtypes have been introduced and new circulating recombinant forms have been created. Currently, the four major strains, CFR01_AE, CRF07_BC, CRF08_BC and B' (Thailand variant of subtype B), cause more than 90% of all infections in China [1,10]. It is noteworthy that CRF08_BC has been involved in many newly reported new HIV-1 unique recombinant form by further recombination with other strains [11][12][13][14]. Previous local molecular epidemiology studies have also found that CRF08_BC has become one of the predominant subtypes among people with injection drug use and heterosexual risk, especially in the southwestern and eastern China [11,13,15]. These CRF08_BC studies have been largely cross-sectional [16,17]. Thus, pattern of transmission of CRF08_BC across all of China remain largely unknown and can affect the feasibility of interventions.
This study used China nationwide epidemiology survey and HIV sequence data to: infer a longitudinal HIV-1 CRF08_BC transmission network for the first time, provide an insight with depth in CRF08_BC transmission between risk populations, and elucidate potential links among risk networks. The results of these analyses can be used to provide effective information for HIV prevention efforts.

Sources of sequence data
All available HIV-1 CRF08_BC sequences covering the entire protease (PR) and partial reverse transcriptase (RT) region (HXB2 genome location 2253-3554) were collected from the China National Center for AIDS/STD Control and Prevention (NCAIDS) molecular database (the previous investigations was conducted in a cross-sectional method using stratified random sampling by province), moreover, partial sequences were downloaded from Los Alamos HIV sequence database (LANL, http:// www.hiv.lanl.gov, default parameters, last updated: January 2018). Our analysis included one sequence per person in the database. When more than one sequence was available (4.1% of persons) only the HIV sequence obtained from the earliest time point was included. We performed a BLAST search by using the collected sequence and selected top 10 sequences with the highest homology with the references for each sequence which was further determined the subtype through phylogenetic tree analysis and excluding repeated sequence [18,19]. Since network inference is less precise with shorter sequences, we eliminated sequences that were less than 900 nucleotides in length [5,20].

Inference of molecular network
To infer the molecular network, the genetic distance, Tamura-Nei 93 nucleotide substitution model (TN93) was used to calculate pairwise distance between all pairs of HIV pol sequences using HYPHY version 2.2.4 [1,10,21]. Using TN93 distance to compute single nucleotide transversion rates and 2 nucleotide conversion rates, as they are the sophisticated genetic distances, they can be represented by closed-form solutions that allow fast distance calculations [22]. We conducted an initial analysis using a genetic distance threshold of 0.7% substitutions/site, since this distance discriminated the maximum number of clusters and has a relatively high resolution in the genetic network (Supplementary Figure S1) [7,20]. And those conform with the criterion were deemed to have potential transmission relation. The network data were visualized using a custom R script in the network package in the R software version 3.5.1.

Analysis of molecular network characteristics
After the network was inferred, we analysed characteristics of the transmission network, including the number of sequences (nodes), links (edges), and the degrees which each individual in the network was defined as the number of links with other individuals. Correlates of clustering were investigated using bivariate analyses and multivariable logistic regression models, and the groups analysed included both all individuals and clustering versus not clustering individuals [10,23]. The topological properties and parameters of the CRF08_BC networks (e.g. nodes, edges, density, and clustering coefficients) were computed with the Cytoscape version 3.2.0 NetworkAnalyzer tool (https://med.bioinf.mpi-inf.mpg.de/ netanalyzer/index.php) [24,25].
Furthermore, we observed the annual dynamics of the significant dense clusters (stationary/expansion) during 2000-2017. For this purpose, modular framework and clusters of tightly interconnected nodes were identified using the Molecular Complex Detection (MCODE) 1.3 plug-in application with default parameters (including degree cut-off of 2, node score cut-off of 0.2, K-core of 2, and maximum depth from seed of 100). This rank and score larger more dense complexes higher in the molecular network results [26][27][28]. Meanwhile, in order to elucidate the effect of the clusters of different growth patterns of persons living with HIV to the epidemic situation, we can define the high-transmission speed clusters and low-transmission speed clusters according to the size and increment situation of the sub-cluster [29].

Analysis of individuals with multiple potential transmission links
In the network, two groups of people were compared including: individuals who were linked to 1-3 other persons and individuals who linked to ≥4 others (i.e. those in the highest quartile of those persons with any links) [6,30,31]. Multivariable logistic regression was applied to determine factors associated with having ≥4 links. The model included age at HIV diagnosis, sampling year, education, marital status, transmission category, and domicile of report.

Analysis of mixing between people reporting different transmission risk
To understand the characteristics of links between persons of the different risk groups and analyse the potential transmission partners, we explored (1) the number of individuals who have links to different risk groups in the transmission networks and (2) the characteristics of crossover persons. All analyses were stratified by transmission categories (including heterosexual contact, persons who inject drugs, male-to-male sexual contact and unknown), geographical distribution (including eastern, northern, central, and southwestern China), age (stratified into <20, 20-29, 30-39, 40-49, 50-59, and ≥60), education level (including primary school, middle school, high school, and above college), marital status (including single, married, and divorced or widowed), and the information not reported or not identified was classified as unknown. The chi-square tests (χ 2 ) and Fisher exact tests were applied to examine differences between subgroups. These datasets were analysed in SPSS version 22.  Table 1).

Characteristics of transmission networks
Under the threshold of 0.7% genetic distance, 639 nodes linked to at least one other sequence, thus considered as clustered in the inferred molecular network (Supplementary Figure S1). This resulted in 152 clusters ranging in size from 2 to 254 sequences with a total of 2144 links. The number of links per sequence ranged from 1 to 133 (median: 1, interquartile range: 1-4). Figure 1 illustrates the region and risk-specific distribution of HIV-1 CRF08_BC. In all clusters within the molecular network, people reporting IDUs and heterosexual risk were the largest risk groups, accounting for 207 (32.4%) and 394 (61.7%) of nodes, respectively. People with MSM risk accounted for only 13 (2.0%) nodes, and people with BT and Unknown risk together accounted for only, 25 (3.9%) of the clustering nodes. CRF08_BC infections were the most concentrated in the southwestern region (500, 78.2%), followed by eastern (102, 16.0%), central (26,4.1%), and northern (11, 1.7%). Furthermore, there was one large cluster in the network (n = 254 persons) which mainly belonged to southwestern and eastern region sequences, and the majority of the persons in this cluster reported IDUs and heterosexual risks (Table 1).

Characterizing network growth
To clarify the dynamic changes of the main CRF08_BC subtype molecular network, MCODE analysis identified two significant modules clusters with network scores ranging between 33.6 and 19.4. Therefore, we investigated the cluster changes in a subgroup analysis. As shown in Figure 2, the type A cluster which had 35 nodes and 572 edges, however, only has added 8 new nodes from 2000 to 2017. The proportion of new clusters in the transmission cluster was significantly reduced after 2010 and it belongs to the low-growth network. Of note, almost all of the new individuals belong to IDUs risk category. These networks have gradually weakened the impact on the epidemic situation. Additionally, the IDUs and heterosexual groups constituted the type B cluster with 26 nodes and 242 edges. The cluster grew from 7 nodes in 2000 to 26 nodes in 2017. Number of cluster new cases increased year by year and it belongs to the high-growth network, and its influence on the epidemic has progressively increased.

Factors correlated with clustering
Next, we explored which demographics and reported risks factors associated with clustering. In bivariate analyses (Supplementary Table S1), individuals whose sequences clustered were more likely to be in the heterosexual or IDU risk factor group, the middle-aged person and the persons with lower education. As expect, in multivariable analysis (Table 2), increased odds of clustering were indicated among persons reporting IDUs risk, men reporting heterosexual risk, people diagnosed between 30 and 49 years of age, and the persons with a lower level of education. Further, the clustering frequency differed across China with persons who resided in southwestern and eastern China at diagnosis had significantly higher odds of clustering. Moreover, participants from northern and central regions were less likely to fall into clusters than participants from the other sites, probably due to the fewer genotypes of HIV-1 CRF08_BC (lower subtype density) in these locations, rather than differences in transmission risk. The nodes sampled after 2008 and the MSM risk factor were significantly less likely to cluster (all P < .05). There was no difference in proportion clustering by ethnicity and marital status.

Nodes with multiple potential transmission links
On the whole, 2144 links were constructed by 639 nodes in this transmission networks, among which 47.3% included 1 link, 21.8% included 2-3 links, and 31.0% had ≥4 links ( Figure 3). Individuals with ≥4 links were more likely to be aged 30-39 years (P = .003), the persons with lower education (P < .001) and report IDUs/MSM risk factor (P < .05, P = .046; Table 3). Similar results were seen using a bivariate logistic regression model, except that the education level of senior high school and the MSM risk factor were marginally associated with ≥4 links clustering (P = .14 and .06; Supplementary Table S2). Overall, although these individuals with ≥4 links represented only 198 of 639 individuals (31.0%), they were extensively involved in the vast majority links (85.0%) of the CRF08_BC network.

Transmission links mixing by risk category
Among persons reporting a history of injection drug use, 81.2% were more commonly linked to other persons reporting IDUs risk, meanwhile, some were also linked to women reporting heterosexual risk (45.4%; Table 4). For the heterosexual risk groups of

Discussion
Since the HIV-1 CRF08_BC initial introduction to China in 1997 [32], it had a nontrivial impact on China public health. Previous studies ordinarily have focused on the HIV-1 CRF01_AE and CRF07_BC epidemics in China, but not CRF08_BC [33][34][35]. This research represents the first analysis of nationwide CRF08_BC data, infers a plausible genetic distance threshold 0.7% and established the transmission network. A previous study has been demonstrated that a lower genetic distance threshold can distinguish recent transmission in outbreak event and improve the probability that potential transmission partners share than epidemiological connection [20,36]. Therefore, from a public health perspective, using this genetic distance threshold can not only obtain    the maximum number of clusters in the CRF08_BC network but also better identify the recent partners in the growing transmission cluster.
Our study not only analysed the formation characteristics of networks but also explored mixing by risk category and it yielded significant new insights into HIV transmission among diverse populations. Not all cluster growth is equivalent, consequently, taking measures focused on the core group curbing risk behaviours are probable to reduce HIV infection transmission among other risk groups [7]. Moreover, our analysis used behavioural and epidemiologic information to investigate transmission patterns characteristics. Although CRF08_BC is mainly located in southwestern China, previous studies have described high frequencies CRF08_BC transmission in eastern China that may be attributed to high levels of immigrant labour in this region [35,37]. As a result, due to frequent population mobility, CRF08_BC has the possibility of a sharp increase as well in China.
HIV-1 CRF08_BC viral diversity varied across different risk populations in China [32,38]. Primarily, our research indicates that IDUs are primary transmission risk groups in the past. Notably, nowadays, the heterosexual contact has overtaken IDUs transmission routes as the greatest risk group for CRF08_BC infections nationwide (67.4 and 23.3%). It showed an obviously changes divergence of CRF08_BC transmission population which had resulted in the highest risk groups changed from IDUs to heterosexuals. Currently, the overall clustering rate of the network has decreased, but the results of the dynamics of network cluster growth trend indicated that not all cluster growth is equivalent [5,23]. Not surprisingly, by comparing the propagation networks of the two growth states, we found that the IDUs group in the network were slowly increasing or quiescent state in recent years, while the heterosexual transmission population is prominently active [1,10]. This suggests that our active/rapid growth network was the target that should be the focus of intervention in the future. Furthermore, we should strengthen the monitoring of heterosexual risk populations in the network, early diagnosis and treatment of infected people, and strengthen prevention and control management for uninfected people. For lowgrowth networks, it may also reflect the positive interventions for IDUs that have been taken in recent years, moreover, the effects of continuous advancement of treatment standards in China [20,39].
We found that clustering was associated with IDUs and heterosexual (male) group, young persons (aged  30-49 years), and lower education, suggesting that they are disproportionately involved in clusters associated with HIV transmission [16,20,40]. These factors also observed in the analysis of multiple potential transmission links. Previous studies report also has shown that nodes with more links in the network may play the role of "core population" who had a higher risk of transmission [41,42]. The level of behavioural intervention received by different educational schools is not balanced. Especially in China, influenced by traditional concepts, it is difficult to receive education about HIV/AIDS in the primary and middle schools. Therefore, some MSM/IDUs groups lack self-protection awareness due to their low education level. And in recent years, the MSM has gradually become a new group at higher risk for CRF08_BC transmission. This suggests that CRF08_BC has signs of transmission from the particular high-risk populations (IDUs and heterosexual) to the MSM population. Although MSM is less likely to be clustered at present, because the active transmission characteristics of the MSM population, it is easy to form "core population" with multiple potential transmission links. This part of the population has the ability to cause widespread transmission of CRF08_BC. Just like since CRF07_BC was first discovered, it has spread rapidly among China provinces through the MSM population [43]. Taken together, these results indicated that prioritizing IDUs, MSM groups, lower education and young person's intervention was likely to be significantly effective in preventing the spread of CRF08_BC [5,6,44].
In this research, the IDUs infection among CRF08_BC likely originates from two predominant sources. Firstly, a large proportion of these groups were linked to other IDUs. Besides, we also found that many IDUs were linked to heterosexual women (45.4% of all links), which indicated that a substantial proportion of IDUs might be involved in transmission with heterosexual women. Moreover, our study found networking by heterosexual, with women much more likely to link to transmission networks with other IDUs. In contrast, there were fewer links with IDUs in the heterosexual men. Although we are not able to establish directionality of transmission from these sequences, suggesting that a portion of heterosexual women are related to IDUs transmission, these transmissions may represent that women are at risk for HIV infection through multiple routes. In the heterosexual transmitted infection population, we found that young (aged 30-49 years) and singlehood persons more commonly had multiple potential transmission partners. Notably, the vast majority of young IDUs (aged 30-39 years) were more likely to link to other heterosexual. This type of assortative mixing may be due to the reason that they are more likely to engage in Two-way propagation behaviour and they are disproportionately involved in clusters associated with HIV transmission [1,2,45]. Together, these findings suggested that prioritizing middle-aged persons of IDUs and heterosexual women infected may be the most effective strategies in preventing transmission of HIV. And identifying and controlling these "bridge population" is essential to curb the spread of HIV among the population. Previous research has also revealed that using molecular network features to target intervention can effectively reduce the spread of HIV [6].
Although the results of our research were robust, there are still some limitations. As sequence in this study may be affected by selection/sampling bias due to limited funding for HIV-1 sequence projects collected by molecular epidemiological survey and differences in data integrity by location and other characteristics. Nevertheless, to the best of our knowledge, this study has so far included the highest number of CRF08_BC strain sequences for transmission networks analysis in China. In the subsequent investigation, we expect this potential bias to decrease as we expand the number of participating regions and the integrity of the data. Additionally, the sample transmission category is based on self-reported information, we could not access available clinical following data, such as CD4 count, viral load, and accurate diagnosis time. As a result, there may have been other associations that we did not have power to detect. In the future, we plan to conduct more detailed analysis to understand these molecular epidemiological survey data.
The genetic transmission network surveillance and analysis for HIV could offer tools to understand transmission dynamics among various regions distribution and diverse risk population of CRF08_BC in China. As the scope and coverage of HIV transmission cluster surveillance continue to grow, we expect to be able to make effort inferences. These observations may also be significant for targeted prevention interventions, which could use the HIV genetic transmission networks, combined with clinical data and epidemiology data, to help us better identify potential transmission relationship and further make a better assessment of CRF08_BC transmission trends at the population level.

Data availability statements
All data included in this study are available upon request by contact with the corresponding author.