High proportion of tuberculosis transmission among social contacts in rural China: a 12-year prospective population-based genomic epidemiological study

ABSTRACT
 Tuberculosis (TB) is more prevalent in rural than urban areas in China, and delineating TB transmission patterns in rural populations could improve TB control. We conducted a prospective population-based study of culture-positive pulmonary TB patients diagnosed between July 1, 2009 and December 31, 2020 in two rural counties in China. Genomic clusters were defined with a threshold distance of 12-single-nucleotide-polymorphisms, based on whole-genome sequencing. Risk factors for clustering were identified by logistic regression. Transmission links were sought through epidemiological investigation of genomic-clustered patients. Of 1517 and 751 culture-positive pulmonary TB patients in Wusheng and Wuchang counties, respectively, 1289 and 699 strains were sequenced. Overall, 624 (31.4%, 624/1988) patients were grouped into 225 genomic clusters. Epidemiological links were confirmed in 41.8% (196/469) of clustered isolates, including family (32.7%, 64/196) and social contacts (67.3%, 132/196). Social contacts were generally with relatives, within the community or in shared aggregated settings outside the community, but the proportion of clustered contacts in each category differed between the two sites. The time interval between diagnosis of student cases and contacts was significantly shorter than family and social contacts, probably due to enhanced student contact screening. Transmission of multidrug-resistant (MDR) strains was likely responsible for 81.4% (83/102) of MDR-TB cases, with minimal acquisition of additional resistance mutations. A large proportion of TB transmission in rural China occurred among social contacts, suggesting that active screening and aggressive contact tracing could benefit TB control, but contact screening should be tailored to local patterns of social interactions.


Introduction
Tuberculosis (TB) remains an important threat to global health, with an estimated 10 million people worldwide falling ill and at least 1.4 million deaths in 2019 [1]. Although the global incidence of TB has declined in recent years, there is still a significant gap between current rates of decline and the goals of the WHO End TB Strategy [2,3].
It is estimated that approximately one-third of TB patients worldwide were not diagnosed [1], so measures that increase diagnosis, especially early in the disease, are key to reducing transmission and thereby controlling TB. One of the core strategies to increase TB diagnosis and achieve global TB control is active case-finding [4]. The World Health Organization (WHO) recommends systematic TB screening of the general population in areas with an estimated TB prevalence of 0.5% or higher [5]. For areas with a lower TB prevalence, a cost-effective strategy involves identifying populations at high risk for transmission and then implementing targeted screening.
Genomic-epidemiological studies of tuberculosis can help identify populations at high risk of transmission, but the limited discriminatory power of previously employed genotyping methods and the barriers to epidemiological investigations have made it difficult to define these populations in China [6]. The populations at high risk for TB in other countries, such as HIV-positive, homeless, and drug abusing individuals, are less important contributors to the TB burden in China [5,7]. Instead, TB screening in China is focused principally on the elderly and internal migrants, but because of the large size of these populations in China, generalized screening is not practical. Most genomic-epidemiological studies of tuberculosis in China have been conducted in large cities, but China also has extensive rural areas where the prevalence of TB is up to three times higher than in urban regions [8]. Therefore, identifying highrisk populations and delineating transmission patterns in rural areas is important for improving TB control in China. To define the transmission patterns of the rural TB burden and thereby provide guidance for improving TB control, we conducted a 12-year prospective population-based genomic-epidemiological study in two rural counties in China.

Study design and participants
Wusheng and Wuchang are rural counties in southwestern and northeastern China, respectively (Supplementary Figure S1). Wusheng is located in the east of Sichuan Province, with an area of 960 square kilometres and an estimated 556,000 inhabitants in 2020. Wuchang is located in the south of Heilongjiang Province, with an area of 7,512 square kilometres. The study included 14 townships (2,299.5 square kilometres) in Wuchang, with a combined estimated 419,000 inhabitants in 2020. The average annual reported TB incidences were 95.8 (per 100,000 population) in Wusheng and 68.9 in Wuchang (Supplementary Figure S2). Lower incidences recorded in 2020 were probably inaccurate and a result of reduced TB diagnosis due to the COVID-19 pandemic.
The study population was comprised of all culturepositive pulmonary TB patients 15 years or older who were diagnosed between July 1, 2009 and December 31, 2020 in the regions. The study was approved by the institutional review board of Biomedical Sciences, Fudan University and all enrolled patients provided written informed consent.

Diagnostic procedures
The Wusheng County Centre for Disease Control and Prevention and the Wuchang City Centre for Tuberculosis Control and Prevention are responsible for the diagnosis and treatment of TB in their respective areas. Individuals with TB-like symptoms or abnormal chest radiographs in general hospitals and township health centres are referred to these designated medical institutions for diagnosis by sputum smear and culture. Sputum specimens were collected when the patients presented for TB diagnosis, before starting treatment. They were cultured in Löwenstein-Jensen media during 2009-2016 and in liquid media during 2017-2020. Culture-positive isolates were inactivated at 80°C for 30 min and stored in the −20°C refrigerator of the site laboratory.

Whole-genome sequencing
All stored clinical strains isolated during the study period were re-cultured from frozen stocks. DNA was extracted using the cetyl trimethyl ammonium bromide (CTAB) method and sequenced as previously described [9]. Raw sequence reads were trimmed with Sickle (version 1.33) and aligned to the inferred Mycobacterium tuberculosis complex ancestor sequence [10] using BWA-MEM. SAMtools (version 1.3.1) and Varscan (version 2.3.6) were then used to identify single nucleotide polymorphisms (SNPs). Pairwise SNP distances were calculated based on fixed SNPs (frequency ≥75%), excluding those in drug-resistance associated genes and repetitive regions of the genome (e.g. PPE/PE-PGRS family genes, phage sequences, and insertion or mobile genetic elements). A genomic cluster was defined as strains differing by 12 or fewer SNPs, consistent with linkage through recent transmission [11].
Based on the identified SNPs, a phylogeny tree was constructed with RAxML-NG (version 1.0.2) software, using the maximum likelihood method with 100 bootstraps and visualized with Interactive Tree of Life (https://itol.embl.de/). Strains were classified into different lineages according to Liu et al [12]. Strains belonging to lineage 2 (L2), also termed the Beijing family, were divided into L2.3, representing "modern" Beijing and other sub-lineages considered "ancient" Beijing [13].
Drug-resistance profiles were predicted for 14 anti-TB drugs based on the mutations reported to be associated with resistance [14], using Whole-genome sequencing (WGS) data as described previously [13]. Pan-susceptible TB was defined as strains without mutations associated with resistance to the four firstline drugs (isoniazid, rifampicin, ethambutol and pyrazinamide). Multidrug-resistant TB (MDR-TB) was defined as strains with mutations associated with resistance to at least isoniazid and rifampicin. Strains with mutations associated with resistance to any of the four first-line drugs but not MDR were termed other drug resistant (DR) TB.

Epidemiological investigation
We conducted epidemiological investigations twice. First, at the time of TB diagnosis we collected demographic, clinical and laboratory information on each patient and also collected data on their close contacts. We then conducted in-depth epidemiological investigation of patients whose strains belonged to genomic clusters, including their residences, workplaces and public community centres and facilities frequented in the three years prior to their TB diagnosis. Putative transmission networks were constructed based on the structure of the genomic phylogeny and the epidemiologic links. The epidemiologic links were defined as confirmedpatients who knew each other and had a history of contact before diagnosis, or probablepatients who did not know each other but lived in the same village. We classified the confirmed epidemiological links into 4 categories according to the closeness of the relationship between the patients: family the patients were immediate relatives; relativesthe patients were related but not immediate relatives; communitypatients lived in the same village; and aggregated settings outside the communitypatients who had contact outside the community in settings such as teahouses, psychiatric hospitals, schools and nursing homes.

Statistical analysis
The clustering rate was calculated by the N method and the cumulative clustering rate was obtained by calculating the clustering rate in 2009 and then adding the TB patients from every successive year of the study, as previously described [15]. Changes in temporal trends of annual clustering were detected using joinpoint regression analysis (Supplementary methods). The distribution of continuous and categorical variables between groups was compared using the Wilcoxon rank sum test or the chi-square test. Logistic regression analysis was used to calculate the odds ratios (OR) and 95% confidence intervals (CI) for risk factors associated with genomic clustering. Variables with p-values less than 0.2 in the univariable analysis were included in the multivariable analysis to calculate the adjusted odds ratios (aOR). Factors with a p-value less than 0.05 in the final model were considered statistically significant. All analyses were performed in Stata (version 14.0).

General population characteristics
Between July 1, 2009 and December 31, 2020, 5502 and 3435 pulmonary TB patients were officially registered in Wusheng and Wuchang, respectively, of whom 1517 and 751 had positive sputum cultures.
After excluding 103 strains that failed re-culturing, 1452 and 713 strains were sequenced. We further excluded 123 patients with non-tuberculous mycobacteria isolates and 54 patients with recurrent TB. The final analysis thus included 1289 and 699 patients from Wusheng and Wuchang, respectively (Figure 1), whose characteristics are shown in Table 1. The proportions of patients who were students or less than 25 years old were higher in Wusheng [15.4% (198/ 1289) and 7.6% (98/1289)] than in Wuchang [9.9% (69/699) and 2.9% (20/699)]. Patients in Wuchang were more likely to have had longer diagnosis delays, chest cavities and smear positivity.
Whole genome sequencing for genotyping and drug resistance prediction Phylogenetic analysis revealed that most strains belonged to the Beijing family lineage (69.0%, 1372/1988) ( Figure 2), with more belonging to the modern Beijing sublineage (74.1%, 1016/1372) than ancient Beijing sublineages (25.9%, 356/1372). All of the non-Beijing strains belonged to L4 (99.8%, 615/616), except for one L1 strain. Analysis of the genome sequences for mutations conferring resistance to 14 anti-TB drugs showed that the drugresistance profiles were similar between the two sites ( Figure 2; Table 2; Supplementary Table S1). In total, there were 1716 (86.3%) pan-susceptible strains and 272 (13.7%) strains with mutations associated with resistance to at least one anti-TB drug. The MDR strains accounted for 5.1% (102/ 1988) of the total strains, and 29.4% (30/102) of MDR strains were predicted to harbour mutations associated with resistance to the fluoroquinolones.

Genomic clustering analysis of M. tuberculosis
To estimate the level of recent transmission, we calculated the clustering rate between 2009-2020. A total of 624 (31.4%, 624/1988) strains, 347 (26.9%, 347/1289) from Wusheng and 277 (39.6%, 277/699) from Wuchang, were grouped into 225 genomic clusters containing 2-13 strains ( Figure 2). To understand the dynamics of clustering trends in the two sites, we calculated the cumulative clustering rate and found that the rate gradually increased as more strains were analyzed, but the trend eased between 2015 and 2016 in both sites. As seen in Figure 3, the clustering rate increased significantly after 2016, especially in Wusheng (p = 0.004, Supplementary Figure S3), rising from 21.8% in 2017-26.9% in 2020. The increased clustering rate may have been the result of strategies we implemented starting in 2017: active case-finding based on symptoms that identified 15-25% more patients; stepwise sputum collection that improved the diagnostic quality of the sputum [16]; and sputum cultures in liquid media that increased the percentage of culture-positive patients from 20-30% to 40-50% (Supplementary Figure S4).

Analysis of MDR-TB
In the analysis of risk factors for MDR-TB, the only association was a higher proportion of MDR in retreated patients (16.1%, 24/149) than in new cases  Table S2). To investigate the cause of MDR we analyzed the number of clustered and non-clustered patients in the 102 MDR patients and found that 34 were clustered. While retreated MDR-TB cases could have secondary resistance that was acquired during treatment, new cases were presumably infected with MDR strains and represent primary resistance. Among the nonclustered MDR patients, 49 were new patients. Therefore, in total, 81.4% (83/102) of MDR-TB patients were likely caused by the transmission of MDR strains. We also looked for accumulated additional drugresistance mutations along the transmission chain. Among 32 putative MDR-TB transmission in 11 genomic-clusters, we found just two events (6.3%, 2/32) indicating the acquisition of additional resistance (Supplementary Figure S5), which is lower than the incidence of acquired resistance reported in urban areas of China [11].

Epidemiological links of genomic-clustered patients
To further delineate the transmission links between the genomic-clustered patients we performed an indepth epidemiological investigation on all patients with clustered isolates. These investigations were completed for 469 Supplementary  Tables S3 and S4.
The 196 patients with confirmed epidemiological links were all close contacts. Among them ( Figure  4A), transmission between family contacts was identified in 32.7% (64/196) and between social contacts in 67.3% (132/196). To further analyze transmission patterns, we divided the social contacts into three categories: relatives outside of the immediate family, community contacts, and contacts who shared aggregated settings such as teahouses located outside the community. Epidemiological links were found in community contacts in 14.5% (16/110) of the confirmed transmission in Wusheng but 51.2% (44/ 86) in Wuchang. Surprisingly, the proportions were reversed for confirmed epidemiological links in aggregated settings outside the community, with 43.6% (48/ 110) in Wusheng and 15.1% (13/86) in Wuchang. The percentages of confirmed links to relatives beyond the immediate family were low in both sites, with 4.5% (5/ 110) in Wusheng and 7.0% (6/86) in Wuchang.
We constructed putative transmission networks based on genomic phylogeny and epidemiological links. Figure 5 shows examples of three of the largest clusters. The inhabitants of the villages in Wuchang live in fairly close proximity, often with daily contact in village shops that may explain the high level of transmission occurring within the community ( Figure 5A). In Wusheng, by contrast, the majority of identified epidemiological links were outside the community and mainly in schools and teahouses ( Figure 5B and C). The teahouses in Wusheng are generally located in townships outside the villages and are frequented by people from several different villages.

Time interval between TB diagnosis
We then analyzed the pairwise time interval between TB diagnosis in the clustered cases belonging to the different categories of contacts ( Figure 4B). We found no statistical difference (p = 0.90) in the time interval between diagnosis of family contacts (3.1 years, interquartile range [IQR] 0.9-5.9 years) and community contacts (3.0 years, IQR 1.3-4.6 years). Unexpectedly though, the average diagnostic time interval between contacts who shared the aggregated settings outside the community (1.1 years, IQR 0.4-2.3 years) was significantly shorter than for family (p < 0.01) or community contacts (p < 0.01). This shorter interval was perhaps because most of the contacts who shared aggregated settings outside the  community were students, and the average time interval between the diagnosis of student contacts (0.9 years, IQR 0.3-1.4 years) was significantly shorter (p < 0.01) than for all other contacts (1.9 years, IQR 0.7-3.6 years).

Risk factors for genomic clustering
Finally, we used logistic regression to identify risk factors associated with clustering. The univariate analysis found that occupation and the Beijing strain were significantly associated with clustering, and both associations persisted in the multivariate analysis (

Discussion
To our knowledge, this was the longest longitudinal population-based genomic epidemiological study of TB transmission in rural China. The overall cumulative clustering rate during the study period was 31.4%. Among 196 genomic-clustered patients with confirmed epidemiological links, 32.7% of transmission occurred between family contacts and 67.3% between social contacts. Of all MDR-TB cases diagnosed during the study, 81.4% were likely from transmission of MDR strains. The average time interval between the diagnosis of clustered student contacts was much shorter than for non-student contacts. Close contacts of TB patients are at a high-risk for TB transmission. A recent systematic review suggested that the pooled prevalence of TB in close contacts was 3.6% (95% CI: 3.3-4.0), and that contact investigation could increase TB case notification and thereby decrease the incidence of TB in the population [17]. In developed countries, index cases have been found to have 6-12 close contacts [18][19][20][21] who could potentially constitute 10-20% [22,23] of the total TB burden. Accordingly, the WHO strongly recommends systematic screening for TB disease among close contacts [5,24]. In China, however, the importance of screening close contacts has been underappreciated and its role in case finding has been limited. In China, only 2-3 close contacts per index case were identified and most of these were family members. It was therefore estimated that close contacts would contribute only about 1.0-3.3% of the total TB patients in China [25,26]. A recent community-based study of active case-finding that included 320,000 people in 10 provinces in China during 2013-2015 concluded that screening of close contacts contributed less then 2% of TB patients [7].
Our study found that 41.8% of clustered patients were close contacts of other patients and contributed 9.9% of the total TB patients, suggesting a high risk of TB transmission among close contacts in rural China. However, it was only through in-depth investigation of these genomic-clustered patients that we could identify the epidemiological links that are usually overlooked in routine contact investigation. In contrast to our findings, the routine TB contact investigations that were conducted in the study sites during the duration of our study registered only 2.4 close contacts per index case, who contributed just 1.8% of the total patients. Without the WGS data that identified clustered patients and directed our investigations, nearly 80% of the transmission events between close contacts would have been missed. Most secondary cases were diagnosed between 1-3 years after the index case, but the long study duration of 12 years, using WGS to analyze all TB isolates in the study areas, allowed us to identify clustered isolates that would be missed in studies with a shorter duration.
Screening of close contacts in China has been poorly implemented for a number of reasons. The huge number of TB patients, limited resources for TB prevention and control and the stigma of tuberculosis all contribute to a failure to identify and screen many close contacts. When contact screening is poorly implemented, the contribution of close contacts to the overall TB burden is underestimated and therefore less attention is paid to the screening, creating a vicious cycle. WHO guidelines define close contacts as persons sharing an enclosed space with the index case during the three months prior to commencement of the current treatment episode [24]. While the identification of family contacts is generally straightforward, the definition and identification of non-family contacts is difficult. In China, non-family contacts refer mainly to classmates and colleagues [26]. In this study, however, we found that many patients with confirmed epidemiologic links were contacts either within the community or contacts who shared aggregated settings outside the community. Therefore, in addition to family contacts, we classified others as social contacts in the hope of understanding where contacts occur and highlighting their relevance for TB transmission in rural China.
We found that in rural China 67.3% of the transmission occurred among social contacts, which was significantly higher than has been found in Vietnam, Zambia and South Africa (15-50%) [27,28]. This suggests that active case-finding in rural China should go beyond family contacts to include social contacts. In this study we proposed three categories of social contacts: relatives outside of the immediate family; people who know each other and live in the same community or village; and people frequenting aggregated settings outside the community. Largely due to differences in climate and lifestyle, the importance of the different categories of contacts varied between the two study sites (Supplementary Figure S6). The villages in Wuchang are very small, consisting of perhaps only a dozen families living in houses that are generally adjacent to each other. Wuchang is located in the northeast of China and has an average annual temperature of only 4°C. Consequently, villagers spend more than half of the year indoors and frequently interact to socialize and play cards with other villagers in the village shops. It is therefore not surprising that 51.2% of epidemiological links were attributable to transmission between social contacts within these communities. Wusheng, in contrast, is located in the southwest of China with an average annual temperature of 18°C. and villagers live in homes that are relatively scattered in the region. We found that transmission between villagers in this setting was less common than transmission at aggregated settings such as schools and teahouses, where individuals from several different villages congregate. These differences in transmission between the two locations suggests that strategies for screening and social contact tracing must be tailored to the conditions and association habits of the local populations.
The purpose of active case-finding is to diagnose patients early and reduce transmission. Importantly, we found that the time interval between TB diagnosis of student patients and contacts was 3-4 times shorter than for family or community contacts ( Figure 4B). Students are a demographic group in China that receives special attention from the government and society. Once a student patient is diagnosed with TB, their classmates and even schoolmates will be screened. If a similar enhanced screening strategy could be extended to other social contacts, patients might be diagnosed earlier in the course of their disease, thereby reducing rural TB transmission.
This study have several limitations. The most important limitation is that the clustering rate we calculated likely underestimates the extent of local TB transmission. We previously found that the current passive case-finding strategy in Wusheng identified only about 30% of the incident TB patients [29], but in the current study we were unable to even estimate the number of TB patients in the sampling region who were undiagnosed or diagnosed elsewhere. Some of the non-clustered patients could have been clustered with TB patients diagnosed locally but without cultured isolates, or with patients diagnosed outside the study period or the geographic regions sampled. In addition, some clustered patients died or were lost to follow-up before the epidemiological investigations could be completed.
In conclusion, this long-term genomic-epidemiological study helped to delineate the patterns of TB transmission in rural China. Transmission appears to occur principally among close contacts, and therefore contact investigation should be extended to include the social interactions that are common in the targeted population. Further improvement in contact tracing, especially the identification and screening of all close contacts, is critical for reducing transmission and improving TB control in rural areas.

Data availability
Sequencing data were deposited in the Genome Sequence Archive (https://bigd.big.ac.cn/gsa) under BioProject PRJCA008815 and PRJCA008816. Deidentified participant data from the study will be made available with publication to medical researchers on a not for profit basis by email request to the corresponding author for the purposes of propensity matching or meta-analysis.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Ethical approval
The study was approved by the institutional review board of Biomedical Sciences, Fudan University and all enrolled patients provided written informed consent.