Trends of pulmonary fungal infections from 2013 to 2019: an AI-based real-world observational study in Guangzhou, China

ABSTRACT Recently, the prevalence trend of pulmonary fungal infection (PFI) has rapidly increased. Changes in the risk factors for, distributions of underlying diseases associated with and clinical characteristics of some individual PFIs have been reported in the past decade. However, data regarding PFIs remain uncertain. This study reports the epidemiological characteristics and trends of PFIs over time in recent years. We applied an automated natural language processing (NLP) system to extract clinically relevant information from the electronic health records (EHRs) of PFI patients at the First Affiliated Hospital of Guangzhou Medical University. Then, a trend analysis was performed. From January 1, 2013, to December 31, 2019, 40,504 inpatients and 219,414 outpatients with respiratory diseases were screened, in which 1368 inpatients and 1313 outpatients with PFI were identified. These patients were from throughout the country, but most patients were from southern China. Upward trends in PFIs were observed in both hospitalized patients and outpatients (P<0.05). The stratification by age showed that the incidence of hospitalized patients aged 14–30 years exhibited the most obvious upward trend, increasing from 9.5 per 1000 patients in 2013 to 88.3 per 1000 patients in 2019. Aspergillosis (56.69%) was the most common PFI, but notably, the incidence rates of Talaromyces marneffei, which used to be considered uncommon, exhibited the most rapid increases. In younger PFI patients, the incidence and trend of PFIs have increased. Infection by previously uncommon pathogens has also gradually increased. Increased attention should be paid to young PFI patients and uncommon PFI pathogen infections.


Introduction
With the widespread use of antibiotics, glucocorticoids, and immunosuppressive agents, and the growing population of immunocompromised patients, such as cancer and HIV patients, the incidence of pulmonary fungal infections has increased [1,2]. PFIs used to occur mainly in HIV and other immunocompromised patients, but an increasing series of reports indicates that immunocompetent and immunocompromised patients without traditional risk factors are affected [3][4][5][6][7]. Recently, accumulating evidence of variations in risk factors, distributions of underlying diseases and basic clinical characteristics associated with aspergillosis, cryptococcosis and Talaromyces marneffei have been reported [4][5][6][7]. However, among pulmonary fungal diseases, whether the traditional clinical characteristics have changed remains uncertain. Studies with substantial data in this research area are urgently needed to provide deeper insight into pulmonary fungal infections and guide clinical physicians.
In medicine, artificial intelligence (AI) methods have emerged as potentially powerful tools for mining electronic health record data to aid in disease diagnosis and management [8], but the analysis of EHR data presents several challenges, including the substantial amount of data and deviations or systematic errors in medical data [9]. However, a recent study reported that AI could overcome this difficult challenge through an automated NLP system and achieved excellent results [10].
Therefore, given the limited knowledge of changes in PFIs and the superiority of AI in analysing a substantial amount of data, we retrospectively extracted clinically relevant information from the EHRs of PFI patients using an automated NLP system. The aim was to investigate changes in the epidemiological characteristics of PFI patients.

Ethics statement
This study was approved by the ethics committee of the First Affiliated Hospital of Guangzhou Medical University (Ethical number: 2018-119). This study was a retrospective case series study, and no patients were involved in the study design, setting the research questions, or the direct outcome measures. No patients were asked for advice regarding the interpretation or reporting of results.

Study design and participants
We applied an automated NLP system to retrospectively select all patients with confirmed PFIs between 2013 and 2019 at the Department of Respiratory Medicine of the First Affiliated Hospital of Guangzhou Medical University. This department is famous for respiratory medicine in China and has been the top respiratory medical centre since 2009, with the average annual number of outpatient visits reaching over 117,146 and the average number of hospitalizations reaching 76,217 over the past 7 years.
First, based on diagnosis information from patients' medical records, AI comprehensively distinguished between patients with and without PFIs. Subsequently, PFI patients with suspected diagnoses were excluded ( Figure 1). Then, the clinical information (including sex, age, pathogens and underlying diseases), admission costs, drug use status and adverse events were extracted from the patients' records. After data collection, trend analyses of the changes in the characteristics of the PFI patients in recent years was performed. Multiple admissions in different years were considered separately, and missing data were excluded from the analysis.

Data production process
The data analysed in this study were mainly produced by the following process. All data acquisition was authorized by the relevant departments of the hospital. By docking with the hospital integrated platform or clinical business system, the data extraction was performed during specified non-business hours of the hospital by means of "push view, intermediate library, message bus, and webservice" to collect the historical data and incremental data of outpatients and inpatients.
After the completion of the data collection, we analysed the EHRs according to the recognized guidelines [11,12] and basic dataset of electronic medical record (sourced from the National Health Commission (http://www.nhc.gov.cn/)) and established a series of data application standards, including standard medical terms and synonyms, etc. In addition to the application of poststructural technology, the collected data were deep-cleaned and verified before forming a standardized specialized disease database, and then the data application was analysed.
During the data collection process, we strictly followed data security standards [13]. To ensure the safety of the data throughout the process, the construction of the platform met the criteria of the national "three levels of information security protection" certification and "ISO27001" certification.
Regarding the accuracy, we conducted in-depth data verification considering the data acquisition, data processing and data production and produced a data verification report covering the whole process of the data production to ensure the accuracy, authenticity and reliability of the platform output data ( Figure 2(A)).

Data standardization process
First, a diagnostic standard terminology database was established according to the International Classification of Diseases Volume 10 (ICD-10) and Systematized Nomenclature of Medicine Clinical Terms (SNOMED) standard terminology. Then, by adding clinical case data, we created a synonym database to achieve accurate mapping between the original diagnosis data and standard terminology. The diagnostic standardization process mainly included the following four steps: diagnostic parsing, diagnostic splitting, diagnostic matching, and diagnostic validation.
(1). Diagnostic parsing: Diagnostic data from the medical record homepage, outpatient medical records and discharge records were parsed to generate original diagnostic data. (2). Diagnosis splitting: In cases with multiple disease diagnoses, the data were split to generate a single original diagnostic diagnosis. If the results were inaccurate in the validation, the rules were improved, and the erroneous results were split according to modified rules until accurate results were obtained. (3). Diagnosis matching: After splitting, the original diagnostic data were matched by identifying existing synonyms in the diagnostic database. If no synonym was matched, the split raw data were normalized manually and then mapped to the standard diagnostic terminology. (4). Diagnostic validation: The raw diagnostic data after normalization were manually checked through random sampling and then released after validation. Accurate results were added to the updated thesaurus library when the words or phrases were not found in the library. Inaccurate results were normalized again according to new rules ( Figure 2(B)).

Statistical analysis
The primary objectives of this research were to study the trend of PFI by age, pathogens and underlying diseases in recent years and the secondary objectives included analysing the trend in sex, incidence of patients with adverse events, mean length of hospital stay, admission costs and drug proportion. A trend analysis was performed using the proportions of PFI patients among all patients with respiratory diseases. The percentage change in the annual incidence and 95% CI were estimated using a linear regression analysis of the log of the annual incidence. The data were analysed using SPSS (IBM SPSS Statistics 27, SPSS, Inc, Chicago, USA) and R (version 4.0.2). All statistical tests were bilateral tests, and a P-value < 0.05 was considered statistically significant.   and that the proportion of younger patients increased, while the proportion of older patients decreased. Similarly, among the outpatients, an upward trend was also observed in the 14-30 age group (P<0.05), with a 6.7% percentage year-on-year increase Although the annual percentage change was greater in the older patient group, the difference was very small (6.2% vs 7.8%). Nevertheless, an overall increasing trend was observed among the younger PFI patients ( Figure 3, Table 1).

Trends by pathogens
The most common infection responsible for PFIs in the hospitalized patients was aspergillosis [56.69% (n=932)], followed by cryptococcosis [17.03% (n=280)] and Talaromyces marneffei [2.55% (n=42)]. Among them, the incidence of patients with aspergillosis, cryptococcosis, and Talaromyces marneffei showed upward trends (P<0.05). Interestingly, the incidence of Talaromyces marneffei, which is a strain that was previously uncommon, presented the fastest growth during the study period among all PFI patients from 0.17 per 1000 patients in 2013 to 1.97 per 1000 patients in 2019, representing a 16% percentage year-on-year increase (P<0.001). Other uncommon infections, such as pneumocystis pneumonia, pulmonary mucormycosis, and candidiasis, also presented a rapid increase but without a statistical significance, which could be related to the small samples. Additionally, in the internal percentage trend analysis of PFIs, we found that only the proportion of Talaromyces marneffei presented a rapid increase, from 0.65% to 4.63% (P<0.05), and other pathogens that were previously uncommon also experienced rapid increases but without statistical significance (Supplemental material eTable3). These findings indicate that the incidence of some infections that were previously uncommon has increased, especially since 2016, and additional attention should be paid to some PFIs that considered rare ( Figure 4, Table 2).

Trends by underlying diseases
Pulmonary infection (28.59%) was the most common underlying disease among the PFI patients, followed by bronchiectasis (23.72%) and chronic obstructive pulmonary disease (COPD) (20.74%). Although the incidence rates of all underlying diseases among the PFI patients remained stable in the recent 7 years, with no statistical significance, the increasing trend of pulmonary infection, bronchiectasis, diseases requiring mechanical ventilation, tumour diseases, hypoproteinemia and diseases requiring invasive ventilation were also observed (P >0.05). Here, we only showed the most common underlying diseases. For more details, see the supplement materials (eTable2). The analysis of relevance also showed that pulmonary infection and chronic pulmonary diseases may be strong independent risk factors for PFIs, especially pulmonary aspergillosis and cryptococcosis ( Figure  5, Table 3). However, regarding uncommon pathogens, further research is still required.  also decreased annually (from 57.26% to 44.29%). Interestingly, the trends of broad-spectrum antibacterial and antifungal drug use were consistent with the fluctuation in antibacterial drug use (Supplemental material eFigure1).

Discussion
This research aimed to use AI-based methods to study the epidemiological characteristics and trends in patients with pulmonary fungal diseases over time in recent years. Using an AI-based investigation, for the first time, we demonstrated a remarkable increasing incidence and trend from 2013 to 2019 in young PFI patients. The incidence of some PFIs that were previously considered rare or uncommon also rapidly increased. Standardizing diagnostic terms and applying AI in this study improved the quality of the data extraction. This 7-year real-world, big data study filled a gap in knowledge regarding the epidemiological characteristics of fungal disease, providing deeper insight into fungal infection and guidance for clinical physicians. During the study period, a remarkable increasing incidence of PFI patients was observed. This sharp increase can be attributed to the increasing numbers of immunocompromised patients with malignancy, haematologic disease, and HIV, and those receiving immunosuppressive agents for organ transplantation or autoimmune inflammatory conditions [1,2]. In addition, PFIs were generally thought to occur in HIV and neutropenia patients. However, the number of reports documenting PFIs in immunocompetent patients or immunocompromised patients who do not have the classic risk factors is increasing [3][4][5][6][7]. Furthermore, advances in diagnostic methods and techniques have also greatly contributes to the identification of PFIs [1].
The stratification by age showed that older patients accounted for most PFI cases, and the incidence of PFIs continuously increased in this group. These findings are consistent with those reported in previous studies [14][15][16]. However, the proportion of younger patients with fungal diseases, especially those between the ages of 14 and 30, also increased annually. This result is comparable to previous findings in some individual pulmonary fungal diseases. The lung was the predominant site of invasive aspergillosis infection (83.4%), and 21.6% of the cases occurred in paediatric patients [15]. Although most cases of pulmonary aspergillosis occur in the 50-70 age group, younger patients aged under 30 also accounted for a considerable part [17]. In Colombia, younger patients (under 40) even accounted for approximately 59.26% of 1976 cryptococcosis patients [18].
However, the trend of PFIs in youth has not been reported in any previous studies. Here, we present strong evidence and are the first to report that the Notes: The data represent the incidence per 1000 patients unless otherwise stated. * Estimated using a linear regression analysis of the log of the of the annual incide-nce (e.g. there was a statistically significant 5.9% year-on-year increase in the incidence of pulmonary aspergillosis during the 2013-2019 period).
incidence of PFIs is increasing in younger patients. This finding may be related to the increased proportion of younger immunocompromised people [1,19]. Patients with high-risk factors are more likely to develop PFIs than patients without such factors [7]. Additionally, the discovery of novel immunodeficiency syndromes in children may contribute to the identification of additional at-risk patient groups [6]. Finally, advances in diagnostic methods and techniques, including the use of metagenomic next-generation sequencing (mNGS), computed tomography (CT), positron emission tomography (PET), and bronchoscopy, have also significantly contributed to the identification of PFI, thus increasing the PFI definitive diagnosis rate [1]. Pulmonary aspergillosis was found to be the most common PFI and exhibited a continuously upward trend, which is comparable to previous findings [20][21][22]. Importantly, the incidence of some pathogens that were previously considered uncommon also showed sharp increases, with Talaromyces marneffei infection showing the steepest upward trend ( Figure  4). The development of highly active antiretroviral therapy and other effective control measures for the HIV/AIDS epidemic had resulted in the decreased incidence rate of T. marneffei infection in HIV patients. 23 However, increasing cases have been reported in non-HIV-infected patients with anti-IFN γ autoantibodies and, those receiving immunosuppressive agents, such as anti-CD20 monoclonal antibodies and kinase inhibitors for malignancy, haematologic disease, organ transplantation and autoimmune inflammatory conditions [6]. Similar to the trend of Talaromyces marneffei, increasing trends of pulmonary mucormycosis, pneumocystis pneumonia and pulmonary candidiasis were also observed (P >0.05). In the future, additional cases of uncommon PFIs are likely to be reported in developing countries. The reason may be that improvement in the national health services in these countries likely lead to an increase in the population of non-HIV-infected patients at risk of infection, including transplantation recipients and cancer patients receiving targeted therapies [6]. Notably, we found that the incidence rates of these rare diseases had remarkably increased since 2016 compared to those in previous years. This finding may be a result of the application of mNGS technology [24,25] and the development of relevant guidelines [26]. Moreover, advances in diagnostic methods and techniques will greatly facilitate the detection and molecular characterization of causative pathogens [27][28][29].
Immunocompromised diseases (including malignancy, haematologic disease, and HIV infection) were previously the most common underlying diseases among fungal infection patients [20,22,30]. However, in this study, we found that pulmonary infections and chronic lung diseases may be great independent risk factors for PFIs, which is similar to the finding of recently published articles [15,21]. The risk in patients with these diseases is less severe than that in immunodeficient patients, but these diseases are more widespread and common and involve a much larger population than those causing immunodeficiency [31]. While the reason why PFIs are highly prevalent in lung infection patients remains unclear, evidence of this phenomenon has been reported. Influenza has been recognized as an independent risk factor for invasive pulmonary aspergillosis and is associated with high mortality [32]. Additionally, positive and negative interactions between Aspergillus and Pseudomonas aeruginosa, which are two central members of the fungal and bacterial pulmonary microbiota have also been reported [33,34]. Volatile compounds released by bacterial pathogens can stimulate the growth of fungal pathogens in lung infections [35]. In addition, chronic respiratory diseases, including COPD and bronchiectasis, have also been found to be great risk factors for PFIs [36,37].
PFIs are among the most common invasive fungal infections and present an increasing prevalence and serious threats to humans worldwide [15]. However, their importance is typically not well recognized or publicized. Most studies related to this topic were focused on invasive fungal infections [14,15,28,38] or individual PFI pathogens [4][5][6]. Knowledge regarding the epidemiology of PFIs in recent decades is relatively limited. To the best of our knowledge, this study is the largest and longest study ever performed investigating the trend in PFIs in recent years. These findings fill a gap in knowledge regarding the changed epidemiological characteristics of fungal disease, providing deeper insight into fungal infection and guidance for clinical physicians.
Limitations also exist in this research. First, this study was a single-centre study, and its conclusions may not apply to other countries. However, based on a large number of patients from throughout country and 7 years of real-world big data, we used Notes: The data represent the incidence per 1000 patients unless otherwise stated. *Estimated using a linear regression analysis of the log of the annual incidence (e.g. there was a statistically significant 5.9% year-on-year increase in the incidence of pulmonary aspergillosis during the 2013-2019 period).
NLP techniques to extract the data and then performed a trend analysis, thus ensuring that this single-centre analysis is representative. Second, this study was an observational study. Although some new insights into the clinical characteristics of PFIs have been reported, the causes of these trends and effective measures remain unknown. Our next study will explore the reason why a rapid increasing trend was observed among young PFIs patients.
In conclusion, the trend analysis revealed a remarkably increasing incidence and an increasing trend among younger PFI patients. The incidence of some PFIs previously considered rare or uncommon also showed rapid increases. These empirical findings in this study provide a new understanding of PFIs. Additional attention should be paid to young PFI patients and some previously uncommon PFIs that have shown a rapidly increasing trend.

Availability of data and material
The data supporting the findings of this study are available from the corresponding author upon reasonable request. Participant data without names and identifiers will be made available after approval from the corresponding author. After the publication of the study findings, the data will be available to other researchers upon request. The research team will provide an email address for communication once the data are approved for sharing with others. A proposal containing a detailed description of the study objectives and statistical analysis plan will be needed to evaluate the reasonability of the data request. The corresponding author will make a decision based on these materials. Additional materials may also be required during the process. Notes: The data represent the incidence per 100 PFI patients unless otherwise stated. *Estimated using a linear regression analysis of the log of the annual incidence. COPD: chronic obstructive pulmonary disease. CTD: connective tissue disease.