Exhaled metabolic markers and relevant dysregulated pathways of lung cancer: a pilot study

Abstract Introduction The clinical application of lung cancer detection based on breath test is still challenging due to lack of predictive molecular markers in exhaled breath. This study explored potential lung cancer biomarkers and their related pathways using a typical process for metabolomics investigation. Material and methods Breath samples from 60 lung cancer patients and 176 healthy people were analyzed by GC-MS. The original data were GC-MS peak intensity removing background signal. Differential metabolites were selected after univariate statistical analysis and multivariate statistical analysis based on OPLS-DA and Spearman rank correlation analysis. A multivariate PLS-DA model was established based on differential metabolites for pattern recognition. Subsequently, pathway enrichment analysis was performed on differential metabolites. Results The discriminant capability was assessed by ROC curve of whom the average AUC and average accuracy in 100-fold cross validations were 0.871 and 0.787, respectively. Eight potential biomarkers were involved in a total of 18 metabolic pathways. Among them, 11 metabolic pathways have p-value smaller than .1. Discussion Some pathways among them are related to risk factors or therapies of lung cancer. However, more of them are dysregulated pathways of lung cancer reported in studies based on genome or transcriptome data. Conclusion We believe that it opens the possibility of using metabolomics methods to analyze data of exhaled breath and promotes involvement of knowledge dataset to cover more volatile metabolites. Clinical significance Although a series of related research reported diagnostic models with highly sensitive and specific prediction, the clinical application of lung cancer detection based on breath test is still challenging due to disease heterogeneity and lack of predictive molecular markers in exhaled breath. This study may promote the clinical application of this technique which is suitable for large-scale screening thanks to its low-cost and non-invasiveness. As a result, the mortality of lung cancer may be decreased in future. Key messages In the present study, 11 pathways involving 8 potential biomarkers were discovered to be dysregulated pathways of lung cancer. We found that it is possible to apply metabolomics methods in analysis of data from breath test, which is meaningful to discover convinced volatile markers with definite pathological and histological significance.


Introduction
Lung cancer is one of the malignant pulmonary diseases that poses a great threat to the health and life of the population. It contributes the highest morbidity rate (11.6% of the total cases) and mortality rate (18.4% of the total cancer death) among all cancers [1]. Owing to lack of typical symptoms, most lung cancer patients are diagnosed at a terminal stage and miss the best treatment period [2]. Screening by the markers that are biologically related to tumour progression can probably provide even more mortality reduction by finding suspected patients as early as possible.
Exhaled breath test is a promising technique for large-scale screening of high-risk population of lung cancer for its convenience, low-cost and non-invasiveness. Therefore, detection of lung cancer through breath test has been thrown into a sharp focus. In 2019, Rudnicka et al. analyzed breath samples from 108 patients with lung cancer and 121 healthy volunteers with chromatography-mass spectrometry (GC-MS) [3]. Cross-validation of the obtained model has shown the sensitivity of 80% and specificity of 91.23%. In addition, Huang and Li used selected ion flow tubemass spectrometry (SIFT-MS) technique to quantitatively analyze 116 volatile organic compounds (VOCs) in breath samples from 148 patients with histologically confirmed lung cancers and 168 healthy volunteers. A diagnostic model based on eXtreme Gradient Boosting (XGBoost) method was built, showing accuracy of 92% [4]. Although a series of research reported diagnostic models with highly sensitive and specific prediction, far exceeding the performance of currently available low-dose computed tomography (LDCT) detection, the clinical application of this technique is still challenging due to disease heterogeneity and lack of predictive molecular markers in exhaled breath.
Recently, Chen et al. used GC-MS data from 160 patients with lung cancer, 70 patients with benign pulmonary disease and 122 healthy subjects to exploring exhaled markers of lung cancer. As a result, they found that 20 VOCs discriminated lung cancer from healthy subjects. Additionally, their reported 19 and 20 VOCs related to histological type and lung cancer stages, respectively [5]. Actually, many efforts have been made selecting a series of breath markers associated with lung cancer [6,7] since 1985 [8]. However, none of them have been used in clinical because the underlying mechanisms about how those VOCs produced are still unclear.
Metabolomics works as a multidiscipline crossed diagnostic tool for exploring differences and dynamic changes in endogenous micromolecular metabolites, combining analytical chemistry and bioinformatics to systematically detect and analyze changes in metabolites in the body. It has been widely used in diverse metabolic samples including urine [9,10], serum/ plasma [11,12] and tissue [13,14]. Based on those high-throughput data and online metabolic database, e.g. the human metabolome database (HMDB) and kyoto encyclopaedia of genes and genomes (KEGG), pathway enrichment analysis can be done, which may explain the relationship between metabolic data and pathophysiological state. As a mature technique, metabolomics provides a scientific and systematic data mining process for differential metabolites analysis, ensuring validity, interpretability, and reproducibility of their results. However, few studies analyzed breath data with metabolic database, even though there are various software and tools designed for guiding and performing metabolomics data analysis.
Here, we are going to identify molecular markers of metabolic dysregulation in lung cancer using the GC-MS data obtained from a recent study [15]. In the mentioned study, we reported a method to differentiate subjects with lung cancer from healthy controls, by means of exhaled breath test with the GC-MS. Then, breath profiles instead of volatile markers were analyzed with machine-learning algorithm and an accuracy of 85% was shown in six-fold cross validations. Although that study does relatively well in diagnosis, there is still necessity to illustrate the combination of metabolic markers and their relative pathways before application in clinic. Pathway enrichment analysis is a knowledge-based approach, depending largely on databases available for bioinformatic analysis such as KEGG and HMDB. Accordingly, our statistical analysis focussed on those candidate VOCs which could be annotated to online databases. This study was performed not only to find potential markers and relevant pathways for detection of lung cancer but also to explore the possibility of combination of metabolomics methods and breath data.

Data acquisition
Data were obtained from a previous case-control study where 236 subjects were asked to participate [15]. All cases were confirmed with an incident of lung cancer histologically or pathologically, while controls were confirmed with a negative result of LDCT scan. The detailed inclusion criteria and exclusion criteria of subjects were listed in Supplementary materials (S1).
Sample collection and analysis were performed as previously published [15]. Briefly, to collect breath samples, subjects were asked to breathe tidally into a self-developed collection device with which VOCs in 1000 mL exhaled breath were captured and concentrated into a Tenax TA stainless steel tube (PerkinElmer, Waltham, MA). Then, each sampling tube was shipped to laboratory for chemical analysis which was performed on GC-MS (QP2010 Plus, Shimadzu, Tokyo, Japan) coupled with a thermal desorption (TD) instrument (TurboMatrix 300 TD, PerkinElmer, Waltham, MA). Subsequently, spectrum analysis including peak identification and background removal was done. Details of collection, detection and data pretreatment are illustrated in Supplementary materials (S2). Metabolites which can be annotated to HMDB were then used for following analysis.

Statistical analysis
Statistical analysis was applied to VOCs present in more than 70% of samples in at least one group, with a quality control relative standard deviation smaller than 25%. A metabolite is described as "putative" following an accurate mass match to the HMDB database [16].
Univariate statistical analysis was performed on filtered data using the Mann-Whitney test. VOCs with fold change larger than 2 and FDR p < .1 were selected as candidate differential metabolic biomarkers. Due to the confounding effects of smoking and gender, stratification by smoking status and gender were applied in univariate statistical analysis. In five times of group-wise Mann-Whitney tests, VOCs selected as differential metabolic biomarkers at least once were employed in following analysis.
Before multivariate analysis, data of all significantly altered metabolites were cube root-transformed, meancentred and divided by the standard deviation of each variable. Orthogonal projections to latent structures discriminant analysis (OPLS-DA) and spearman rank correlation analysis were performed on scaled data.
Those confirmed metabolites were imported into MBRole 2.0 for pathway enrichment analysis.

Subjects
Two hundred thirty-six participants including 60 lung cancer patients and 176 healthy subjects were recruited in the study (Table 1). There were no significant differences of smoking history and age between two groups. Unfortunately, the moderate imbalance of genders existed when the significance level was set as 0.05. Considering univariate statistical analysis were conducted for males and females, respectively, the differences with p-value of .024 were acceptable.

Univariate statistical analysis
A total of 308 VOCs were obtained after data pre-treatment, of whom 81 had confirmed HMDB annotations, namely the putative metabolites. Volcano plots displaying log-fold-change of signal against p-value from nonparametric test were employed to show the results of univariate statistical analysis [17] (Figure 1). According to the selection criteria ((fold change) > 2, (FDR p) < .1), several differential metabolites were selected in each group, respectively. The fold change data and the corresponding FDR p-values of group-wise differential metabolites were listed in Supplementary materials (S4).
In addition to comparison stratified by genders and smoking status, the univariate statistical analysis was also performed on all subjects. Twenty-four VOCs were found to be differential metabolites ( Figure 1 and Figure S4), of whom 10 upregulated in patients, while others downregulated. As shown in Figure 2, there were 8, 10, 7, 22 and 24 kinds of metabolites selected in males, females, smokers, non-smokers and all subjects, respectively. A total of 31 VOCs presenting at least once among group-wise differential metabolites were determined as candidate differential metabolites (Table 2).

Multivariate statistical analysis
Correlation analyses were performed for all candidate differential metabolites with each other using Spearman's correlation (Figure 3 and Figures S5 and  S6). N-Nonanal and n-Octanal has the strongest correlation among all metabolites (r ¼ 0.67426, p < .001). Besides them, there were seven pairs of metabolites having correlation coefficients larger than 0.5 (S7). The rests of correlation coefficients were smaller than 0.5. It could be concluded that there is no strong correlation among all VOCs based on reference [18]. Thus, it is not necessary to remove any one from these 31 VOCs.
Multivariate statistical modelling was performed using OPLS-DA on the 31 confirmed metabolites. This model showed moderately significant group separation (Q 2 ¼0.331, R 2 Y ¼ 0.357, Figure 4(a)). Permutation tests confirmed the robustness of the model (100 Figure 4(b)). PLS-DA was also performed, and the score plots are shown in Supplementary materials (S8).
As subjects from case and control groups cannot be separated completely in OPLS-DA, ROC analysis was performed on output values of PLS-DA ( Figure 5). Areas under the curve (AUCs) ranged from 0.822 to 0.92 in 100 cross validations, and the average predictive accuracy was 0.787.
Scaled peak intensity of 31 differential metabolites from 60 cases and 176 controls were displayed in a heatmap ( Figure 6), showing that each group had its specific metabolic profiles. In detail, for metabolites lying in upper rows, patients have relatively lower levels than those of controls. For metabolites lying in lower rows, the situation was opposite.

Pathway enrichment
To further explore the relationship between the above 31 differential metabolites and the pathogenesis of lung cancer, these small molecular metabolites were introduced into MBRole 2.0 (http://csbg.cnb.csic.es/ mbrole2/) to obtain the key metabolic pathways involved. As shown in Figure 7 and Table 3, 8 potential biomarkers (Table 4) were involved in a total of 18 metabolic pathways. Among them, 11 metabolic pathways have p-value smaller than .1, indicating that they have significant contribution to the lung cancer metabolic pathway, namely monterpenoid biosynthesis, toluene and xylene degradation, glycosaminoglycan biosynthesis-heparan sulphate, reductive carboxylate cycle (CO 2 fixation), biphenyl degradation, glycolysis/ gluconeogenesis, C5-branched dibasic acid metabolism, pyruvate metabolism, selenoamino acid metabolism, taurine and hypotaurine metabolism and sulphur metabolism. Wherein, glycosaminoglycan biosynthesisheparan sulphate has the greatest rich factor of 0.3333.

Discussion
Although 308 kinds of VOCs were detected, the number of putative metabolites used to analysis is only 81. Many volatile metabolites were not able to be annotated in HMDB or KEGG database likely account for this remarkable difference between the number of detected VOCs and the number of putative metabolites. As a knowledgebase of human metabolome, HMDB involves a series recording of metabolites derived from human serum, urine, saliva and so on. Although HMDB has been continuously improving in the past decades, the exhaled breath as one of metabolic products of human body has not been included. It reflects metabolomic studies on breath test are still in its infancy. Although VOCs in breath has been detected since 1971, there are still lots of mystery compounds in exhaled breath, especially those enzymatically and non-enzymatically transformed products derived from well-known endogenous or exogenous compounds. As consequent, HMDB are not providing the necessary metabolite coverage to allow researchers to identify these VOCs in breath.
Totally, 31 kinds of differential metabolites were selected from 81 putative metabolites, through the Mann-Whitney test, Spearman rank correlation analysis and OPLS-DA. In cross-validation, the average accuracy of the multivariate model based on these 31 VOCs is 78.7%, while that of the previously reported model selecting VOCs by machine learning algorithm was 85% [15]. In our opinion, these differences were derived from that some VOCs which may be valuable for pattern recognition were removed unexpectedly during the process of HMDB annotations. Obviously, those removal were related to the lack of available data and knowledge on breath research in HMDB database. However, the main intention of this study is to explore several volatile biomarkers and related pathways instead of overemphasizing the pursuit of accuracy. As we mentioned before, diagnostic models separated from biomedical meanings are always not robust enough. So, it is significantly meaningful to obtain several kinds of confident markers, even though that is not the entire set of lung cancer markers in breath.
Eleven pathways involving eight potential biomarkers were discovered in enrichment analysis. Among them, monoterpenoid biosynthesis pathway has the lowest p-value, indicating the statistical significance. Menthol, camphene and eucalyptol were annotated in this pathway. First, monoterpenoid chemicals were sometimes used in treatment [19][20][21]. For instance, camphene was reported to be a main component of essential oils of lemongrass which induces apoptosis and cell cycle arrest in A549 lung cancer cells [22]. In addition, eucalyptol shows several pharmacologic activities that may be used in treatment of some pulmonary disease including rhinosinusitis, bronchitis, asthma and chronic obstructive pulmonary disorder (COPD) [23]. Likewise, camphene could be beneficial in battling respiratory illnesses, and could act as a cough suppressant and anti-congestive tool [24]. Indeed, patients who received treatment after the diagnosis of LC or had a history of airway inflammatory in the past 3 months were excluded. However, inclusion or exclusion of subjects were based on medical record in our hospital and questionnaire survey. Questionnaire survey, the criteria of every patient are obviously subjective. Generally speaking, lung cancer always comes with pulmonary symptoms. Therefore, a large number patient may dose themselves with some cough suppressant or cold drug and did not state in their questionnaire. However, it could have been avoided to a certain extent by more careful and scientific design of questionnaire. In future research, we would give more related examples and more strict definition of each symptom and treatment. Secondly, some cigarettes may contain menthol, and menthol cigarettes has been confirmed to increase lung cancer risk [25,26]. So, it is not sure whether the regulation of monoterpenoid biosynthesis pathway is related to lung cancer or other factors including therapy and smoking.
Although their relation to lung cancer is not yet clear, menthol, camphene and eucalyptol were detected when comparing breath VOCs from smokers and nonsmokers with and without COPD [27]. This literature also suggested even though some VOCs relate not only to disease but also to smoking status, detections of their concentrations still make sense for disease diagnosis. As regard to toluene and xylene degradation, oxylene and acetic acid were found in this pathway. Although we did not find any literature reporting the relations of this pathway and lung cancer incidence, toluene and xylene are confirmed to be risk factor of lung cancer [28,29]. Additionally, they have been detected in exhaled breath [30], especially they were reported as lung cancer markers in exhaled breath in a series of papers [31][32][33][34]. But many aromatic VOCs were also reported as results of cigarettes exposure [35], which made it doubtful whether they can be available lung cancer markers. In our view, even cigarette smoke may be the source of these molecules, their levels still make sense. Because the key point is the difference of degradation capacity rather than absolute concentrations of those molecules. To test our hypothesis, we applied univariate statistical analysis in people with different genders and smoking status. Group-wise differential metabolites were listed in Supplementary materials (S4) which showed that o-xylene was selected as differential metabolite both in smokers and non-smokers. However, it has no significant differences between subjects with and without lung cancer, when we took all subjects including smokers and non-smokers into consideration. Similar results were shown in a dual centre study comparing breath VOCs which was mentioned above [27]. In that study, COPD patients were diagnosed from smokers and non-smokers, respectively, to overcoming the confounding effects of smoking. However, the sample size of each subgroup was limited. More longitudinal studies for aromatic VOCs should be conducted in future, especially focussing on toluene and xylene degradation pathway.
Other pathways obtained in our study were reported as lung cancer-related pathway before, including glycosaminoglycan biosynthesis-heparan  sulphate, reductive carboxylate cycle (CO 2 fixation), glycolysis/gluconeogenesis, C5-branched dibasic acid metabolism, pyruvate metabolism, selenoamino acid metabolism, taurine and hypotaurine metabolism and sulphur metabolism. Although metabolism analysis such as pathway enrichment were rarely used in studies on lung cancer biomarkers in breath, similar studies have been conducted based on miRNA or DNA data. That provides lots of meaningful information for our work. Higher impact values indicated that these metabolic pathways are more relevant to the pathogenesis of lung cancer. Among all pathways, we enriched, glycosaminoglycan biosynthesis-heparan sulphate with rich factor of o.333 (p ¼ .008) may be the most possible dysregulated pathway related to lung cancer. Yang et al. reported it as one of the biologic pathways enriched by differentially expressed smoking and lung cancer specific miRNA [36]. Similarly, reductive carboxylate cycle (CO 2 fixation) was reported as lung cancer related pathway in 2016 [37]. Wang et al. investigated specific genotypes of different subtypes or stages of lung cancer through gene expression variations of chromosome 2 genes. IDH1 were selected as differential gene and enriched to reductive carboxylate cycle (CO 2 fixation) pathway which is upregulated in lung cancer. Huang et al. performed a meta-analysis of 4 lung cancer microarray datasets encompassing 353 patients to reveal differentially expressed genes between normal lung tissues and lung cancer of different stages [38]. Overall, 1838 genes were found to be dysregulated. glycolysis/gluconeogenesis were showed to be one of significantly regulated pathway in lung cancer. As regard to C5-branched dibasic acid metabolism, a genome-scale metabolic models for exploring changes in metabolism under normal and cancer conditions have concluded this pathway is relevant to lung cancer and prostate cancer [39]. Additionally, its dysregulation is also related to cystic fibrosis which is one of the risk factor of lung cancer [40]. As with pathways above, pyruvate metabolism [41,42], seleno-  amino acid metabolism [43] taurine and hypotaurine metabolism [44] also be confirmed to be closely related to lung cancer through other kinds of omics data. Although we believe that abnormalities of VOCs in breath of lung cancer patients are closely related to dysregulation of these pathways, results are not convinced enough. More details about relationship between lung cancer and volatile metabolites, like genetic-level information, are necessary. Other kinds of omic data have also been involved in HMDB, which made it more compatible with the increasing number of multi-omic or systems biology studies [16]. Recently, efforts have been made on searching lung cancer markers in exhaled breath condensate (EBC) where genes [45] and proteins [46] could be detected. With different sampling technique, multiomics data including VOCs, genes and proteins could be acquired in exhaled breath, simultaneously. Therefore, studies combining volatile breath and EBC may be a promising way to do some data mining on exhaled lung cancer markers and their related pathways.
There still some limitations in this study. As far as we known, subtypes and stages of lung cancer have influences on metabolic disorders. However, the sample size in our study is not enough for comparisons among subgroups. Other limitation is that too many VOCs were not involved in HMDB database, owing to few studies on pathway related to volatile metabolites. Therefore, metabolomics cannot work in exploring breath marker as well as it should be. Further, lack of standards of sampling techniques and analytical techniques may lead to different results between various studies. For instance, expiratory flow rate, breath hold and inclusion of anatomical dead space were reported to be significant influence factors of breath test [47]. Addressing these weaknesses requires more researchers making efforts to expand the metabolic dataset, and standardize the sampling and analytical techniques.

Conclusion
A key challenge for efforts to apply breath diagnosis of lung cancer in clinical is the lack of clear explanation about relationship between volatile makers and lung cancer. Although our study failed to provide a list of all markers in breath, we still open the possibility of exploring dysregulated pathway which result in variation of VOCs in breath, which may illustrate where these markers derived from. We believe that with the gradually improved bioinformatic database (e.g. HMDB or KEGG) the bottleneck of studies on exhaled markers of lung cancer may be removed.

Disclosure statement
The authors declare no conflict of interest.

Institutional review board statement
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the institutional ethics review committee of Sir Run Run Shaw Hospital, Hangzhou, China (Approval no. ChiCTRDCD-15007106, 18 September 2015).

Informed consent statement
Informed consent was obtained from all subjects involved in the study.

Data availability statement
The data based on the results of the current study were obtained, are accessible from the corresponding authors upon reasonable request.