Development of a gene signature associated with iron metabolism in lung adenocarcinoma

ABSTRACT There are few studies on the role of iron metabolism genes in predicting the prognosis of lung adenocarcinoma (LUAD). Therefore, our research aims to screen key genes and to establish a prognostic signature that can predict the overall survival rate of lung adenocarcinoma patients. RNA-Seq data and corresponding clinical materials of 594 adenocarcinoma patients from The Cancer Genome Atlas(TCGA) were downloaded. GSE42127 of Gene Expression Omnibus (GEO) database was further verified. The multi-gene prognostic signature was constructed by the Cox regression model of the Least Absolute Shrinkage and Selection Operator (LASSO). We constructed a prediction signature with 12 genes (HAVCR1, SPN, GAPDH, ANGPTL4, PRSS3, KRT8, LDHA, HMMR, SLC2A1, CYP24A1, LOXL2, TIMP1), and patients were split into high and low-risk groups. The survival graph results revealed that the survival prognosis between the high and low-risk groups was significantly different (TCGA: P < 0.001, GEO: P = 0.001). Univariate and multivariate Cox regression analysis confirmed that the risk value is a predictor of patient OS (P < 0.001). The area under the time-dependent ROC curve (AUC) indicated that our signature had a relatively high true positive rate when predicting the 1-year, 3-year, and 5-year OS of the TCGA cohort, which was 0.735, 0.711, and 0.601, respectively. In addition, immune-related pathways were highlighted in the functional enrichment analysis. In conclusion, we developed and verified a 12-gene prognostic signature, which may be help predict the prognosis of lung adenocarcinoma and offer a variety of targeted options for the precise treatment of lung cancer.


Introduction
Lung cancer is the leading cause of cancer-related deaths around the world [1]. In 2018, there were 2.1 million new lung cancer cases and 1.8 million deaths worldwide [2]. In recent years, the incidence of lung adenocarcinoma has consistently increased and has caused it to become the most common type of non-small cell lung cancer [3]. Thus, it is necessary to establish a neo-model for predicting the prognosis of lung adenocarcinoma in order to develop more effective diagnosis and treatment strategies.
Iron (Fe) is an essential nutrient for the human body; iron plays a prominent role in multiple forms of cell death, including apoptosis, necrosis, ferroptosis, and ascorbate-mediated death [4]. Circulating iron is normally found complexed with transferrin (Tf) and circulates in the bloodstream. Tf is absorbed in peripheral tissues by binding to TfR1 [5]. The high expression of TfR1 is not only related to the reduced response to chemotherapy, but also to the increased phosphorylation of Src kinases in breast cancer, promoting tumor cell division, motility and adhesion [5]. Disorders of iron metabolism in cancer are well known. Based on review of the literature, there is evidence that iron plays a particularly important role in lung cancer [6]. Disorders of iron metabolism are closely linked to the occurrence, proliferation and progression of tumors, and seriously affect tumorigenesis [7]. Sukiennicki [8] et al. showed that high iron and high iron protein represent higher body iron, which may be relevant to the occurrence of lung cancer. Ferritin and SOD are widely recognized in the occurrence of lung cancer [9,10]. Researchers have demonstrated that the increase of these two markers in lung cancer patients seems to be the result of inflammation and oxidative stress, and it is believed that inflammation and oxidative stress are important components of the pathogenesis of lung malignancies [10][11][12]. Chanvorachote [13] et al. found that iron can induce cancer stem cells and promote the production of an aggressive phenotype through the generation of ROS in lung cancer cells, which contributes to the occurrence of lung tumors. Although lung cancer is certainly not just an iron disease, these findings indicate that there is a clear and direct connection between iron and lung cancer. Therefore, it is necessary to identify novel prognostic biomarkers and construct more accurate prognostic models. Doing so can provide an effective reference for precise clinical treatment strategies for lung adenocarcinoma.
In our study, the mRNA expression profile and corresponding information data of patients with lung adenocarcinoma were obtained from the TCGA and GEO databases. We aimed to establish a credible iron metabolism-related prognostic gene signature for patients with lung adenocarcinoma. Our results help predict the prognosis of LUAD patients and provide a novel direction for the development of precise treatment strategies.

Data collection
The Cancer Genome Atlas (TCGA) data mining platform was searched and standardized RNA-seq data was downloaded. The number of fragments per million bases (FPKM) and relevant clinical data of LUAD, were accessed and naturalized into an expression matrix [14] (As of 16 July 2020, https://portal.gdc.can cer.gov/repository). The samples consisted of mainly 594 cases of LUAD (535 samples, 59 adjacent normal samples). For the clinical information materials of TCGA-LUAD patients, the following methods were used for preprocessing: (1) Samples without clinical data were deleted; (2) Samples with follow-up time lower than 30 days were deleted. In total, 486 LUAD patients were included in the research as a training set.
Iron metabolism genes were downloaded through the GeneCards data portal, screening out the relevant score threshold (relevant score ≥ 5), and finally 3037 iron metabolism-related genes were obtained (https://www.genecards.org/). The mRNA expression matrix of iron metabolism genes in this study was obtained by taking the intersection with the expression matrix of the above-mentioned TCGA-LUAD patients, which was then used for subsequent analysis.
In addition, we retrieved gene expression arrays (GSE42127) and clinical information materials of another 133 lung adenocarcinoma patients in the Gene Expression Omnibus (GEO) (https://www. ncbi.nlm.nih.gov/geo/) in order to verify the prognostic status of the gene signatures found in the training set (TCGA). Similarly, we deleted samples with no clinical data and with a followup time lower than 30 days. Finally, 131 patients were used as a test set for further validation.

Identify differentially expressed genes by dimension reduction algorithm
The 'SVA' R software was used to eliminate batch effects and other unnecessary changes in highthroughput experiments; the intersection genes of the TCGA and GEO data sets were obtained respectively [15]. Next, the 'limma' R software package was used to further distinguish the differentially expressed genes (DEGs) between the tumor tissue and the tumor adjoining tissue ((FDR)<0.05, logFC = 1.5) [16]. Finally, univariate Cox analysis was executed to single out iron metabolism-related genes with a strong prognostic ability (P < 0.01). The candidate metabolic genes obtained were used in the next step of constructing the prognostic gene signature.

PPI network construction
The candidate genes related to iron metabolism were obtained by univariate Cox analysis. On the STRING database portal (version 11.0) [17], the protein-protein interaction (PPI) network of candidate genes was downloaded. R software package was used to compute the correlation coefficients of iron metabolism candidate genes and construct a correlation network diagram.

Construction and evaluation of iron metabolism gene prognosis model
For the candidate genes obtained above, in order to prevent overfitting, LASSO-Cox regression analysis was carried out through the 'glmnet' package, and a predictive prognostic model containing 12 genes related to iron metabolism (iteration = 2000) was constructed [18]. The LASSO penalty was applied to simultaneous consideration of contraction and variable selection [18,19]. The penalty parameter (λ) of this metabolic model was confirmed through 10-fold cross-validation based on the 'glmnet' software package in the R software [18]. On the basis of standardized expression levels of a piece iron metabolic genes and its regression coefficient, the risk score equation of LUAD patients was calculated as follows: Risk Score = ∑ (The expression level of a piece metabolic gene × regression coefficient).
Patients with lung adenocarcinoma could be split into high-risk and low-risk groups, on the basis of the median risk score. Kaplan-Meier analysis was carried out and the 'survival ROC' software package was applied in order to draw time-dependent ROC curves [20,21]. In addition, it was possible to assess the predictive performance of the metabolic gene signature using a calibration chart that compared, predicted, and observed overall survival (OS) [20]. The GSE42127 data set with clinical data was used for further external verification.

Prognostic independence analysis of lung adenocarcinoma
There was a need to further determine independent prognostic parameters and verify the powerful prognostic ability of gene signature. Therefore, in order to conduct the study of gene signature and clinical pathological parameters (mainly age, gender, stage, TNM stage) that predict prognosis in the TCGA data set, we conducted univariate and multiple Cox regression analysis. Among them, in the multivariate Cox regression analysis, P < 0.05 was considered statistically significant. Therefore, we only considered the parameters with P value <0.05 in the univariate analysis.

Potential correlations between high and low risk populations and biological functions and immune cells
To explore the molecular mechanisms of the metabolic gene signature, we executed gene set enrichment analysis (GSEA) to further validate the model (version GSEA_4.0.3) [22]. Before that, we divided LUAD patients into high-risk and low-risk groups. Then, based on the gene expression data of lung adenocarcinoma patients obtained from TCGA, the ESTIMATE (using gene expression profile to assess stromal cells and immune cells in malignant tumor tissues) algorithm was used to calculate stromal, immune and estimated scores [23]. Subsequently, using the 'gsva' software package in the R software, we performed a single-sample gene set enrichment analysis (ssGSEA) [24]. Through the application and estimation of expression data, valuable insights into the state of immune cell infiltration and the activity of immune-related pathways were obtained [24].

Statistical analysis
Statistical analysis was performed in R software v. 4.0.2. The Student's t test was applied to analyze paired samples of tumor tissue and adjacent tumor tissue. For OS between people in different risk groups, Kaplan-Meier analysis and comparisons were applied. Through univariate and multivariate Cox regression analysis, survival assessment was further carried out. For further verification and evaluation, time-dependent ROC curve and calibration curves were drawn. Additionally, the hazard ratio (HR) and 95% confidence interval (CI) was computed. The stromal, immune and estimated scores were calculated using the ESTIMATE software package. In all the statistical tests involved in this research, a P value < 0.05 was considered statistically significant.

Results
In the present study, we aimed to identify genes involved in iron metabolism that affect the prognosis of LUAD. We identified 12 DEGs involved in iron metabolism. These DEGs were used to construct a new prognostic models and validate it to explore the prognostic predictive power and diagnostic power of the signature. In addition, we also performed functional enrichment analysis and immune correlation analysis to explore the potential biosynthetic mechanisms involved in the pathogenesis of lung adenocarcinoma.

Identification of iron metabolism DEGs in LUAD
After pre-processing, 486 LUAD patients from TCGA and 131 lung adenocarcinoma patients from GEO were selected. A detailed summary of the clinical features of these patients is shown in Table 1. In order to identify prognostic genes related to iron metabolism of LUAD, differential expression analysis was performed. The DEGs between tumor samples and neighboring tumor samples were selected through the Wilcox Test. A total of 257 iron metabolism-related DEGs were identified (adjusted p values<0.05 and |logFC|>1.5); among them, there were 154 up-regulated DEGs, and 103 significantly down-regulated DEGs. The heat map and volcano map of these differential genes are shown in (Figure 1(a,b)). To further identify the representative prognostic genes of iron metabolism, we performed univariate Cox analysis, leading to the retention of 46 DEGs (P < 0.01, Table 2). The interaction network between these genes is shown in (Figure 1(e)).

Establishment and verification of an iron metabolism-related gene signature
For constructing a genetic signature related to iron metabolism, the following steps were performed: first, on the basis of expression profiles of the above 46 candidate genes, LASSO Cox regression analysis (2,000 iterations) was carried out. According to the minimum λ, the optimal model was constructed with the minimum parameters ( Figure 1(c,d) Patients could be split into high-risk group (n = 238) and low-risk group (n = 239) using the optimal cutoff value of the risk score (After adjusted, Figure 2(a)). Kaplan-Meier analysis results revealed that the OS of the two different risk groups in the training group was significantly different. It was observed that the OS of the low-risk group was significantly higher than that of the high-risk group (P < 0.0001, Figure 2(b)). Next, the strong prognostic value of 12 gene signatures was analyzed using the time-dependent ROC curve. In addition, with respect to the prediction of risk scores for 1-year, 3-year, and 5-year overall survival, the AUCs were 0.735, 0.711, and 0.601, respectively (Figure 2(d, e,f)).

External verification of 12 gene signatures in GSE42127
The external data set GSE42127 further proved the predictive capability of the 12-gene prognostic signature. For patients in the GEO cohort, the same calculation method as the TCGA cohort was applied to compute the risk score, following which the LUAD patients were split into highrisk and low-risk groups (Figure 3(a)). Kaplan-Meier analysis results were similar to those obtained in the TCGA cohort; it was shown that the overall survival of the low-risk group was significantly longer than that of the high-risk group (P = 0.001, Figure 3(b)). Next, the prognostic ability of the signature was assessed through time-dependent ROC, wherein the 12gene signature could have a higher performance. When predicting the AUC of the overall survival (OS) of the 12-gene signature, the results at 1, 3 and 5 years were 0.904, 0.745, and 0.712, respectively (Figure 3(d,e,f)).  (Table 3). However, in the multivariate Cox regression analysis, the risk score was an independent predictor of OS.

Constructed and verified nomogram and calibration plots
All the clinical information parameters in the univariate Cox regression analysis mentioned above exist in the TCGA and GEO cohorts. Among these, gender, stage, T stage, M stage and N stage were the parameters involved in the TCGA nomogram (Figure 2(c)). The parameters included in the GEO nomogram were gender and stage (Figure 3 (c)). In the TCGA and GEO cohorts, a prognostic nomograph was constructed to predict the OS at 1, 3, and 5 years, respectively. It can be seen that those patients with higher scores have distinctly lower OS than those with lower scores. In addition, the results of the calibration chart have shown that the nomogram is significantly accurate in predicting the OS of patients with lung adenocarcinoma (TCGA: Figure 2(g,h,i), GEO: Figure 3 (g,h,i)).

Biological function and immune analysis of TCGA Cohort
Next, we aimed to deepen our understanding of the biological functions of the prognostic model. In 477 LUAD samples from high-risk and low-risk groups of TCGA (After adjusted), ssGSEA was used to explore the tumor microenvironment in different immune clusters, and to compute the stromal score, immune score and estimated score of cancer tissue expression profile. Based on the data, we have reason to conclude that patients with high immunity have higher estimated score, stromal score, and immune score than patients with low immunity. In contrast, the tumor purity of the low-immune patients was higher than that of the high-immune patients. The result is shown in Figure 4(a). These results indicated that the TCGA cohort significantly enriched many immune-related biological processes (P < 0.05). Among them, eight biological processes related to immunity include: immunoglobulin complex, natural killer cell chemotaxis, circulating immunoglobulin complex, immunoglobulin receptor binding, MHC class II protein complex, MHC protein complex, positive regulation of interferon−gamma biosynthetic process, and T cell receptor complex (P < 0.05, Figure 4(b)).  We also probed the correlation between high and low-risk populations and immune status. ssGSEA analysis was used to analyze the immune cells and related functions of the two groups. Between the low-risk group and the high-risk group of the TCGA cohort, it was observed that the scores of APC co-inhibition, Th2 cells, NK cells and MHC class I were significantly different (P < 0.05, Figure 4(c,e)). In addition, for the high-risk group, the scores of Inflammation-promoting, Parainflammation and Th1 cells were higher, while the scores were lower for HLA, type II IFN response, aDCs, B cells, iDCs, Mast cells, Neutrophils and T helper cells (P < 0.05, Figure 4(c,e)). In the ssGSEA score of the GEO cohort, between the high-risk group and the low-risk group, there were differences in cytolytic activity, MHC class I, Parainflammation, type II IFN response, Th2 cells, CD8 + T cells, iDCs, Inflammation−promotion, Th1 cells, and Treg (P < 0.05, Figure 4 (d,f)). (*** P < 0.001, ** P < 0.01, * P < 0.05, ns = not significant).

Discussion
Lung cancer is a malignant lesion formed by the immortal proliferation of cancer cells with genetic mutations in the lungs. Lung adenocarcinoma is the most common type of lung cancer. Studies have shown that malignant cancer is usually related to dysregulated iron metabolism, especially the expression of iron metabolism genes. This excess iron is needed not only in the early stages of tumor development, but also in the late stages of promoting the metastatic cascade [25,26]. In view of the complex network of iron metabolism genes in cancer cells and their effects on tumor growth and survival, it is necessary to understand their relevance to the prognosis of lung adenocarcinoma. In present study, we identified DEGs related to iron metabolism, and then constructed a 12-gene prognostic model through the LASSO Cox regression analysis and verified its relationship with OS in the external cohort (GSE42127). The results indicated that the 12-gene signature was able to divide LUAD patients in the TCGA and GEO datasets into two groups with different risk levels, namely the high-risk group and the low-risk group. Kaplan-Meier analysis suggested that patients with low-risk scores were correlated with better prognosis, and vice versa. These results indicated that our gene model significantly correlated with the overall survival of LUAD patients. In addition, univariate and multivariate Cox analysis results revealed that our signature model was closely related to risk scores. Risk scores are an extremely important factor that predict the prognosis of patients, which further reflects the strong prognostic ability of our signature. ROC analysis, nomogram and calibration graphs using TCGA and GEO data sets also confirmed the robustness of our prognostic model. In addition, functional and immune exploration analysis showed that immune-related pathways were enriched. Therefore, these results indicated that our 12gene signature provides the possibility of identifying lung adenocarcinoma and using iron metabolism genes to establish a prognostic model.
The gene signature proposed in this study consists of 12 iron metabolism-related genes (HAVCR1, SPN, GAPDH, ANGPTL4, PRSS3, KRT8, LDHA, HMMR, SLC2A1, CYP24A1, LOXL2, TIMP1). Hepatitis A virus cellular receptor 1 (HAVCR-1) is mainly a susceptibility gene for asthma and allergies, which is principally expressed on Th2 cells and acts as an effective costimulatory molecule for T cell activation [27,28]. According to a report by Zheng et al. [29] the abnormal expression of HAVCR-1 is associated with the occurrence and progression of NSCLC. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is a glycolytic enzyme and one of the main housekeeping proteins, and its increased expression is correlated with the proliferation and invasion of lung cancer [30]. Angiopoietin-like protein 4 (ANGPTL4) is a glycoprotein secreted by various cells; it belongs to the Angiopoietin family (ANGPTL) and is overexpressed in nonsmall cell lung cancer [31]. According to Ma et al. [32] serine protease 3 (PRSS3) and its signal transduction pathway are related to poor prognosis in lung cancer, which may lead to the invasion and growth of lung adenocarcinoma tumor cells. Keratin 8 (KRT8) is a type II basic intermediate filament (IF) protein, which can be abnormally expressed in various human cancers (including lung adenocarcinoma tissue) [33]. Lactate dehydrogenase A (LDHA) is an enzyme that plays a particularly important role in cancer cell metabolism and tumor growth, and is connected with poor prognosis in lung adenocarcinoma [34,35]. Hyaluronan-mediated motor receptor (HMMR) is a multifunctional protein, according to Song et al. [36]. HMMR is associated with the reduction of the overall survival of lung cancer patients. In addition, it can pass HCG18/miR-34a-The 5p/ HMMR axis that can accelerate the progression of lung adenocarcinoma [37]. Glucose transporter 1 (GLUT1) is a pivotal protein in the pathway of cellular energy metabolism, also known as solute carrier family 2 member 1 (SLC2A1); it has a particularly essential role in the occurrence and progression of tumors, and may be one of the driver genes of lung cancer [38]. Cytochrome P450 family 24 subfamily A member 1 (CYP24A1) is situated at the inner mitochondrial membrane and nucleus, according to Shiratsuchi et al [39]. The expression of CYP24A1 is relevant to the poor prognosis of resected lung adenocarcinoma. Lysine oxidase-like 2 (LOXL2) pertains to the lysyl oxidase (LOX) family, and is mainly involved in the formation of cross-linked products of matrix collagen and elastin outside the cell [40]. In addition, according to the report by Peng et al. [41], LOXL2 has a driving effect on the invasion and metastasis of lung cancer, and the increase of LOXL2 expression indicates poor prognosis in patients with LUAD. The tumor/stroma TIMP-1 intensity ratio in the tissue has a particularly important predictive effect on tumor recurrence [42]. At present, the function of SPN in the occurrence and development of lung cancer is indistinct. Although some biological functions of these 12 genes have not been reported in LUAD, it provides a new direction for the study of tumorigenesis and cancer immunity.
In the past ten years, although iron has been a research hotspot of lung cancer, there are few studies on the correlation between iron metabolism and tumor immunity. For the patients in the different risk groups of LUAD, GO analysis was conducted. Unexpectedly, many biological processes related to immunity were enriched. Therefore, we speculate that iron metabolism may be closely related to tumor immunity. Moreover, we also studied and explored the interrelationship between risk groups and immune cells. Interestingly, there is a difference between high and low-risk groups, including naïve B cells, CD8 + T cells, activated CD4 + memory T cells, M1 Macrophages and activated dendritic cells. Previous research has indicated that CD8 + T cells [43,44] and macrophages [44,45] have a connection with the poor prognosis of lung cancer patients. Perhaps one of the reasons for the poorer prognosis of high-risk patients is the weakened anti-tumor immune function.
This study also has few deficiencies as well as limitations. First, our predictive model is constructed and verified by retrospective data from public databases. Therefore, it is necessary to conduct more prospective experimental studies to further verify the prognosis of our gene signature, and experimental studies on these genes may provide new insights into their biological functions. Secondly, the use of a single feature to build a predictive gene signature is actually an inherent defect. In practice, other mechanisms also affect the occurrence and development of lung adenocarcinoma.

Conclusion
In summary, our research may define a new gene signature of iron metabolism to explore the overall survival of lung adenocarcinoma. The 12-gene signature consists of promising prognostic biomarkers for lung adenocarcinoma, and also provides multiple targets for precise treatment.

Research highlights
1. We developed the first-ever signature based on iron metabolism-related gene. 2. We uncovered a prognostic signature of 12 iron metabolism-related genes. 3.Provide a novel direction for the development of precise treatment strategies. 4. High-risk individuals showed higher tumor purity and lower immune scores.