Development of cancer prognostic signature based on pan-cancer proteomics

ABSTRACT Utilizing genomic data to predict cancer prognosis was insufficient. Proteomics can improve our understanding of the etiology and progression of cancer and improve the assessment of cancer prognosis. And the Clinical Proteomic Tumor Analysis Consortium (CPTAC) has generated extensive proteomics data of the vast majority of tumors. Based on CPTAC, we can perform a proteomic pan-carcinoma analysis. We collected the proteomics data and clinical features of cancer patients from CPTAC. Then, we screened 69 differentially expressed proteins (DEPs) with R software in five cancers: hepatocellular carcinoma (HCC), children’s brain tumor tissue consortium (CBTTC), clear cell renal cell carcinoma (CCRC), lung adenocarcinoma (LUAD) and uterine corpus endometrial carcinoma (UCEC). GO and KEGG analysis were performed to clarify the function of these proteins. We also identified their interactions. The DEPs-based prognostic model for predicting over survival was identified by least absolute shrinkage and selection operator (LASSO)-Cox regression model in training cohort. Then, we used the time-dependent receiver operating characteristics analysis to evaluate the ability of the prognostic model to predict overall survival and validated it in validation cohort. The results showed that the DEPs-based prognostic model could accurately and effectively predict the survival rate of most cancers.


Introduction
As the most prevalent fatal disease, cancer ranked second in all mortality worldwide in 2017 [1]. And the death rate of cancer was increasing year by year, cancer deaths increased from 7.62 million in 2007 to 9.56 million in 2017. In 2018, 18.1 million people worldwide have been diagnosed with various types of cancer [2]. Despite the significant progress in treatment, timely diagnosis and high cost of treatment make it impossible to obtain effective treatment, which was still the reason for the low 5-year survival rate of most cancers [3]. In order to develop optimal anti-cancer treatment protocols and elucidate the mechanism of tumorigenesis, it is essential to estimate the prognosis of tumor patients [4]. Although many studies used RNA sequence data from the Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) to evidence many tumor prognostic biomarkers and construct many prognostic models [5,6], utilizing genomic data to predict cancer prognosis was insufficient and imprecise, because molecular drivers of cancer were derived not just from DNA alterations alone, but from protein expression, modification, and activity at the metabolic level [7].
It is widely acknowledged that tumor cells were characterized by rapid generation and 16abnormal proliferation. Hence, tumor tissues would regulate the expression of proteins and promote the production of proteins associated with cancer progression [8]. Moreover, proteins were the functional effectors of cellular processes as well as the targets for a vast majority of therapeutics [9]. Therefore, the study of proteomics can improve our understanding of cancer etiology and progression as well as heighten the assessment of cancer prognosis [10]. Although most previous studies have focused on the effects of individual-specific protein on cancer prognosis [11][12][13], cancer is a heterogeneity disease that does not only involve individual protein but also interactions among proteins of different function. The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project had generated a great deal of proteomics data of the vast majority of tumors by mass spectrometry [14]. Based on the proteomics data from CPTAC, we expect to combine multiple proteins to construct a pan-cancer prognostic model.
In current study, we screened out differentially expressed proteins (DEPs) in five cancers: hepatocellular carcinoma (HCC), uterine corpus endometrial carcinoma (UCEC), children's brain tumor tissue consortium (CBTTC), lung adenocarcinoma (LUAD) and clear cell renal cell carcinoma (CCRC). Next, we explored the role of the differentially expressed proteins in cancer and the relationships among them. Furthermore, the DEPs-based survival-predictor model was also developed for predicting survival rates for the vast majority of cancers.

Identification of DEPs between tumor tissues and adjacent nontumorous tissues
For the proteomic data from CPTAC, background correction, quantile normalization, and batch normalization were performed using R software (version 3.6.1). The protein expression values of these five cancers were normalized by the 'sva' package. The bioconductor (http://www.bioconductor.org) package 'limma' was employed for DEP screening. A |log2Fold Change|>1 and an adjusted P value <0.05 were set as cutoff criteria.

PPI network construction
The PPI network of DEPs was performed by STRING [15] (https://string-db.org/) and a combined score >0.9 (high confidence) was set as the cutoff criterion. Using cytoscape online software (http://www.cytos cape.org/) to visualize the results from STRING.

Construction of DEPs-based classifiers
Based on univariate Cox regression models, we identify single DEP as independent prognostic DEPs for OS with p-value<0.05. The least absolute shrinkage and selection operator (LASSO)-Cox regression model [16] was used to identify the most accurate predictive DEPs for OS. The correlation of each prognostic DEPs was performed by R package 'ggcorrplot,' 'statn.'

Predictive performance of the DEPs-based classifiers
The patient's risk score is obtained by multiplying the expression of DEPs in LASSO by their respective coefficients. And the patients were stratified into two risk-groups by median. The survival was analyzed by the Kaplan-Meier log-rank analysis. The timedependent receiver operating characteristics (tdROC) analysis was used to assess performance of single DEP and classifiers through the 'timeROC' package of R software. The area under the curve (AUC) of tdROC reflected predictive accuracy. P-values <0.05 were considered statistically significant.

Data analysis
The Student's t-test, Wilcoxon test, and other data processing were completed by SPSS 19.0. Kaplan-Meier analysis is calculated by the 'survminer' package of R software. When all the hypotheses are P < 0.05, the difference is statistically significant.

GO analysis and KEGG analysis
In order to explore the role of the 69 DEPs in tumors, we conducted GO analysis and KEGG analysis. And the 69 DEPs were mainly associated with the following biological processes: carboxylic acid biosynthetic process, organic acid biosynthetic process, G1/S transition of mitotic cell cycle, cell cycle G1/S phase transition, monocarboxylic acid biosynthetic process, glucose metabolic process, hexose metabolic process, and DNA replication (Figure 2a). The results also indicated that the DEPs were mainly associated with the following cellular contents: nuclear chromosome part, extracellular matrix, telomeric region and MCM complex ( Figure 2a). Besides, the DEPs were related to molecular functions, such as extracellular matrix structural constituent, carbohydrate binding, helicase activity and monosaccharide binding (Figure 2a). Similar to GO analysis, KEGG analysis showed the DEPs primarily contributed to the following pathways: Cell cycle, Glycolysis/Gluconeogenesis, DNA replication, Carbon metabolism, Pentose phosphate pathway and Fructose and mannose metabolism ( Figure  2b). Furthermore, combining GO cluster diagram and GO chord diagram, we found that the parts of DEPs involved in DNA replication, Cell cycle and Arginine and proline metabolism were mainly high-expressed, and others associated with these GO terms such as Carbon metabolism and Fructose and mannose metabolism were both highly and poorly expressed (Figure 2c,d).

DEPs interaction clusters common across five cancers
The 69 DEPs were used for the network analysis and almost half the DEPs formed an interaction network after eliminating proteins that acted independently ( Figure 3a). And these interacting proteins were roughly separated into four groups with CDK1, ENO3, Argininosuccinate synthase (ASS1) and Versican core protein (VCAN) as the cores ( Figure  3a,b). CDK1 was observed to be the key hub protein that interacted with DNA replication licensing factor

The effect of individual DEPs on survival
To explore the effect of these proteins on cancer prognosis, Kaplan-Meier survival analyses were performed using individual protein. Based on the median value of each DEP expression, we divided the cancer patients into two clusters: high protein level and low protein level. Then, we defined four types of cancer as the training cohort: HCC, CCRC, LUAD, and UCEC; and defined CBTTC as validation cohort. We counted the OS of patients from the training cohort. As shown in Figure S1, only 10 proteins out of 69 DEPs were statistically significant in the survival analysis (P < 0.05). Patients whose cancerous tissue expressed higher levels of one of RRM2, Procollagen-lysine,2-oxoglutarate 5-dioxygenase 2 (PLOD2), MKI67, MCM5, and CKD1 had lower survival rates (Figure S1A-E). And Patients whose cancerous tissue expressed higher levels of one of FBP1, FBP2, ENO3, GPD1, and ASS1 had higher survival rates ( Figure

DEPs-based survival-predictor model constructing
For acquiring a more excellent model, multiple DEPs were combined to predict survival rates for cancer patients. We first conducted univariate Cox analyses in training cohort and found that 33 DEPs related to survival were identified ( Figure 4a). Then, we used 69 DEPs to perform the LASSO Cox regression model in training cohort. Based on the results of the LASSO Cox regression model, 24 prognostic DEPs with non-zero regression coefficients were finally chosen as the potential prognostic biomarkers for the OS of cancer patients (Figure 4b,c). The detailed information of DEPs for constructing the prognostic signature was summarized in Table 1 Figure 4d,e. Among these proteins, the values of correlation between CDK1 and MKI67, P4HA2 and P4HA1, PGM5 and IL33, PGM5 and DES were all more than 0.5.

Evaluation of the survival-predictor model
Based on the survival-predictor model, we evenly divided cancer patients into two groups by the median risk score cutoff point, which value is 0.250379: High risk and Low risk (Figure 5a). The patient information was shown in Tables 2  and 3. Furthermore, the expression heatmap of the 24 DEPs in high risk or low-risk group was shown in Figure 5a. We then estimated the accuracy of the 24-DEPs model on predicting survival. The Kaplan-Meier survival curves showed that survival rates were significantly lower in the High risk (P < 0.001) (Figure 5b). The ROC analysis showed the one, two, and three years AUC of the 24-DEPs survival-predictor model were 0.764, 0.754, and 0.742, respectively (Figure 5c). Remarkably, the AUC of the 24-DEPs survival-predictor model was more than the AUC of the 10 proteins described above ( Figure S2). So, compared with a single protein as a predictor, the 24-DEPs survival prediction model had accurate and powerful prediction capability.
In order to further validate the availability of this model, we used the same 24-DEPs survivalpredictor model and cutoff point to cluster patients in validation cohort (CBTTC) ( Figure  5d). And the survival analysis also indicated that high risk had a worse OS(P < 0.001) (Figure 5e). The result of the ROC analysis was also In conclusion, the 24 DEPs-based classifiers could accurately predict the survival not only in the training cohort, but also in the validation cohort.

Discussion
As a complex disease, cancer involves not only in DNA alterations, but also in protein expression and modification [7].With technological improvements, CPTAC generates comprehensive mass spectrometry-based proteomic data for most cancers [14], which providing a unique opportunity for pancancerous proteomics research with sufficient data.
In current study, we firstly screened 69 differentially expressed proteins in five types of cancer tissue. More importantly, the expression trend of the DEPs was consistent in all five cancers, which indicated these proteins were not specific to any certain type of cancer. Among the DEPs, CDK1 played an important role in progression into mitotic phase, which could drive the cell cycle in all cell types [17]. Previous studies also showed that CDK1 expression was upregulated in a majority of tumor tissues, which correlated with the prognosis of cancer patients [18][19][20]. And MCM2, MCM3, MCM4, MCM5, MCM6, MCM7 formed the MiniChromosome Maintenance 2-7 complex, which was exported by the CDKs to trigger DNA replication [21]. In brief, CDK1 interacted with MCM2-7 complex to participate in the cell cycle, which was the same as the GO analysis and KEGG analysis. Furthermore, we found CDK1, as a key hub protein, interacted with other DEPs to form an interaction cluster. In addition to MCM2-7 complex, other proteins in the cluster also influenced the growth and division of tumor cells by participating in the cell cycle such as RRM2, PRKAR2B, and MKI67 [22][23][24]. Most DEPs related to the cell cycle were up-regulated, which was consistent with the vigorous growth and division of tumor cells. The 69 DEPs were involved not only in the cell cycle, but also in cell metabolism (Figure 2a,b). Since metabolic reprogramming was a well-established hallmark of cancer, alterations in metabolism-related proteins expression were common in tumors [25]. According to the Figure 3, metabolically related DEPs were roughly divided into two groups: carbohydrate metabolism-related proteins and amino acid metabolismrelated proteins. ENO3, FBP1, FBP2, GPD1, and ALDOB were all glycolytic pathway-related proteins with inhibitory effects on tumor [26][27][28][29]. For instance, A LDOB disrupted redox homeostasis by reducing the levels of fructose 1,6-bisphosphate in tumor cells, which could inhibit tumor cell proliferation [27]. Previous research also showed that although gluconeogenesis was frequently suppressed in tumors, re-expression of gluconeogenesis enzymes such as FBP1 could inhibit tumor growth [29]. As an enzyme responsible for the biosynthesis of arginine in most body tissues, ASS1 was downregulated in multiple diverse cancers to reprogram arginine metabolism to make tumor cells more aggressive [30]. What's more, according to our results, these metabolism-related proteins that inhibit cancer were also downregulated. But also as a protein related to amino acid metabolism, PYCR1 was highly expressed to maintain the redox balance of tumor cells and prevent apoptosis by synthesizing proline [31]. Despite the DEPs associated with metabolism and cell proliferation, quite a few DEPs were associated with the extracellular matrix. As a large extracellular matrix proteoglycan, VCAN regulated proliferation, invasion, and metastasis adhesion in a vast majority of tumor cells, and VCAN expression was associated with poor prognosis in most cancers [32][33][34]. THBS2 was also an extracellular matrix protein and promoted cell migration and angiogenesis [35]. Distinguished with VCAN and THBS2, though DCN was associated with the extracellular matrix, it could antagonize many tyrosine kinase receptors to inhibit tumor development and progression [36]. According to these results, the four DEPs interaction clusters manifested that one cluster was involved in cell growth and division, one in carbohydrate metabolism, one in amino acid metabolism, and the rest in the extracellular matrix regulation. To summarize, the functions of the 69 DEPs fell into three main categories: cell proliferation and division, cellular metabolism, and extracellular matrix regulation.
In the following step, we performed Kaplan-Meier survival analyses of 69 DEPs one by one and found that only 10 DEPs were significantly correlated with survival for multiple cancer. Of the 10 proteins, the preceding text showed that some studies identified RRM2, PLOD2, MKI67, MCM5, and CKD1 promoted cancer progression and FBP1, FBP2, ENO3, GPD1, and ASS1 inhibited cancer progression, which was consistent with our results ( Figure S1). Nevertheless, this traditional way of concentrating on molecular biomarkers such as single protein has not been successful; because the development and progression of cancers were primarily accomplished by a set of   biomolecules, rather than the dysfunction of an individual molecule [37,38]. As shown in Figure  S2, the accuracy of the 10 DEPs in predicting the prognosis of cancers was not high. Therefore, according to the LASSO regression method, we determined 24 DEPs: MKI67, LOXL2, PLIN4,  IL33, MDK, P4HA2, AKR7A3, PLCXD3, CDK1,  SRPX, PRPH, PRKAR2B, P4HA1, CALML3, SFN,  DES, PHYHD1, GPD1, AADAT, PGM5, ADH1C, FBP2, ENO3, EHD3. In accordance with the above classification, among the 24 proteins, CDK1, SFN, PRKAR2B, MKI67 and MDK were involved in the cell cycle [17,24,39]; AKR7A3, GPD1, ENO3, FBP2, AADAT, PGM5 and ADH1C were involved in cell metabolism [26,28,40]; LOXL2, P4HA1, P4HA2, SPRX, DES, PRPH and CALML3 were involved in construction and regulation of extracellular matrix [41][42][43]. And most of these proteins have been identified to contribute to prognosis of many cancers [19,26,[41][42][43][44]. Although IL33 and EHD3 did not belong to any of the three groups mentioned above, some researches showed that they could inhibit the proliferation of tumor cells [45,46]. In addition to these widely studied proteins, there were still several proteins whose roles in cancer were unclear such as PLCX3, PHYHD1 and PLIN3, which provided a new direction for cancer research. Although no research had yet explored the specific ways in which they interacted, according to correlation analysis, PGM5 was related to IL33 and DES. Therefore, we inferred that PGM5 may be involved in the regulation of tumor inflammation and extracellular matrix by regulating metabolism. Tumor immune microenvironment was closely related to tumor prognosis, and NK cells and T cells are the main anti-tumor cells, which were associated with cancer immune evasion [47][48][49]. Among the 24 DEPs, IL33 and EHD3 were associated NK cell and played an important role in TCR-mediated T cell functions [50,51]. Edwin Wang et al proposed a cancer hallmark network framework for modeling genome sequencing data associated clinical phenotypes [52,53]. And most of the 24 DEPs (CDK1, SFN, PRKAR2B, MKI67 and MDK involved in the cell cycle; LOXL2, P4HA1, P4HA2, SPRX, DES, PRPH and CALML3 involved in construction and regulation of extracellular matrix; IL33 and EHD3 may involve in immune) were linked to cancer hallmarks. Therefore, these DEPs could add to our understanding of tumor evolution and tumorigenesis and be helpful for predicting tumors' evolutionary paths and clinical phenotypes. Based on the 24 DEPs-based classification, we divided the cancer patients into two groups in training cohort. The Kaplan-Meier survival analysis and the ROC analysis showed that the 24-DEPs survival-predictor model was better predictor than single protein (Figure 5b,c). We further verified the correctness of this grouping method in validation cohort and the two groups also showed significantly different survival rates (Figure 5e). Therefore, the DEPsbased survival-predictor model showed excellent survival prediction effect and is applicable to most cancers, which will contribute to therapeutic decision-making. Yet, there are several limitations in this study. Firstly, this study mainly explored the effect of the differentially expressed proteins on predicting the OS of multiple cancers. It will inevitably be interesting to combine proteomics with genomics and even metabonomics to predict pan-cancer OS in the future. Secondly, the current study was a retrospective study utilizing the CPTAC database. Therefore, more prospective studies were still needed. Moreover, proteins data of this study were based on clinical specimens, which had limitations for clinical application. It would be clinically valuable, if we could discover tumor biomarkers in various accessible blood samples.

Conclusion
In summary, our study screened 69 differentially expressed proteins in five cancers. Then, we confirmed these DEPs were mainly associated with cell proliferation and division, cellular metabolism, and extracellular matrix. According to the LASSO regression method, we have determined 24 DEPs. Notably, the DEPs-based survival-predictor model could accurately predict the OS in multiple cancers. And this is the first study to utilize proteomics to construct a pan-cancer prognosis model, and the results indicated that the pan-cancer analysis may complement single cancer analysis in the identification of prognostically differentially expressed proteins.

Highlights
(1) 69 differentially expressed proteins (DEPs) were identified. (2) The DEPs formed an interaction network across five cancers. (3) The 24 DEPs could accurately predict the OS in multiple cancers.