Comprehensive bioinformatics analysis reveals the hub genes and pathways associated with multiple myeloma

ABSTRACT
 Purpose While the prognosis of multiple myeloma (MM) has significantly improved over the last decade because of new treatment options, it remains incurable. Aetiological explanations and biological targets based on genomics may provide additional help for rational disease intervention. Materials and Methods Three microarray datasets associated with MM were downloaded from the Gene Expression Omnibus (GEO) database. GSE125364 and GSE39754 were used as the training set, and GSE13591 was used as the verification set. The differentially expressed genes (DEGs) were obtained from the training set, and Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were performed to annotate their functions. The hub genes were derived from the combined results of a protein–protein interaction (PPI) network and weighted gene coexpression network analysis (WGCNA). The receiver operating characteristic (ROC) curves of hub genes were plotted to evaluate their clinical diagnostic value. Biological processes and signaling pathways associated with hub genes were explained by gene set enrichment analysis (GSEA). Results A total of 1759 DEGs were identified. GO and KEGG pathway analyses suggested that the DEGs were related to the process of protein metabolism. RPN1, SEC61A1, SPCS1, SRPR, SRPRB, SSR1 and TRAM1 were proven to have clinical diagnostic value for MM. The GSEA results suggested that the hub genes were widely involved in the N-glycan biosynthesis pathway. Conclusion The hub genes identified in this study can partially explain the potential molecular mechanisms of MM and serve as candidate biomarkers for disease diagnosis.


Introduction
Multiple myeloma (MM) is the second most common type of cancer in the blood system; it is characterized by abnormal clonal proliferation of malignant plasma cells in bone marrow [1]. It has been reported that nearly 140 000 new cases are diagnosed worldwide each year, and the number of cases of morbidity and mortality in 2019 alone is more than twice that in 1990 [2,3]. Due to the insidious onset of monoclonal gammopathy of undetermined significance (MGUS, a premalignant precursor condition of active MM), a considerable number of patients are diagnosed accidentally when serious complications occur. According to a report, more than 50% of patients presenting with no history of malignancy who required urgent surgery for pathological compression fracture were found to have MM or plasmacytoma [4].
Multiple genetic changes may initiate the process of myeloma and promote its progression [5]. With the development of autologous haematopoietic cell transplantation and new targeted drugs, the overall survival of MM has been considerably prolonged. However, the overall treatment effect in the world is far from satisfactory [2]. First, as a disease of unknown aetiology, the treatment of MM is still largely limited to symptomatic management. Second, under the pressure of drug selectivity, mutations in target genes lead to the development of drug resistance and disease relapse, which has been observed in patients undergoing chimeric antigen receptor T-cell therapy [6]. Third, the absence of efficient diagnostic tools makes early diagnosis difficult for asymptomatic patients, a group that is believed to benefit most from timely intervention [7,8]. Therefore, exploring the aetiology of MM and looking for efficient biomarkers to improve the recognition ability of patients will help to minimize the harm of MM to humans. Different from traditional research methods, the popularization and application of high-throughput sequencing and the establishment of a global gene database provide broader and necessary data support for the aetiological diagnosis of MM [9]. Meanwhile, the development of bioinformatics technology provides a reliable way to discover key regulatory genes of disease [10]. On this basis, an increasing number of MM-related genes and pathways have been discovered, and some of them have been proven to play an important role in the onset and progression of disease in subsequent validation [11][12][13]. However, these studies may have some defects while providing valuable information. On the one hand, the unstandardized heterogeneity of the data used may lead to the concealment of important genetic information, resulting in unreliable results. On the other hand, monotonous computer algorithms and unreasonable threshold settings may include many genes with low association with MM or exclude certain genes with small differential multiples but have an important functional role, which weakens the value of follow-up research based on these studies [14]. For these concerns, bioinformatics analysis for MM needs to be developed to a deeper and more comprehensive level.
In this study, we integrated the gene expression information of MM in the Gene Expression Omnibus (GEO) database to construct the training set and validation set of differentially expressed genes (DEGs). Comprehensive bioinformatics analysis methods, including Gene Ontology (GO) analysis, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, protein-protein interaction (PPI) network analysis, weighted gene coexpression network analysis (WGCNA), and gene set enrichment analysis (GSAE), were used to screen and annotate the candidate genes. The clinical diagnostic value of hub genes was also verified. The aim of this study was to screen the hub genes and pathways highly associated with MM and to explain their functions.

Data integration and DEGs identification
Three microarray datasets associated with MM, including the GSE125364, GSE39754, and GSE13591 datasets, were downloaded from the GEO database (http:// www.ncbi.nlm.nih.gov/geo). The GSE125364 and GSE39754 datasets were used as the training set (including 45 MM samples and 3 healthy controls, 170 MM samples and 6 healthy controls, respectively), and the GSE13591 dataset, which included 133 MM samples and 5 healthy controls, was used as the validation set. Supplementary Table S1 shows the detailed information for the 3 datasets.
First, the GSE125364 and GSE39754 datasets were standardized and integrated. The batch effect was removed to obtain the gene expression matrix of the samples. Principal component analysis (PCA) was used to visualize the spatial distribution of samples and evaluate their enrichment degree. Subsequently, the R package 'limma (version 3.5.1)' was used to identify DEGs between 215 MM samples and 9 healthy controls [15]. P-value was adjusted by the Benjamin and Hochberg method. The cut-off criteria were adjusted to Pvalue < 0.05 and |[log2FoldChange (log2FC)]| > 1. All genes were visualized by volcano plot using the R package 'ggplot2 (version 3.3.2)', and the top 100 significantly changed DEGs were shown by heatmap using the R package 'pheatmap (version 0.7.7)' [16].

PPI network construction
The STRING online database (version 11.0, https:// string-db.org/) was used to construct a PPI network to clarify the relationships among proteins encoded by DEGs. The minimum interaction score was set as 0.9 [18]. The network was visualized by Cytoscape (version 3.8.0, https://cytoscape.org/). Then, Cyto-Hubba (version 0.1) plug-ins of Cytoscape (version 3.8.0) were used to recognize the interaction degree of candidate gene clustering according to 5 types of algorithms [the Degree, the Density of Maximum Neighborhood Component (DMNC), the Edge Percolated Component (EPC), the Maximal Clique Centrality (MCC) and the Maximum Neighborhood Component (MNC)] [19]. A Venn diagram produced by the online tool Bioinformatics & Evolutionary Genomics (http:// bioinformatics.psb.ugent.be/webtools/Venn/) was used to show the overlapping genes.

WGCNA of candidate genes
WGCNA, which was implemented by the R package 'WGCNA (version 1.6.9)', can cluster genes with similar expression patterns to show the association between modules and specific traits or phenotypes [20]. Different samples were grouped to identify the module genes that were highly correlated with the phenotype. First, the samples were clustered, and outliers were removed to ensure the accuracy of the analysis. The soft threshold parameter that satisfied the scale-free coexpression network relationship was set to 15. Second, to reduce interference from spurious correlations, the adjacency matrix was replaced by a topological overlap matrix (TOM). Through the dynamics cut tree algorithms, the corresponding dissimilarity was calculated to identify hierarchical clustering genes, and the minimum module size was set to 30 to obtain large modules. Third, different module eigenvectors (MEs) were achieved to merge modules with high similarity at a height cut of 0.75. The relevance between modules and clinical traits is shown with a heatmap to determine the modules most closely associated with MM. The |correlation coefficient| > 0.5 and P-value < 0.05 were cut-offs for module screening. Finally, the module membership (mM) and gene significance (GS) of individual genes were measured, and the genes in the key modules with |mM| > 0.8 and |GS| > 0.2 were selected for further analysis.

Candidate Hub Gene Identification and evaluation
To identify the candidate hub genes, the intersection of candidate genes from the abovementioned WGCNA and PPI results was constructed, and a Venn diagram was used to display the overlapping genes. Pathway interrelation analysis of candidate hub genes was performed using the ClueGo (version 2.5.8) plug-ins of Cytoscape (version 3.8.0) [21]. To evaluate the ability of candidate hub genes to recognize MM samples and healthy controls, the expression data of genes were extracted from the training set, and receiver operator characteristic (ROC) curves were constructed using the R package 'pROC (version 1.12.1)' [22].

Hub gene validation
In the validation phase, the expression data of the candidate hub genes from the GSE13591 dataset were extracted, and the R package 'ggplot2 (version 3.3.2)' was used to visualize the consistency of the expression trends of the samples. Then, the ROC curves of these genes in the validation set were plotted, and the results with the training set were compared. Genes with inconsistent expression trends and insignificant differences (P-value ≥ 0.05) were excluded from the following analysis. Finally, the random forest (RF) model of hub genes in the training set was constructed, and the diagnostic efficiency of the model was verified by the validation set. Decision curve analysis (DCA) was used to weigh the clinical practicability of the RF model.

GSEA of hub genes
GSEA can be used to determine whether an alreadydefined gene set exhibits a statistically significant difference in two different traits [23]. In this study, GSEA software (version 3.0, http://www.gsea-msigdb. org/gsea/index.jsp) was used to explore biological processes and pathways that may be involved in the pathogenesis of MM. The KEGG pathway gene set was used as the enrichment background, and the expression values of hub genes were taken as phenotypic files to calculate Pearson correlation coefficients. The correlation coefficients between each hub gene and other genes were sorted in descending order as scanning sequences, and the abovementioned gene set was used as a background gene for enrichment analysis. The analysis was performed under the following settings: false discovery rate (FDR) < 0.25, nominal P-value < 0.05 and |normalized enrichment score (NES)| > 1. Figure 1 shows the flow chart of the study design.

DEGs screening
By integrating the GSE125264 and GSE39754 datasets, the gene expression matrices of 224 samples were obtained (including 215 MM samples and 9 healthy controls). PCA results showed complete separation between MM samples and healthy controls in the PCA distribution 3D plot ( Figure 2A). The DEGs in MM samples and healthy controls in the gene expression matrix were analysed by the R package 'limma (version 3.5.1)'. In total, 1759 DEGs were identified, including 954 upregulated genes and 805 downregulated genes. All DEGs were described by the volcano plot ( Figure 2B) and Supplementary Data S1, and the heatmap shows the expression of the top 100 DEGs ( Figure 2C).  Figure 3). For GO BP, DEGs were mainly enriched in immune response, innate immune response, inflammatory response, translation and translational initiation ( Figure 3A). For GO MF, DEGs were significantly associated with protein binding, poly(A) RNA binding, structural constituent of ribosome, enzyme binding and cadherin binding involved in cell-cell adhesion ( Figure 3B). For GO CC, the obtained results indicated that proteins encoded by DEGs were mostly located in the extracellular exosome, cytosol, membrane, endoplasmic reticulum (ER) and ER membrane ( Figure 3C). The results of KEGG pathway analysis showed that the pathways associated with protein processing in the ER, ribosome, cell adhesion molecules and osteoclast differentiation were significantly enriched in DEGs ( Figure 3D).

PPI network
A total of 1759 DEGs were imported into the STRING online database (version 11.0) to construct the PPI network. Outlier proteins were removed, and an  interaction network of 1145 proteins was obtained.
The protein network was processed by Cytoscape (version 3.8.0) and is shown in Figure 4A. Different types of algorithms were used to analyse the 1 145 genes with interactions using the Cytohubba (version 0.1) plug-ins of Cytoscape (version 3.8.0) (Supplementary Data S3). The top 100 genes obtained by each algorithm are shown by a Venn diagram, and 71 candidate genes were obtained from the overlapping part ( Figure 4B).

Candidate genes in WGCNA
The overall clustering and clinical characteristic heatmap of the training set are shown in Figure 5A. The correlation of the samples was good, and all samples were included for further analysis. As shown in Figure 5B, when the soft threshold was set to 15 (R2 = 0.85), the interaction between genes conformed to the scale-free distribution to the maximum extent, and the mean value of the adjacency function gradually approached 0. A total of 26 modules were identified by average linkage hierarchical clustering based on the expression values of 16 239 genes in the training set and are shown in Figure 5C. The cluster diagram and heatmap in Figure 5D show the correlation between different modules. Meanwhile, 1 000 genes were randomly selected to draw the topological matrix cluster diagram to further illustrate the correlation between different modules ( Figure 5E). At a height cut of 0.75, 12 modules were eventually achieved ( Figure 5C, Supplementary Data S4). Based on the correlation between MEs and clinical traits shown in the heatmap, |cor| > 0.5 and P-value < 0.05 were used as the screening conditions, and we found that the tan module (|cor| = 0.64, P-value = 2e-27) was significantly positively associated with MM ( Figure  5F). The tan module contained 209 genes. For each gene, GS and mM were calculated to draw scatterplots ( Figure 5G). According to the |mM| > 0.8 and |GS| > 0.2 conditions, a total of 29 candidate genes were obtained (Supplementary Data S5).

Candidate Hub Gene Identification
By intersecting the abovementioned 29 WGCNA module genes and 71 PPI protein-coding genes, a total of 8 candidate hub genes were obtained: RPN1, SEC11C, SEC61A1, SPCS1, SRPR, SRPRB, SSR1 and TRAM1 ( Figure 6A). Subsequently, the ClueGo (version 2.5.8) plug-ins of Cytoscape (version 3.8.0) were used for pathway interrelation analysis of the candidate hub genes ( Figure 6B). The functions of candidate hub genes are mainly enriched in the process of protein anabolism and catabolism, which involves the coregulation of one or more genes. Among them, SPCS1 was involved in several stages of this process. At the same time, ROC curves of candidate hub genes were drawn using the R package 'pROC (version 1.12.1)' to analyse the ability of these genes to distinguish MM samples from healthy controls. According to the obtained results, the areas under the curve (AUCs) of candidate hub genes were all greater than 0.7 ( Figure 6C).

Hub genes in validation set
The ability of the candidate hub genes to diagnose MM was further evaluated in the GSE13591 dataset (the SEC11C gene was excluded due to missing data in the validation set). First, a boxplot was used to visualize the expression level of genes ( Figure 7A). The expression levels of RPN1, SEC61A1, SRPR, SRPRB, and SSR1 were increased in MM samples versus healthy controls (P-value < 0.05). Subsequently, the ROC curves of these genes in the validation set were plotted. As shown in Figure 7B, the AUCs of 7 candidate hub genes (RPN1, SEC61A1, SPCS1, SRPR, SRPRB, SSR1 and TRAM1) were all greater than 0.7. The constructed diagnostic model including the 7 genes exhibited efficient discrimination and specificity between MM samples and healthy controls in both the training and validation sets, indicating that the combined diagnostic accuracy of the hub gene set was superior to the single diagnosis of some hub genes ( Figure  7C). Furthermore, we evaluated the clinical gains of the individual diagnostic genes and the diagnostic model in the training and validation sets by DCA curves. The obtained results demonstrated that the clinical gains of the diagnostic models were higher than those of the individual diagnostic markers ( Figure 7D), indicating that the constructed diagnostic models had considerable clinical application.

Hub genes in GSEA
The GSEA method was selected to investigate the biological functions and pathways associated with hub genes in MM. Table 1 shows the top 1 enrichment results of GO related to each hub gene (see Supplementary Data S6 for details). Most of the genes are positively correlated with protein synthesis and transport processes, such as protein exit from the ER, response to topologically incorrect protein, IRE1mediated unfolded protein response (UPR) and others. Meanwhile, these genes are negatively correlated with functions such as aromatic amino acid family catabolic processes, DNA methylation involved in gamete generation or positive regulation of the Gprotein coupled receptor signaling pathway. The enrichment results of the KEGG pathway with the greatest correlation with hub genes are shown in Table 2 (see Supplementary Data S7 for details). Among them, the pathways with a significant positive correlation were mainly concentrated in N-glycan biosynthesis and ubiquitin-mediated proteolysis. However, cell biochemical metabolism, neuroactive ligand receptor interaction and olfactory transduction were the main pathways with a significant negative correlation.

Discussion
Despite decades of research, the genetic changes that make MM pathogenic and relapses remain largely unknown. Abnormal levels of cell cycle pathway genes [13], cell adhesion genes [12], and DNA methylation [24] in MM patients have been reported in recent studies with encouraging prospects. However, existing studies failed to provide enough information to explain the gene regulatory mechanism of oncoprotein production. It is well-known that a common type of M protein, immunoglobulin G (IgG), has been demonstrated to be present in more than 50% of patients with MM [25]. Therefore, an in-depth study of the DEGs in MM patients is helpful to elucidate the molecular mechanism of the disease and is expected to provide potential targets for diagnosis and treatment. Here, by comprehensive bioinformatics analyses of 3 public gene expression datasets, GSE125364, GSE39754 and GSE13591, we screened the DEGs between MM samples and healthy controls and verified the diagnostic ability of hub genes as biomarkers for disease. Moreover, the biological functions of these hub genes and their significantly related signaling pathways were further explored. The obtained results showed that the expression levels of RPN1, SEC61A1, SPCS1, SRPR, SRPRB, SSR1 and TRAM1 were significantly upregulated in MM samples compared with healthy controls and could be used as candidate diagnostic biomarkers for MM. The enrichment results from GO analysis, KEGG analysis and GSEA suggested that the hub genes were widely involved in the different phases of protein biosynthesis, targeted transport and catabolism and affected the functions of ER and ribosomes. Our study provides evidence to support the regulation of these proteincoding genes in the pathological process of MM and offers a theoretical basis for further studies based on these specific targets.
RPN1 encodes proteins that form the regulatory subunit of the 26S proteasome. Phosphorylation of RPN1 is a prerequisite for proper assembly of the 26S proteasome and facilitates assembly of the 19S base. Blockage of RPN1 phosphorylation results in reduced cell proliferation and mitochondrial dysfunction [26]. As a proteasome ubiquitin-associating subunit, RPN1 participates in proteasomal-mediated substrate degradation by providing T1 and T2 receptor sites for substrate binding and deubiquitination [27,28]. Meanwhile, its stronger binding affinity with branched K11/K48-linked polyubiquitins enhances proteasomal degradation during mitosis [29]. RPN1 has also been reported to be involved in selective autophagy of the ER. The failed UFMylation of RPN1 would cause the suspension of ER autophagy and the accumulation of ER stress, which will eventually disrupt cell differentiation and homeostasis maintenance [30]. A recent review by Zheng et al. detailed the dual role of autophagy in myeloma cells [31]. Combined with our current study, we believe that RPN1mediated abnormal ER autophagy may be involved in the genesis and progression of myeloma as a cause or consequence. SEC61A1 encodes the α-subunit of the heteropolymeric SEC61 complex, which is essential in the process of secretion and insertion of polypeptides into the ER [32]. Abnormalities in SEC61A1 have been linked to the progression of kidney disease and liver cancer in humans [33][34][35]. In a B cell inactivation study, a decrease in the expression of SEC61A1 and the production of IgM and IgG were simultaneously observed, suggesting that the variation in SEC61 may be related to plasma cell deficiency [36]. Further research by Schubert et al. confirmed that B cells of SEC61A1 mutation carriers have an intrinsic deficiency to develop into plasma cells. At the same time, less immunoglobulin secretion was observed in EBV-transformed B cell lines carrying SEC61A1 mutations [37]. In our study, 7 protein-related pathways were enriched in the pathway interrelation analysis of SEC61A1, which indicated that the important regulatory role of SEC61A1 participates in protein metabolism ( Figure  6B). Therefore, further studies are necessary and urgent to clarify the roles of SEC61A1 in the pathogenic mechanism of MM.
SPCS1 encodes a microsomal signal peptidase complex subunit, which plays an important role in protein processing, localization, and secretion [38]. When new proteins are translocated into the ER, SPCS1 may assist in removing signal peptides from them. According to the existing literature, SPCS1 focuses on some viral infectious diseases [39,40]. The detailed relationship between it and the development of MM has not been investigated. The GO analysis of GSEA showed that BP of SPCS1 was significantly enriched in the process of protein exit from ER and sperm capacitation. Of note, SPCS1 is involved in more protein regulatory processes than other hub genes in pathway interrelation analysis ( Figure 6B), which indicates the indispensable role it may play in oncoprotein metabolism.
High expression levels of SRPR and SRPRB in patients with MM were also found in our study. SRPR and SRPRB are genes that encode signal recognition receptors (SRPs) and participate in the targeting and translocation of nascent secretory and membrane proteins to the ER [41]. Abnormal expression of these proteins has been observed in haematological and other tumors [42][43][44][45]. Ma et al. pointed out that overexpression of SRPRB significantly reduced the expression and phosphorylation of NF-κB [44]. Although this pathological process may be indirectly involved in immune regulation in MM [1,46], direct evidence of the relationship between SRPR/SRPRB and MM is still lacking and needs to be further clarified. SSR1 (also known as SSR-Alpha/TRAP Alpha/TRAPA) is a glycosylated membrane receptor related to protein translocation and can be induced by the IRE1α/XBP1 pathway in response to ER stress [47,48]. SSR1 is thought to be involved in the dysfunction process of β cells, affects insulin secretion and ultimately leads to the occurrence of type 2 diabetes [49]. Yan et al. found that the expression level of SSR1 in hypopharyngeal squamous cell carcinoma tumor tissues was higher than that in normal tissues [50]. In a patient with congenital disorder of glycosylation, Ng et al. found a mutation in the SSR subunit of the translocation-related protein complex [51]. In our study, the expression level of SSR1 was significantly increased in MM samples, suggesting that it may serve as a potential biomarker for MM. TRAM1, a protein involved in the translocation of nascent polypeptides, was localized in the mammalian ER [52]. TRAM1 supports the import of ER protein by modulating the phospholipid bilayer near the lateral gate of the SEC61 channel and participates in the stability of ER membrane degradation substrates [53,54]. The decreased expression of TRAM1 not only enhanced the UPR during acute ER stress but also promoted the induction of NF-κB, a protein reported to participate in the immunoregulation of MM and the development of bortezomib resistance [1,46,54,55]. The enrichment of the ubiquitin-mediated proteolysis pathway and drug metabolism cytochrome P450 pathway in GSEA in this study further supports and strengthens the understanding of the biological functions of TRAM1.
The relationship between N-glycosylation and MM was also the focus of our work. In the KEGG pathway enrichment analysis of GSEA, 5 of the 7 hub genes were enriched in a significantly positive correlation with N-glycan biosynthesis ( Table 2). As the most prevalent coand posttranslational protein modification, glycosylation is involved in many biological processes, such as protein folding, stability or degradation [56,57]. N-glycan is a type of protein glycosylated substrate that modifies approximately half of the proteins in serum [57,58]. Dysregulation of glycosylation is involved in a variety of diseases, including the progression of malignancies [59,60]. There is an indisputable relationship between N-glycosylation and MM. Zhang et al. found that the N-glycosylation levels of serum protein in patients with MM were different from those in healthy controls, suggesting that the N-glycosylation levels of serum protein are a potential biomarker of the disease [60]. IgG consists  of two heavy chains and two light chains. Each heavy chain of IgG carries a single covalently attached biantennary N-glycan at the highly conserved asparagine 297 residue in each of the CH2 domains of the Fc region of the molecule. This N-glycan structure maintains the stability and function of IgG [61]. Moreover, the modification of B cell maturation antigen by Nglycan improves its binding ability with ligands, which is of great significance for the normal functioning of plasma cells [62]. More direct evidence comes from the work of Mittermayr et al. They found similar N-glycan profiles between MGUS patients and normal controls and considerable N-glycomic heterogeneity in the malignant phase. This result provided a theoretical basis for the clinical stratification of disease degree based on N-glycan profiles [63]. Our current study confirmed the results of previous studies and proposed for the first time that RPN1, SEC61A1, SPCS1, SRPR and SRPRB may mediate the N-glycosylation process of proteins in MM. Despite these meaningful findings, our study has several potential limitations. First, although there was rigorous quality control of samples from the different datasets, it is undeniable that the effects of inherent heterogeneity may not have been completely eliminated. Second, due to the limitations of the available data, it was not possible to study changes in the expression of these genes during disease progression. Further in vitro and in vivo experimental evidence is necessary to validate the clinical value of the hub genes obtained in this study.

Conclusion
Based on comprehensive bioinformatics methods, we identified the hub genes significantly associated with MM and explored their biological functions. We found that RPN1, SEC61A1, SPCS1, SRPR, SRPRB, SSR1 and TRAM1 were highly expressed in MM samples and can be used as potential biomarkers for diagnosis. Further studies indicated that the hub genes play important roles in the regulation of protein biosynthesis, transport or catabolism. These findings provide valuable supplements for the aetiological mechanism of MM, especially the possible existence of abnormal protein metabolism processes.

Disclosure statement
No potential conflict of interest was reported by the author(s).