Identification of four key biomarkers and small molecule drugs in nasopharyngeal carcinoma by weighted gene co-expression network analysis

ABSTRACT Nasopharyngeal carcinoma (NPC) is a heterogeneous carcinoma whose underlying molecular mechanisms involved in tumor initiation, progression, and migration are largely unclear. The aim of the present study was to identify key biomarkers and small-molecule drugs for screening, diagnosing, and treating NPC via gene expression profile analysis. Raw microarray data was used to identify 430 differentially expressed genes (DEGs) in the Gene Expression Omnibus (GEO) database. The key modules associated with histological grade and tumor stage were identified using weighted gene co-expression network analysis. qRT-PCR was used to verify the differential expression of hub genes. Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and the connectivity map database were used to identify potential mechanisms and screen small-molecule drugs targeting hub genes. Functional enrichment analysis showed that genes in the green module were enriched in the regulation of cell cycle, p53 signaling pathway, and cell part morphogenesis. Four DEG-related hub genes (CRIP1, KITLG, MARK1, and PGAP1) in the green module, which were considered potential diagnostic biomarkers, were taken as the final hub genes. The expression levels of these four hub genes were verified via qRT-PCR, and the results were consistent with findings from the GEO analysis. Screening was also conducted to identify small-molecule drugs with potential therapeutic effects against NPC. In conclusion, four potential prognostic biomarkers and several candidate small-molecule drugs, which may provide new insights for NPC therapy, were identified.


Introduction
Nasopharyngeal carcinoma (NPC) is a malignant carcinoma arising from the nasopharyngeal epithelial lining of the nasopharynx. It has a unique geographical distribution and racial prevalence. Based on GLOBOCAN cancer estimates and the International Agency for Research on Cancer, there were approximately 129,079 newly diagnosed NPC cases and 72,987 deaths in 2018, accounting for approximately 0.7% of cancer cases and deaths in that year [1,2]. The geographical distribution of global NPC incidence is extremely unbalanced; more than 80% of newly diagnosed cases are in Asia, while fewer than 0.2% of the cases occur in Oceania, resulting in an agestandardized rate of 10.0 per 100,000 persons in Southeast Asia and 0.5 per 100,000 persons in Central America [1,2]. Despite substantial improvements in NPC treatment options over the past decade, including chemotherapy, immunotherapy, and radiotherapy, the prognosis for patients with NPC remains low; approximately 20% of cases recur and the 5-year survival rate is only 50%-70% [3]. Thus, novel biomarkers and therapeutic strategies are urgently required for effective diagnosis and improved treatment for NPC patients. The remarkable geographical distribution and racial prevalence of NPC incidence has spurred studies on its risk factors, and it is thought that several etiological factors, including genetic and ethnic factors, Epstein-Barr virus (EBV) infection, and environmental factors (e.g. cigarette smoking and exposure to dust and formaldehyde) contribute to the development and progression of NPC [4][5][6][7][8]. EBV infection is perhaps the most extensively studied etiological factor in NPC pathogenesis. The study involves in situ hybridization of EBV-encoded RNA, which is found in all tumor cells but not in the normal nasopharyngeal epithelial lining of the nasopharynx, suggesting that EBV activation may play a vital role in the pathogenesis of NPC [9][10][11]. There is growing evidence that EBV is associated with multiple human cancers, including NPC [7][8][9][10]. Apart from EBV infection, genetic susceptibility is perhaps the most common cause of NPC. Studies have identified genetic susceptibility genes involved in DNA repair, and metabolic and immune response pathways are associated with the development of NPC [12][13][14]. Recent linkage and association studies have described a potential association between genetic susceptibility genes and the development of NPC, including MST1R, PTPRG, HLA, and MMP2 [12][13][14][15][16][17]. The rapid developments in high-throughput whole-genome sequencing and bioinformatic analyses have played a significant role in the screening for candidate biomarkers of several diseases, including NPC.
In this study, we sought to obtain relevant data from the Gene Expression Omnibus (GEO) database and undertook a weighted gene co-expression network analysis (WGCNA) to identify key modules. We then explored the underlying mechanisms of hub module genes using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Based on the connectivity map database, four potential small-molecule drugs targeting the hub genes were identified. Two NPC cohorts from the GEO database and experimental data from our study were used to verify the expression levels of these four hub genes. The aim of our study was to identify key biomarkers and validate the expression levels of four hub genes, and to further screen for smallmolecule drugs of these hub genes.

Data processing and identification of differentially expressed genes (DEGs)
The robust multiarray averaging (RMA) method was used for data pre-processing, including background adjustment, quantile normalization, and log2 transformation of expression matrices. The corresponding platforms were used to annotate each probe based on the Entrez ID. For genes with several probes, the medians of all probes were selected. The 'limma' package in the R statistical software (version 3.6) was used to identify differentially expressed genes associated with NPC, with |logFC| >1 and P-value < 0.05 set as the screening thresholds.

Construction of weighted gene co-expression networks
The median absolute deviation of each gene expression value was calculated and genes with median absolute deviation in the top 25% were chosen for WGCNA. The 'WGCNA' package in R was used to construct weighted gene coexpression networks [19]. First, the goodSamplesGenes function was used to detect good samples and good genes. Second, the pickSoftThreshold function was used to select an appropriate soft-thresholding power. Then, the coexpression similarity was raised to achieve a scalefree topology based on the appropriate softthreshold power. Finally, cox-expression gene modules were generated using the cutreeDynamic function. Additionally, we calculated gene significance (GS), module eigengene (ME), and module membership (MM) to detect the association between clinical parameters of either the genes or modules; P < 0.05 was considered statistically significant.

Functional enrichment analysis of hub module genes
To better understand the potential molecular mechanisms by which hub module genes impact correlative clinical parameters, we uploaded all hub module genes to the Metascape online tool (https://metascape.org/gp/index.html#/main/ step1) [20] where Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed. P < 0.01 was considered statistically significant.

Selection and validation of hub genes
Genes with the highest intra-modular connectivity were selected as candidate hub genes. Genes with biological significance usually have higher absolute GS values. In this study, candidate hub genes were screened based on the criteria: absolute GS value > 0.20 and MM > 0.80. The final hub genes were identified from the intersection of NPC-related DEGs with candidate hub genes. Expression levels of hub genes in patients with NPC were assessed using box plots. The diagnostic values of hub genes were evaluated using the area under the receiver operating characteristic (ROC) curve. Differential expression and diagnostic values of hub genes were validated in NPC patients using a separate external dataset (GSE53819). Three microarray datasets (GSE13597, GSE40290, and GSE64634) were then combined to further verify the expression levels of hub genes.

Identification of candidate small-molecule drugs
The connectivity map (CMAP) is an online database (https://portals.broadinstitute.org/cmap) with more than 7000 gene expression profiles of 1309 compounds that is used to predict small-molecule drugs for specific diseases [21]. To identify smallmolecule target drugs for treating NPC, the hub genes were uploaded to the CMAP database's 'query' page. Correlation between the smallmolecule drugs and hub genes was evaluated using enrichment scores ranging from −1 to 1. The chemical and molecular structures of the small-molecule drugs were downloaded from the PubChem Compound database (https://pubchem. ncbi.nlm.nih.gov/).

Tissue collection
NPC specimens and adjacent nasopharyngeal epithelial tissues were collected from 20 patients, immediately transferred into liquid nitrogen and then preserved at −80°C. This study was approved by the Ethics Committee of the Xiangya Third Hospital, Central South University and was conducted in accordance with the Declaration of Helsinki.

RNA extraction and quantitative reverse transcription PCR (qRT-PCR)
RNA was extracted from the cells using TRIzol reagent and reverse transcribed into cDNA using the following reaction mixture: 2 μL cDNA, 10 μL SYBR® Premix Ex Taq TM 11 (Takara Bio, Tokyo, Japan), 0.4 μL forward and reverse primers, and 6.4 μL ddH2O. Each sample was run in triplicate and GAPDH was used as the internal control. The reaction conditions used were: pre-denaturation at 95°C for 10 min followed by 40 cycles of denaturation at 95°C for 15 s, annealing at 60°C for 30 s, and extension at 70°C for 30 s. The data was analyzed using the ΔΔCT method and the average of the three replicate experiments for each sample was taken. The primers used are shown in Table 2.

Statistical analysis
Two-tailed Student's t-test was used to identify differentially expressed genes associated with NPC. Statistical analyses were conducted in R (version 4.0.0), MedCalc (version 19.1) or SPSS (version 24.0) software. Genes with |logFC| >1 and P < 0.05 were considered statistically significant DEGs.

Results
This study used WGCNA to identify key gene modules associated with NPC histological grade and tumor stage. Genes with the highest intramodular connectivity were selected as candidates. These hub genes were identified from the intersection of NPC-related DEGs with candidate hub genes. GO and KEGG analyses were performed to explore the mechanisms underlying the hub module genes. The connectivity map database was used to screen for potential small-molecule drug targets of the hub genes. Finally, two NPC cohorts from the GEO database were used to verify the expression levels of the four hub genes.

Identification of DEGs
Two microarray datasets, GSE12452 and GSE53819, from the GEO database, were used to identify DEGs between normal samples and NPC samples. DEGs were identified based on the thresholds |logFC| >1 and P < 0.05. Volcano plots of DEGs in the two separate datasets are shown in Figure 1a and Figure 1b. A total of 430 common genes were identified at the intersection of DEGs in the two datasets for further WGCNA.
Of the 430 genes, 165 were significantly upregulated while 265 were significantly downregulated in NPC patients (Figure 1c and Figure 1d).

Construction of weighted gene co-expression network and identification of hub modules
A total of 3949 genes with median absolute deviations in the top 25% in 31 NPC samples and 10 normal samples in GSE12452 dataset were selected. The samples were clustered to remove outlier samples. WGCNA was then used to identify genes that were highly co-expressed across the groups. To ensure a scale-free network, the power of β = 5 (scale-free R 2 = 0.89, slope = −1.80) was selected as the soft-thresholding, as shown in Figure 2.
Then, similar modules were merged based on the height of ME in the clustering equal to 0.25 ( Figure  3a). 12 co-expressed gene modules were identified.
The 'grey' module, which contained 583 genes not attributed to any modules (Figure 3b), was removed in the subsequent analysis. The relationships among the 11 modules were analyzed using clinical parameters (including histological grade and tumor stage). The green module was significantly associated with NPC tumor stage (N stage) and was chosen as the clinically significant module for subsequent analyses (Figure 4).

Functional enrichment analysis of hub module genes
GO and KEGG pathway enrichment analysis of the above green module was carried out to mine potential biological functions associated with NPC using the Metascape online tool. P < 0.01, minimum overlap genes = 3, and an enrichment factor > 1.50 were set as the cutoff criteria. GO analysis showed that genes in the green module were mainly enriched in plasma membrane-bound cell   projection assembly, smoothened signaling pathway, cellular response to nutrient levels, regulation of cell cycle process, neuro maturation, and protein kinase complex ( Figure 5). Two significantly enriched pathways, the p53 signaling pathway and proteoglycans in cancer, were associated with genes in the green module.

Identification and validation of hub genes
A total of 14 genes with the greatest connectivity in the green module were identified as candidate hub genes based on GS > 0.2 and MM > 0.8 ( Figure 6). The final four hub genes: cysteine-rich intestinal protein 1 (CRIP1), KIT ligand (KITLG), microtubule affinity regulating kinase 1 (MARK1), and post-GPI attachment to protein 1 (PGAP1), were identified from the intersection of NPC-related DEGs with candidate hub genes. Data in the GSE12452 and GSE53819 datasets were compared with that of normal tissues, and three of the final four hub genes: PGAP1, MARK1, and KITLG showed significantly higher expression in NPC tissues, while the expression of CRIP1 was significantly downregulated (Figure 7). To verify the diagnostic potential of these four hub genes, ROC analysis was performed on the GSE12452 and GSE53819 datasets. The results of the ROC analysis  demonstrated that these four genes had excellent diagnostic efficiencies in NPC patients (Figure 8). The expression levels of the four hub genes were further validated in the combined microarray datasets (Figure 9).

Identification of potential small-molecule drugs for NPC treatment
The hub genes were uploaded to the CMAP database to identify potential small-molecule drugs for NPC treatment. Based on the criteria: number of instances n > 5 and P < 0.05, the top ten most significant small-molecule drugs, which might be potential therapeutic targets for NPC treatment, were identified to be vorinostat, pioglitazone, fludrocortisone, thalidomide, bendroflumethiazide, chlorcyclizine, etiocholanolone, hyoscyamine, sulfadiazine, and biperiden (Table 3 and Figure 10).

Validation of the expression levels of four hub genes in NPC tissues and cell lines
The mRNA expression levels of MARK1, PGAP1, KITLG, and CRIP1 genes in 20 pairs of NPC and adjacent tissues were quantified via qRT-PCR.  The results showed that the expression of MARK1, PGAP1, and KITLG was upregulated in NPC samples relative to adjacent tissues (Figure 11(a-c)), while the expression of CRIP1 was downregulated in NPC samples relative to adjacent tissues ( Figure  11d). The expression of MARK1, PGAP1, KITLG, and CRIP1 genes in four NPC cell lines (CNE1, HNE3, 5-8 F, and HK-1) and normal human immortalized nasopharynx cell line (NP69) were measured via qRT-PCR. The results showed that MARK1, PGAP1, and KITLG expression was upregulated in NPC cell lines relative to the normal human nasopharynx cell line (Figure 11(e-g)), while CRIP1 expression was downregulated in NPC cell lines relative to normal human nasopharynx cell lines (Figure 11h).
Despite substantial improvements in the treatment of NPC over the past decade, the prognosis for patients with NPC diagnosis remain largely unsatisfactory due to a lack of obvious clinical signs in its early stages, with most initial diagnoses being made when the disease is at an advanced stage [3]. Therefore, elucidating the molecular mechanisms underlying NPC is of great significance. WGCNA was used to identify potential molecular biomarkers and small-molecule drugs  associated with NPC progression. We identified 430 DEGs and 11 co-expressed gene modules. The green module had the highest correlation with NPC tumor stage. Four genes, which had high functional significance in the clinically significant module, were identified as the final hub genes. Compared with normal tissues, PGAP1, MARK1, and KITLG showed significantly higher expression in patients with NPC, while CRIP1 expression was significantly lower in patients with NPC in both the test and validation datasets. These results were confirmed via qRT-PCR. The results of the ROC analysis showed that these four hub genes had excellent diagnostic efficiency in tumor and normal tissues. Furthermore, functional enrichment analysis showed that green module genes were mainly enriched in cellular response to nutrient levels, regulation of cell cycle process, and p53 signaling pathway. PGAP1, the gene encoding the post-GPI attachment to protein 1, is closely associated with the cell cycle process, cell proliferation, cell differentiation, and multiple signaling transduction pathways. PGAP1 can directly and/or indirectly induce carcinogenesis through tumor cell proliferation, migration, and resistance to multiple drugs [22][23][24][25].
MARK1, also known as partitioning defective gene 1 (Par-1), is a member of the serine-threonine kinase family, which plays a vital role in the regulation of cell migration, cell proliferation, and establishment and maintenance of cell polarity and microtubules [26][27][28]. Several studies have revealed that MARK1 plays an important role in tumorigenesis, and Natalia et al [29] found that MARK1 is a novel functional target for miR-125a-5p, with implications in the regulation of stimulated cell migration of cervical tumor cell lines. Tang et al [30] showed that MARK1 overexpression was correlated with cell migration and proliferation in colorectal cancer through its regulation of miR-23a expression. KITLG, encoding the ligand for the tyrosine kinase receptor encoded by the KIT locus, plays an essential role in the regulation of cell survival and proliferation, stem cell maintenance, mast cell development, and migration. KITLG/SCF binding can activate several signal transduction pathways [31][32][33]. KITLG gene expression levels were evaluated in testicular germ cell cancer, breast cancer, and gastrointestinal stromal tumors [34][35][36][37]. Overexpression of KITLG was correlated with increased tumor cell migration and proliferation, suggesting a tumorigenic role for KITLG [34,36]. KITLG was expressed at significantly higher levels in normal mammary gland epithelial tissue than in breast tumor tissue, indicating that it might be involved in the homeostasis of normal mammary tissue and its disruption would confer breast carcinogenesis [38]. CRIP1 is a member of the LIM/double zinc finger protein family, which is highly expressed in the intestine and immune cells [39]. In gastric cancer, CRIP1 was overexpressed in primary tumor tissues and confirmed to be an independent prognostic factor. Patients with gastric carcinoma overexpressing CRIP1 had shorter overall survival [40]. Compared with adjacent normal tissues, CRIP1 was significantly more highly expressed in cervical cancer and was associated with FIGO stage [41].
The connectivity map database was used to screen small-molecule drugs with promising therapeutic capabilities against NPC. To screen out potential small-molecule drugs for treating NPC, 66 candidate small-molecule drugs were obtained from the CMAP database prediction based on the final NPC hub genes. Among the top ten most significant potential small-molecule drugs, vorinostat and thalidomide are particularly interesting and have antitumor effects in NPC treatment. Vorinostat is a histone deacetylase inhibitor (HDACi) that has shown strong antitumor effects in various types of solid cancers when combined with other traditional chemotherapeutic drugs such as camptothecin and gemcitabine [42][43][44][45][46][47]. Thalidomide is a glutamic acid derivative and is a classic drug with immunomodulatory properties that inhibits tumor necrosis factor alpha [48][49][50]. Subsequent studies found that thalidomide has anti-angiogenic, anti-inflammatory, anti-fibrotic, and immunomodulatory effects, and the clinical effectiveness of thalidomide in the treatment of diseases such as refractory Crohn's disease, lung metastasis, multiple myeloma, hepatocellular carcinoma, and breast carcinoma has been confirmed [51][52][53][54][55][56][57][58]. The CMAP database results provide several potential small-molecule drugs for future NPC therapy. However, in vivo and in vitro studies are necessary to validate their therapeutic efficacies.

Conclusion
In summary, by constructing a weighted gene coexpression network with data from the GEO database, our study identified four hub genes correlated with prognosis in NPC tumors and several potential small-molecule drugs for NPC treatment, which may contribute to new insights for NPC therapy. Further studies, including in vivo and in vitro experiments, are necessary to elucidate the molecular mechanisms of hub genes for future clinical applications.

Highlights
(1) CRIP1, KITLG, MARK1 and PGAP1 have excellent diagnostic efficiency in NPC. (2) Four prognostic biomarkers and several candidate small-molecule drugs for NPC. (3) Ten small-molecule drugs might provide potentially therapeutic goals for NPC.