Identification of significant genes and therapeutic agents for breast cancer by integrated genomics

ABSTRACT Breast cancer is the most commonly diagnosed malignancy in women; thus, more cancer prevention research is urgently needed. The aim of this study was to predict potential therapeutic agents for breast cancer and determine their molecular mechanisms using integrated bioinformatics. Summary data from a large genome-wide association study of breast cancer was derived from the UK Biobank. The gene expression profile of breast cancer was from the Oncomine database. We performed a network-wide association study and gene set enrichment analysis to identify the significant genes in breast cancer. Then, we performed Gene Ontology analysis using the STRING database and conducted Kyoto Encyclopedia of Genes and Genomes pathway analysis using Cytoscape software. We verified our results using the Gene Expression Profile Interactive Analysis, PROgeneV2, and Human Protein Atlas databases. Connectivity map analysis was used to identify small-molecule compounds that are potential therapeutic agents for breast cancer. We identified 10 significant genes in breast cancer based on the gene expression profile and genome-wide association study. A total of 65 small-molecule compounds were found to be potential therapeutic agents for breast cancer.


Introduction
Breast cancer is a frequently diagnosed cancer in women with a family history [1]. Breast cancer is a heterogeneous disease with different molecular subtypes and biological behaviors. Gene microarray technology and immunohistochemical techniques have classified breast cancers into different types [2]. The estrogen receptor (ER) is the most important prognostic and predictive immunohistochemical marker in breast cancer. ER-negative tumors tend to be of higher histological grade, are more sensitive to chemotherapy, and are more likely to metastasize to visceral organs [3,4]. Breast cancer does not have a poor prognosis, and there is no lack of therapeutic targets. ER positive tumors represent about 70% of all breast cancers and there are a lot of therapeutic targets, as well as for HER2 positive breast cancer (about 20% of all BC). The only subtype lacking for target therapies is the triple negative subtype [5,6]. There is an urgent need to find available drugs and clarify their molecular mechanisms in breast cancer treatment.
Most previous studies have focused on identifying novel prognostic markers and drug targets for breast cancer [7][8][9]. Sulaiman et al. [10] reported that a synthetic azaspirane targets the Janus kinase/signal transducer and activation of transcription 3 pathway in breast cancer. Huang et al. [11] demonstrated that the Gαh-PLCδ1 signaling axis drives metastatic progression in breast cancer. However, due to toxicity, cost, the chemical effects of novel prognostic markers and drug targets for breast cancer that need further research [12], not all previous findings contribute to breast cancer treatment; breast cancer still lacks therapeutic targets and with poorer prognosis. And there is still an urgent need to identify additional therapeutic and prognostic targets in breast cancer [13].
Genome-wide association studies (GWAS) are widely used to characterize the genetic mechanisms that underlie complex diseases. Integrative analyses of GWAS data are rapidly becoming a standard approach to explore the genetic basis of disease susceptibility [14]. Network-wide association studies (NetWAS) can identify relevant disease-gene associations by integrating tissuespecific networks and GWAS results [15,16]. Prior studies have shown that the networkassociated analysis of GWAS data is highly efficient when used to identify novel causal genes of complex diseases [17,18].
In this study, to better understand the molecular mechanisms and find therapeutic agents for breast cancer, we identified novel candidate therapeutic agents for breast cancer treatment by integrating genomic data with drug database analysis. In total, 65 small-molecule compounds were identified, including trichostatin A, LY-294,002, econazole, prestwick-1082, and vorinostat. Our study demonstrates the usefulness of this approach for evaluating the relationship among genes, diseases, and drugs. These findings will pave the way for the discovery of potential therapeutic targets for breast cancer.

Summary of GWAS datasets in breast cancer
The UK Biobank is a large, population-based prospective UK study, which was established to identify genetic and nongenetic determinants of various diseases. It comprises approximately 500,000 individuals with extensively detailed phenotypes. Their genotypes were determined using an array that included 847,441 genetic polymorphisms, enabling the identification of novel genetic variants in a uniformly genotyped and phenotyped cohort of unprecedented size [19]. Using data from the UK Biobank, samples from the participants were genotyped on the UK Biobank Axiom array and UK BiLEVE custom array. Genotype imputation was conducted with IMPUTE software against the UK10K haplotype panel and the 1000 Genomes Project phase 3 panel. GWAS analysis was performed by SNPTEST using a logistic regression model. A genome-wide geneassociation study was performed using the MAGMA gene analysis tool, and multiple genes and genetic variants were identified. The Icelandic GWAS dataset from the deCODE Genetics genealogical database was based on whole-genome sequencing using Illumina technology. Finally, meta-analysis of small nucleotide polymorphisms (SNPs) in the UK Biobank and deCODE sample was performed using the METAL analysis tool [20].
The atlas of genetic associations in the UK Biobank (GeneATLAS, http://geneatlas.roslin.ed. ac.uk) helps researchers effectively analyze UK Biobank results without high computational costs. It also allows users to query genome-wide association results for 9,113,133 genetic variants and download GWAS summary statistics for more than 30 million imputed genetic variants (>23 billion phenotype-genotype pairs) [21]. We downloaded large-scale GWAS breast cancer summary data from the atlas of genetic associations. Detailed descriptions of sample characteristics, experimental designs, statistical analyses, and quality control can be found in previous studies.

Gene expression datasets
Oncomine (https://www.oncomine.org) is a cancer microarray database and web-based data mining platform for facilitating discovery. In this study, differentially expressed genes (DEGs) in breast cancer were identified by comparing cancer samples to respective normal samples using the Oncomine database. The heatmap of significant DEGs in breast cancer was driven from the Oncomine.

Identification of significant genes in breast cancer
NetWAS (https://hb.flatironinstitute.org/netwas/) integrates tissue-specific networks and significant GWAS association results, and identifies relevant disease-gene associations based on genomics. Briefly, SNP-level association statistics were converted into gene-level statistics (gene-based P values), which then were integrated with tissuespecific networks to predict the causal genes [18]. Greene et al. [13] demonstrated that NetWAS is more accurate than GWAS alone. In this study, we identified the most relevant genes in breast cancer using NetWAS.

Kyoto Encyclopedia of Genes and Genomes pathway and Gene Ontology analyses
Cytoscape is one of the most successful network biology analysis and visualization tools. It exposes more than 270 core functions and 34 applications as REST-callable functions with standardized JSON interfaces supported by Swagger documentation [22]. CluePedia, a plug-in in Cytoscape, can search for certain Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathways of certain genes by calculating linear and nonlinear statistical dependencies from experimental data [23]. KEGG signaling pathways were identified by CluePedia. Search Tool for the Retrieval of Interacting Genes (STRING) (https://string-db. org/cgi/input.pl) is an online tool that for Gene ontology (GO) analysis in gene sets [24,25]. GO is a commonly used bioinformatics tool that provides comprehensive information on the gene function of individual genomic products based on defined features consisting of three domains: biological process (BP), cellular component (CC), and molecular function (MF) [26]. We conducted GO analysis using the STRING database.

Analysis of the correlation between significant genes and breast cancer
Gene Expression Profiling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn) is a web server for analyzing RNA-sequencing expression data of 9,736 tumors and 8,587 normal samples from The Cancer Genome Atlas and Genotype-Tissue Expression projects, using a standard processing pipeline [27]. The Human Protein Atlas (HPA, www.proteinatlas.org) is an immunohistochemistry-based map of protein expression profiles in normal tissues, cancer tissues, and cell lines, and provides a resource for pathology-based biomedical research, including protein biomarker discovery [28][29][30]. Correlations between significant genes and breast cancers were analyzed with GEPIA and HPA.

Analysis of the correlation between significant gene expression and overall survival
PROGgeneV2 (http://www.compbio.iupui.edu/ proggene), a tool that can be used to predict the prognostic implication of genes in cancers, is written in PHP5 with a MySQL database backend, which stores gene expression data, covariates data, and metadata for cataloged studies in the form of relational database tables. Survival analysis in PROGgeneV2 is done using the backend R script; users can input multiple genes and use combined analysis to create survival plots for different genes of interest [31]. We used PROGgeneV2 to analyze the relationship between overall survival and genes that were overexpressed and underexpressed in breast cancer.

Drug prediction analysis
CMap (https://portals.broadinstitute.org/cmap/) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple patternmatching algorithms that together enable the discovery of functional connections among drugs, genes, and diseases through the transitory feature of common gene expression changes [32][33][34]. We used CMap to identify small-molecule compounds as potential therapeutic agents to target the significant genes in breast cancer.

Identification of significant DEGs in breast cancer
To identify the significant DEGs in breast cancer, we retrieved GWAS summary data (C50-C50) of breast cancer from the UK Biobank, and microarray expression profiles of breast cancer from the Oncomine database. C50-C50 contained 10,478 malignant neoplasm of breast cases and 235,016 controls for the analyses, and the data were consolidated and normalized (Figure 1(a,b)).

GO and KEGG enrichment analyses of significant DEGs in breast cancer
To explore the roles of the significant DEGs in breast cancer, we played GO and KEGG enrichment analyses. BP analysis revealed that the significant genes in breast cancer were mainly enriched in the Wnt signaling pathway, calcium-modulating pathway, protein repair, gene silencing by microRNA (miRNA), mRNA cleavage involved in gene silencing by miRNA, and positive regulation of epithelial cell proliferation involved in lung morphogenesis ( Table  2). MF analysis showed that significant genes were enriched in functions related to oxidoreductase activity, acting on a sulfur group of donors and disulfide as acceptor, and phosphoinositide 3-kinase (PI3K) and PIK3CA activities ( Table 2). CC analysis showed that significant genes were enriched in P-bodies. KEGG analysis revealed that significant genes in breast cancer were mainly involved in pathways in cancer, breast cancer, gastric cancer, melanoma, the PI3K/Akt signaling pathway, mitogen-activated protein kinase (MAPK) signaling pathway, Ras signaling pathway, tight junctions, and ubiquitin-mediated proteolysis (Figure 3).

Correlation between significant DEGs and breast cancer
To verify the significant DEGs of breast cancer, we further explore the DEGs. Consistent with the identification of significant genes, protein profiling in breast cancer samples from the HPA using immunohistochemistry showed that the gene expression of CLDN7, RBM33, SH3RF1, and UBE2Z was significantly enriched in breast cancer, whereas there was no significant enrichment of FGF7 and TNRC6B (Figure 4). The significant DEGs (CLDN7, BMPER, FGF7, MSRB3) in breast cancer samples compared to normal samples also showed coincident results of significant gene identification ( Figure 5).

Correlation between overall survival and significant DEGs in breast cancer
In the analysis of the correlation between overall survival and significant DEG expression (CLDN7,

Drug prediction analysis
To identify potential small-molecule compounds with therapeutic effects on breast cancer, drug prediction analysis was performed using CMap. A total of 65 drugs were predicted, and the 10 most significant were trichostatin A, LY-294,002, econazole, Prestwick-1082, vorinostat, lomefloxacin, clorsulon, amantadine, thiostrepton, and orciprenaline (Table 3).

Discussion
Breast cancer is the most commonly diagnosed malignancy in women worldwide and is the main cause of cancer-related death in women [35][36][37]. Although there are a lot of effective therapeutic agents for breast cancer, breast cancer remains a major health problem and is a top biomedical research priority [38][39][40], as there is an urgent need for effective breast cancer treatments.
In this study, we identified 10 significant genes (CLDN7, MLLT10, RBM33, SH3RF1, SSBP4, UBE2Z, BMPER, FGF7, MSRB3, and TNRC6B) in breast cancer using combined GWAS data and profiling of DEGs. Protein profiling in breast cancer samples from the HPA using immunohistochemistry and analysis of significant DEGs in breast cancer samples compared to normal samples from GEPIA further verified the results. Significantly overexpressed genes (CLDN7, MLLT10, RBM33, SH3RF1, SSBP4, and UBE2Z) were correlated with shorter survival, whereas underexpressed genes (BMPER, FGF7, MSRB3, and TNRC6B) were correlated with longer survival in breast cancer.
Consistent with our findings, previous studies have shown that some of these genes play important roles in the development of breast cancer. For example, Bernardi et al. [41] showed that CLDN7 is associated with a shorter time to recurrence, suggesting its contribution to the aggressiveness of breast cancer. In a GWAS, Guo et al. [42] identified common genetic loci for breast cancer risk including SSBP4. Whole transcriptome analysis by Bauer et al. [43] demonstrated that BMPER plays a possible therapeutic role in breast cancer. Fu et al. [44] demonstrated that acetylation, expression and recruitment of FGF7 promoters induce cancer growth and progression. Zhu et al. [45] found that targeting FGF7 can exert oncogenic functions in breast cancer. A previous study showed that the ZEB1-MSRB3 axis is related to breast cancer genome stability [46]. Interestingly, other DEGs in breast cancer identified in this study, including MLLT10, RBM33, SH3RF1, UBE2Z, and TNRC6B, have not been proven in previous studies. We believe that these are potentially novel key genes in breast cancer.
BP analysis in GO annotation indicated that the 10 significant genes are mainly enriched in the Wnt signaling pathway, which plays an important role in the occurrence and development of many cancers. Inhibiting this pathway can suppress breast cancer growth and metastasis [47][48][49]. MF analysis of GO suggested that the DEGs were most significantly enriched in functions related to oxidoreductase activity. The redox reaction is accompanied by tumor development. CC analysis of GO annotation showed that the 10 DEGs were enriched in P-bodies. A previous study suggested that P-body disassembly correlates with breast cancer progression [50].
KEGG analysis of the 10 DEGs showed their enrichment in breast cancer, gastric cancer, melanoma, the PI3K/Akt signaling pathway, MAPK signaling pathway, Ras signaling pathway, tight junctions, and ubiquitin-mediated proteolysis. Some of these pathways contribute to the development of breast cancer. For example, the PI3K pathway is found in many types of cancer and plays an important role in breast cancer cell proliferation [51]. Ras signaling is a key determinant of poor survival in breast cancer patients [52]. Abnormal MAPK signaling plays a core role in the regulation of growth and survival, and the development of drug resistance in triplenegative breast cancer [53].
The aim of this work was to identify significant genes and potential therapeutic agents for breast cancer based on genomics. We found 65 potentially small-molecule compounds to reverse significant genes in breast cancer. The 10 most significant drugs were trichostatin A, LY-  294,002, econazole, Prestwick-1082, vorinostat, lomefloxacin, clorsulon, amantadine, thiostrepton, and orciprenaline. Consistent, with our study, it has been reported that trichostatin A, a histone deacetylase inhibitor, has therapeutic potential in breast cancer [54]. Jiang et al. [55] showed that trichostatin A sensitizes ER-negative breast cancer cells to tamoxifen. LY294002, a specific inhibitor of the PI3K pathway, can decrease the rate of cell growth and increase the therapeutic sensitivity in MCF7 cells expressing wild-type p53, which may be useful for the treatment of breast cancer [56]. Econazole is a novel PI3K/AKT signaling pathway inhibitor, which can be used to overcome adriamycin resistance and improve chemotherapy sensitivity in breast cancer [57]. A preclinical study showed that vorinostat can prevent the formation of brain metastases in breast cancer [58]. Yang et al. [59] suggested that thiostrepton is a promising agent for triple-negative breast cancer. Kwok et al. [60] showed that thiostrepton selectively targets breast cancer cells through inhibition of Forkhead box M1 expression.
However, some of the predicted drugs, such as Prestwick-1082, lomefloxacin, clorsulon, amantadine, and orciprenaline, have not been shown to directly play a role in breast cancer. Thus, future studies are needed to confirm our findings. Compared to previous studies [61][62][63], we conducted an analysis combining genomic data with drug database analysis to identify novel candidate therapeutic agents for breast cancer treatment. Our study demonstrates the usefulness of this approach for evaluating the relationship among genes, diseases, and drugs. These findings will pave the way for the discovery of potential therapeutic targets for breast cancer.

Conclusion:
Combined analyses of network-wide association studies, gene expression profiles, and drug databases are helpful for identifying potential therapeutic agents for diseases. This method is a new paradigm that can guide future research directions.

Availability of data and materials
All materials are available by the corresponding author.

Ethical statement
Our study did not require an ethical board approval because it did not contain human or animal trials.

Consent for publication
Not applicable.

Disclosure Statement
The authors declare no competing interests.

Highlights
(1) Combined analyses of network-wide association studies, gene expression profiles, and drug databases. (2) A useful approach for evaluating the relationship among genes, diseases, and drugs.
These findings will pave the way for the discovery of potential therapeutic targets for breast cancer.