Protein-protein interaction networks and different clustering analysis in Burkitt's lymphoma.

ABSTRACT Objective: Burkitt's lymphoma (BL) is a highly aggressive malignant lymphoma, its molecular biological mechanism has not been fully investigated. The construction of protein-protein interaction (PPI) networks and the identification of complexes through a cluster analysis are important research directions in the post-genome era. However, different cluster analysis algorithms have their own characteristics, and a single analysis has some limitations. In this study, we obtained the target and pathway information of BL using different clustering analyses. Material and Methods: First, we obtained 50 BL genes by screening the Online Mendelian Inheritance in Man (OMIM) database; their related genes were further extracted from the literature. The PPI network was constructed with the Search Tool for Retrieval of Interacting Genes/Proteins (STRING). Afterward, the interaction data were input in Cytoscape3.4.0 software and related plug-ins were used to implement topology analysis and clustering analysis. Functional analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database were used to characterize the biological importance of the clusters. Results: We constructed a PPI network consisting of 459 nodes (proteins) and 1399 sides (interactions), 12 genes and 8 signaling pathways were found to be closely related to BL. Conclusion: In this study, the use of combined algorithms to analyse gene interactions provides a new perspective for network-based analysis. The results of this study reveal new insights into the molecular mechanisms underlying BL, which may be novel therapeutic targets for disease management and may provide a bioinformatic basis for the further understanding of BL.

Burkitt's lymphoma (BL) is a highly malignant tumour originating from follicular germinal centre cells and is a special subtype of B-cell non-Hodgkin's lymphoma, which has a short tumour cell doubling time, high aggressiveness and poor prognosis characteristics. The main molecular characteristics of BL are CD19+, CD20+, BCL6+, CD10+, BCL2-and translocation of cmyc [1]. Its pathogenesis is not clear; intensive chemotherapy is the main treatment at the present stage, but the considerable side effects of drugs indicate the urgency of research on the new low toxicity and effective methods [2]. In recent years, with the rapid development of molecular biology, information such as cell cycle protein, apoptosis protein, oncogene expression and cell signal pathways of BL has been gradually revealed, but the exact pathogenesis remains to be further studied [3].
The protein interaction network is both the product of the biological development and the interaction data of large-scale complexity [4], for its research, now focused on the detection of protein complexes and functional modules [5]. Protein interaction networks are usually composed of a large number of high-density protein nodes but also have sparsely connected nodes, which indicates that clustering is the primary tool for data mining in protein interaction networks [6]. However, different clustering algorithms are based on different mathematical models, there are certain limitations [7]; therefore, using a single clustering method to analyze the network will generate large or small errors and impact research results.
In this study, based on the existing methods, BLrelated gene information were analyzed by using a series of bioinformatic methods, including proteinprotein interaction (PPI) network construction, topology analysis, multiple cluster analysis and pathway enrichment analysis. The results we obtained may facilitate the understanding of the molecular mechanisms underlying BL. search term 'Burkitt lymphoma' in the 'Gene' section (Burkard et al. [12]). Sixty-eight BL-related genes were identified (up to 2016-07-26); By reading the related articles one by one, miscellaneous data were screened out. Finally, 55 genes (up to 2016-07-26) related to the clinical, immuno and molecular definitions of BL were reserved, as shown in Table 1.

PPI network construction and topology attribute analysis
The Search Tool for the Retrieval of Interacting Genes/ Protein (STRING) database was used to retrieve the predicted interactions for the identified genes and allows visualizing complex networks [8]. Version 10.0 of STRING, which contains information on approximately 9.6 million proteins from more than 2000 organisms is a biological database and web resource of known and predicted PPIs. We set the required interaction score at >0.4 as the threshold used for analysis. Subsequently, the protein interaction was read into Cytoscape 3.4.0 (http://cytoscape.org/), which allows the network to be integrated into any type of attribute data and implement topology analysis [9].

Mining function modules and implementing topology analysis
Different clustering analyses (OH-PIN, IPCA, EAGLE, MCOMD) were used to detect molecular complexes in the network through plug-ins CentiScaPe 2.1 and the Cytocluster 1.0.1 in Cytoscape, whereas ClusterViz 1.0.3 was used to achieve CML PPI network topology analysis. The parameters and the applied thresholds were shown in the notes of Figure 2.

Pathway enrichment analysis
The proteins included in the eight molecular complexes were submitted to the online DAVID system (https://david.ncifcrf.gov/) (parameters: count = 2, EASE = 0.1, 'species and background' choosing 'Homo sapiens'), which consists of an integrated biological knowledge base and analytical tools aimed at systematically extracting biological meaning from large gene or protein lists [10,11]; a relative search of the Kyoto Encyclopaedia of Genes and Genomes (KEGG), BBID, PANTHER, BIOCARTA and REACTOME_PATHWAY databases helped identify the biological pathways regulated by BL-related genes. A P-value ≤ 0.05 was used as the cut-off criterion.

PPI network and molecular complexes
A gene/protein interaction network, consisting of 459 nodes (protein) and 1399 sides (interactions), was constructed after a comprehensive literature search of 50 related genes using the STRING ( Figure 1). Then, through OH-PIN, IPCA, EAGLE, MCOMD algorithm analysis, the first two clusters with the highest degree of association were obtained respectively for further analysis ( Figure 2).

Results of network topology attribute analysis
The network topology attribute analysis showed that the connectivity of nodes in the network (the number of nodes in the network) follows a descending distribution. That is, as the number of connected nodes increases, the number of corresponding nodes decreases. Thus, the gene-protein interaction networks are scale-free networks [12]. We found that a ≥ 35 degree of nodes in the network corresponded to a sharp reduction in the number of odes; therefore, nodes with a ≥ 35 connectivity in the network were selected as the key nodes (hub), including tp53, bcl2, akt1, myc, bax, mdm2, mapk8, stat1, jak1, fos, il4, myd88, which totals 12 genes ( Figure 3).

Molecular complex pathway enrichment
The enriched terms obtained from the Pathway enrichment analysis of BL are shown in the following table (Table 2). Only the first five enrichment pathways of each molecular complex are listed in the table, but we took all meaningful data for analysis. Finally, we took the same channel repeat number of ≥ 4 as the key path, a total of the following eight pathways: tolllike receptor signalling pathway, Jak-STAT signalling pathway, apoptosis signalling pathway, p53 signalling

Discussion
Protein is not only the material basis of life but also the executor of life activities; they rarely achieve their assigned functions in an individual way, such as genetic material replication, gene expression regulation, cell signal transduction, cell proliferation, apoptosis, metabolism and other processes, and activities are dependent on the interaction between proteins. Therefore, the study and analysis of PPI networks have naturally become the foundation to understanding cellular organizations, processes and functions [13]. The PPI network, which has a modular structure, is made up of relatively independent subnets on the topology or biological function. Therefore, a clustering analysis of a PPI network can reveal its structure and predict the function of unknown proteins in the cluster [14]. For example, Nanda Kumar Yellapu, etc., through the analysis of the PPI network found that PAK1 is an important target gene of breast cancer [15]; Ma [16] used microarray and system biology methods to describe the genetic background of cold syndrome in the TCM system. Based on the protein interaction network, Harriet Keane constructed the MPP+ model to analyze and predict the crosstalk between network proteins, and four target nodes/proteins (P62, GABARAP, GBRL1 and GBRL2) with a high betweenness were identified by the intervention of mitochondrial function and autophagy to influence Parkinson's disease [17]. Gu et al. [18] studied inflammation and the angiogenesis gene network module response, predicting that adam17, cd40, ets1, foxo1, smad3 and tlr4 were the target genes of miR-145; Pan [19] predicted the biomarkers that distinguish HCC from MHCC by identifying DEMs through the use of a network biological approach.
Clustering analysis, which is based on a certain algorithm and a given rule mining the protein network dense subgraph is a type of important method of mining complex data [20].Traditional clustering algorithms are divided into hierarchical-based methods, graph-based methods, density-based methods and so on [21]. However, there is no reliable evidence that the results of these methods have significant biological implications. Different types of clustering algorithms have their advantages and disadvantages: the biggest advantage of graph-based methods is simple and efficient, the advantage of the hierarchical-based method is that it can make the hierarchical modules of the whole protein network to be presented in a tree structure. However, the deficiency of the above two methods is that each protein node can only belong to one cluster. The density-based method allows a protein to repeatedly emerge in the process of expanding the search, which enables achieving the goal that the same protein belongs to different clusters, but this method cannot identify the nondense sub-graph structure in the protein network [22]. In the actual protein network, each protein node may belong to a number of protein complexes or functional modules, with multiple functions, involved in a number of different biological processes [23].
Therefore, comprehensively considered, we used OH-PIN, IPCA, MCODE and EAGLE, which were four different clustering algorithm analyses, aimed at reducing the error and improve the authenticity of the results. OH-PIN is an agglomerative hierarchical algorithm that can distinguish overlapping protein functional modules. IPCA, improved by the DPClus [24] algorithm, is based on the density of the algorithm; the process is to select the largest node as the seed, followed by an extended analysis, when all the nodes that may have been joined are detected as an output cluster [25]. MCOMD is a non-overlapping clustering algorithm based on density [26], but because nodes with large weights do not necessarily have a high degree of connectivity with other nodes, there is no  guarantee that the predicted complexes will be tightly connected to each other; furthermore, in practical applications, the coverage rate of the complexes based on the density algorithm is not high [27]. EAGLE is a type of agglomerative hierarchical algorithm based on the maximum clique, which can identify overlapping function modules [28], by merging the two clusters with the largest similarity; this algorithm finds the optimal partitioning method to realize the identification of functional modules [29]. In this study, we will analyze and compare the results of four types of cluster analyses, the high correlation degree of gene nodes and the common role of the signalling pathway is more biologically significant.
Twelve important target genes associated with BL were identified in our study. Related studies have shown that the c-myc gene in t (8; 14) translocation is recognized as the main molecular genetics of BL. Dave et al. compared the gene expression profiling of BL and diffuse large B-cell lymphoma with gene chip technology and found that the c-myc target gene and the germinal centre subgroup genes were overexpressed [30]. However, studies have shown that the translocation of c-myc/Ig may also occur in normal precancerous cells [31]. Further study found that c-myc can be used as a transcriptional active gene amplifier through the upregulation of cyclin D and E to drive cell proliferation [32]. Epstein-Barr virus, as an important factor in the occurrence of BL, exists in almost all endemic BL. It can cause abnormal expression of apoptosis inhibitor gene BCL2 or the apoptosis of the Fasmediated pathway in lymphocyte system to be suppressed, which leads to the suppression of lymphocyte apoptosis and the occurrence of lymphoma [33]. The over expression of MDM2 can inhibit the expression of p53 protein, and its abnormal expression can disrupt cell cycle regulation in BL [34], and the high expression of MDM2 and MDM4 may be the main mechanism in some cases. TP53 (encoding p53) is an important tumour suppressor gene that plays an important biological role in cell cycle regulation, DNA damage repair, cell differentiation, apoptosis and senescence and the transcriptional regulation of key genes such as p21, Bax, etc., which prevent the occurrence and development of tumour [35,36]. The findings of Pervez S confirmed this view [37]; the Fos gene belongs to the immediate early response gene, also known as the 'molecular switch', and it plays a dual role in tumour cell apoptosis, invasive growth and tumour angiogenesis, which depends on the cell type and apoptotic stimulus signal types [38]. IL-4 is one of the key cytokines in the process of tumour immunity, and its effect on tumours is also positive and negative, but more studies have shown that IL-4 has a growthpromoting effect on tumour, which can protect the death receptor and chemotherapy-induced apoptosis of tumour cells. As studies have shown, IL-4 significantly inhibits Fas/APO-I and chemotherapeutic druginduced tumour cell apoptosis in prostate cancer, breast cancer and bladder tumour cells [39] and can induce the differentiation of monocytes from blood into M2 macrophages that promote tumour growth and angiogenesis [40]. Mapk8, stat1, jak1 and myd88 are important transduction proteins that play a key role in MAPK, Jak-STAT and toll-like signalling pathways, respectively.
According to the literature, we found that the key genes of this study are consistent with the high degree of target genes in traditional research, which proves that the gene/protein interaction network constructed by us has a certain value.

Conclusions
In this study, 12 important target genes (tp53, bcl2, akt1, myc, bax, mdm2, mapk8, stat1, jak1, fos, il4 and myd88) and 8 molecular complex pathways (toll-like receptor signaling pathway, Jak-STAT signaling pathway, apoptosis signaling pathway, p53 signaling pathway, MAPK signaling pathway, colorectal cancer signaling pathway, pancreatic cancer signaling pathway and neurotrophin signaling pathway) associated with BL were identified. These genes and pathways might play important roles in the biological progression and become new targets for treating BL. Additionally, we apply new ideas that combine multiple algorithms to analyze the interaction between genes, which is more reliable than a single method. However, further investigation is required to verify these results and the proposed mechanisms of action in BL.

Disclosure statement
No potential conflict of interest was reported by the authors.