Functional clusters analysis and research based on differential coexpression networks

ABSTRACT Differential coexpression analysis has gradually become an important approach to improve the conventional method of analysing differentially expressed genes. With this approach, it is possible to discover disease mechanisms and underlying regulatory dynamics which remain obscure in differential expression analysis. The detection of differential coexpression links and functional clusters between different disease states is a demanding task. Nevertheless, there is no gold standard for detecting differential coexpression links and functional clusters. Consequently, we developed a novel fusion algorithm FDvDe (Fusion of differential vertex and differential edge) to detect differential coexpression links by aggregating the set of ‘differential vertex’ and ‘differential edge.’ Then, we constructed differential coexpression networks between normal and tumour states by integrating the differential coexpression links. With this approach, we identified 1823 genes and 29370 links. Then, we developed the algorithms GTHC (GO term hierarchical clusters) to identify functional modules. The distance matrix used in the hierarchical process was formed by the GO semantic similarity. Furthermore, we aggregated the densities among clusters describing the connectivity among clusters and topological property analysis to discover the hub genes and hub pathways which play an important role in disease mechanism. In this paper, we showed that our approach worked well on a data set of breast cancer samples (68 tumour samples) and normal samples (73 normal samples), and revealed the crucial role and biological significance of the modules and hub genes found in this approach.


Introduction
In the last 18 years, microarrays have gradually become a common technological means for discovery of disease mechanisms based on datasets of global gene expression. Along with the emergence and development of molecular and cellular biology, we have altered the traditional way of studying disease mechanisms through differential expression analysis to differential coexpression analysis of genes participating together in a distinct functional group or a complex biological pathway. Therefore, a large number of coexpression networks include physical underlying protein-protein interactions and metabolic reactions, as well as functional associations such as epistasis of single nucleotide polymporphisms (SNP), synthetic lethality relationships and coexpression of genes [1][2][3]. These coexpression networks have been widely applied to answer and solve the most common biological questions including identifying cancer-related genes and drug discovery [3][4][5].
Coexpression networks, as a type of molecular networks, are constructed using the correlation coefficient of the whole gene set, pair by pair. However, similar to other static molecular networks, most of the algorithms based on coexpression network only reveal the gene regulatory mechanism under one state. Through experimental analysis, the molecular interactions or functional modules vary corresponding to different conditions including genetic variations and changes in the environment. Consequently, the conventional analysis based on one condition has to be modified to adapt to a variety of conditions. Only with this approach, can we understand the underlying cellular dynamics mechanism. Up till now, the differential coexpression network analysis has been developed and has gradually become one of the most important methods for discovery of the underlying mechanisms of pathogenesis, such as the Weighted Gene Coexpression Network Analysis (WGCNA) [6], Differential Coexpression profile (DCp) [7], Differential Coexpression enrichment (DCe) [7], Differential Coexpressed Links (DCELs) [3], Differential Coexpression Network (DCEN) [3], Detection of global changes of the network topology (DCglob) [8] and Detection of local changes of the network (DCloc) [8]. Differential coexpression analysis has been applied to successfully identify the differential coexpression modules, a group of genes strongly correlated under one condition but not under other, and differentially expressed genes [9][10][11][12].
Although various methods to construct differential coexpression networks have been developed, there is no gold strategy of selecting differentially coexpressed links. Based on the different angles of constructing differential networks, some authors, e.g. Hsu et al. [3] chose the differential coexpressed links by calculating the different value of the pair of genes under different conditions and obtained the pair of genes with P-values lower than a threshold t. Others, for example Ideker and Krogan [13] constructed the differential coexpression network by integrating the edges obtained in the process of edge-wise subtraction of the coexpressed gene pairs between the different disease conditions of A and B. With this approach, it is possible to discover some underlying regulation relationships that are not detected in specific networks. Meanwhile, it will allow us to discard the correlations that do not differentiate one disease state of interest from another [13]. Still others obtained sparse group-specific networks and compared the network topologies, such as modularities between groups and degrees [14,15].
Although most of the above-mentioned methods have revealed some underlying biological mechanisms of pathogenesis and addressed a number of important questions, these approaches are limited for lack of criteria of selecting differentially coexpressed links and the scientific interpretation of the hub genes. In this paper, we developed a novel method, FDvDe (Fusion of Differential vertexes and Differential edges), to select differentially coexpressed links and construct a differential coexpression network by integrating these links. With this method, we considered both the differences of gene pair correlation coefficient and the topological property of the same vertex under the two different conditions, A and B. Next, we developed the algorithm GTHC (gene ontology (GO) term hierarchical clusters) to obtain 64 clusters which are relevant to breast cancer. Afterwards, we studied the correlation among these clusters, and identified a number of hub genes that are crucial to the mechanism of disease. Finally, we found that the genes identified through these approaches occupied an important position in the topological structure of the differential coexpression network by quantitatively calculating the significance of these genes.

Methods
The algorithms FDvDe and GTHC were implemented using the statistical programming language R [16]. While the former was applied to choose the differential coexpressed edges, the latter was applied to detect the functional modules of the differential coexpression network. The flow chart of these two algorithms is illustrated in Figure 1.

Gene expression data pre-processing
The original normalized gene expression datasets of 20 362 genes across breast cancer samples (68 samples) and normal samples (73 samples) were obtained from The Cancer Genome Atlas (TCGA) website. The GO terms could be obtained from the GO website (http://purl.oboli brary.org/obo/go.obo), and we downloaded the whole GO annotations for human genes from the National Center for Biotechnology Information (NCBI) Entrez Gene database. Then we applied some steps of the method proposed by Hsu et al. [3] to conduct the data pre-processing. First, we used the process of log normal distribution to transfer the raw gene expression data to satisfy the normal distribution and then used a quantile normalization method to normalize to make sure that all the arrays were in a similar distribution. Second, according to the circumstance that there were some probes with an NA value, we discarded the probes with more than 20% of NA values and applied the KNN (k-nearest neighbours) algorithm embedded in the R package 'impute' to impute the missing data; the parameter k was set to 10 [3]. Third, the expression of genes with multiple expression arrays was calculated by the average of these expressions. Finally, the genes without GO annotations were discarded, as the information provided by GO terms of the gene is needed to obtain the semantic similarity matrix.

Constructing coexpression network
After the data pre-processing step, we obtained the final gene expression data from normal samples denoted by condition A and tumour samples denoted by condition B. We used the Pearson correlations to calculate the correlation coefficient of all pairs of genes separately for condition A and condition B, denoted by p c With this approach, we obtained the correlation coefficient matrix A and matrix B, respectively, for condition A and condition B, which were implemented in the next steps.

Differential vertex
The principle of selecting the differential vertexes was based on the topological property. For the coefficient matrix A and B, we chose the pairs of genes whose correlation coefficient values p were over the threshold T = 0.9; consequently, we obtained coexpression networks N A and N B corresponding to condition A and B. For a given gene j , we applied the method proposed by Bockmayr et al. [8] to calculate the differential vertex based on the neighbourhood of the given gene, let V i A denote the set of genes which are its neighbours in N A , whereas V i B in N B : According to this definition, we know that the greater the number of common neighbours of the given gene in both networks, the smaller the value of d i .

Differential edge
After defining d v , we modified the method proposed by Wally et al. [12] and defined the formula of differential edge d e as follows: Where p A i;j , p B i;j denoted the Pearson correlation coefficient of gene i and gene j in both networks. The parameter b was set to 0.8 if the signs of p A i;j and p B i;j are not the same, e.g. p A i;j has a negative sign, while p B i;j a positive one; on the contrary, if the sign is the same, b was set to 1. The modification was derived from the hypothesis that the alteration of the sign is more important than the invariability of the sign if the values are equal; for example, the difference of ¡0.2 and 0.2 is more significant than that of ¡0.3 and ¡0.7. The latter part of this formula aimed to solve the following situation: 0.0 to 0.5 and 0.5 to 1.0, in which we considered that the changes of the latter had more importance.
In this equation, we used the method of absolute value to make sure that the result is positive. Meanwhile, we used permutation testing to guarantee the significance of the result. After 1000 random samplings, we obtained the P-values by calculating the count of the differential edges detected in the permutation testing that were lower than that detected in the original data. Then, we adjusted the P-values using the Benjamini-Hochberg (BH) procedure [17]. Finally, we chose the differential edges (gene pairs) with P-value smaller than 2.5 £ 10 ¡4 to prepare for the next step.

FDvDe
After the above two parallel procedures, we obtained two sets of gene vertices, X and Y, where X stands for the vertices obtained in the procedure of differential vertex, and Y stands for the gene vertices existing in the differential edges (pair of gene vertex) obtained from the procedure of differential edge. The detailed algorithm FDvDe (fusion of differential vertex and edge) using pseudo-code is displayed in Figure 2.
With this procedure, we deleted the edges (gene pairs) if both of the two vertices of an edge are not in the set of V (set of genes in X \Y) simultaneously. Then we constructed the differential coexpression network with these reserved edges.

Topological properties
To analyse the differential coexpression network and explore the biological significance and correlations of the hierarchical clusters and hub genes in the point of structure properties, the following four network topologies were quantified.
(1) Degree distributions of differential coexpression network. The distributions were defined to describe the fraction of vertices in the differential network having degree k. As we know, most complicated networks satisfy the property scale-free and power law, expressed as P k ð Þ » k Àg , where we usually set g between 2 and 3 [18,19].
(2) Average path lengths for differential coexpression networks. The average path length is a concept in network topology that is defined as the average number of steps along the shortest paths for all possible pairs of network nodes; it is a measure of the efficiency of information or mass transport in a network. (3) The significance of seed nodes. We defined this concept to describe the significance of the seed nodes in network topology based on the these two correlation coefficients ECC [20] and p i;j .
where E i;j represents the number of triangles that include the edge (i, j) and min(k i , k j ), the minimum of the degree of gene i and gene j , which represents the maximal possible number of triangles including the edge (i, j). We applied the Pearson correlation coefficient p i;j ¼ pearson gene i ; gene j À Á , Where p i;j represents the Pearson correlation coefficient of gene i and gene j . Then we defined the final correlation of gene i and gene j F i;j as follows: And the significance of gene i S i was defined as where j represents all of the neighbours of gene i in the network. Then we sorted the whole S i ; the more significant the gene was, the greater the number of positions of the gene were.
(4) Connectivity densities We applied the connectivity densities proposed by Hsu et al. [3] to describe the correlation among the clusters to discover the hub genes that regulate more than one functional module and understand the mechanism of pathogenesis. It was defined as follows: where E represents the number of edges between the two clusters or within the cluster; i = j denotes that the two clusters are the same, whereas i6 ¼j that the two clusters are not the same. Then we used the z-score to transfer the value D i;j . Last, we chose a value greater than 4.57 for further research, and we can hypothesize that the chosen clusters have the strongest relations with breast cancer.

GO term hierarchical clusters
In this step, we developed a novel algorithm GTHC (GO term hierarchical clusters) to calculate the functional semantic similarity matrix of the whole genes and replace the Pearson correlation coefficient distance matrix with the GO semantic similarity matrix. The algorithm was modified from the method proposed by Schlicker et al. [21]. To calculate the functional similarity, Schlicker combined Resnik's method, which developed the concept of 'information content' [22], and Lin and Dekang's idea [23] to calculate the semantic similarity between two terms using the proportion of the common ancestors of the terms and the information to describe the terms. After the pre-processing procedure, the genes without annotation of GO term were deleted; consequently, the retained genes were all annotated by a GO term. For the given two genes, A and B, annotated by GO terms, let GO A i denote the i-th GO term of gene A in the whole N terms and GO B j denote the j-th GO term of gene B in the whole M terms. The algorithm GTHC is displayed in Figure 3.

Results and discussion
After the gene expression data pre-processing, we obtained 15 673 genes in normal samples and 15 602 genes in tumour samples for further research. The two algorithms, FDvDe and GTHC, were developed to choose the differential coexpressed links and identify the functional similarity clusters.

FDvDe
The FDvDe algorithm was based on two paratactic methods of detecting differential vertexes and differential edges. The result of the dissimilarity degree of vertex under different conditions is shown in Figure 4. We discovered that the number of dissimilarity degrees was close to 1, indicating that most of the genes were topologically dissimilar. Then, we chose the vertexes (genes) having a differential value greater than 0.85 for further research and deleted the other vertexes, which were topologically similar between the two coexpression networks, A and B.
Concurrently, we used matrix A and B obtained from the process of constructing the coexpression network to detect the differential edges. Some of the results are demonstrated in Table 1.  After the above procedure, we obtained two datasets, X and Y, which were the necessary input in algorithm FDvDe, and the number of gene vertices in X and Y is displayed in Figure 5. Next, we used the FDvDe algorithm to fuse these datasets to obtain the final differential edges. Finally, we used these edges to construct a differential network of 1823 genes and 29 370 edges (pairs of genes).

Topological properties
Through the analysis of the differential coexpression network, we discovered that the network satisfied the properties of scale-free and power law which are common in a complicated network, indicating that our method of constructing the network was rational. The degree distribution and the shortest path lengths are described in Figure 6. The top 50 genes with significant topological  property in the set of seed nodes are displayed in Table 2. The connectivity densities among these 64 clusters are displayed in Figure 7(B).

GO term hierarchical clusters and connectivity among them
In the differential coexpression network, all the 1823 genes annotated by GO terms were classified into 64 clusters shown in the heatmap of Figure 7(A). The quantitative distribution of these clusters is shown in Figure 8. The clusters classified by the GTHC algorithm were significantly highly rich in the biological themes. We performed the gene functional enrichment analysis using the bioinformatics tool DAVID [24,25], and the most important results are presented in Table 3. We performed further analysis to study the connectivity among these 64 clusters. The connectivity densities are displayed in Figure 7(B), in which we could discover that clusters 44 and 63 had the closest connection with other clusters. In order to simplify the research and detect the most meaningful and significant genes in the functional clusters, we removed the connectivity of clusters whose value is smaller than 4.57 (P-value < 0.01). With the above approach, we discovered that most of the edges existed between the two different clusters; only cluster 31 and 44 had self-correlation. According to this phenomenon, we could conclude that most of the differentially coexpressed gene pairs tended to regulate the pathway or molecular functions existing between the functional clusters. This conclusion is similar to the functional explanations of genetic correlations in previous researches [26,27]. The above phenomenon is displayed in Figure 7(C). Further, for more detailed analysis of the association relationship, we magnified Figure 7(C) to describe the correlation of clusters 52, 35 and 63 in Figure 7(D).
In these 64 clusters, there were a large number of genes significantly specifically expressed in the differential coexpression network. Cluster 9 included SPDEF, FOXA1, XBP1 and TFE3, which were differentially expressed in breast cancer [28]. We also found that FOXN2, FOXN3 and FOXP2 were included in the family of FOXA1 and the ARID3A having the same function annotation with ARID (AT-rich interaction domain). Cluster 13 included PAPSS1, AK3, AK5, CKB, HSPA12B, HYOU1, TK2 and UCK2 belonging to the same function term of adenosine triphosphate (ATP)-binding, in which PAPSS1 has been reported to further enhance MTA1s-mediated repression of ligand-ER transactivation function [29]. Meanwhile, cluster 42 included ACCAA2, ACACB, ACSM5, ACSS2 and ACSS3, five genes 100% enriched in the functional term of ATP-binding. Some observations suggest ACSS2 to be expressed to a significant extent in particular tumour types, including triple-negtive breast cancers [30]. Cluster 50 included three genes, USP19, USP43 and USP54, regulating the same pathway 'thiol-dependent ubiquitinyl hydrolase activity.' Cluster 56 included ITIH5, SERPINA3, SERPINB5, SERPINF1, SPINT1, SPINT2, TIMP2, TIMP4 and WFDC2 belonging to the same term shown in Table 3, in which the SERPINA family genes were studied by de Ronde et al. [31]. We also discovered seventeen genes SLC11A2, SLC25A16 etc belonging to the SCL family, which is significantly expressed in breast cancer [31]. Cluster 63 included PKIB, SFN and YWHAG, in which SFN may account for some of the unexplained multiple-case breast cancer families [32].
Recently, genetic interactions have been discovered to include the between-pathway mode, in which the correlation of gene pairs regulates two different pathways, and the within-pathway mode, in which the connections of gene pairs operate within a single pathway [33]. A high proportion of the interactions existing in the differential coexpression network are of the between-pathway mode [34]. Consequently, we studied the correlation among the 64 clusters, and we chose the clusters among which the connectivity densities are greater than the threshold, and constructed the clusters correlation network shown in Figure 7(C). With this approach, we detected some hub genes that regulate the network by the between-pathway mode. For example, YWHAG, was identified only in cluster 63, and linked to the other 11 clusters of the 14 ones which have interactions with cluster 63. The YWHAG gene has been testified to be important in understanding the breast cancer mechanism [35]. The BTD in cluster 44 was shown to link to 10 of all the Table 2. Top 50 genes in the set of seed nodes. 11 clusters which have interactions with cluster 63, and it was verified to be a potential breast cancer biomarker in plasma [36]. FAAH2 widely existing in the same cluster needs to be researched. Recently, CTNNB1, which was included in cluster 52, was researched by Tanabe et al. [37] in cancer and stem cells. When we mapped the hub genes to the set of seed nodes, as expected, most of the hub genes obtained by the employed approach had a significant meaning of the topological properties in the differential coexpression network.

Identification of hub genes based on the GO term and the topological properties
With the approaches described above, we identified a number of genes that existed in the clusters that contribute to different biological processes and we mapped the obtained genes to the set of seed nodes. In the analysis of topological property, we had all the 1823 items in the set of seed nodes (the bigger the number is, the more important the vertex is). Expectedly, most of the hub Figure 7. GO term hierarchical clusters and their relationships in differential coexpression networks.
genes detected in the clusters had an important position in the seed nodes. For example, SCNN1A (sodium channel epithelial 1 alpha subunit), which has been previously studied [38], existed in cluster 58 and its location was 1819 of the whole 1823 items. The PPAP2C gene (phosphatidic acid phosphatase type 2C) existed in cluster 39 and its location was 1815. The PPAPDC1B gene (phosphatidic acid phosphatase type 2 domain containing 1B) having the same function as PPAP2C, has been reported to be specifically expressed in HER2+ tumours [39]. Overall, these results verified that the genes identified by the method proposed in this paper were related to breast cancer and have very important topological properties in the differential coexpression network.

Conclusions
Although in the last ten years, a number of various methods and algorithms have been proposed to construct differential coexpression networks and identify the functional clusters and hub genes, there is still no gold standard. In this paper, we developed two novel algorithms, FDvDe and GTHC. The former one aimed to obtain the final set of differentially coexpressed edges (pairs of genes) considering both the differences in the correlation of gene expression and the neighbours of the same vertex under different conditions, whereas the latter one was proposed to detect the functional clusters based on the GO term semantic similarity matrix. Then, we conducted integration analysis of the GO term semantic similarity and topological properties in the differential coexpression network. With these approaches, we developed a rational and effective means to construct the differential coexpression network, while the research frame reflected in this experiment to detect the hub genes and testify their statistical significance and topological properties aimed to present a novel research idea to analyse the mechanism of pathogenesis of breast cancer. We believe that the analysis of differential coexpression networks will provide data for future studies on the mechanisms of cancer.  Note: The clusters are in column 'Cluster'; column 'Term' represents the function or pathway that the cluster mainly regulates belonging to the biological category shown in the column 'Category''; column 'Count' represents the count of genes in the corresponding term, while '%' represents the count divided by the number of all the genes in the cluster. The 'P-value' reflects the statistical significance.