Advanced search
1,697
Views
6
CrossRef citations to date
0
Altmetric
Research Paper

Identification of epigenetic modulators in human breast cancer by integrated analysis of DNA methylation and RNA-Seq data

, &
Pages 473-489
Received 15 Feb 2018
Accepted 19 Apr 2018
Accepted author version posted online: 25 Jun 2018
Published online: 07 Aug 2018

ABSTRACT

Human tumors undergo massive changes in DNA methylation. Recent studies showed that site-specific methylation of CpG sites is determined by the DNA sequence context surrounding the CpG site, which alludes to a possible mechanism for site-specific aberrant DNA methylation in cancer through DNA-binding proteins. In this paper, DNA methylation data and RNA-Seq data of breast tumors and normal tissues in the database of The Cancer Genome Atlas (TCGA) were integrated with information of DNA motifs in seven databases to find DNA-binding proteins and their binding motifs that were involved in aberrant DNA methylation in breast cancer. A total of 42,850 differentially methylated regions (DMRs) that include 77,298 CpG sites were detected in breast cancer. One hundred eight DNA motifs were found to be enriched in DMRs, and 109 genes encoding proteins binding to these motifs were determined. Based on these motifs and genes, 63 methylation modulator genes were identified to regulate differentially methylated CpG sites in breast cancer. A network of these 63 modulator genes and 645 transcription factors was constructed, and 20 network modules were determined. A number of pathways and gene sets related to breast cancer were found to be enriched in these network modules. The 63 methylation modulator genes identified may play an important role in aberrant methylation of CpG sites in breast cancer. They may help to understand site-specific dysregulation of DNA methylation and provide epigenetic markers for breast cancer.

Introduction

Methylation of cytosine bases in DNA mainly occurs in the context of CpG dinucleotides and are stable with successive rounds of cell division [1]. During normal development, most methyl groups derived from the gametic DNA are first erased following fertilization [2]. Then, at the time of implantation, there is a wave of denovo methylation that modifies almost all CpGs in the genome except for CpG islands. Following implantation, changes in methylation take place in a site-specific manner and can involve either denovo methylation or demethylation of some genes [2]. In normal cells, most CpGs are methylated, and methylation occurs mainly in low-density CpG regions. Genomic regions termed CpG islands that contain high CpG and C:G content are typically unmethylated. However in human cancer cells, extensive hypomethylation occurs in low-density CpG regions [3], particularly in long blocks corresponding to lamin-associated domains (LADs) and regions of large organized chromatin lysine modifications (LOCKs) [4], whereas hypermethylation occurs in a locus-specific manner in CpG islands [5] and CpG island shores [6]. As CpG islands are frequently located in the promoter regions of vertebrate genome [7], hypermethylation of such CpG islands in the promoter regions can silence expression of genes [8]. Epigenetic silencing is one important mechanism by which genes encoding for tumor suppressors, DNA repair enzymes, and proteins involved in other cellular/regulatory pathways, are inactivated in human cancers, which implies that epigenetic silencing may contribute to cancer initiation and progression. Therefore, understanding how site-specific DNA methylation is dysregulated in cancer is very important to the understanding of cancer biology.

Several recent studies [912] have demonstrated that cis-acting DNA motifs play an important role in regulating site-specific DNA methylation. Particularly, the study in [9] demonstrated that when various 1,0001,000 bp long sequences were inserted into the genome of mouse embryonic stem (ES) cells, these sequences recapitulated the DNA methylation patterns naturally found in ES cells. They also found that, by systematically reducing the length of inserted sequences below a certain length, the methylation signature could not be recapitulated. Moreover, mutations in the DNA-binding motifs in these sequences modulated the methylation level. Therefore, this study clearly showed that the underlying genetic sequence and DNA motifs in the sequences play a key role in regulating methylation of nearby CpG sites. Recent studies on methylation quantitative trait loci (meQTL) [1316] found that naturally occurring genetic variations were associated with DNA methylation levels at proximal CpG sites. This provides additional evidence that methylation of a CpG site depends on its surrounding DNA sequence context.

Although these studies [916] have provided convincing evidences that site-specific methylation of CpG sites is determined by the underlying DNA sequences, and is plausibly regulated by the proteins that bind to the motifs in the DNA sequences, the role of such DNA motifs and their binding proteins in aberrant DNA methylation in cancer has not been studied. This work aims to fill this gap by integrating methylomes and transcriptomes of both breast tumor and normal tissues available in the TCGA data portal with DNA-binding proteins and their binding motifs in 7 databases to find DNA motifs and their associated binding proteins that are involved in aberrant DNA methylation in breast cancer.

Results

Differentially methylated CpG sites in breast tumors

A total of 42,850 differentially methylated regions (DMRs) were found at false discovery rate (FDR) 103 from the 450K methylation microarray of 94 breast invasive carcinoma (BRCA) samples and 94 matched normal tissue samples, and these DMRs include 77,298 CpG sites. Among these DMRs, 12,853 DMRs are hypermethylated in tumors, i.e., their methylation levels are increased in tumors compared with those in normal tissues; 29,997 DMRs, on the other hand, are hypo-methylated in tumors, i.e., their methylation levels are decreased in tumors compared with those in normal tissues. Distribution of DMRs in different genomic regions including CpG islands, CpG shores, and open seas is showed in Table 1. CpG islands are regions of more than 500 bp whose G/C content is greater than 55% [17]; CpG shores are the regions of comparatively low CpG density, located 0–2 kb to CpG islands; and CpG open seas are regions 4 kb away from any CpG islands [18]. Fisher’s exact test based on the distribution of DMRs in Table 1 shows that hypermethylated DMRs are enriched in CpG open seas (P value < 2.2 x 10–16) and depleted in CpG islands (P value < 2.2 x 10–16). Hypomethylated DMRs exhibit a similar trend, as they are enriched in CpG open seas (p value < 2.2 x 10–16) and depleted in CpG islands (P value < 2.2 x 10–16). The distribution of DMRs in different gene regions is depicted in Table 2. Hypermethylated DMRs are in enriched in gene body (P value < 2.2 x 10–16), and depleted in promoter regions (P value < 2.2 x 10–16). Similarly, hypomethylated DMRs are enriched in gene body (P value < 2.2 x 10–16), and depleted in promoter regions (P value < 2.2 x 10–16).

Table 1. Distribution of DMRs in different genomic regions.

Table 2. Distribution of DMRs in different gene regions.

Motifs of DNA-binding proteins enriched in DMRs

A recent study [9] showed that mutations in protein-binding motifs in the DNA sequence surrounding a CpG site modulate the methylation level of the CpG site, and that DNA sequence of about 1,000 bp around a CpG site determines the methylation of the CpG site. Therefore, we hypothesize that there are certain DNA motifs present in the 1,000 bp long sequence around each DMR, and that methylation of CpG sites in an DMR is potentially regulated by proteins binding to those DNA motifs. To find these DNA motifs, we first performed clustering analysis to group correlated DMRs into clusters and then searched for DNA motifs enriched in each cluster. The reason for the clustering analysis is that correlated methylation of DMRs in a cluster is likely due to co-regulation by certain DNA-binding proteins, and, therefore, it is more likely to detect motifs of those DNA-binding proteins in a cluster. Figure 1 shows the methylation data of 42,850 DMRs in the 188 tissue samples. The average methylation level of all CpG sites in each DMR is shown. Two clusters, one hypermethylated cluster and another hypomethylated clusters, are visible. However, at a cutoff value 0.6 for the Pearson’s correlation, these two clusters are further divided into 4,211 clusters, of which 66 clusters contain 100 DMRs.

Figure 1. Methylation data and the dendrogram from hierarchical clustering of DMRs. Each row represents average methylation level of CpG sites in one of 42,850 DMRs across 188 samples. Two clusters of DMRs are visible: one hypermethylated cluster and the other hypomethylated cluster in cancer. However, 66 different clusters with 100 DMRs were found at a cutoff value of 0.6 for the Pearson’s correlation between methylation levels.

The FIMO algorithm [19] was used to find DNA motifs enriched in the 1,000 bp long DNA sequences surrounding DMRs in each of the 66 clusters. The IDs of the CpG sites in each of the 66 DMR clusters are listed in supplemental file S1. This identified 108 DNA motifs and 109 proteins binding to these motifs, which are listed in supplemental file S2. While most motifs are bound by one protein, there are three motifs bound by complexes of two or three proteins. The top 10 motifs bound by one protein and the names of genes encoding their binding proteins are given in Figure 2. All of these 10 genes except two (STAT1 and SP1) are methylation modulator genes, as will be shown in the next section, and all 10 genes are implicated in several types of cancer, as will be elaborated in the Discussion section.

Figure 2. Top 10 DNA motifs significantly enriched in DMRs and genes that encode the proteins binding to the motifs.

Modulator genes for DMRs

The 109 proteins that bind to 108 motifs significantly enriched in the DMRs potentially regulate the methylation of CpG sites in the DMRs. In order to determine if these DNA-binding proteins are functionally relevant in regulating DNA methylation, we used a linear regression model to test the correlation between the expression level of the gene encoding each of these 109 proteins and the methylation level of CpG sites in the cluster of DMRs where the motif of the protein is enriched. Among the 188 tissue samples used in detecting DMRs, 171 samples have both RNA-Seq and DNA methylation data. The gene expression and DNA methylation data of these 171 samples were used in the correlation analysis. Figure 3 shows the result of correlation analysis for 16 genes that yielded smallest P values. The correlation between expression levels of these genes and the methylation levels of corresponding DMRs is highly significant with a P value less than 1021. Overall, this test identified 79 genes whose expression levels are significantly correlated with methylation levels of corresponding DMRs at an FDR0.01. We also tested the correlation between the expression levels of these 79 genes and the methylation levels of the 11,469 non-differentially methylated CpG sites using the 171 data samples, and no significant correlation was found.

Figure 3. Correlation analysis on the expression level of the gene encoding a DNA-binding protein versus the methylation level of corresponding DMRs. The x-axis is the expression level (log2 TPM value) of a gene, and y-axis is the average methylation level (M-value) of CpG sites in a DMR cluster. Shown in the figures are 171 data points from 171 samples and the line fitted with the linear regression model.

In addition to the 171 data samples (referred to as the training data) used in the correlation analysis, there were 327 tissue samples that had both RNA-Seq and DNA methylation data. The data of these 327 samples (referred to as the validation data) were used as independent data to validate the correlation between the expression levels of the 79 genes and the methylation level of the corresponding DMRs. The correlations of 63 out of the 79 genes were still significant in the validation data. Therefore, we named these 63 genes as DNA methylation modulator genes, in line with the definition of the epigenetic modulator in [20]. Names of 16 of these 63 methylation modulator genes are shown in Figure 3, and the full list of 63 genes and their binding motifs is shown in supplemental file S3. Eight of the 10 genes whose binding motifs are enriched in DMRs as listed in Figure 2 (excluding STAT1 and SP1) are among the 63 methylation modulator genes. A recent study on the somatic mutations in 560 breast cancer genome sequences identified 93 genes that carry cancer driver mutations [21]; 18 of these cancer driver genes are among the 964 DNA-binding protein encoding genes that we compiled and used to determine if their binding motifs were enriched in DMRs. Interestingly, 3 of the 18 genes (ESR1, FOXP1, SMAD4) are included in our 63 methylation modulator genes, and as shown in supplemental figure S1, their expression levels are significantly correlated with the methylation level of their corresponding DMRs, although whether and how the mutations in these genes affect the CpG methylation levels await future studies.

In order to see if the expression levels of DNA methylation modulator genes can predict methylation levels of corresponding CpG sites, we employed a multiple linear regression model to fit the average methylation of CpG sites in each of the 66 DMR clusters to the expression levels of modulator genes whose binding motifs were enriched in the DMR cluster. We first fitted the model with the training data, and calculated the adjusted R-squared value and the P value of the F-test for the model fitness. We then used the model learned from the training data to predict the average methylation level of each DMR cluster using the validation data, and calculated the predicted R-squared value, and also the P value of the F-test. The adjusted R-squared values, the predicted R-squared values, the P values of the F-test for 5 DMR clusters are shown in Table 3, and the results for the full list of the 66 DMR clusters are in supplemental file S1. As shown in Table 3, the adjusted R-squared values and predicted R-squared values for these 5 DMR cluster are high, greater than, or equal to 0.69 and 0.40, respectively; and P values for the model fitness are very small (1.79×1032), while the number of predictors for each DMR cluster is relatively small (7). The data in supplemental file S1 show that the maximum and minimum adjusted R-squared values of the training data are 0.88 and 0.36, respectively; the maximum and minimum predicted R-squared values of the validation data are 0.62 and 0.06, respectively. More than 80% of the adjusted R-squared values are greater than 0.62, and more than 80% of the predicted R-squared values are greater than 0.137. The P values of the fitness test with the training data for all 66 DMR clusters are less than 3×1014. With the validation data, the fitness test did not yield significant result (P value 0.129) for cluster 13, and yielded P values 1.21×103 for the other 65 clusters. These results indicate that the expression levels of relevant genes have good power of predicting the methylation level of corresponding CpG sites.

Table 3. Fitness of DNA methylation modulation models. The average DNA methylation level of CpG sites and expression levels of modulator genes in a DMR cluster were fitted to a multiple linear regression model. The adjusted R2 value and the P value of the F-test for model fitness were computed with the training data set. The model learned from the training data was used to predict the average DNA methylation level of each DMR cluster with the validation data, and the predicted R2 value and the P value of the F-test were obtained. The model fitness results for the full list of 66 DMR clusters are in supplemental file S1.

Network modules of methylation modulator genes

In order to better understand the influence of the 63 methylation modulator genes on DNA methylation, we performed a network analysis. A network of 964 genes encoding DNA-binding proteins was constructed with FIMO [19] and ACRANE [22] based on DNA motifs and gene expression data. A sub-network, consisting of the 63 methylation modulator genes and genes that reach at least one methylation modulator gene through paths in the network, was extracted, which yielded a network of 708 genes and 1,275 directional edges. This extended network of 63 methylation modulator genes is depicted in Figure 4. The 63 methylation modulator genes exhibit significantly higher out-degree centrality than other genes (P value = 1.3 x 10–5, Welch two sample t-test), implying that these modulator genes have important regulatory effects on other genes. In this network, 16 genes are regulators of the 63 methylation modulator genes, and 629 are the targets of the 63 methylation modulator genes. The 16 regulator genes are ZFX, TBX1, USF2, RREB1, EGR2, SREBF2, WT1, MNT, TFAP4, ZNF740, SPIC, EGR4, SP1, CLOCK, ETV1 and THRA. The target genes of each of these 16 regulator genes are listed in supplemental file S4. Although these 16 genes do not directly regulate DNA methylation, they may still play an important role in the methylation of CpG sites in DMRs through regulation of methylation modulator genes. Moreover, 15 of the 18 genes encoding DNA-binding proteins that contain cancer driver mutations [21] are included in the 708 genes; they are TP53, MYC, GATA3, CBFB, MDM2, ESR1, FOXA1, BRCA1, CTCF, DNMT3A, FOXP1, XBP1, SMAD4, CUX1, PRDM1.

Figure 4. Extended network of 63 methylation modulator genes. The network consists of 63 methylation modulator genes and 645 genes encoding DNA binding proteins. Dark nodes are methylation modulator genes; and the size of a node reflects the out-degree centrality of the node.

Twenty network modules in the extended network of methylation modulator genes were detected, with a community detection method based on edge-betweenness. The network modules are depicted in Figure 5, and names of genes in each network module are listed in supplemental file S5. In 16 out of 20 modules, one or more methylation modulator genes are hub nodes, implying that the methylation modulator genes may play an important role in regulating other genes. In fact, among 37 hub genes in the 20 network modules, 18 are DNA methylation modulators including the 8 modulators in Figure 2, and 14 are among the 16 regulators of DNA methylation modulator genes. To get insight into the functional impact of these network modules, we searched for pathways that are enriched in each module from the 4,731 C2 human gene sets in the molecular signatures database (MSigDB) [23]. In total, 29 MSigDB C2 gene sets were found to be enriched in the 20 network modules at an FDR0.05. These enriched gene sets are listed in Table 4. The first three gene sets in Table 4 are in fact from the gene network that links breast cancer susceptibility and centrosome dysfunction [24], which was constructed centering around the four known genes encoding tumor suppressors of breast cancer, BRCA1, BRCA2, ATM, and CHEK2. The fourth gene set contains mutated transcription factors regulating genes involved in breast cancer [25]. The next three gene sets are based on the genes associated with ES cell identity [26], and, in breast cancer, the ES-like signature is associated with high-grade estrogen receptor (ER)-negative tumors, often of the basal-like subtype, and with poor clinical outcome. Some other gene sets in Table 4, such as the P53, NFAT, and Smad2&3 pathways are also related to breast cancer.

Table 4. Gene sets enriched in the 20 network modules. Enriched gene sets were found by testing 4,731 MSigDB C2 human gene sets against the gene set of each network module at an FDR0.05. Module ID column lists the IDs of the network modules that the gene set is enriched. The list of genes in each network module is in supplemental file S5.

Figure 5. Network modules in the extended network of 63 methylation modulator genes. Twenty network modules are highlighted. Dark nodes are methylation modulator genes.

Discussion

Recent studies have shown that site-specific methylation of CpGs site is determined by the underlying DNA sequence surrounding a CpG site and the motifs in the DNA sequence. Therefore, it is plausible that such DNA methylation is regulated by the proteins bound to those DNA motifs. However, the role of such DNA motifs and their binding proteins in aberrant DNA methylation in cancer has not been studied. In this paper, we developed a computational pipeline to find genes and their DNA binding motifs that may play a regulatory role in aberrant DNA methylation in breast cancer.

The TCGA DNA methylation data from breast cancer patients and matched normal tissues were used to find a total of 42,850 DMRs that include 77,298 CpG sites. Although some results based on the TCGA DNA methylation data have been published [27], no result about differentially methylated CpG sites was reported. Our result provides the insight into the landscape of dysregulated DNA methylation in breast cancer. Based on the DMRs identified, we found 108 protein-binding motifs, that are enriched in the 1,000 bp long DNA sequences surrounding the DMRs, and 109 genes that encode proteins binding to these motifs. All top ten genes listed in Figure 2 have been reported to be involved in various types of cancer. The MAZ is highly expressed in hepatocellular carcinoma (HCC), which promotes proliferation, invasion and metastasis of HCC cells [28]; it is overexpressed in breast tumors and affects the prognosis of breast cancer by upregulating miR-34a [29]. ZNF263 regulates FoxA1 expression [30], which functions in estrogen-ER signaling pathway and is associated with antiestrogen response in breast cancer [31]. STAT1 is linked to increased invasion and lymph node metastasis in triple-negative breast cancer [32] and poor prognosis of breast cancer [33]. Sp1, Sp3 and Sp4 were reported to individually play a role in the growth, survival and migration/invasion of breast, kidney, pancreatic, lung and colon cancer cell lines [34]. Overexpression of Sp2 downregulates the expression of tumor suppressor gene CEACAM1 in prostate cancer [35], and, in mouse, overexpression of Sp2 increases susceptibility to wound- and carcinogen-induced tumorigenesis [36]. The expression level of ZBTB7 was reported to be significantly correlated with histological grade of breast cancer, and the overexpression of ZBTB7 was associated with shorter recurrence-free survival [37]. Overexpression of TFDP1 was shown in meta-analysis studies [38,39] to be strongly associated with decreased overall survival, relapse-free survival, and metastasis-free interval of breast cancer patients. KLF16 expression level was reported to be strongly correlated with the size, invasion depth, lymphatic metastasis and TNM stage of gastric tumors [40].

The 63 DNA methylation modulator genes may play an important role in dysregulated methylation in breast cancer based on three lines of evidences. First, the binding sites of these genes are significantly enriched in the 1,000 bp long DNA sequences surrounding the DMRs. Second, the expression levels of these genes are significantly correlated with the DNA methylation levels in the clusters of DMRs where the binding motifs of the genes are enriched. These correlations are significant in both training data and independent validation data. Third, the expression levels of these 63 genes are not correlated with the DNA methylation levels of 11,469 non-differentially methylated regions. Of note, one in vitro study [41] showed that induction of the EBF1 gene, one of the 63 methylation modulators, in pro-B cells considerably reduced the methylation level of CpG sites near the DNA binding site of EBF1. The inverse correlation between the expression level of EBF1 and methylation levels of the corresponding DMRs in Figure 3 is consistent with the observation in the in vitro study.

Many of these methylation modulator genes have been reported to be implicated in different types of cancer. Eighteen hub genes in the 20 network modules in Figure 5 are DNA methylation modulators, among which 15 genes have been shown to play a role in carcinogenesis. These 15 genes include the 8 genes (MAZ, ZNF263, SP2, SP3, SP4, ZBTB7B, TFDP1, and KLF16) discussed earlier and the following 7 genes: EGR1, ZNF148, TBX15, ACL2, KLF4, KLF5, and KLF15. EGR1 and WT1 are two hub nodes in the same network module; they are regulators of STIM1 expression [42], which plays an important role in TGF-β-induced suppression of breast cancer cell proliferation [43]. ZNF148 modulates cell proliferation via ceRNA regulatory mechanism in colorectal cancer [44]. Another hub gene, TBX15, in the same module as ZNF148 is downregulated in ovarian carcinoma [45]. ACL2 overexpression is a prognostic indicator in lung squamous cell carcinoma [46]. KLF4, also named GKLF, isupregulated during progression of breast cancer [47]. KLF5 promotes breast cancer proliferation, migration, and invasion [48]. KLF15 expression suppresses breast cancer cell proliferation at least partially through p21upregulation and subsequent cell cycle arrest [49].

A number of other hub genes in Figure 5, for example, NFR1, Sp1, EGR4, RREB1, and ETV1, ZFX, USF2, and TFAP4, are not DNA methylation modulators but are also implicated in cancer. NFR1 is involved in regulating androgen receptor transactivation and oxidative stress in prostate cancer cells [50,51]. As mentioned earlier, Sp1 plays a role in several types of cancer. EGR4 is involved in cell proliferation of small cell lung cancer [52]. It was reported that oncogenic Kras leads to repression of the miR-143/145 cluster in pancreatic cancer and is dependent on the Ras responsive element (RRE) binding protein (RREB1), which negatively regulates miR-143/145 expression [53]. ETV1 is an androgen receptor-regulated gene that mediates prostate cancer cell invasion [54]. ZFX is overexpressed in breast cancer, which positively correlates with tumor metastasis [55], and overexpression of ZFX promotes cell proliferation, migration, and invasion in gallbladder cancer [56]. A partial or complete loss of USF2 function is a common event in breast cancer cell lines [57]. High expression of TFAP4 in primary neuroblastoma patients was associated with poor clinical outcome and suppression of TFAP4 in MYCN-expressing neuroblastoma cells impaired migration and colony formation [58].

The majority of the gene sets in Table 4, which are enriched in the network modules in Figure 5, are related to breast cancer. As mentioned earlier, the first six gene sets were obtained from genes associated with breast cancer. The P53 gene is a well known cancer repressor. Recent studies point to an important role for NFAT in modulating invasive migration, particularly in breast cancer [26]. Smad2 and Smad3 are involved in breast cancer bone metastasis by differentially affecting tumor angiogenesis [59]. Loss of VDR leads to increased incidence, higher tumor burden, and more aggressive phenotypes of cancer [60], and VDR polymorphisms are associated with breast cancer [61]. SMRT also known as NCOR2 is associated with tamoxifen resistance of breast cancer and control of ERα transcriptional activity [62]. Interestingly, seven gene sets related to the MAPK pathway and three gene sets consisting of genes with high-CpG-density promoters bearing histone H3 trimethylation at K27 (H3K27me3) and possible dimethylation at K4 (H3K4me2) were enriched in our network modules. These results indicate that the 63 methylation modulator genes and genes in network modules, particularly hub genes, in Figure 5 are highly relevant in breast cancer and/or other types of cancer.

Materials and methods

Overview of the computational pipeline

Figure 6 depicts a flow chart of the computational pipeline for identifying modulator genes that are involved in dysregulated DNA methylation in breast cancer. In the first data preprocessing step, Illumina 450K methylation microarray intensities of both breast tumors and normal tissues are normalized; the batch effect is removed, and the effect of unknown and unmeasured variables is also removed with surrogate variable analysis (SVA). In the second step, DMRs in tumors relative to normal tissues are detected. In the third step, co-regulated DMRs are determined with hierarchical clustering, and the motifs of DNA binding proteins that are significantly enriched in each cluster of DMRs are identified. In the fourth step, gene expression levels of those proteins whose binding motifs are enriched in DMRs are correlated with the DNA methylation, and based on such correlation, methylation modulator genes for DMRs are determined. Finally, network analysis is performed to find network modules connected to modulator genes. More detailed description of each step is given in the following.

Figure 6. Computational pipeline for identifying modulator genes involved in dysregulated DNA methylation in breast cancer.

Data preparation

The level 1 DNA methylation data of 516 BRCA tumors and normal tissues available in the TCGA database were downloaded using the TCGA-Assembler [63]. The clinical information of the same patients was also downloaded. The level 3 RNA-Seq data of 498 out of the 516 tissue samples available in the TCGA data portal were downloaded again with the TCGA-Assembler. The software motifDb [64] was employed to access all the DNA motifs of DNA-binding proteins in seven databases including hPDI [65], JASPAR [66], UniPROBE [67], StamLab [68], Jolma [69], CIS-BP [70], and Hocomoco [71]), which yielded 3,395 position weighted matrices (PWMs) for DNA motifs of 964 genes encoding DNA binding proteins. Interestingly, all 964 genes are transcription factors (TFs). Apparently, many PWMs are redundant, meaning that they describe the same motif. Nevertheless, all 3,395 PWMs were used in our analysis.

Preprocessing of DNA methylation data

Each level 1 TCGA DNA methylation data sample contains raw intensities of 485,512 485,512 CpG sites generated from the Illumina 450K methylation microarray. The raw data were preprocessed with software minifi [72] and normalized with SWAN [73] to obtain the methylation percentages (β-values) of 485,512 CpG sites. All β-values were transformed to M-values, as M-values yield more statistically robust results in finding differentially methylated CpG sites [74]. The batch effect in the DNA methylation data was removed with computational method ComBat [75]. More specifically, methylation data samples, the batch information (obtained from the ‘plate’ field in the TCGA bar code of each sample), and other covariates including age and tissue statues were input to the ComBat function. Additive and multiplicative batch parameters in the L/S model were estimated, and then the M-values were adjusted with these estimated parameters to correct the batch effect. To further improve accuracy and robustness of the downstream analysis, we employed the surrogate variable analysis (SVA) [76,77] to remove the effect of unknown and unmeasured variables from the methylation data. The data matrix output from ComBat and known variables including age and issue status were used by the SVA algorithm to generate values of surrogate variables, which were then used in the downstream analysis.

Detection of differentially methylated regions

Among 516 DNA methylation data samples, 94 tumor samples have matched normal tissue samples. These 188 samples were used to detect DMRs. The TCGA IDs of these 188 samples are listed in supplemental file S6. Methylation levels of CpG sites spatially closed to each other are strongly correlated [78]. Experiments showed that the methylation level of a CpG site is determined by its surrounding DNA sequence of about 1,000 bp long, which implies that methylation of CpG sites with a window of about 1,000 bp may be co-regulated, and thus methylation levels of these CpG sites may be strongly correlated. While we can detect differentially methylated CpG sites individually, correlation among methylation levels of nearby CpG sites can be exploited to improve detection power. To this end, we employed the a-clustering algorithm [79] to cluster CpG sites into co-regulated methylated region, if 1) CpG sites are within a window of 1,000 bp, and 2) Pearson’s correlation between each pair of CpG sites is0.6. Note that different regions may contain different number of CpG sites, and some regions may contain only one CpG site. We then detect DMRs as follows.

The following linear regression model is used to model the methylation level of each CpG site: (1) Mik=γi0+skβi+j=1mxjkγij+εik,(1)

where Mik is the methylation level of the ith CpG site in the kth sample, γi0 is the mean methylation level of the ith CpG site across different samples, sk represents the status of sample k (sk=1 if the sample is tumor and otherwise, sk=0), regression coefficient βi determines whether the ith CpG site is differentially methylated, xjk, j=1,,m, are m covariates including measured covariates such as age and surrogate variables identified in SVA, γij, j=1,,m, are regression coefficients, and εik is the residual error. Model (1) was inferred with the Limma algorithm [80].

Based on the co-methylated regions determined from the a-clustering algorithm and the βi‘s estimated from model (1) with Limma, the bump hunting algorithm [81] was employed to find DMRs. Specifically, each co-methylated region was assigned a new test statistic iMRjβi, where MRj stands for the jth co-methylated region. The null distribution of this statistic was determined by random permutation of the disease status of samples, and DMRs were found at a false discovery rate (FDR) of 103, which yielded a total of 42,850 DMRs. The test statistic was also used to identify a set of non-differentially methylated regions, if the test statistic yielded a q-value0.9. This gave a set of 11,46911,469 non-differentially methylated regions, which will be used later in the identification of DNA methylation modulator genes.

Identification of DNA motifs enriched in DMRs

The average methylation level of the ith DMR in the jth sample, mij, is calculated as mij=1nikDMRiMkij, where Mkij is the methylation of the kth CpG site in the ith DMR and the jth sample, and ni is the number of CpG sites in the ith DMR. This resulted in a 42,850×188 42,850 data matrix that includes mij values of 42,850 DMRs in the 188 samples. This data matrix was employed by the agglomerative hierarchical clustering method to cluster DMRs into clusters, based on the dissimilarity metric dij=0.5ρij/2, where ρij is the Pearson’s correlation between the ith and the jth rows of the data matrix. The average linkage dissimilarity was used to agglomerate DMRs into clusters. A cutoff value of 0.2 for dij, which corresponds to a value of 0.6 for ρij, was adopted to determine clusters. This resulted in 4,211 clusters, of which 66 clusters contain 100 or more DMRs.

The FIMO algorithm [19] was employed to search for motifs of DNA-binding proteins that are significantly enriched in each of the 66 clusters with 100 or more DMRs. Specifically, the DNA sequence of 1,000 bp of each DMR was extracted from the human genome. The 1,000 bp long DNA sequences of all DMRs in a cluster formed a target set where enriched motifs would be found. Similarly, the DNA sequence of 1,000 bp of each non-differentially methylated region identified earlier was extracted from the human genome, and DNA sequences of all 11,46911,469 non-differentially methylated regions formed a background set. The target set of DNA sequences together with the PWM of the motif of each DNA-binding protein were input to the FIMO algorithm to find occurrences of the motif at a P value cutoff of 105. Similarly, occurrences of the motif in the background set were found with FIMO. The binding sites of several TFs, including CTCF, E2f1, Nanog, nMyc, and Smad1, across the whole genome were identified with Chip-Seq [82]. The P value cutoff of 105 was chosen such that the numbers of binding sites of these five TFs found by FIMO across the whole genome are close to the numbers reported in [82]. Based on the numbers of occurrences of each motif in the target and background sets, the Fisher’s exact test was employed to find motifs significantly enriched in the target set at an FDR0.01. This identified 108 motifs significantly enriched in DMRs, and 109 genes that encode proteins binding to 108 motifs.

Determination of modulator genes for DMRs

The RNA-Seq data of 171 samples out of the 188 samples used to detect DMRs are available in the TCGA data portal; the level 3 RNA-Seq data of these 171 samples were downloaded. The scaled estimate of gene expression levels were multiplied by 106 to obtain transcripts per millions (TPM) values, and then a logarithm transformation was performed on these values. The batch effect was removed with the Limma software package using the batch information in the TCGA bar code of each sample. The expression levels of the 109 genes, encoding the DNA-binding proteins whose motifs were enriched in DMRs, were extracted. Suppose that the expression level of one of such genes in the jth sample is zj. The following linear regression model was employed to test the correlation between the expression level of the gene and the methylation levels of CpG sites: (2) mj=μ+zjβ+ej,(2)

where mj is the average methylation level of all CpG sites in a DMR cluster in the jth tissue sample. The P value of the hypothesis test β=0 was calculated, and the significant correlation was determined at a FDR0.01. This process identified 79 genes whose expression levels were significantly correlated with the methylation levels in the corresponding DMR clusters.

Two more tests were performed on the 79 genes to ensure that reliable methylation modulator genes were determined. First, the correlation between the expression levels of these genes and the methylation levels of non-differentially methylated regions were tested using the 171 data samples. This test did not find any significant correlation. Second, another data set independent of the 188 samples were used to test the correlation between the gene expression level and the DNA methylation level. Excluding the 188 samples from the 516 tissue samples, we had 328 samples, among which 327 samples had both RNA-Seq data and DNA methylation data. The TCGA IDs of these 327 samples were listed in supplemental file S7. These 327 data samples are referred to as the validation data, and the 188 data samples used earlier are referred to as the training data. The linear regression model (2) with the 327 validation samples was employed to validate the correlation between the expression level of each of the 79 genes and the DNA methylation level in the DMRs that the binding motif of the gene was enriched. This test identified 63 out of 79 genes whose expression levels were still significantly correlated with the DNA methylation level of the corresponding DMR, and these 63 genes were determined to be modulator genes for differential DNA methylation.

DNA methylation modulator genes whose binding motifs were enriched in each of the 66 clusters of DMRs were identified. If there are n such genes in a DMR cluster and zjk is the expression of the kth such gene in the jth tissue sample, let mj be the average DNA methylation level of all CpG sites of the DMR cluster in the jth tissue sample. Then, the methylation and gene expression data of 171 training samples were used to fit the multiple linear regression model mj=μ+sjβ0+k=1nzjkβk+ej, where sj is the disease status of the jth sample defined earlier. The adjusted R-squared value was calculated and the F-test was performed to assess the fitness of the multiple regression model. The regression model with the regression coefficients estimated from the training data was then employed to predict the average methylation level of each DMR cluster based on the validation data. The prediction error was used to calculate the adjusted R-squared value, which was referred to as the predicted R-squared value, and the F-test was performed to assess the fitness of the model with the validation data.

Gene network analysis

For the 964 DNA binding proteins that we compiled from 7 databases, a network was constructed with FIMO. Specifically, the DNA sequence of 900 bp, starting at 700 bp before the transcription start site (TSS) and ending at 200 bp after the TSS, of each gene encoding a DNA binding protein was extracted from the human genome. For each of 964 genes, FIMO was employed to search over the 900 bp long DNA sequences of all other 963 genes to determine if the DNA motif of the gene is present. The presence of the motif was determined at a P value of 10–5105. A directional edge from gene i to gene j was established if the motif of gene i was found to be present in the promoter region of gene j. This constructed a network of 964 genes.

The expression levels of 964 genes were extracted from the RNA-Seq data using the same procedure described earlier for extracting the expression levels of methylation modulator genes. A network of these 964 genes was inferred from their expression levels using the ACRANE algorithm [22] with a p value threshold of 107 for the mutual information and a tolerance equal to 0.1. The network created by FIMO (named FIMO network) was modified with the network created by the ACRANE (named ACRANE network). Specifically, an edge between two genes in the FIMO network was removed if no edge existed between the same two genes in the ACRANE network. This yielded a network of 964 genes for further analysis.

A subnetwork of the 63 methylation modulator genes was extracted from the network of 964 genes by eliminating all genes that cannot reach at least one methylation modulator gene through paths in the network. This yielded a network of 708 genes and 1,275 edges. Network modules in this gene network were identified with the community detection algorithm based on edge betweenness [83]. The idea of edge betweenness-based community detection is that edges connecting separated modules have high edge betweenness. Therefore, edges were removed one by one in the decreasing order of the edge betweenness until reaching the maximal modularity score 0.5. This resulted in the optimal separation of vertices and 20 network modules.

A gene set enrichment analysis was performed to find pathways and gene sets that are enriched in each of 20 network modules. The C2 gene sets in the molecular signatures database (MSigDB) [23] were downloaded. The C2 gene sets contain 4,731 curated human gene sets that are from major pathway databases such as KEGG, BIOCARTA, and REACTOME, and other canonical pathways. The Fisher’s exact test was employed to test all 4,731 MSigDB gene sets against the gene set of each network module as the target set and all 20,532 genes excluding the genes in the network module as the background set.

Supplemental material

Supplemental Material

Download Zip (3151 KB)

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work was supported by the National Institute of General Medical Sciences under Grant 5R01GM104975 to XC.

References

 

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.