Modelling the gene expression and the DNA-binding in the 3T3-L1 differentiating adipocytes

ABSTRACT The 3T3-L1 pre-adipocyte cell line is widely used to study the fat cell differentiation in vitro. Researchers also use this cell model to study obesity and insulin resistance. We surveyed the literature, the gene expression omnibus and the sequence read archive for RNA-Seq and ChIP-Seq datasets of MDI-induced 3T3-L1 differentiating cells sampled at one or more time points. The metadata of the relevant datasets were manually curated using unified language across the original studies. The raw reads were collected and pre-processed using a reproducible state-of-the-art pipeline. The final datasets are presented as reads count in genes for the RNA-Seq and reads count in peaks for the ChIP-Seq dataset. The curated datasets are available as two Bioconductor experimental data packages curatedAdipoRNA and curatedAdipoChIP. In addition, the packages document the source code of the data collection and the pre-processing pipelines. Here, we provide a descriptive analysis of the datasets with context and technical validation.


Introduction
The 3T3-L1 cell line is used as a cell model for studying the fat cell differentiation [1]. This adipocyte differentiation model has many applications in obesity and insulin-resistance research such as lipid synthesis, white vs brown adipose tissue development, insulinsensitizing drug action [2][3][4]. The most commonly investigated aspect of the molecular biology of this cell line is the gene expression and chromatin binding at the different stages of differentiation. The development of the phenotype is achieved through certain transcription factors which drive a well-defined transcriptional program [5]. High-throughput sequencing technologies are used to model the connection between gene expression and chromatin in the transcriptional regulation. The availability of sufficient sample size and good quality datasets is a necessity for successful modelling.
The increasing amount of available high-throughput sequencing data necessitates the development of standards for sharing and integrating data. The creators of the primary data are often required to adhere to the standards of the repository where they report and share the data. The development of across repositories metadata standards, ontologies and controlled vocabularies has been attempted to help researchers to share the data they generate and use the data generated by others [6]. Despite the fact that these attempts are general in purpose and intended to work across different data types, we found them to be useful in curating the metadata of the specific adipocyte model [7]. In particular, we used standard model metadata such as induction media, culturing time and the antibodies to encode the metadata necessary for understanding the experimental design across different studies. Confounder metadata such as library type and machine model has been also recorded to facilitate the analysis of the data.
We surveyed the literature, the gene expression omnibus (GEO) and the sequence read archive (SRA) for RNA-Seq and ChIP-Seq datasets of differentiating 3T3-L1 cells sampled at one or more time points [8,9]. The metadata of the relevant datasets was manually curated using unified language across the different studies. The data were processed using an updated reference genome and annotation. The final product was packaged in a versatile object format that allows for any number of downstream analysis. The curation of a large number of samples from similarly designed experiments can be useful [10]. In addition, these datasets were processed in standard pipelines to allow combining, comparing and integrating data from different studies or sources. In other words, this work increases the utility of the datasets by providing data ready for exploring and testing hypotheses.
In this article, we present the methods that were used to collect and process the raw data and provide a technical validation of the final product. We start by describing the cell line model and the induction protocol for adipocyte differentiation. Then, we state the search strategy and the inclusion criteria of the studies. Next, we present the steps and the tools for obtaining and processing the raw data. Finally, we provide links to the software environment and the code for reproducing the full process. Using a subset of the data, we perform several technical validation analyses. First, we check the separation of the samples by phenotype in multidimensional scaling (MDS) and the sample replication using similarity measures. Second, we describe the expression and binding patters of adipocyte markers and enrichment of gene sets which are expected true biology for this model. Moreover, we compared several aspects of the model to human primary adipocytes. Together, the descriptive analysis provides an assessment for the validity of the model and the appropriateness of the curation process.

3T3-L1 differentiation protocol
3T3-L1 is a mouse pre-adipocyte cell line which can be induced to differentiate into mature adipocytes when it is treated with a chemical cocktail. The most commonly used variant of these chemical cocktails contains 1-Methyl-3-isobutylxanthine, Dexamethasone and Insulin (MDI). The treatment usually starts by inducing fully confluent 3T3-L1 cell culture (pre-adipocyte) which begins to differentiate shortly after by accumulating lipid into lipid droplet structures within days [1]. Another variant of these differentiation media includes the addition of rosiglitazone, a peroxisome proliferatoractivated receptor gamma (PPARG) agonist [11]. In this dataset, we selected the studies that used the MDI differentiation protocol with minimal variations in dose and method. The differentiation course is usually divided into several stages depending on the expression of certain markers and the amount of accumulated lipid; the day 2-4 marks the early stage; up to day 7 is an intermediate stage and up to 14 d is latedifferentiation stage.

Data collection and acquisition
We surveyed GEO and SRA repositories for highthroughput sequencing data of MDI-induced 3T3-L1 pre-adipocyte samples at different time points. The data were obtained from GEO or SRA in the form of raw reads (fastq). In total, 98 RNA-Seq and 187 ChIP-Seq samples (transcription factor, co-factor and histone modification markers; referred to as factors) were included. Samples with multiple runs or paired-end runs were obtained separately and combined in later steps of the pipeline. Raw reads were downloaded from the SRA ftp server using Wget. FASTQC was used to assess the quality of the raw reads [12]. We did not remove the samples with low quality at this stage but rather added the quality information to the final product in the metadata table in the form of qc_read objects. Table 1 and 2 list the RNA-Seq and ChIP-Seq datasets included in this curation, respectively. For each dataset, we recorded The GEO/SRA ID, the number of included sample (N), the time points in hours from the point of MDI induction and the stage of differentiation (0, noninduced; 1, early; 2, intermediate; 3, late-differentiation) . Each dataset was connected to a published study of which we recorded the PubMed ID and a reference.

Data pre-processing and processing
Gene expression data processing pipeline For RNA-Seq, the raw reads were aligned to UCSC mm10 mouse genome using HISAT2 [44]. FeatureCounts was used to count the aligned reads (bam) in known genes [45]. The reads count in genes were presented as a count matrix with genes in rows and samples in columns. Together, the metadata of the samples, the gene annotations and the count matrix were packaged in a SummarizedExperiment object and deposited as a Bioconductor experimental data package (curated-AdipoRNA) [46]. Figure 1 (left) depicts the steps of processing the RNA-Seq data.

DNA-binding data processing pipeline
For ChIP-Seq, the raw reads were aligned to the same mm10 genome using BOWTIE2, peaks and signal tracks were built from the aligned reads (bam) using MACS2 [47,48]. The reads count in a peakset of replicated peaks across samples was acquired and arranged in a matrix with peaks in rows and samples in columns using BEDTOOLS [49]. The peakset was annotated and peaks were assigned to the nearest gene using ChIPseeker [50]. Genomic annotations and gene coordinates were accessed through TxDb.Mmusculus.UCSC.mm10. knownGene [51]. As described above, the metadata and the data were arranged in a SummarizedExperiment object and deposited as a Bioconductor experimental data package (curatedAdipoRNA) [46]. Figure 1 (right) depicts the steps of processing the ChIP-Seq data.

Method of technical validation
The technical validation analysis presented in this manuscript was based on the reads count in genes or peaks from the RNA-Seq (n = 98) and ChIP-Seq (n = 22, subset) samples of the curated datasets. The counts were transformed using the variance stabilization transformation (VST) to adjust the distribution of counts to be comparable across samples. MDS was applied using cmdscale (base R) [52]. The differential expression analysis was applied using DESeq2 [53]. The gene ontology (GO) terms annotation was obtained from org.Mm.eg.db and tested for enrichment using goseq [54,55]. The R packages tidyverse, xtable and ComplexHeatmap were used to transform, reshape and visualize the data [56][57][58]. The analysis was conducted in an R environment and using Bioconductor packages [59,60].
The signal tracks from histone modification ChIP-Seq samples (n = 9) were built from the aligned reads using MACS2 [48]. The scores over 10 bp windows in the promoter regions ( AE 3kb) around the transcription start sites of the genes of interest were extracted, normalized and visualized using EnrichedHeatmap [61]. Three datasets of human primary adipocytes were used to compare the gene expression and DNAbinding in 3T3-L1 model to primary cells. Isolates from the subcutaneous fat of healthy subjects (n = 24) were induced for differentiation using MDI for 10 d and profiled for gene expression by microarrays (GSE98680) [62]. Human mesenchymal stem cells (hMSC) and human multipotent adipose-derived (hMAD) cells were induced by the same medium for 6 h or 19 d and used in ChIP-Seq for CEBPB (GSE68864) or PPARG (GSE59703), respectively  [63,64]. The processed data were obtained from GEO using GEOquery [65].

Software environment and code availability
We packaged the software environment where the code was executed as docker images (https://hub.docker. com/r/bcmslab/adiporeg). The scripts used to collect, process and package the datasets are available on GitHub under GPL-3 licence (https://github.com/ MahShaaban/curatedAdipoRNA and https://github. com/MahShaaban/curatedAdipoChIP). The code for generating the technical validation figures and the metadata tables in this manuscript is available on GitHub (https://github.com/BCMSLab/curated_adipo_ describtor).

Results and discussion
The stage of differentiation explains the variance among the adipocyte samples To test whether or not the final-processed datasets represent the distinct phenotypes they are supposed to, we applied MDS analysis on the full RNA-Seq reads count in all genes and a subset of the ChIP-Seq reads count in peaks of CCAAT enhancer binding protein beta (CEBPB) and PPARG targets. The counts were first transformed using VST. The stage of differentiation of the samples (0-3) was used to represent the phenotype and it showed an appropriate separation along the first two dimensions of the MDS (Figure 2(a)). The stage of differentiation but not the dataset study origin variable explained a significant amount of the variance in gene expression. Similarly, the DNA-binding patterns were explained by the factor/antibody used in each sample (Figure 2(b)).

Adipocytes at the same stage and their replicates are similar to each other and dissimilar to other stages
We tested the relationship among different samples and replicates. We used the counts from RNA-Seq and ChIP-Seq samples to calculate the Euclidean distances among them as a measure of dissimilarity. With the exception of a few samples, most RNA-Seq samples had low dissimilarity between replicates and close-by phenotype (time point/stage of differentiation) (Figure 3(a)). This suggests adequate data filtering and pre-processing. In addition, the gene expression reflects the distinct genotype of the adipocyte at a different stage of maturation. The ChIP-Seq samples from the same ChIP antibody were also similar and they had a low dissimilarity by phenotype within each Figure 1. RNA-Seq and ChIP-Seq data processing pipelines. Raw reads were obtained from SRA using Wget. Reads quality was assessed using FASTQC. The mouse genome sequence and annotation were downloaded from UCSC. The genome indices were generated and used to align the RNA and ChIP-Seq reads using HISAT2 and BOWTIE2, respectively. The aligned RNA-Seq reads were used to count reads in genes using featureCounts. The aligned ChIP-Seq reads were used to call and annotate peaks using MACS2 and ChIPSeeker. Reads in peaks were counted using BEDTOOLS. group (Figure 3(b)). Therefore, the binding pattern from each sample can be replicated for a given factor and distinguished enough from that of other factors.

Adipocytes exhibit appropriate gene expression and binding patterns of adipogenic and lipogenic markers
The adipocyte differentiation is a well-studied process.
In response to the induction stimulus, certain adipogenic factors are turned on. They direct a well-defined transcriptional program which transforms the cell into the mature adipocyte characterized by the accumulation of lipids in the lipid droplets [5,66]. Therefore, we could use this information to confirm that the curated datasets reflect meaningful biology as expected from high-quality experiments. The expression of essential adipogenic transcription factor genes Cebpb and Pparg was turned on at the early (stage 1) and intermediatelate (stage 2-3) differentiation stages, respectively (Figure 4(a)). The expression of important lipogenic genes such as Acly, Fasn and Lpl which are essential  for lipid synthesis and accumulation was progressively increased as the cells transitioned to the mature adipocytes (Figure 4(a)). The same was also reflected by the increased binding of CEBPB and PPARG in peaks belonging to the lipogenic genes (Figure 4(b)).
Adipose, lipid and insulin-related gene sets are enriched in differentiated adipocytes compared to pre-adipocytes We used the differentially expressed (DE) genes and differentially bound peaks to perform gene set enrichment analysis of key biological processes-gene ontology (GO) termsthat are most expected to be actively regulated during the differentiation course. The terms adipose tissue development (GO:0060612); lipid catabolic process (GO:0016042); lipid storage (GO:0019915); glucose metabolic process (GO:0006006); and cellular response to insulin stimulus (GO:0032869) were enriched in most comparisons. The fraction of DE gene members of each term between induced (stage 1-3) and noninduced (stage 0) samples was significantly higher than what is expected by chance alone (Figure 5(a)). The same pattern was also observed in the binding pattern on the gene members of the same terms that are bound to CEBPB or PPARG in induced (stage 1 or 3, respectively) and the non-induced (0 stage) samples. These peaks had absolute (up or down) fold-changes more than the random gene set ( Figure 5(b)).

The regulation of adipose development genes is associated with expected histone modifications
Histone modifications play an important role in the adipocyte differentiation [67]. They assume roles in the induction and/or the repression of adipogenic genes by working either individually or in combinations [68]. Moreover, histone markers are usually found in association with transcription factors at the enhancers and promoter regions [69]. Here, we showed as an example the dynamic changes in the modification patterns of H3K27ac, H3K4me1 and H3K4me3 at the promoter regions of the members of two important gene ontology terms negative regulation of fat cell differentiation (GO:0045599) and positive regulation of fat cell differentiation (GO:0045600). The modifications varied across two variables; the stage of differentiation and the functional category of genes ( Figure 6). The changes suggest the dynamic modification of genes in key pathways where the induced and repressed expression had different signatures; however, an exhaustive study of these signatures is beyond the purpose of the current validation.
The 3T3-L1 model reflects essential aspects of the adipocyte biology The 3T3-L1 cell line provides a model for studying the development of the fat cells and the behaviour of the mature adipocytes [70]. Although differences are expected to arise between the model and the subject it poises to study, the 3T3-L1 cell model reflects the essential aspect of the biology of the adipocytes [71]. We compared the expression and binding patterns of key adipogenic transcription factors and their targets in the 3T3-L1 model to human primary adipocytes [62][63][64].
The expression of PPARG and CEBPA and their lipogenic gene targets LPL, ACLY and FASN was induced by the MDI in the human cells (Figure 7, left). In addition, PPARG had binding peaks in the promoter regions of each of the three lipogenic genes (Figure 7, right). The adipogenic factors also showed a pattern of binding  Peaks in ChIP-Seq samples (n = 22) were tested for differential peak binding between stage (1, early-differentiated, for CEBPB; 3, late-differentiated, for PPARG) and 0 non-induced stage using the reads count in peaks. The absolute fold-changes of significantly expressed peaks in genes from the three GO terms (same as above) and a random gene set (n = 50) are shown as box plots (25%, 50% and 75% percentiles).
consistent with the suggested auto-regulation and feedback loops among them [72]. These similarities support the 3T3-L1 as an in vitro model for studying the adipocytes.
The curated dataset of differentiating adipocytes is publicly available and optimized for reusability The processed RNA-Seq dataset is available as a Bioconductor experimental data package (http://bio conductor.org/packages/curatedAdipoRNA/). The package provides a SummarizedExperiment R object. The object contains three tables. The first is the metadata table, which contains the manually curated sample metadata using unified language to facilitate comparing and combining the data from different studies. The two main metadata items are the time point (hours) and the stage of differentiation of each sample (0, non-induced; 1, early; 2, intermediate; 3, late-differentiation). In addition, the metadata contain quality assessment measurements of the samples in the form of qc_read objects. The second table is a GRanges object with essential information about the genes in which reads were counted. The third table is a count matrix of all known mouse genes. Together, the tables can be used in analyses such as differential expression, gene set enrichment and/or time-course analyses.
The processed ChIP-Seq dataset is available as a Bioconductor experimental data package (http://biocon ductor.org/packages/curatedAdipoChIP/). The package provides a SummarizedExperiment R object. The object contains three tables. The first is the metadata table similar to the one described above and information of the ChIP antibodies. The second table is a GRanges object with essential annotations about the Peaks in which reads were counted. The third is a count matrix of the reads in peaks and their gene assignment in the mouse genome. Together, the tables can be used in differential peak binding and occupancy analyses.

Conclusion
We surveyed public repositories for high-throughput sequencing data on the in vitro adipocyte model 3T3-L1 and curated extensive metadata on the included samples. The raw data were collected and processed using standard pipelines. The product of this study is the construction of gene expression and DNA-binding models of the differentiating 3T3-L1. The processed data were documented and made available as an opensource Bioconductor experimental data packages. The models were assessed for quality and were found to reflect the essential aspect of known adipocyte biology from the published literature and the human primary adipocytes. Figure 7. Gene expression and binding patterns of adipogenic transcription factors and lipogenic genes in human primary adipocytes. Probe intensities from microarrays samples (n = 24) from primary pre-adipocytes and differentiating adipocytes were used to estimate the expression (low, blue; high, red) of selected genes (GSE98680). Samples were prepared from primary cells isolated from the subcutaneous fat of healthy human subjects. The isolated cells were induced for differentiation using MDI medium for 10 d. The peaks from ChIP-Seq samples from primary adipocytes were used to represent the binding (present, black; not, white) of adipogenic transcription factors on selected genes. CEBPB and PPARG ChIP antibodies were used in human mesenchymal stem cells (hMSC) (GSE68864) or human multipotent adipose-derived (hMAD) (GSE59703) cells 6 h or 19 d after MDI-induction.