Genome-wide identification and characterisation of long non-coding RNAs in two Chinese cattle breeds

Abstract Long non-coding RNAs (lncRNAs) have been essentially involved in all kinds of important biological processes, such as cell differentiation and development. In this study, we comprehensively profiled genome-wide lncRNAs of bovine skeletal muscle in two Chinese cattle breeds using RNA-seq approach. The longissimus dorsi samples were collected from three Charolais (CH) and three Tongjiang (TJ) adult cattle, respectively. Approximately 80 million reads were obtained per library, and by which we assembled 31,009 unique transcripts. We successfully identified a total of 4155 different lncRNAs in six muscle samples, consisting of 3204 intronic lncRNAs, 737 intergenic lncRNAs, 113 sense lncRNAs and 101 antisense lncRNAs. Among them, 32 lncRNAs were differentially expressed (Fold Change >1.0, FDR < 0.05) between CH and TJ. Among these lncRNAs, 1744 identified bovine lncRNAs were located adjacent to 1146 protein-coding genes (<100 kb upstream and downstream) and functionally enriched in organ development, gene regulation and signal transduction. Our results provide a catalogue of bovine skeletal muscle-related lncRNAs, which will complement the list of bovine lncRNAs and contribute to understanding of the molecular mechanism of underpinning muscle development in cattle. Highlights The first study to compare specialised beef cattle breed with Chinese local beef cattle breed on long non-coding RNA (lncRNA). The 4155 different lncRNAs in beef cattle genome were identified to complement the list of bovine lncRNAs. The 1146 protein-coding genes were found maybe in cis-regulatory relationships with lncRNAs to understand the molecular mechanism underpinning muscle development in cattle.


Introduction
Long non-coding RNA (lncRNA) is a kind of non-coding RNA longer than 200 nucleotides, and some lncRNAs are even longer than 100 kb (Ponting et al. 2009). Initially, lncRNA was regarded as the 'noise' of genomic transcription with no biological function, a by-product of RNA polymerase II transcription. Borsani et al. (1991) reported that Xist was involved in the regulation of X chromosome inactivation. Since then, increasing studies have shown that lncRNAs play important roles in many biological processes, including transcriptional regulation, epigenetic modification, cell differentiation, development, reproduction, aging, as well as in varieties of tumours and other diseases (Bonasio and Shiekhattar 2014;Meseure et al. 2015;Taylor et al. 2015;Schmitz et al. 2016;Bolha et al. 2017;Kopp and Mendell 2018;Fernandes et al. 2019;Wu et al. 2019). Thus, lncRNA has attracted extensive attention from researchers in the field of biology. IncRNA functions are mainly based on its secondary structure, which binds to proteins, can cause chromatin remodelling and affect the function of transcription factors (Fernandes et al. 2019). Also, it can bind to microRNAs at its linear level to affect the expression of messenger RNA (mRNA) indirectly ). In addition, it can bind to the mRNAs directly to affect their translation, splicing and degradation.
Besides, several studies thus far have investigated the bovine lncRNAs. Huang et al. (2012) predicted a total of 449 putative lncRNAs located in 405 intergenic regions from public bovine-specific ESTs. Weikard et al. (2013) identified 4848 potential lncRNAs in bovine skin samples (pigmented and non-pigmented) by deep next-generation sequencing. Billerey et al. (2014) defined a set of 584 different lincRNAs with 418 lincRNAs found in all nine muscle samples used paired-end RNA sequencing. Koufariotis et al. (2015) captured 9778 high-confidence putative lncRNAs from RNA-Seq data across 18 tissues of a single cow. Liu et al. (2017) identified 7188 bovine skeletal muscle lncRNAs by RNA-Seq. Kern et al. (2018) identified 7235 lncRNAs in cattle using RNA-seq from eight tissue types. Yang et al. (2018) detected 3746 lncRNAs in mammary glands of Holstein cows using RNA-seq.  detected 4995, 1568 and 4243 lncRNAs in bovine mammary gland, ileum and rumen tissues using RNA-seq. Li et al. (2018) found 28,558 lncRNAs expressed in the bovine mammary gland by RNA-seq. However, the number of all recently identified bovine lncRNAs identified is rather lower than that of human and mouse , suggesting that further efforts are needed to comprehensively discover all bovine lncRNAs.
Charolais (CH) cattle is a worldwide specialised beef cattle breed. Tongjiang (TJ) cattle is a local cattle population of Sichuan Province, China, which has long been used as work cattle. It is a small type and slowly growing cattle with high meat quality. At present, TJ cattle is in the process of herd improvement to beef. So, it is the necessary work to compare the genetic structure between the two breeds. In this study, we reported the identification and characterisation of lncRNAs in longissimus dorsi of CH cattle and TJ cattle using an Illumina HiseqX platform. Our results will provide a useful resource for understanding the function and regulation of lncRNAs in beef cattle and annotating the bovine genome, as well as contribute to better investigate skeletal muscle development in mammals.

Ethics statement
All animal experimental protocols were approved by Biological Studies Animal Care and Use Committee at Institute of Animal Sciences, Sichuan Agricultural University, Chengdu, China.

Animals and sample preparation
In this study, six bovine longissimus dorsi samples were taken from three CH cattle and three TJ adult cattle, respectively. The six cattle had no genetic relationship with each other for at least three generations. They were free access to water and food under natural lighting and had not suffered any diseases and taken any drugs in the previous 1 year. After slaughtering at 18 months old, samples were collected from longissimus dorsi muscle between 12th and 13th rib within 30 min. All samples were immediately frozen in liquid nitrogen and then stored at À80 C for RNA extraction.

RNA isolation and Illumina deep sequencing
Total RNA from six samples was isolated using RNAprep pure Tissue Kit (DP431, TIANGEN, Beijing, China) following the manufacturer's instruction. The integrity and purity of total RNA were analysed by NanoDrop 2000 spectrophotometer (Thermo Scientific, Waltham, MA) and RNA Nano 6000 Assay Kit in a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA) with RNA integrity number >8. Complementary (cDNA) libraries were constructed using approximately 3 ug of total RNA per sample. Clustering of the indexcoded samples used the TruSeq PE Cluster Kit v3-cBot-HS on the cBot Cluster Generation System (Illumina, San Diego, CA). After cluster generation, the libraries were sequenced on an Illumina HiseqX platform and 150 bp paired-end reads were generated.

Transcriptome assembly
Raw reads were first subjected to quality filtering, which removed the reads containing 5% of ploy(N), and low quality reads (>20% of the bases with Q 10) to generate clean reads. Then, the Q20, Q30 and GC% of the clean data were calculated. The clean reads were mapped to the reference genome (Btau_5.0.1, NCBI) using HISAT2 tool and default parameters (Kim et al. 2015). The mapped reads from each library were assembled with StringTie software (Pertea et al. 2015), and multiple assembled transcript files were merged to produce a unique set of transcriptions using the Taco software (Niknafs et al. 2017).

Gene expression analysis
Transcript abundance was quantified as the transcripts per million (TPM) (Li et al. 2010). The transcripts with TPM level <1 were filtered out. The significance of the differentially expressed lncRNAs was analysed using an R package DESeq2 (Love et al. 2014) and multiple comparisons were corrected for p value with the false discovery rate (FDR). A FDR value <.05 and absolute value of log2FoldChange >1 were used as the threshold to statistically define the significant difference.

Identification of putative lncRNAs
To obtain putative lncRNAs, a stringent analysis pipeline was used to filter the assembled transcripts as follows: (1) Transcripts overlapping with bovine proteincoding genes were removed. (2) Single exon transcripts were removed. (3) Transcripts less than 200 bp were removed. (4) Transcripts with coding potential score predicted by the coding potential calculator (CPC) (Kong et al. 2007) ! À1 were removed. After these steps, the remaining transcripts were considered as the putative lncRNAs.

Enrichment analysis
To predict the target genes of lncRNAs, we searched for protein-coding genes 100-kb up-and down-stream of each putative lncRNA. These target genes were subjected to functional exploration according to the Gene Ontolog (GO) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) enrichment analysis using DAVID online analytical tools to predict their functional roles (Huang da et al. 2009). GO terms and KEGG pathways with p<.05 and Fold Enrichment >2 were considered to be significantly enriched.

Validation by qRT-PCR
Six lncRNAs were randomly chosen for qRT-PCR validation, whose primers were designed using Primer3 online tool (Table 1). The total RNAs for qRNA-PCR was the same as describing above for lncRNA-seq. Total RNAs were converted to cDNA using the PrimeScript RT reagent Kit (TaKaRa, Dalian, China) following the manufacturer's instructions. The PCR was performed according to the SYBR Premix Ex Taq TM II instructions (TaKaRa) on a CFX96 Real-Time PCR System (Bio-Rad, Hercules, CA). The PCR cycling condition involved 2 min incubation at 95 C, followed by 39 times of 95 C for 10 s, Tm for 10 s (Tm indicated in Table 1). The relative gene expression levels were calculated based on GAPDH gene expression using the 2 ÀDDCt method.

Overview of RNA sequencing and transcriptome assembly
To comprehensively and accurately identify lncRNAs expressed in the bovine longissimus dorsi muscle, we constructed six cDNA libraries (CH1-3, TJ1-3) from longissimus dorsi muscle of CH cattle and TJ cattle in this study. A total of 489,465,128 raw reads were generated in all six libraries with a length of 150 bases, resulting in a total of 73.42 gigabases. The result of RNA quality was shown in Table 2. After filtering out adaptors, poly-N, low quality and rRNA reads, a total of 488,266,748 clean reads were remained. The percentage of clean reads in each library ranged from 99.48% to 99.94%. The reads were then aligned into bovine Btau_5.0.1 reference genome using the HISAT2 software. Approximately 94.74% (94.32%$95.14%) of the total clean reads were mapped to the bovine reference genome and 66.66% (65.61%$68.19%) of the reads were uniquely mapped to specific regions of the bovine genome. The mapped sequences in each library were reconstructed and assembled using the StringTie and a total of 31,009 unique assembled transcripts were obtained.

Identification of putative lncRNAs in bovine skeletal muscle
To identify lncRNAs, we developed a stringent filtering pipeline to discard transcripts with one or more of the following: homology to known genes, lengths <200 bp, single exon and CPC score ! À1. A total of 4155 lncRNA candidates were identified by our pipeline (Supplementary Table S1), including 3204 intronic lncRNAs,737 intergenic lncRNAs, 113 sense lncRNAs and 101 antisense lncRNAs in the six libraries ( Figure  1(a)). The 4155 lncRNAs were distributed, but not equally, over all chromosomes (Figure 1(b)), which the highest density was on chromosome X (212 lncRNAs), and the lowest density was on chromosome Y (1 lncRNA). There were also 19 lncRNAs located in Bos taurus unplaced genomic scaffolds. The size of 4155 lncRNAs ranged from 200 to 8199 bp with an average length of 2114 bp (Figure 1(c)). Approximately 80.51% of lncRNAs were shorter than 2800 bp. Their average number of exon was 3.24 with a range of 2-16 ( Figure  1(d)). Most of these lncRNAs contained 2-4 exons, accounting for 87.32% of the 4155 lncRNAs.

Identification of differentially expressed lncRNAs
Expression levels of lncRNA transcripts were estimated by Transcripts Per Kilobase Million (TPM) using Kallisto software. A total of 263 lncRNAs were expressed specifically in longissimus dorsi of CH cattle, while 180 lncRNAs expressed specifically in that of TJ cattle (Figure 2). A total of 3712 lncRNAs were expressed in the longissimus dorsi muscle of both breeds, of which 2999 lncRNAs were expressed in longissimus dorsi muscle of all individuals. The R package DESeq was used to get differential expression levels of lncRNAs between the two breeds. Differential expression analysis (CH vs. TJ) suggested that 20 lncRNAs were identified up-regulation and 12 lncRNAs were identified down-regulation in CH cattle (Supplementary Table  S1). Principal components analysis (PCA) of differentially expressed lncRNAs showed distinct clustering of CH cattle and TJ cattle ( Figure 2).

Prediction of lncRNA functions
A number of studies have indicated that lncRNAs can act in cis to regulate nearby coding genes (Marchese et al. 2017). We predicted the potential targets of lncRNAs upon applying a cut-off of 100 kb upstream and downstream of the 4155 lncRNAs. A total of 1146 protein-coding genes were identified as the nearest neighbours of the 1744 lncRNAs. Then we performed gene ontology analysis of these 1146 protein-coding genes to explore the lncRNAs functions using DAVID Bioinformatics Resources. The analysis was performed on biological process (GOTERM_BP_DIRECT), cellular component (GOTERM_CC_DIRECT), molecular function (GOTERM_MF_DIRECT) and KEGG_PATHWAY using the functional annotation tool with a corrected Fisher exact p value <.05 and folder enrichment >2. GO functional annotation showed that 239 genes were functionally enriched in 38 GO terms, including 23 biological processes, 7 cellular component and 8 molecular function (Table 3). These terms were mainly associated with organ development, gene regulation and signal transduction, such as skeletal muscle cell differentiation  5 KEGG pathways, including gap junction (bta04540), insulin signalling pathway (bta04910), prostate cancer (bta05215), insulin resistance (bta04931) and glutamatergic synapse (bta04724).

Experimental validation of putative lncRNAs
Six differently expressed lncRNAs were randomly selected for validation by qPCR. The results shown that these six lncRNAs were expressed in all samples.    And their expression patterns were consistent with the expression levels of the RNA-seq. This result suggests that most of our putative lncRNAs would be truly expressed in vivo (Figure 3).

Discussion
lncRNA rapidly became a hot spot in the field of genetic research, since it was demonstrated as important regulators of gene expression in a variety of biological processes. An increasing number of lncRNAs have been identification and characterised in multiple species with the developed next-generation sequencing technology, especially with the advent of highthroughput RNA-seq methods (Jathar et al. 2017;Cao et al. 2019). The latest NONCODE database source contains 548,640 lncRNA transcripts, which points to 172,216 lncRNAs from the human genome, 131,697 lncRNAs from the mouse genome and only 23,515 lncRNAs from the bovine genome ).
The number of lncRNAs in the bovine genome seems to be much underestimated. In previous studies, the EST data and RNA-seq data were both used to identify lncRNAs in cattle. For example, 21,951 IncRNAs and 449 putative lncRNAs were identified from bovine ESTs data (Huang et al. 2012). On the other hand, several studies reported that thousands of lncRNAs were detected in diverse bovine tissues by RNA-seq data (Kern et al. 2018;Choi et al. 2019;Wang et al. 2019).
In this study, we identified a total of 4155 bovine lncRNAs based on transcriptomic data from six RNAseq datasets in longissimus dorsi muscle of CH and TJ cattle. This result increases in the number of bovine lncRNAs and further improves the current database of information on bovine lncRNAs. lncRNAs can be either single-exonic or multi-exonic, while it is a challenge to distinguish true lncRNAs from the abundant one exon, low expression fragments and expression noise assembled from RNA-Seq data (Cabili et al. 2011;Derrien et al. 2012). Thus, we only chose multi-exonic lncRNAs for further analysis, as was done in other studies (Tong et al. 2017;Kern et al. 2018;Yang et al. 2018;Yue et al. 2019), and depends on a highly stringent filtering pipeline to minimise the number of false-positive lncRNAs. Consequently, 4155 non-protein coding transcripts were considered as highly reliable lncRNAs, which was less than previous studies due to our stringent criteria (Billerey et al. 2014;Liu et al. 2017;Choi et al. 2019;Yue et al. 2019). These lncRNAs have different expression distribution in different chromosomes. The vast majority of putative lncRNAs were intronic or intergenic in our study, which confirmed that intragenic and intergenic regions were the major sources of non-coding RNAs (Mattick and Makunin 2006;Xu, Bai, et al. 2017). lncRNAs are known to be shorter transcript lengths and fewer exon numbers than mRNAs Sulayman et al. 2019). Approximately 80.51% of our 4155 lncRNAs was shorter than 2800 bp and 87.32% of them had 2-4 exons, which was in agreement with previous studies in cattle (Billerey et al. 2014;Liu et al. 2017;Choi et al. 2019;Yue et al. 2019). These investigations provide valuable information about the characteristics of lncRNAs in bovine.
Although thousands of lncRNAs have been identified and analysed in bovine muscle, several lncRNAs have been reported to play important roles in bovine myoblast proliferation and differentiation Jin et al. 2017;Xu, Ji, et al. 2017;Yue et al. 2019). Muscle mass and growth rate are two economically important traits in beef production. The differences in muscle mass and growth between CH and TJ cattle provide a good model for studying the mechanisms that regulate muscle development (Chen et al. 2008). In this study, 32 lncRNAs were differentially expressed between CH and TJ cattle and may be involved in muscle development. It may provide a better understanding of the dynamic gene regulation of muscle mass and growth differences between the two breeds and lays important groundwork for further research regarding the regulatory role of lncRNA in muscle development.
Emerging evidence suggests that lncRNAs often as cis-acting element are actively involved in various developmental and physiological processes. For example, the bovine lncRNAs TCONS_00041733 regulate differentially expression coding gene EFNA1, which is involved in vascular development, inflammatory response and cell proliferation and angiogenesis as well as sperm morphology and membrane functionality ). In the cis prediction, we sought out 4155 lncRNAs that neighboured 1146 protein-coding genes (<100 kb upstream and downstream). These neighbouring genes were significantly functionally enriched in 38 GO terms (23 biological process, 7 cellular component and 8 molecular function) and 5 KEGG pathways, which were mainly associated with organ development, gene regulation and signal transduction. Eight of the cis target protein-coding genes enriched in GO term associated with skeletal muscle cell differentiation (BCL9L, HIVEP3, MEF2D, RB1, MAFF, EMD, BTG2 and ASB2), indicating the important roles of lncRNAs in skeletal muscle development. It was consistent with previous studies that lncRNAs regulated nearby protein-coding genes to involve in muscle development and diseases (Zhan et al. 2016;Liu et al. 2017;Yue et al. 2019). For example, the most recent study reported that the bovine lncKBTBD10 affect its proximity gene KBTBD10 to involve in myogensis ).

Conclusions
In this study, 4155 lncRNAs were identified in six bovine longissimus dorsi muscle tissues by RNA-seq technology. The major characteristics of lncRNAs were similar to previous studies, including unequal distribution on chromosomes, relatively shorter ORF and fewer exons. Thirty-two different expression lncRNAs were found out between CH cattle and TJ cattle. The identified lncRNAs may act in cis to target 1146 protein-coding genes, which were mainly associated with organ development, gene regulation and signal transduction. This study provides a useful database of bovine muscle lncRNAs for future investigations of their regulatory roles in bovine muscle development.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was financially supported by the Earmarked Fund for Supported by Sichuan Science and Technology Programme [Grant No: 2016NYZ0050].