Non-coding transcript variants of protein-coding genes – what are they good for?

ABSTRACT The total number of protein-coding genes in the human genome is not significantly higher than those in much simpler eukaryotes, despite a general increase in genome size proportionate to the organismal complexity. The large non-coding transcriptome and extensive differential splicing, are increasingly being accepted as the factors contributing to the complex mammalian physiology and architecture. Recent studies reveal additional layers of functional complexity: some long non-coding RNAs have been re-defined as micropeptide or microprotein encoding transcripts, and in turn some protein-coding RNAs are bifunctional and display also non-coding functions. Moreover, several protein-coding genes express long non-coding RNA splice-forms and generate circular RNAs in addition to their canonical mRNA transcripts, revoking the strict definition of a gene as coding or non-coding. In this mini review, we discuss the current understanding of these hybrid genes and their possible roles and relevance.


Introduction
The earlier understanding on the evolution of complexity in higher eukaryotes necessitated a proportionate increase in the number of protein-coding genes in the human genome compared to the much simpler organisms like worms and flies. Hence, the original discovery that the human genome hardly contains 30,000 protein-coding genes, not much more than found in the genome of the worm -Caenorhabditis elegans was perplexing [1]. An optimistic view at the time suggested that the low numbers could mean fast realization of the objectives of functional genomics and easy unravelling of all the complexities within a few years [2]. Notwithstanding the optimism in the wake of the 21 st century, we still have a long way to go before completely understanding the gene-proteinfunction correlation and associated complexity of the Homo sapiens. These miscalculations originally arose due to the devaluation of the significance of the non-coding parts of the genome, often referring to them as junk DNA. It is interesting to note that a typical bacteria harbors proteincoding genes in approximately 90% of its genome, while greater than 95% of the human genome does not code for proteins. By current estimates, the short and long non-coding RNA genes in the human genome out-number protein-coding genes. The importance of tRNAs and rRNAs as decoders of the genetic code was evident. However, the other spectrum of the non-coding transcriptome is slowly being characterized. Novel functions of these short regulatory RNAs and the more cryptic long non-coding RNAs (lncRNAs) are constantly being discovered (reviewed in [3,4]). These non-coding transcripts could be the missing link in explaining the complexity achieved by mammals, despite having similar number of protein-coding genes to that in invertebrates. Roles for lncRNAs in complex processes unique to higher eukaryotes like neuronal development and cancer metastasis support this notion [5][6][7]. Alternative splicing is another factor contributing to the functional diversity and complexity of the proteome as well as the non-coding transcriptome [8,9]. Differential splicing can give rise to multiple mRNAs from a single gene, coding for polypeptides displaying distinct sequences and functions. Non-coding RNA genes also undergo extensive splicing and posttranscriptional processing [10]. The importance of splicing in mammalian physiology and pathology is well documented and reviewed elsewhere [11,12].
A deeper understanding of the genomic 'dark matter' and of the role of alternative splicing has brought us to the question whether genes can be strictly classified as coding and non-coding. Classification of genes into coding or non-coding is often determined based on their protein coding potential indicated by the presence of long, conserved and translatable open reading frames (ORFs). Since majority of annotated eukaryotic proteins are larger than 100 amino acids, RNAs which do not harbor ORFs longer than 300 nt were classified as non-coding. However, recent evidences indicate that a large number of lncRNAs harbor small ORFs (smORFs) and encode for short micropeptides [3,13]. Another more reliable approach for coding/noncoding distinction relies on the eventual biological function of the transcripts and depends on whether they need to be translated to protein to perform their functions or whether they act as RNA-effectors. The credibility of this mode of transcript annotation is also questioned by the discovery of the so-called cncRNAs (coding and non-coding RNAs) or bifunctional RNAs, wherein a protein-coding RNA additionally performs non-coding functions (reviewed in [14]). For example, independent of the tumor suppressor function of the p53 protein, p53-encoding mRNA directly binds and suppresses the ubiquitin ligase activity of MDM2 [15]. Thus, the emergence of cncRNAs are blurring the boundaries between the coding and non-coding transcriptome [16], (Figure1). Here, we will focus our attention to the subset of protein-coding genes, which in addition to mRNAs also express non-coding transcript variants, often as a result of alternative splicing [17]. We will refer to these genes as bifunctional or hybrid genes and will summarize the current understanding on these transcript variants and discuss their possible functions.

Prevalence of non-coding transcript variants of protein-coding genes
Most of the mammalian genes express more than one transcript, often generated as a result of alternative splicing [9]. This is in addition to the compact arrangement of the gene loci where multiple genes overlap with each other in the same as well as opposite strands of the chromosome. Moreover, several miRNAs and most of the small nucleolar RNAs (snoRNAs) are parts of the intronic regions of protein-coding and non-coding genes and are processed from the pre-mRNA, pre or post splicing [18]. Despite the presence of multiple splice variants per gene, most of the human genes express a single major transcript variant [19].
Interestingly, in case of approximately 17% of the proteincoding genes, the most prominent transcript variant is noncoding. Most of the major non-coding transcripts are predominantly nuclear and some of these detected transcripts are splicing intermediates carrying intron remnants [19]. Non-coding variants mainly arise by the skipping of the first or last exon, potentially resulting in the loss of start or stop codons, respectively. At least a few of these predicted candidates are possible evolutionary anomalies, since the direct consequence of alternative splicing in some cases is the generation of perfect non-sense mediated decay (NMD) substrates, which are degraded constitutively [20]. Nevertheless, several genes in the GenBank possess well annotated long non-coding transcript variants, which show moderate expression levels and could thus be stable enough to perform non-coding functions.
In addition to lncRNAs, another class of stable ncRNAs which are generated as a product of pre-mRNA splicing is the circular RNAs (circRNAs) [21,22]. They are usually generated by the back-splicing of one or more exons from a pre-mRNA, resulting in an end-joined circular RNA molecule [23,24]. CircRNAs are expressed from thousands of human genes and at least in some cases, show expression higher than the corresponding linear isoforms [21,25,26]. While a large number of circRNAs are regulatory non-coding RNAs, some of them have protein coding potential, even though they do not undergo capping and poly-adenylation [27][28][29]. Thus much like in the case of lncRNAs, direct experimental verification is the only means of assessing the translatability of circRNAs. Recent investigations have unraveled the functional Figure 1. Can genes be strictly classified as coding and non-coding? Gene with multiple exons can be transcribed and spliced to generate an mRNA, which is translated to protein, which determines it's cellular function. In some cases, the mRNAs can additionally have regulatory functions independent of translation. These protein-coding RNAs are classified as cncRNAs (coding/non-coding RNAs). The same gene can also generate long non-coding transcripts by alternative splicing. These lncRNAs can be genuine "non-coding RNAs" with regulatory functions or they can harbor smORFs (small ORFs) which encode micro-peptides. Back-splicing can also generate circRNAs (circular RNAs) from several genes. While circRNAs are usually stable non-coding RNAs with regulatory function, recent evidences suggest that at least some of them could be translated.
relevance and interesting regulatory mechanisms of at least a few of these lncRNA/circRNA-mRNA hybrid genes (Table 1).

Examples for functionally characterized hybrid genes transcribing mRNAs and lncRNAs/circRNA
Splicing is often coupled to transcriptional elongation and the speed of transcription affects the outcome of splicing, resulting in the inclusion of alternative exons or exon skipping [30]. Interestingly, UV irradiation induces alternative splicing by a mechanism which involves RNA Pol II hyperphosphorylation and subsequent slow-down of transcriptional elongation [31]. A recent study identified that the major splicing change induced by UV is the incorporation of alternative last exons (ALE) proximal to the transcriptional start site and the general shortening of the transcripts [32]. The ASCC3 gene encodes a helicase involved in DNA repair, which is expressed in the early stages of UVinduced DNA damage and participates in UV-induced transcriptional suppression [33]. UV induces the ALE switch of ASCC3 mRNA into a shorter lncRNA splice variant which participates in the transcription recovery in the later stages after UV-induced DNA damage [17,32]. This describes a unique case of conditional conversion of a protein-coding RNA into a non-coding RNA, perfectly synchronized with the temporal necessities of the DNA damage response.
One of the earliest and the best characterized human gene which expresses functional lncRNA splice variants in addition to protein-coding transcripts is SRA1 (steroid receptor RNA activator 1) [34]. Alternative splicing and the use of differential transcription start sites generates mRNAs and lncRNAs from this gene locus with variations at the 5`end [35]. While the lncRNA transcripts scaffold transcription co-activation complexes together with several nuclear receptors including the estrogen receptor and PPARγ, the mRNA-derived polypeptide (SRAP) is also a potential transcriptional regulator (reviewed in [36]). The preferential upregulation of the lncRNA isoform in invasive human breast cancer cell lines indicated a role for this splicing switch in malignancy, but later studies also identified high SRAP expression as a poor prognostic factor for survival in a subset of breast cancer patients [35,37]. Furthermore, the mouse homolog of SRA1 gene also displays similar gene architecture and expresses both coding and non-coding splice variants. Sra1 −/mice displayed defective adipogenesis, but it is not yet clear whether the loss of lncRNA or protein expression is crucial for this phenotype [38].
Protein Phosphatase 1 Nuclear Targeting Subunit (PNUTS or PPP1R10) was originally described as a protein-coding gene encoding a regulatory subunit of protein phosphatase-1 (PP1). PNUTS protein suppresses PP1 activity and participates in cell cycle regulation and DNA damage response [39][40][41]. PNUTS gene locus is highly conserved between human and mouse and expresses both coding and non-coding transcripts. In the mouse genome, lncRNA-PNUTS is generated by alternative splice site integration at the 5ʹ end of exon 12 and consequent generation of a premature stop codon [42]. Interestingly, siRNA mediated depletion of the RNA binding protein hnRNP E1 leads to a splicing switch resulting in the preferential expression of the lncRNA [42]. Moreover, actinomycin-D and cycloheximide, pharmacological modulators of transcription and translation, respectively, also alters the mRNA-lncRNA switch at this gene locus, indicating that this process is highly dynamic. LncRNA-PNUTS seems to regulate epithelial-to-mesenchymal transition (EMT) and cell migration by acting as a competing endogenous RNA (ceRNA) for miR-205, a master regulator of EMT-related transcription factors.
CircRNAs are resistant to exonuclease digestion, which has enabled the enrichment and sequencing analysis of this class of RNAs in several physiological and pathological settings [43]. However, there are only a handful of examples for functionally characterized mRNA-circRNA hybrid genes (Table 1). Table 1. A list of functionally characterized bifunctional genes expressing non-coding splice variants in addition to mRNA transcripts. Gene names and chromosomal locations of human genes are indicated. The lncRNA/circRNA and mRNA variants are mentioned with comments on their biological functions and related citations. General themes for the regulatory role of the noncoding transcript variants The coding-to-non-coding switch in the case of ASCC3 and PNUTS certainly means that there could be many more inducible non-coding RNA expressing genes and the currently approximated number of lncRNAs could be well underestimated. SRA1 and ASCC3 genes represent interesting cases of regulatory complexity wherein the non-coding transcript variants of protein-coding genes function as genuine lncRNAs with scaffolding function attributable to their specific folding and structure. While there could be many such candidate lncRNA-mRNA hybrid genes and interesting regulatory mechanisms waiting to be discovered, some general themes can be applied to explain the functional relevance of this category of alternative splicing products. The major functions attributable to the non-coding transcripts arising from hybrid genes are (i) the regulation of splicing and mRNA processing (ii) miRNA sponge activity (iii) competitive RBP-binding/ sequestration and (iv) micropeptide/microprotein expression ( Figure 2). LncRNAs are active regulators of splicing and many antisense lncRNA transcripts participate in alternative splicing of their overlapping protein-coding RNAs [44]. The alternatively spliced non-coding variants could also be involved in the specific recruitment or sequestration of spliceosome factors and RNA binding proteins (RBPs). Major non-coding splice variants arise from the retention of introns [19]. Intron retention has been accepted as a mechanism of gene regulation in lower eukaryotes and plants, while its role is less understood in mammals [45,46]. One major consequence of intron retention is the nuclear accumulation of the resulting transcripts, which may undergo signal-induced posttranscriptional splicing [19,47], (Figure 2(a)). In addition, co-transcriptional back-splicing of circRNAs seems to modulate gene expression by competing with linear-splicing [48]. Interestingly, depletion of general splicing or transcription termination factors seems to enhance circRNA biogenesis from hybrid gene loci [49].
One of the most investigated roles of lncRNAs and circRNAs is their ability to function as ceRNAs or miRNA sponges [50][51][52]. The over-representation of ceRNA activity among functionally characterized lncRNAs/circRNAs is mainly because of the availability of sequence analysis algorithms which help in their prediction and relative ease with which one can prove such a function [53,54]. However, these ceRNAs are important for the complex fine-tuning of the proteome by modulating miRNA-mediated gene silencing [55]. Like the lncRNA-PNUTS, many of the lncRNA transcript variants could be potential ceRNAs (Figure 2(b)). One intriguing question regarding lncRNA-PNUTS left unanswered is the fact that its cognate mRNA does not function as a ceRNA, despite harboring the miR-205 binding sites. It could well be argued that the miRNA binding sites present within the coding region cannot efficiently perform the sponge function due to altered RNA secondary structure or ribosome occupancy. If this is true, the non-coding splice variants of coding transcripts could act as professional and specific miRNA scavengers. It is known that certain pseudogenes/lncRNAs impart miRNA resistance upon their functional protein-coding counterparts by acting as specific ceRNAs [56,57]. Due to their long half-lives, circRNAs are very efficient ceRNAs and there are many mRNA/circRNA expressing bifunctional genes, where the protein expression is fine-tuned by the cognate circRNA [52]. In a similar manner, the lncRNA splice variants may also act as suppressors of miRNA activity on their cognate mRNAs. Alternative splicing is known to modulate miRNA-dependent gene expression by generating transcripts Figure 2. Roles of``non-coding'' variants arising from protein-coding genes. (a). Intron retention could lead to the generation of nuclear lncRNA intermediates which can undergo signal-induced posttranscriptional splicing to translatable mRNAs, which are then exported from the nucleus. (b). Some lncRNA transcript variants and circRNAs function as miRNA sponges, facilitating efficient translation of mRNAs. (c) lncRNAs and circRNAs compete with mRNAs for binding to regulatory RNAbinding proteins (RBPs), sequester RBPs and alter the translation and stability of mRNAs. (d). mRNAs and smORF-containing transcript variants could encode fulllength proteins and truncated micropeptides respectively. Some micropeptides could function as miPs (microProteins) and interfere with full length protein function, owing to their common origins and sequence identity.
with variable 3ʹ ends with and without miRNA binding sites [58]. The generation of splice forms which lack protein-coding potential could be an additional level of complexity.
The non-coding variants of hybrid genes may act as competitive interactors of RBPs, thus sequestering them from their target mRNAs (Figure 2(c)). Such regulation is relevant in the context of pre-mRNA splicing as well as mRNA translation and decay. For instance, binding of circPABN1 to ELAVL1/HuR prevents its binding to the linear PABN1 mRNA, resulting in the suppression of PABN1 translation [59]. Competitive RBP-binding by non-coding variants could also modulate the miRNA-RBP cross-talk [60]. In addition to circRNAs generated by simple back-splicing of exons, sequencing data also confirm the existence of exon-intron circular RNAs (EIciRNAs) and circular intronic RNAs (ciRNA) [61,62]. These intron-retained circRNAs promote transcription of their host genes in cis by associating with RBPs and RNA-pol II [58,59].
It is increasingly being accepted that a large proportion of transcripts annotated as lncRNAs harbor smORFs and express micropeptides which may be stable enough to be functional [63][64][65]. Majority of the lncRNA products from the bifunctional or hybrid genes could be of this category and may express micropeptides which are functionally related to their canonical mRNA encoded proteins. Translatable exonic-circRNAs also very often generate polypeptides which resemble truncated versions of the proteins translated from their cognate mRNA pair [66]. Analogous to microRNAs, microproteins (miPs) are defined as small (< 120 residues), single domain polypeptides, which exert a dominant negative function by heterodimerizing their homologous target proteins into non-functional protein complexes [67,68]. Many of the non-coding splice-forms originally predicted to be NMD targets may be actively translated, providing miPs which interfere with the biological functions of the full length proteins encoded by their mRNA splice variants (Figure 2(d)). The short ORF database (http://www.sorfs.org) lists close to 2 million smORFs in the human transcriptome [69]. Interestingly, the database generated by extensive filtering of ribosome profiling data, assigns half of these smORFs to genes which are already annotated as ''protein-coding''. If this is true, this would mean that there are more than 1 million 10-100 residue long peptides in the human cells, which are generated by the alternatively spliced transcript variants. More importantly, around 200,000 of them show no overlap with other known ORFs, suggesting that they are unique polypeptides generated by the use of alternative exons and/or frames of translation.

Outlook
The general skepticism in accepting concepts which deviate from the central dogma of molecular biology was the main reason for ignoring the importance of the non-coding regions of the genome as ''junk DNA''. We have come a long way forward and the non-coding transcripts are now receiving the attention which they deserve. However, the lack of conserved sequences/domains and limitations in the predictability of non-coding RNA functions are causes for concern. The predictability of function is the main reason why antisense transcripts and ceRNAs dominate the list of functional lncRNAs/circRNAs in the literature. This will be the same for the specific set of non-coding transcripts discussed here. Novel hybrid genes and inducible ncRNA-mRNA switches could be revealed by studies which couple functional screens with splice-form specific expression analysis. However, transcript variants with ceRNAs/miRNA sponge function will be easier to identify and investigate. In this context, it should be noted that there is general skepticism regarding the ceRNA hypothesis [70]. The ceRNA hypothesis states that pseudogenes, lncRNAs and circRNAs compete with mRNA targets for binding to miRNAs, thus suppressing miRNA function [71]. Many of the studies describing miRNA sponges in the past have relied on over-expression approaches and whether those lncRNAs can function as ceRNAs at physiologically relevant copy numbers is debatable. However functional endogenous miRNA sponges have been reported and largescale studies have provided support to the existence of extensive ceRNA regulatory networks [52,55,72].
The distinction between the coding and non-coding transcriptome is slowly being blurred by the emergence of micropeptide encoding smORFs. The huge potential for the hybrid gene-derived lncRNAs in generating diverse micropeptides is underlined by the sORF database entries. Do these riboseq-derived smORFs really encode micropeptides? How many of them give rise to physiologically relevant micropeptides or miPs? Technical advancements in the specific and quantitative analysis of micropeptides will be key to filtering out the noise in these datasets. Another question is whether the transcripts predicted to be NMD substrates are genuine targets of NMD? Some RNAs which are indeed NMD targets could well have cancer-specific implications as NMD is known to be inhibited in cancer [73]. Translatability of the circRNAs is also a variable factor in the analysis and functional characterization of bifunctional genes. Methylation, presence of ORFs and additional structural elements like IRES (internal ribosome-entry sites) seems to regulate circRNA translation and algorithms have already been established to evaluate their translation potential [27][28][29]66,74]. In addition to the native mRNA-lncRNA and mRNA-circRNA hybrid genes, splice-site mutations in cancer can also potentially generate pro-oncogenic noncoding transcript variants. Identifying genuine hybrid genes and categorizing their potential non-coding transcripts into micropeptide encoders, potential ceRNAs and other classes will be the way forward.

Notes on contributor
Sonam Dhamija Both authors equally contributed to the drafting and revision of the manuscript and approved it for publication.