Background splicing and genetic disease

We report that low level background splicing by normal genes can be used to predict the likely effect of splicing mutations upon cryptic splice site activation and exon skipping, with emphasis on the DBASS databases, BRCA1, BRCA2 and DMD. In addition we show that background RNA splice sites are also involved in pseudoexon formation, recursive splicing and aberrant splicing in cancer. We discuss how background splicing information might inform splicing therapy.


Introduction
We previously established that cryptic splices sites (css) are already active, albeit at very low levels, in normal genes. We did this by using EST data to identify rare splice sites and then compared their positions to known css that are activated in human disease (1). However, this approach was limited to a minority of genes for which there was su cient EST sequence data. Since that time a large amount of RNA-sequencing data has been deposited, which we reasoned would strongly increase the power of css prediction. In support of this, RNA sequencing studies have shown that normal splicing is accompanied by a background of low level or noisy splicing between a large number of hidden splice sites within introns and exons (2).
The snaptron database (http://snaptron.cs.jhu.edu/) lists all of the RNA-seq reads from over 70,000 human samples that were most probably generated by splicing (3). Here we compare the snaptron database to cryptic splice site and exon skipping databases (4,5) and conclude that background splicing reads give a good indication of the likely effect of splicing mutations upon exon skipping as well as css activation. Further comparisons show that background splice sites in normal human genes are also informative about pseudoexon formation, recursive splicing, aberrant splicing in cancer and splicing therapy.

Materials And Methods
References to experimental reports of splice site mutations that cause aberrant splicing of BRCA1, BRCA2 and DMD were obtained from the database of aberrant splice sites (DBASS) the human genome mutation database (HGMD), the Leiden Open Variation Database online (LOVD) and by searching Pubmed (4,6,7).
We also analysed splicing mutations that cause a wide range of medical syndromes by using DBASS and exon skipping databases (4,5). Databases of aberrant splicing in cancer and of recursive splicing were also analysed and are described in the text. We used the HGMD http://www.hgmd.cf.ac.uk/ac/index.php and LOVD websites https://databases.lovd.nl/shared/genes (6,7) to clarify the exon nomenclature used in original reports and the BLAT tool (8) from the UCSC website http://genome.ucsc.edu/ (9) to obtain genome reference numbers for relevant splice sites.
We then compared the above experimental databases of aberrant splicing to the Snaptron database (3), which lists all RNA sequences from over 70 000 human samples that were most probably generated by splicing. Snaptron lists major splicing events between intronic ss and alternative splice sites (ass) but there are many more examples of splicing events with much lower read numbers that are referred to here as background splicing (bss) events. Background splicing may occur between 5' or 3' intron ss with bss that are not the normal intron partner (Fig 1B,C) but also occurs within introns and exons or across these boundaries. The splice sites listed in snaptron are all canonical, GT or GC for 5'ss or AG for 3'ss. Possible non-canonical ss were ltered from snaptron in order to avoid including mRNA deletions that were not generated by splicing (3).
The experimental datasets were compared to the snaptron database by a method that is explained in Figure 1 and Table 1 for BRCA1 but also applies to all other analyses. Fig 1D illustrates that mutations of intron splice sites typically activate cryptic splice sites (css) or cause exon skipping or both (4,5). Cryptic splice sites are dormant splice sites that are activated to very high levels following the mutation of intronic splice sites, and usually occur within 1000bp of the mutated ss (4). DBASS also lists mutations that create de novo splice sites (also known as de novo css) or pseudoexons. De novo ss are new ss that are directly generated by mutations and then compete for splicing with an intron ss. De novo mutations can exist on their own or may also activate pseudoexons (Figure 2A,B).
Snaptron has four different RNA sequencing databases that can be analysed. SRAv1 (hg19) and SRAv2 (hg38) are from the sequencing read archive at NCBI and contain 41 and 83M splice junctions identi ed by sequencing, respectively There are also two smaller databases TCGA (hg38) and GTEx (hg38) with 37 and 29M junctions (3).
We largely analysed SRAv1 because this was the rst available database and it also helped with comparisons to other hg19 databases. SRAv1 is still available but has now been superseded by SRAv2.
We downloaded snaptron data for individual genes and arranged it on worksheets in a manner that allowed us to identify background splicing events that might be activated by splicing mutations, as illustrated in Fig 1D. A protocol for this is given below for BRCA1 but is applicable to all genes. BRCA1 splicing data was downloaded from snaptron by using the link http://snaptron.cs.jhu.edu/srav1/snaptron?regions=brca1. RNA splicing data for any other gene can be obtained by changing brca1 to the required gene name ie http://snaptron.cs.jhu.edu/srav1/snaptron? regions=dmd. To access the other spliced RNA databases of snaptron srav1 can be changed to srav2, gtex or tcga. Downloaded snaptron data was then selected, copied and pasted into the spreadsheet LibreO ce Calc. We chose paste special, unformatted text, then UT-16 and tab options. Sheet 1 in the spreadsheet was copied to two further worksheets. For sheet three we re-ordered the data by selecting the highest sequencing reads in column O (with the extended data option). We then chose and copied the top rows with the highest sequencing reads into sheet 4 (the more rows that are chosen from sheet 3 the greater the number of minor alternative splicing events that can be seen). In sheet 4 it is important to take note of the gene direction along the chromosome indicated by -or + in column G (strand). If + this means that the 5'ss are listed in column D (and 3'ss in column E) however, if column G is -then the 5'ss are listed in column E and the 3'ss in column D. The splice sites were then ordered in a 5' to 3' direction in sheet 4 either by ordering from low to high for column D (for those genes on the + strand) or from high to low for column E (for those genes on the -strand). Worksheets 1 and 2 were used separately to order the large numbers of 5'ss or 3'ss from snaptron in a 5' to 3' direction. By this means 5' or 3' intronic ss and major alternative ss can be identi ed from sheet 4 (see Fig 1A) and all of their partner background ss can be identi ed from either sheet 1 or from sheet 2 (see Figure 1B,C). Occasionally, column G of worksheet 4 contained rows with both -and + values, due to overlapping transcripts from both strands.
The required transcript usually has the greatest number of reads and can also be identi ed from the UCSC genome browser.
It is also useful to identify all of the partners of the mutated splice site at the start of the analysis in case this site is normally involved in alternative splicing. Major alternative splicing can be seen in sheet 4 of the spreadsheet and minor alternative splicing can identi ed by looking at the read numbers for the partner sites of the mutated splice site (in sheet 1 or 2). If there is no or little alternative splicing the analysis proceeds as outlined in Figure 1D. If there are also alternative splice sites then these should be analysed alongside the main partner ss, as outlined in gure 1D, because these alternative splice sites may also participate signi cantly in aberrant splicing events. A splicing mutation of LAMP2 is a good example of this (Appendix 1).
By unrestricted search (see text) we mean searching snaptron for reads for the genome reference number of a putative splice site irrespective of its splice partners.

BRCA1
We initially analysed BRCA1 as proof-of-principle because its mutational landscape in cancer is well described and includes splicing mutations that have been repeatedly analysed (10,11). We rst downloaded the RNA-seq data for BRCA1 from snaptron into a spreadsheet (see Materials and Methods). This spreadsheet lists over 6000 differently spliced transcripts of BRCA1, although the large majority of these are background splicing events that are only supported by very low reads. Fig 1A lists the splicing events with the highest reads, these include intron removal and major alternative splicing events. At least 8 isoforms of BRCA1 have been identi ed (12) and the major alternative splice sites (but not the isoforms) can easily be identi ed in Figure 1A (shaded).
Figures 1B, C and D illustrate how background RNA sequencing data can be used to predict css or exon skipping events that are likely to result from splicing mutations of BRCA1 or any gene. Fig 1B  examines the theoretical effect of mutation of the BRCA1 intron 5'ss 41222944 (shown in red). This might be expected to enhance the use of alternative 5'ss partners for the non-mutated 3'ss 41219713, as illustrated in gure 1D. Figure 1B lists the 5'ss partners for 3'ss 41219713 that have been identi ed in snaptron. As expected there are a large number of reads (148299) for splicing between 3'ss 41219713 and its normal 5'ss partner 41222944 of BRCA1 (blue shading). Other 5'ss partners of the 3'ss 41219713 are also used but at much lower background levels in wild type BRCA1 transcripts. These include single and multiple exon skipping events (yellow shading) between the 3'ss 41219713 and the 5'ss of other upstream introns (compare Figure 1A and B). In addition there are 2 reads for a rare splicing event between 3'ss 41219713 and an exonic 5'ss that is located -93 bases upstream of the normal 5'ss 41222944 and further low level reads for seven background 5'ss that are located downstream within the intron.
Mutation of BRCA1 5'ss 41222944 is known to activate a css at +69 (13)(14)(15) or at +65 (16) These two css exactly match the bss with the most supporting reads ( Figure 1B, red shading). The background splicing information is therefore a good match to the slightly different experimental results of both groups.
Similarly, Figure 1C examines the possible effect of mutation of the 3'ss 41203135 (red shading) by showing all of the splicing events involving its normal partner 5'ss 41209068, as illustrated in Figure 1D.
Mutation of 3'ss 41203135 is known to activate exon skipping between the normal partner 5'ss 41209068 and the downstream intronic 3'ss 41201212 plus weaker activation of the 3'css 41203127 (13,14). Figure  1C shows that these two splicing events have the most reads of the background splicing events involving the 5'ss 41209068 of the wild type BRCA1 gene. Table 1 (rows 13 and Table 1 shows that 15 of these css exactly match bss of wild type BRCA1, the two exceptions are shaded in column 3 and discussed in Table S1, which also provides references. Twelve of the15 bss that match css have the highest reads of all candidate bss, as listed under column 4 and as illustrated in Figs 1B, C. Sites that are candidates for css activation are de ned here as bss within 1000 bases of the intronic ss that is mutated (see Discussion).

The data for Figures 1B and C is summarised in
Many of the splice site mutations of BRCA1 in Table 1 activate exon skipping rather than css and eight of the splice site mutations do both (Table 1, column 2). The ratio of css reads to exon skip reads from the background RNA sequencing data (Table 1, columns 5,6) appears to correlate with the experimental nding of whether splice site mutations activate css or exon skipping. There are six exceptions to this that are shaded as pairs in columns 5 & 6 and are discussed (Table S1). Also shaded are some possible false positive bss reads for both css activation (column 5 rows 5, 24, 31 and 35) and for a double exon skip (column 7 row 16), see Table S1 and Discussion. This data (Table 1) suggests that the effect of splice site mutations upon css activation and even exon skipping can be inferred from background splicing data. In order to test this hypothesis we undertook analyses of further experimental databases that include over 300 medical syndromes caused by splice site mutations.

DBASS, BRCA2 and DMD
We next compared the snaptron database with the database of aberrant splice sites (DBASS). DBASS lists the experimental results for splicing mutations that cause a wide range of human genetic diseases (4). Table 2 is a summary of Table S2, which is an index all of the splicing mutations in DBASS, and shows that the DBASS mutations are subdivided into those that activate aberrant 5' or 3' splice sites (DBASS5 and DBASS3) and that the most common mutations activate css but can also generate de novo css or pseudoexons.
We rst compared the DBASS5 experimental results for 5' css activation with the snaptron RNA splicing data. Table S2 shows how 199 of the 459 mutations in DBASS5 that activate css were systematically chosen to cover every listed medical syndrome. We generated similar tables of background splicing to those illustrated in Fig 1A,B,C for each of the 199 mutations and compared these with the experimental results. Each analysis is summarised in single rows in Table S3 sheet 1. The background splicing tables (see Fig 1B or C) are not shown but the key results are recorded in Table S3 and the raw data can easily be generated as described in Materials and methods. Table 3 row DBASS5   summarises Table S3 sheet 1 and shows that 201 out of 237 (85%) of the 5'css identi ed by experiment (some mutations activate more than one css) exactly match bss in snaptron and are therefore already in use at low levels by normal genes. 150 out of 201 (75%) of the bss that match the position of css have the greatest number of supporting reads compared to other bss (Table 3,  The reason why 15% or so of the experimentally identi ed 5' css or 3'css did not match a background ss was usually because there were no background ss reads for comparison (Table S3). Where background ss data was available, we found that background ss did not match the experimentally reported 5' or 3' css in only 2 to 3% of cases, listed as poor matches in Tables 3 and S3. Table 3 also includes summaries for similar analyses of BRCA1 (Table 1), BRCA2 and DMD (Tables S4, S5). DBASS5* and DBASS3* of Table 3 summarise an analysis of a subcategory of css that are activated by mutations that occur outside the highly conserved regions of the normal 5' or 3'ss (Tables 2, S2, S6). The activated css of DBASS5* and DBASS3* tend to match bss with particularly high reads (Table S6, see Discussion).
Overall the very large majority of css originate from bss (see Discussion) and usually the bss that is activated is the one with the most reads relative to other bss (Table 3).

Exon skipping
We next asked whether background splicing data can indicate whether splice site mutations might cause exon skipping rather than css activation. Some of the papers referenced in DBASS report whether or not exon skipping accompanied css activation (Table S3, column N). Table 4 column 1 summarises that there are 39 reports of both exon skipping and css activation and 71 reports of css activation only for the 5'ss mutations analysed in Table S3. For the reports of css activation only, the total number of background single exon skip reads from the 71 examples is 6621, which is much smaller than the total background skip reads (251128) from the 29 reports of both css and skip activation, so con rming the correlation seen for Table 1. Similar results were found for DBASS3 (Table 4). Table 4 also summarises an analysis of a second database of splicing mutations (Tables S7, S8) that generally cause exon skipping rather than css activation (5). Table 4 shows that we analysed 79 experimental reports of 5'ss mutations that cause exon skipping only. Of these, 71 examples have higher background splicing reads for exon skipping than reads for potential css (background ss within 1000 bases of the intronic ss). Conversely, the 71 experimental reports in DBASS5 of 5'ss mutations that only caused css activation (column 1, line 5) had higher reads for the css than for background exon skipping  (Table 4). Overall these results con rm that the likely effect of splicing mutations upon css activation or exon skipping can in general be inferred from their background splicing ratios. The exceptions to this general nding are shaded in Table 4 and discussed in more detail in Tables S3 and S7. This analysis shows that when the background reads for single exon skipping are greater that the background reads for any candidate css then exon skipping preferentially occurs in response to a splice site mutation (Fig 1D). Table 5 lists all experimental reports of multiple exon skipping events that we found and compares these to the background splicing reads from snaptron. We also included experiments that did not detect the multiple skipping events indicated by snaptron but used RT-PCR primers that were capable of doing so (rows 33 to 42). We did not include predictions of multiple exon skipping from snaptron where experiments were restricted to single skip analyses. The rst three examples are taken from a report about the LAMP2A, B and C variants which are generated by alternative splicing from a common 5'ss and three alternative 3'ss (17). The authors report that the same mutation of the common 5'ss has different effects upon single or double exon skipping by each 3' alternative ss. It can be seen that these differences in skipping correlate well with the relevant background splicing reads ( Table 5, Appendix 1). Other notable features of Table 5 (8, 12, 13, 15, 18 and 30) where there is some but not exact agreement between the experimental results and background splicing reads. There are also six css listed that did not match snaptron background ss reads. For the css of row 5, snaptron has no splicing variants with which to compare and for row 2 the css has a non-consensus sequence, which is ltered from snaptron (3). The other four non-matching css are discussed at the bottom of the source tables. This analysis shows that high background reads for multiple exon skips is a good indication that these events will occur in response to splice site mutations.

Multiple exon skipping
De novo ss and pseudoexons Table 6 summarises our comparison of the snaptron database with mutations in DBASS that generate de novo splice sites (also known as de novo css) or pseudoexons ( Table 2). Here we have divided the de novo mutations into two types, created or enhanced. Created refers to a mutation that creates the GT or GC dinucleotides of a 5' de novo ss or that creates the AG dinucleotide of a 3' de novo ss. Enhanced refers to mutations that enhance already existing GT, GC or AG dinucleotides. As expected none of the 34 and 123 created de novo ss of DBASS5 or DBASS3 match bss in snaptron ( Table 6, row1). Even if there were reads for the original dinucleotide these would have been ltered from this database (3). There are 95 reports of mutations that enhance de novo ss in DBASS5 (Table 2) and we analysed the rst 40 medical syndromes caused by this mutation type and report that 29 of these de novo css positions exactly match bss from snaptron ( Table 6, row 1). Similar results were found for mutations that generated 3' de novo ss (row 2), although a far bigger proportion of the mutations created an AG dinucleotide splice site rather than enhanced existing AG sites.
Pseudoexons are most commonly generated when a mutation that creates a 5' or 3' de novo ss also causes the activation of a partner pseudoexon ss (Fig 2A,B). The 5' and 3' de novo ss that initiate pseudoexon formation matched background ss at a similar level to the de novo mutations only (Table 6). For the 3'pss that partner the 5' de novo mutations, there is a match of 59 out of seventy one 3'pss with background ss (Table 6). Of the twelve 3'pss that did not have a match in snaptron, ten were partnered to 5' de novo sites that were created from non-GT or non-GC dinucleotides (Table S9). Table S9 also describes that 54 out of the fty nine 3'pss that matched background ss were the nearest upstream background 3'ss to the downstream mutation that created the de novo 5' pss. Four of the ve 3'pss were only marginally more distant from an inner background 3'ss and all had far more reads than the inner 3'bss (Table S9). Table 6 row 4 shows that a smaller proportion of 5' pss matched background ss (10/22). In all cases the matching bss are the nearest of all bss to the upstream 3' de novo ss mutation (Table S9).
Pseudoexons that were created by means other than de novo css mutations (Fig 2C) had the best match to bss ( Table 6 row 5, Fig 2C). These were mainly mutations within the pseudoexon, some of which are known to create splicing enhancers, but also included ve mutations outside the pseudoexon that enhance the polypyrimidine tract or the branch point recognition sites for the 3'pss. In addition, some of the pseudoexons were activated by mutations of anking 5' or 3' ss (Table S9). 25 out of 26 pairs of these pseudo splice sites matched background ss in snaptron and 48 of these 50 pss matched bss with the highest reads of all background ss within the intron in which the pseudoexon was formed (Table S9).

Spliceosome mutations and cancer
Mutations of the spliceosome, in particular of SF3B1, have been reported to activate novel aberrant splicing events in leukaemia and other cancers (18)(19)(20). We report that all tested novel cancer ss caused by SF3B1 mutations matched background ss in snaptron with relatively high read numbers (Table 7,  Table S10). Nevertheless, the background read numbers for the aberrant 3' or 5' css that are activated by spliceosomal mutations are still in the order of 1000 fold less than the reads for normal intron removal, as indicated in the last column of Table 7 and see Table S10. By contrast, the rarer exon skipping or exon inclusion events that are enhanced by SF3B1 mutations have background splicing reads only 20 to 40 fold lower on average than normal intronic splicing ( Table 7).
Mutations of the splicing components U2AF and SRSF2 are reported to cause quantitative rather than qualitative changes in splicing (21,22), whereas mutations of the small non-coding RNA U1 are reported to activate novel splicing events in SHH medulloblastomas (23). However, we found that 23 out of 24 of the most novel aberrant splice sites caused by U1 mutations matched background splice sites, including matches to aberrant splice sites for PTCH1, GLI2, CCND2 and PAX5, which are implicated in this cancer (Tables 7, S10). Sixteen out of the 23 css caused by U1 mutations matched background ss within the top three background reads (Tables 7, S10)

Recursive splicing
Large introns are removed in sections by a process called recursive splicing that uses internal splice sites within introns (24)(25)(26)(27). We analysed the rst 20 of over 2000 recursive splice sites discovered by a screen of the human genome (25) and Tables 8 and S11 show that all of these sites matched background ss, as would be expected (26) and that in 12/20 cases the matching background ss had the highest reads of all bss with an individual intron. Similarly, Tables 8 and S11 show that 82 and 86% of 5' and 3' recursive splices identi ed in human DMD introns (27) matched background ss and that the background ss with the most reads matched 3' and 5'RS from DMD introns on 23/34 and 26/36 occasions.

Discussion
We report that the large majority of css that are activated in genetic disease are already in use at low levels by normal genes and are therefore a component of background or noisy splicing (1,2). Our results also indicate that the likely effect of splice site mutations upon css activation or exon skipping ( Figure  1D) can often be discerned from the pattern of background splicing by normal genes. For example the css that are experimentally reported often correspond to background ss within 1000 bases of the mutated intronic splice site that have the most reads (Table 3, 3S, Fig 1D); when exon skipping is caused by a splice site mutation this usually correlates with higher background reads for skipping compared to candidate css reads (Table 4). Table 5 shows that the experimental reports of multiple exon skipping caused by splicing mutations also correlate reasonably well with background splicing reads. Consequently an initial consideration of background splicing is likely to give a useful indication of the primer design required to investigate the full effect of a potential splice site mutation.
It should be noted that this paper does not contribute to the large body of work designed to assess whether mutations at splice sites or outside these regions are likely to impair splicing (28), nor is it informative about intron retention, which is a common aberrant effect caused by splice site mutations.
We generally restricted our css candidates to background ss within 1000 bases of the mutated canonical ss (Figure 1D), because this is observed experimentally (4) (Table S2). However, many background ss are greater than 1000 bases from a canonical ss and in about 10% of introns, these sites have the highest number of reads (Table S7). We show here that some of these background ss have facilitated pseudoexon formation (Table 6) and that some are recursive splice sites (Table 8). Sibley et al (2015) previously established that recursive splice sites and recursive exons can be identi ed from RNA seq data (26).
About fteen percent of css from the cryptic splice site database DBASS did not match background ss in the snaptron database SRAv1 (see below), usually because in these cases snaptron has no or relatively few background splice site reads with which to compare. Therefore the percentage of css that do not match bss is likely to fall as RNA sequencing databases increase in size.
There might however be a higher level of false positives, ie bss within 1000 bp of a splice site mutation that are not activated as css. The numbers in brackets in column 5 of Table 1 (rows 5, 24,31,35) show the reads of top bss that were not activated as css, despite having higher read numbers than those bss that matched the css. Of course these non-matching bss might be identi ed as css in subsequent experiments, Figure 1B provides an example of this. However, for BRCA2, two out of six top bss that were not activated as css following the mutation of 6 different splice sites (out of a total of 40) have been repeatedly analysed (Table S4).
The upper limit of top bss reads that might be false positives can be estimated from Table 3 as the proportion of css that matched bss that did not have the highest reads. For DBASS5 this is 51/201 (25%) and for DBASS3 35/97 (36%). Table S3 column T gives all of the details and also indicates those nonmatching bss with markedly higher reads than the bss that match css. This gives a false positive estimate of 22 out of 201 (11%) for DBASS5 and 19 out of 97 (20%) for DBASS3. An estimate of the level of possible false positives can also be made from Table 4 where 8 out of 79 reports (10%) of 5' exon skipping only, nevertheless have higher background reads for candidate 5' css than for the exon skips (for example Table 1 row 9). Similarly there are 10 out of 64 reports (16%) of likely false positive 3'css candidates (Table S7 sheet 2).
For multiple exon skipping, we suggest that the level of false positives indicated by Table 5, ten out of 42, is an upper limit. We included these ten examples because the RT-PCR primers that were used were capable of but did not apparently detect the multiple exon skips indicated by snaptron (Table 5).
However, there may be other reasons why some of these skipping events, if they occurred, might not have been reported.
Six mutations listed in DBASS5 and two mutations from DBASS3 (Table S2) generate more complicated patterns of aberrant splicing than those illustrated in Figure 1D. Because most of these examples did not readily t the format of Table S3 they are separately analysed and discussed in Appendix 1.
We largely analysed background splice sites that splice to intron 5' or 3' ss ( Figure 1D). Consequently background splicing events within introns and across intronic ss were not usually considered, in order to remove less relevant background splicing. We rescreened the 15% of experimentally identi ed css from DBASS for which we found no match to a background ss (Table 3) without this restriction and only found one clear example that we had missed by our approach (Table S3 5'css, PKP1).
Unrestricted screening increased the bss match to enhanced 5' de novo css from 29/40 to 37/40 and from 8/10 to 10/10 for enhanced 3' de novo css (Tables 6 rows 1 and 2, Table S9, sheets 1 and 2, column O). So an unrestricted screen for background splice site matches is perhaps best for assessing whether a mutation might generate a de novo css. In addition, there are many excellent in silico programmes that can already do this effectively (29,30).
There was a better match between background ss and css than between background ss and de novo css (Tables 3 and 6). In large part this is because possible reads for de novo css that originate from noncanonical splice sites are ltered from snaptron. But also the read numbers of the background ss that matched css were relatively higher than background ss that matched de novo sites (Table S3, S9).
De novo 3' or 5' css mutations that are more distant from an intron ss often activate partner pseudo splice sites (4). For such cases we found that there was a 59/71 match between 3'pss and background ss and a lower match (10/22) between 5'pss and background ss ( Table 6). The 3' and 5'pss usually matched background ss nearest to the de novo mutation (Table S9).
Pseudoexons that were not initiated by de novo ss mutations (Fig 2C) usually matched background splice sites in the host intron with the highest reads ( Table 6, Table S9). Similarly, css that are activated by mutations outside the core sequence of a canonical splice site often matched background ss with relatively high reads (Table 3). We suggest that for both of these different types of aberrant splicing events, the causative mutations are relatively weak and consequently may only have a phenotypic effect through enhancement of relatively active bss.
Snaptron has four different RNA sequencing databases that can be analysed (3) and see Materials and methods. We largely analysed the rst database SRAv1 but as a control we also analysed BRCA1 and BRCA2 splicing mutations using the smaller GTEx and larger SRAv2 databases (Table S12). We found 35 experimentally reported css from both BRAC1 and BRCA2 of which 29 match bss listed in SRAv1 (Table S12). Use of the larger SRAv2 database increased the number of matches to 32/35, whereas the smaller Gtex database had only 18/35 matches (Table S12). Gtex RNA is made from normal tissue samples and Table S12 shows that the ratio of intron to css reads for each of the css of BRCA1 and BRCA2 have similar values when calculated from Gtex or from the SRA databases, so demonstrating that css usage occurs at similar frequencies in the three databases. In further support of this conclusion, the css examples that matched bss from SRA databases but not Gtex usually had particularly large intron to css read ratios (Table S12), indicating that non-matching in Gtex is due to its smaller number of sequencing reads. Therefore background splicing is a property of normal genes and not just of genes from diseased tissue.
The match between bss and the experimental reports of css activation in DMD is less than average (Table 3). DMD has relatively low expression levels and the correlation between css activation and bss reads increases slightly with the use of the larger SRAv2 database from 11/22 to 14/22 exact matches, this is again below average most probably because there are still relatively few sequencing reads and therefore variants for DMD even in SRAv2 (Table S5).
Our analysis indicates that the large majority of aberrant splicing events that have been detected in cancer samples are present in the snaptron database SRAv1 (Tables 7, S10a) and the Gtex database made from normal tissue (Table S10b). Furthermore, the background ss that matched the cancer ss had relatively high reads compared to other background ss, which is perhaps re ective of the relatively low sequencing coverage of the cancer samples. Our nding that oncogenic mutations of the spliceosome enhance strong background splice sites, rather than activate entirely novel ss, is consistent with the subtle effects of the spliceosome mutations upon splice site recognition (23,31).
There is strong evidence that two genes EZH2 and BRD9 have a causal role in cancer as the result of mutations of splicing components SRSF2 and SF3B1 respectively (32,33). EZH2 and BRD9 are both inactivated by aberrant splicing events that cause the inclusion of pseudoexons with stop codons (32,33). Snaptron shows that the 3' and 5' pss of the pseudoexons of both genes match the highest or second highest background reads within their host introns at approximately 5% of normal intron splicing reads for BRD9 and 25% for EZH2 (Table S10a, sheet 9). This indicates one way that spliceosome mutations, which are likely to cause only mild changes to splicing (34), might achieve a phenotypic effect, namely by altering background or alternative splicing events that are already established at relatively high levels.  -51 (Fig 3A). The snaptron database (Fig 3B) contributes the information that the css at -389 has by far the most reads of any background ss within this intron at 1700 reads and that the css at -51 has 11 reads (Fig 3B).
We suggest that the CAPN3 mutation is a typical example of a relatively weak splicing mutation that only has a phenotypic effect because it is able to enhance an already strong background ss. Table S6 lists other examples of inactivating 5' and 3'ss mutations that lie outside the core ss region and that activate css with relatively high background reads, these may also be amenable to the approach used by Hu et al (35). By contrast the use of oligonucleotides to block css that are activated by mutations of the highly conserved region of an intron ss might be less likely to restore intron removal and more likely to enhance a different css (36) or to induce exon skipping (37).
Antisense oligonucleotides have been developed that can induce exon skipping for the treatment of Duchenne muscular dystrophy (38-40) and Wilton et al (38) have categorised individual DMD exons according to the ease with which they can be skipped, with category 4 exons as the most di cult. We wondered whether single exon skipping is favoured when background reads for this event outnumber reads for alternative background splicing events involving the same exon ( Fig 4A). Fig 4B shows this type of background splicing data for all category 4 DMD exons plus the exons that we would predict to be the most di cult to skip as indicated by their background splicing reads. As can be seen there is not a strong overlap between the experimental results and our predictions, although there are points of agreement. Targeting of exon 8 or 54 with ASOs is found experimentally to cause the skipping of both exon 8 and 9 or of both exon 54 and 55 (38). This agrees with the background splicing data which has higher reads for the downstream double exon skip than for single skips of exons 8 and 54. Targeting of exon 10 caused multiple but variable downstream exon skipping (38), which is consistent with the dominant background reads for multiple downstream skipping of this exon (Fig 4B). However, our analysis also indicates that targeting exon 43 might induce double exon skipping (Fig 4B), which isn't the case (Table S13).
Interestingly, Fig 4B also illustrates that some DMD intron splice sites have more background reads for skips of 6, 7, 9, 11 or 12 exons than for single exon skips or for potential css. There are also examples of highest background splicing reads for 3, 4 or 5 exon skips for some of the genes listed in Table S3.
Multiple exon skipping is relatively untested and it would seem important to discover if skipping on this scale does occur in response to splice site mutations or to antisense oligonucleotides.  Tables   Tables can be found in the supplementary section.