Title Page Title: Deep sequencing of pre-translational mRNPs reveals hidden flux through evolutionarily conserved AS-NMD pathways

Background:​ The ability to generate multiple mRNA isoforms from a single gene by alternative splicing (AS) is crucial for the regulation of eukaryotic gene expression. Because different mRNA isoforms can have widely differing decay rates, however, the flux through competing AS pathways cannot be determined by traditional RNA-Seq data alone. Further, some mRNA isoforms with extremely short half-lives, such as those subject to translation-dependent nonsense-mediated decay (AS-NMD), may be completely overlooked in even the most extensive RNA-Seq analyses. Results: ​RNA immunoprecipitation in tandem (RIPiT) of exon junction complex (EJC) components allows for the purification of post-splicing mRNA-protein particles (mRNPs) not yet subject to translation (pre-translational mRNPs) and translation-dependent mRNA decay. Here we compared EJC RIPiT-Seq to whole cell and cytoplasmic RNA-Seq data from HEK293 cells. Consistent with expectations, we found that the flux through known AS-NMD pathways is substantially higher than what is captured by RNA-Seq. We also identified thousands of previously unannotated splicing events; while many can be attributed to “splicing noise”, others are evolutionarily-conserved events that produce new AS-NMD isoforms likely involved in maintenance of protein homeostasis. Several of these occur in genes whose overexpression has been linked to poor cancer prognosis. Conclusions: ​Deep sequencing of RNAs in post-splicing, pre-translational mRNPs provides a means to identify and quantify splicing events without the confounding influence of differential mRNA decay. For many known AS-NMD targets, the NMD-linked AS pathway dominates. EJC RIPiT-Seq also enabled identification of numerous conserved but previously unknown AS-NMD events.


Background
A central mechanism underlying metazoan gene expression is alternative pre-mRNA processing, which regulates the repertoire of mRNA isoforms expressed in various tissues and under different cellular conditions. Extensive deep sequencing of RNA (RNA-Seq) has revealed that ~95% of human protein-coding genes are subject to alternative splicing (AS) [1,2], with current estimates suggesting ~82,000 different protein-coding mRNA isoforms generated from ~20,000 protein coding genes [3]. Thus, production of alternative mRNA isoforms massively expands the protein repertoire that can be expressed from a much smaller number of genes [4,5]. But cells also need to control how much of each protein is made. Although transcriptional control is often considered the predominant mechanism for modulating protein abundance, emerging evidence indicates that post-transcriptional regulatory mechanisms are crucial as well.
Not all mRNA variants are protein-coding. Nearly 15,000 human mRNAs in the Ensembl database (release 93) are annotated as nonsense-mediated decay (NMD) targets [3]. NMD is a translation-dependent pathway that both eliminates aberant mRNAs with malformed coding regions (i.e., those containing premature termination codons due to mutation or missplicing) and serves as a key mechanism for maintenance of protein homeostasis [6]. This protein homeostasis function is mediated by AS linked to NMD (AS-NMD), wherein the flux through alternate splicing pathways that result in protein-coding and NMD isoforms is subject to tight control [7]. These NMD isoforms harbor a premature termination codon either due to frameshifting or inclusion of a poison cassette exon. Because NMD isoforms are rapidly eliminated after the first or "pioneer" round of translation, only protein-coding isoforms result in appreciable protein production ( Figure 1A, bottom ). Thus increasing or decreasing flux through the NMD splicing pathway decreases or increases protein production, respectively.
Although AS-NMD was originally described as a mechanism by which RNA binding proteins (e.g., SR and hnRNP proteins) could autoregulate their own synthesis, recent work indicates that AS-NMD is much more pervasive, tuning abundance of many other proteins such as those involved in chromatin modification and cellular differentiation [8].
The true extent to which AS-NMD contributes to protein homeostasis can only be appreciated by determining the flux through the protein-coding and NMD splicing pathways.
Transcriptome-wide assessment of mRNA isoform abundance generally relies on RNA-Seq of whole cell or cytoplasmic RNA. Such methods provide a static snapshot of the species present

EJC, whole cell and cytoplasmic libraries
In our recent study investigating the organizing principles of spliced RNPs [13], we generated three biological replicates from HEK293 cells of EJC-bound RNAs partially digested with RNase T1 during RNP purification ( Figure 1A ). Paired-end deep sequencing of these EJC RIPiT libraries resulted in 19-25 million mate pairs each ( Supplemental Table 1 ). For comparison to RNA-Seq libraries, we chose rRNA-depleted whole cell and cytoplasmic HEK293 RNA-Seq datasets (two biological replicates each) previously published by Sultan et al. [14]. We chose these particular libraries based on their similarity in cell treatment and library preparation to our EJC libraries, their clean cellular fractionation, and sequencing depth (51-57 million mate pairs each).
All libraries were downloaded from their respective repositories (see Declarations) and processed in parallel. Reads were aligned to the Genome Reference Consortium Human Build 38 (GRCh38.p12) [3] using STAR (v2.5.3a) [15] after first filtering out those mapping to repeat RNAs [16]. To minimize the effect of misalignment in ensuing analyses, mismatches were limited to three per read, with gaps caused by deletions or insertions being strongly penalized.
These strict mapping parameters resulted in 6-10 million and 30-43 million aligned pairs for the EJC RIPiT and RNA-Seq libraries, respectively ( Supplemental Table 1 ). For quantification, we limited all analyses to unique reads with high mapping quality (MAPQ ≥ 5). For all libraries, we used Kallisto (v0.44.0) to derive expression values for the ~200,000 annotated transcripts in GRCh38.p12 [3]. Examination of per-transcript abundance revealed high concordance ( ≥ 0.93 to 0.99) among all biological replicates ( Supplemental Figure 1A ). Therefore, all subsequent quantitative analyses utilized merged biological replicate data.

EJC libraries are enriched for spliced transcripts and translation-dependent decay targets
To assess the relative representation of NMD targets in EJC and RNA-Seq libraries, we first examined read coverage on known AS-NMD genes. The SR proteins TRA2B and U2AF2 negatively regulate their own expression by promoting inclusion of a highly-conserved poison cassette exon containing a premature termination codon ( Figure 1B, C ). Although these poison exons were detectable in all library types, they were much more abundant in the EJC libraries.
As expected due to NMD, cytoplasmic RNA-Seq libraries exhibited the lowest poison exon inclusion (percent spliced in; PSI) values (16% and 4%, respectively), with the whole cell libraries being somewhat higher (29% and 6%, respectively). Yet, the EJC RIPiT libraries indicate much higher inclusion percentages (averaging 94% and 73%, respectively). Thus, for both TRA2B and U2AF2, the predominant splicing pathway in HEK293 cells under standard growth conditions is poison exon inclusion. Similar trends were observed for other known AS-NMD targets ( Supplemental Figure 1B-D ), including hnRNPA1 where the AS-NMD isoform results from splicing in the 3′UTR as a consequence of alternative polyadenylation ( Figure 1D ).
The substantial differences between the EJC RIPiT and RNA-Seq quantitations for these previously documented AS-NMD isoforms clearly illustrate the advantage provided by the EJC RIPiT libraries for more accurately assessing flux through alternative processing pathways that result in mRNA isoforms with widely different decay rates.
In GRCh38.p12, every transcript isoform is given a specific annotation [17]; relevant annotations in protein-coding genes are "protein-coding", "NMD", "NSD", "retained intron", and "processed transcript", with the latter being a catch-all for transcripts not clearly attributable to any other category. NSD (non-stop decay) is another translation-dependent mRNA degradation pathway that eliminates transcripts having no in-frame stop codon [18]. When exported to the cytoplasm, transcripts containing one or more retained introns are also usually subject to translation-dependent decay due to the presence of in-frame stop codons in intronic regions.
For transcripts detectable in our libraries [TPM >0 in all replicates of a particular library type (EJC, whole cell or cytoplasmic)], the number of exon junctions (i.e., positions at which introns were removed) per protein-coding and NMD isoform ranged from 0 to >100 and 1 to 69, respectively ( Figure 2A ). As expected, protein-coding isoforms having no exon junctions were less abundant in the EJC libraries than in either RNA-Seq library ( Figure 2B, top ). In contrast, spliced protein-coding isoforms containing 5 or more exon junctions were enriched in EJC libraries, with the degree of enrichment increasing with exon junction number. For each exon junction number bin (i.e., 1-4, 5-10 and 10+), NMD isoforms were even more enriched in EJC libraries than were protein-coding isoforms ( Figure 2B, bottom ). EJC library enrichment was also readily discernible in per-transcript scatter plots for NMD, NSD, retained intron, and processed transcript isoforms ( Figure 2C, D and Supplemental Figure 2A, B ). All of these observations are consistent with the notion that EJC-associated RNAs are enriched for spliced transcripts subject to subsequent elimination by translation-dependent decay.
Because they are not translated, long intergenic non-coding RNAs (lincRNAs) are not subject to translation-dependent decay. As expected, lincRNAs lacking exon junctions (e.g., MALAT1, RMRP, NEAT1 and NORAD) were substantially depleted from EJC libraries, whereas those containing exon junctions were of similar or higher abundance in EJC than RNA-Seq libraries ( Figure 2E and Supplemental Figure 2C ). Particularly notable was XIST, the most highly represented Pol II transcript in our EJC libraries [13]. XIST is both spliced and exclusively nuclear. Reflecting this, median abundance of the eight major XIST isoforms was five-and forty-fold greater in EJC than in whole cell and cytoplasmic libraries, respectively.

EJC libraries capture new exon junctions
Having established that spliced transcripts known to be eliminated by translation-dependent decay are enriched in EJC libraries, we next wondered whether EJC libraries might contain new transcript isoforms that had previously eluded detection due to their low abundance in RNA-Seq. Such isoforms should contain previously unannotated exon junctions. To identify all previously annotated exon junctions, we integrated the RefSeq (hg38) [19], Ensembl (GRCh38.p12) [3], GENCODE (v29) [20] and Comprehensive Human Expressed SequenceS (CHESS) transcriptome annotations to create a comprehensive reference file containing 575,976 known introns ( Supplemental Table 2 ). CHESS is derived from 9,795 RNA-Seq samples from diverse cell types in the GTEx collection, so represents the most complete compendium of human transcripts reported to date [21]. Yet while CHESS found 118,183 new exon junctions not previously annotated in RefSeq, Ensembl or GENCODE, 82,918 other junctions present in RefSeq, Ensembl and/or GENCODE were not returned by the CHESS pipeline ( Figure 3A ). This lack of concordance with respect to annotated junctions shows that even the most comprehensive RNA-Seq data analyses are unlikely to capture all bona fide splicing events.
To identify annotated and unannotated exon junctions in our EJC, whole cell and cytoplasmic libraries, we considered only those reads that cross an exon junction. The position of an exon junction in an individual read can be found by examining the "N operation" in the CIGAR string, which indicates the locations and lengths of gaps inserted during alignment to genomic DNA ( Supplemental Figure 3A ). We further required that any candidate junction: (1) occur within an annotated gene; (2) have reads with ≥15 nt aligning on both sides of the junction (≥90% exact sequence match on each side); (3) be detectable in all replicates of a particular library type (EJC, whole cell or cytoplasmic); and (4) have a mean read count ≥2 per library type ( Supplemental Figure 3A ). Using these criteria, we identified 151,072 junctions contained in the RefSeq/Ensembl/GENCODE/CHESS reference file (annotated junctions) and 5,917 previously unannotated junctions. MEME analysis of the latter revealed the 5′ and 3′ splice site consensus motifs for the major spliceosome, although at somewhat lesser strength (bits) than annotated junctions ( Figure 3B ). To limit our analysis to events most likely representing real splicing events (as opposed to mapping artifacts), we subsequently only considered the 5,412 previously unannotated junctions where the putative intron began and ended with dinucleotides expected for either the minor (AT-AC) or major (GT-AG) spliceosome. Of these, only three had AT-AC termini, indicating that the vast majority (>99.9%) of the unannotated events we detected are due to intron excision by the major spliceosome.
The majority (73%) of previously-annotated exon junctions meeting our detection criteria in protein coding genes ( Supplemental Figure 3A ) were present in all three library types ( Figure 3C, left ). There was less concordance, however, with respect to unannotated junctions, with the EJC libraries having many more unannotated junctions than either whole cell or cytoplasmic RNA-Seq ( Figure 3C, right ). Consistent with the expectation that EJC libraries should be enriched for exon junctions, both annotated and unannotated junctions were supported by more reads per million mapped (RPM) in the EJC libraries ( Figure 3D ). Also as expected, annotated junctions were generally supported by more reads than unannotated junctions in all library types. The major class (49%) of the new junctions were new alternative 5′ or 3′ splice sites (i.e., that combined a known 3 ′ or 5 ′ splice site with a previously unannotated 5′ or 3′ splice site, respectively) ( Figure 3E ). Other categories were previously unannotated exon skipping events (34%), new cassette exons (14%) and new introns (4%).

Relationship of new splicing events to reading frame
Previous analyses of low abundance, unannotated splicing events in RNA-Seq data have revealed a strong tendency for such events to maintain reading frame [22,23]. To investigate whether this is due to some inherent ability of the splicing machinery to detect reading frame in the nucleus [24,25], or simply due to translation-dependent decay of out-of-frame events in the cytoplasm, we determined the distance from each previously unannotated splice site meeting our selection criteria to the nearest annotated splice site observed in any of our three library types. In all, 126 and 273 unannotated 5 ′ and 3 ′ splice sites, respectively, occurred within 15 nts of an annotated 5 ′ or 3 ′ splice site. Comparison of unannotated-to-annotated splice site distance aggregation plots between the three library types revealed both similarities and differences ( Figure 4A ). Around annotated 5 ′ splice sites, all three libraries displayed similar patterns, with the greatest unannotated usage being at intron position +5, consistent with the preference for a G and a T at positions +5 and +6, respectively, in the human 5 ′ splice site consensus sequence ( Figure 3B ) and the prevalence of GT dinucleotides at this position in this set of 126 5 ′ splice sites (dotted gray line in Figure 4A ). More notable was the pattern near 3 ′ splice sites, where positions +3 and +4 in the downstream exon exhibited the highest unannotated usage. Strikingly, whereas the RNA-Seq libraries were strongly skewed toward position +3, both positions +3 and +4 in the EJC libraries were highly represented, with their usage closely reflecting the number of available AG's at these positions (dotted gray line in Figure 4A ). Comparison of fractional abundance [unannotated read counts/(unannotated + annotated read counts)] at individual sites confirmed that whereas the EJC and RNA-Seq libraries exhibited similar utilization at position +3, utilization of position +4 was much more prominent in the EJC than either RNA-Seq library ( Figure 4B ). These observations strongly support a model in which out-of-frame splicing events are rapidly eliminated by NMD, resulting in their underrepresentation in both whole cell and cytoplasmic RNA-Seq libraries. Because utilization of AGs at positions +3 and +4 in the EJC libraries so closely paralleled their availability, we conclude that (at least with regard to 3 ′ splice sites) the splicing machinery has no ability to read frame.

Evolutionary conservation versus splicing noise
Regardless of reading frame, most unannotated splicing events are likely due to "splicing error" [26] or "splicing noise" [23]. Splicing noise results from spurious utilization of cryptic splice sites that are not evolutionarily conserved. To assess both evolutionary conservation and splice site strength, we calculated mean basewise phyloP 30-way vertebrate conservation [27] and MaxENT (a generally accepted measure of how well a particular splice site matches the consensus) [28] scores for both annotated and unannotated splice sites, using the same 5 ′ and 3 ′ splice site window sizes (9 and 23 nts, respectively) for both calculations ( Figure 5A ). We also calculated conservation and MaxENT scores for sequences chosen at random from inside annotated genes and containing either GT or AG at the appropriate position within the 5 ′ or 3 ′ splice site window, respectively. Plotting MaxENT versus conservation revealed markedly different distributions between annotated splice sites and random GT-and AG-containing sequences ( Figure 5B, Supplemental Figure 4 ), with annotated sites being significantly skewed toward higher values for both measures. In contrast, whereas unannotated splice sites were similarly distributed as annotated splice sites with regard to MaxENT, the majority exhibited conservation scores more similar to random than annotated splice sites ( Figure 5C ).
For the random sequences, 95% had 5 ′ and 3 ′ splice site conservation scores below 1.03 and 0.63, respectively. Using these values as cutoffs to filter out the majority of events likely due to splicing noise (although this may be unnecessarily conservative for 3 ′ splice sites due to the high degree of overlap between the annotated and random conservation scores) left us with 252 (12%) and 630 (26%) evolutionarily-conserved unannotated 5 ′ and 3 ′ splice sites, respectively ( Supplemental Table 3 ). The majority of these occurred within annotated protein-coding exons, so their conservation is likely driven by amino acid conservation and not as a requirement for recognition by the splicing machinery (see Supplemental Figure 4C for an example). Almost all of the new evolutionarily conserved introns (i.e., both the 5′ and 3′ splice sites were previously unannotated, but exhibited high conservation) also fell into this category. Thus the new introns likely constitute splicing noise due to low level spliceosome assembly on sites within exons that by happenstance resemble splice site consensus sequences. In contrast, examination of unannotated 3′ splice sites occurring within introns uncovered a conserved alternative splicing event in the HECTD4 (HECT domain E3 ubiquitin protein ligase 4) gene that adds 9 amino acids into the middle of the protein ( Supplemental Figure 4D ); this spliced isoform is currently annotated in mouse RefSeq and GENCODE, but not in humans.
Other alternative 3′ splice sites in the CNOT1 and EEA1 genes generate AS-NMD isoforms ( Figure 5D, E ), the latter due to creation of a new poison cassette exon.

New evolutionarily-conserved poison cassette exons
Having found examples of new AS-NMD isoforms generated by unannotated 3′ splice sites, we were interested to investigate which of the new cassette exons identified here might also function in this capacity. Of the 445 new cassette exons ( Figure 3E ), 412 (93%) occurred in protein-coding genes; the remainder occurred in pseudogenes and ncRNAs. Based on the data in Figure 1 , poison exons should exhibit higher abundance in EJC than in RNA-Seq libraries.
Consistent with this, 315/412 (76%) were solely detectable in the EJC libraries, with the remainder averaging 12-and 13-fold higher abundance in the EJC libraries than in whole cell or cytoplasmic RNA-Seq, respectively ( Figure 6A ). Of the 377 new cassette exons detectable in EJC libraries, 70% were frameshifting (i.e., not a multiple of 3 nts long). Individual inspection of the 25 most abundant non-frameshifting exons revealed that 80% contained an in-frame stop . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/847004 doi: bioRxiv preprint first posted online Nov. 19, 2019; codon. Therefore, as expected, the vast majority of new exons likely function as poison cassette exons.
To assess whether any of the new cassette exons constitute conserved regulatory elements, we calculated mean phyloP 30-way conservation scores across the entire exon.
Combining these exon conservation scores (white to dark blue in Figure 6B ) with the previously calculated 5′ and 3′ splice site conservation scores ( Figure 5B ) revealed a set of 20 previously unannotated cassette exons exhibiting both high internal (phyloP score ≥ 1) and high splice site (≥ 1 for both splice sites) conservation ( Figure 6B right; Supplemental Table 3 ). Among these, the most highly represented in our datasets was a new 94 nt exon within intron 8 of the 22-intron protein tyrosine phosphatase, receptor type A ( PTPRA) gene ( Figure 6C ).

Discussion
Here we demonstrate that deep sequencing of transcripts in pre-translational RNPs provides a means to identify/quantify mRNA isoforms underrepresented in or absent from RNA-Seq libraries due to their rapid elimination by translation-dependent mRNA decay. We captured this pre-translational population by tandem immunoprecipitation (RIPiT) [12] of two core EJC proteins. EJCs are stably deposited upstream of exon junctions late in the pre-mRNA splicing process, and EJCs in 5′ UTRs and coding regions (~98% of all) are necessarily removed during the first or "pioneer" round of ribosome transit. Thus the EJC provides an excellent handle by which to enrich for fully-processed, but not-yet-translated mRNAs ( Figure 1A ). Because they are specifically enriched for spliced transcripts, EJC RIPiT-Seq libraries also better capture low abundance splicing events than traditional RNA-Seq libraries. This enabled us to identify thousands of new exon junctions not currently annotated in any of four major reference datasets . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. based on RNA-Seq. Many of these new splicing events generate isoforms subject to NMD, with some being evolutionarily-conserved AS-NMD regulatory events. Thus EJC RIPiT-Seq constitutes a useful method to query the spliced transcriptome without the confounding effects of differential translation-dependent decay of individual mRNA isoforms.

Flux through AS-NMD pathways
Since its initial description [29, 30], AS-NMD has increasingly emerged as a key post-transcriptional regulatory mechanism [31-33] (refs). Due to their widely different decay rates, however, the flux through the alternative processing pathways resulting in protein-coding and NMD isoforms cannot be determined by traditional RNA-Seq methods. As shown in is a general feature of our EJC RIPiT libraries ( Figure 2 ). We note, however, that to increase the abundance of pre-translational RNPs, we exposed our HEK293 cells to 2 mg/ml harringtonine for 60 minutes prior to cell harvest and lysis [13]. At least in yeast growing under suboptimal conditions, inhibition of translation can induce rapid transcriptional upregulation of genes involved in ribosome biogenesis [34]; the extent to which this is also true in mammalian cells growing under optimal conditions, and whether transcription and pre-mRNA processing of other gene classes are affected, has yet to be thoroughly explored. One recent study in HeLa cells, however, showed that, whereas a 15 minute exposure to 100 mg/ml cycloheximide had almost no effect on mRNA abundance in whole cell RNA-Seq, multiple mRNAs encoding ribosomal proteins decreased in abundance after a 24 hour cycloheximide treatment (i.e., the opposite of yeast) [35]. Because any transcriptional effects would confound the analysis, elimination of translation inhibitors would be advisable for any future EJC RIPiT-Seq study specifically aimed at quantifying flux through alternative RNA processing pathways in non-perturbed cells.
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Identification of novel conserved splicing events
A major goal for this study was to assess the utility of EJC RIPiT-Seq libraries for identifying novel sites of exon ligation that are underrepresented in traditional RNA-Seq libraries. These could be splicing events resulting in either stable, low abundance isoforms or highly unstable transcripts such as NMD and NSD substrates. As illustrated in Figure 3A  Among this conserved set, the majority display features expected to generate an AS-NMD isoform (i.e., frameshift or in-frame stop codon).

New poison exons regulate genes linked to cancer
It has now been well established that changes to pre-mRNA splicing patterns can drive cancer  45] or the advantageous effects of NMD in eliminating mRNA isoforms encoding neoepitopes that would otherwise be recognized by the immune system [46]. But our findings suggest that decreased poison exon inclusion should also be considered as a contributor to the mechanisms underlying cancer. An obvious means to alter splicing flux is a cis-acting mutation that disrupts splice site recognition and, thereby, poison exon inclusion. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. exons documented here, this possibility should certainly be considered in future hunts for cancer-promoting mutations. Of note, current "exome" sequencing generally captures only DNA covering and surrounding annotated exons [48]. Therefore, the unannotated cassette exons we identify here are likely absent from most DNA sequencing databases. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Deep sequencing libraries
All libraries were downloaded from the NCBI GEO GSE115788 (specifically, samples GSM3189985, GSM3189986, and GSM3189987) and the European Nucleotide Archive PRJEB4197 (specifically, runs ERR304485, ERR304486, ERR304487, and ERR304488). Each RNA-Seq replicate contained an average of 50 to 60 million mate pairs per library.

Library processing and alignment
Read counts for unprocessed libraries and for the individual processing steps detailed below are provided in Supplemental Table 1 . Prior to alignment, adaptor sequences and long stretches (≥ 20 nt) of adenosines were trimmed from the 3′ end of sequencing reads. All libraries were filtered using STAR v2.5.3a [15] for reads that aligned to repeat regions, as defined by RepeatMasker [16]. Remaining reads were aligned with STAR on two-pass mode to the human genome, release 93 [3]. This alignment allowed a maximum of 3 mismatches per pair and highly penalized deletions and insertions. Mapped reads were then filtered for low mapping quality (MAPQ < 5) and/or duplicated reads, identified with the MarkDuplicates tool (Picard v2.17.8) [49].
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Junction identification pipeline
The custom bioinformatics pipeline designed for our annotated and unannotated junction analysis ( Figures 3 -6 ) is shown in detail in Supplemental Figure 3A . Transcriptome annotation files from RefSeq (hg38) [19], Ensembl (GRCh38.p12) [3], GENCODE (v29) [20], and CHESS (v2.1) [21] were combined to create a comprehensive reference file of all annotated introns ( Supplemental Table 2 ). Any junction that appears in our libraries but is not annotated in one of the aforementioned transcriptomes is referred to as "unannotated." To identify unannotated exon junctions, all reads with CIGAR strings containing an "N" operation were isolated and then compared to the annotated intron reference file using Bedtools intersect [51]. Reads that did not match the length or location of a known intron were considered the result of potential unannotated splicing events. These junctions were further filtered based on the following criteria: (i) overlap with a known gene, (ii) reads must have ≥15 nt aligned on both sides of the potential junction, (iii) present in all replicates of any library type, (iv) major spliceosome dinucleotide consensus sequences at the 5′ and 3′ splice sites, and (v) mean read count ≥ 2 per library type.

Nearest annotated splice site analysis
For analysis of new splicing events near annotated exons ( Figure 4 ), each unannotated 5′ splice site was paired with its nearest annotated 5′ splice site based on the 3′ splice site used in both splicing events. Similarly, each unannotated 3′ splice site was paired with its nearest annotated 3′ splice site based on the 5′ splice site used in both splicing events. The number of available GT and AG dinucleotides at nucleotide positions -30 to +30 surrounding each annotated splice site in this unannotated/annotated paired dataset.

Splice site strength and conservation
Splice site strength and mean conservation scores for annotated and unannotated splice sites were calculated using MaxEntScan [28] and phyloP 30-way basewise conservation scores [27] ( Figure 5A ). Random sequences of the appropriate length (9 nts for 5′ splice sites and 23 nts for 3′ splice sites) and internal to annotated genes were obtained from the hg38 annotation file [3] using the Bedtools random function [51]. Only those random sequences containing a GT at . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/847004 doi: bioRxiv preprint first posted online Nov. 19, 2019; positions 4 and 5 or an AG at positions 19 and 20 were used to calculate MaxENT and conservation scores for comparison to 5′ and 3′ splice sites, respectively.
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Additional files
Additional file 1: Includes Figures S1-S5 and Table S1. Figure S1. Comparison of biological replicates; additional examples of AS-NMD transcript coverage. Figure S2. Comparison of EJC and cytoplasmic RNA-Seq coverage across different transcript biotypes. Figure S3. Representation of bioinformatics pipeline to classify annotated and unannotated exon junctions. Figure S4. Comparison of MaxENT and conservation scores for annotated and randomly located splice sites; additional examples of conserved unannotated splicing events. Figure S5. Additional example of an unannotated AS-NMD cassette exon. Table S1 . Sequencing and alignment information for each replicate of the analyzed libraries. Table S2. Table S2. Comprehensive file of all intron locations annotated in RefSeq (hg38), Ensembl (GRCh38.p12), GENCODE (v29), and CHESS (v2.1 ). Table S3. Table S3. Locations and characteristics of highly conserved unannotated exon junctions identified in this study.

Acknowledgments
We thank Alicia Bicknell, Athma Pai, Harleen Saini and Guramrit Singh for critical reading of the manuscript. We thank Weijun Chen for technical advice.

Funding
This work was supported by funding from HHMI and NIH RO1-GM53007. M.J.M. was an HHMI Investigator during the time this work was conducted.

Availability of data and materials
The RIPiT datasets analyzed in this study were downloaded from NCBI GEO under accession number GSE115788 (specifically, samples GSM3189985, GSM3189986, and GSM3189987). RNA-Seq datasets were downloaded from the European Nucleotide Archive under accession number PRJEB4197 (specifically, runs ERR304485, ERR304486, ERR304487, and ERR304488).

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Author details
M.J.M. is a member of the National Academy of Science (USA) and a fellow of the American Association of Arts and Sciences.
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/847004 doi: bioRxiv preprint first posted online Nov. 19, 2019; . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Figure Legends
The copyright holder for this preprint . http://dx.doi.org/10.1101/847004 doi: bioRxiv preprint first posted online Nov. 19, 2019; Supplemental Table 1 Sequencing and alignment information for each replicate of the analyzed libraries.   Supplemental Table 2 List of all introns previously annotated by RefSeq (hg38) [19], Ensembl (GRCh38.p12) [3], GENCODE (v29) [20], and/or CHESS (v2.1) [21]. Table includes information on intron location, length, strand, transcript ID (if available, [3]), and annotation origin.  . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Supplemental Table 3 List of highly conserved unannotated splicing events (see Results for conservation score cut-offs). Table includes    . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.    CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.    B . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/847004 doi: bioRxiv preprint first posted online Nov. 19, 2019; Starting file: .fastq of Sequenced Reads Isolate reads that cross exon-exon junctions: A . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  Figure 4 A B C D . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.