| Literature DB >> 29548336 |
Serghei Mangul1,2, Harry Taegyun Yang3, Nicolas Strauli4, Franziska Gruhl5,6, Hagit T Porath7, Kevin Hsieh3, Linus Chen8, Timothy Daley9, Stephanie Christenson10, Agata Wesolowska-Andersen11, Roberto Spreafico12, Cydney Rios11, Celeste Eng13, Andrew D Smith9, Ryan D Hernandez14,15,16, Roel A Ophoff17,18,19, Jose Rodriguez Santana20, Erez Y Levanon7, Prescott G Woodruff10, Esteban Burchard21, Max A Seibold22,23, Sagiv Shifman24, Eleazar Eskin3,18, Noah Zaitlen25.
Abstract
High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki .Entities:
Mesh:
Year: 2018 PMID: 29548336 PMCID: PMC5857127 DOI: 10.1186/s13059-018-1403-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Schematic of the ROP. Human reads are identified by mapping all reads onto the reference sequences using a standard high-throughput mapping algorithm. ROP protocol categorizes mapped reads into categories of genomic (red colors) and repetitive (green colors) reads. Unmapped reads that fail to map are extracted and further filtered to exclude low-quality reads, low-complexity reads, and reads from ribosomal DNA (rDNA) (grey color). ROP protocol is able to identify unmapped reads aligned to human references with use of a more sensitive alignment tool (lost human reads: red color), unmapped reads aligned to human references with excessive (“hyper”) editing (hyper-edited RNAs: cyan color), unmapped reads aligned to the repeat sequences (lost repeat elements: green color), unmapped reads spanning sequences from distant loci (non-co-linear: orange color), unmapped reads spanning antigen receptor gene rearrangement in the variable domain (V(D)J recombination of B cell receptor and T cell receptor: violet color), and unmapped reads aligned to the microbial reference genomes and marker genes (microbial reads: blue color)
Fig. 2Genomic profile of unmapped reads across 10,641 samples and 54 tissues. Percentage of unmapped reads for each category is calculated as a fraction from the total number of reads. Bars of the plot are not scaled. Human reads (black color) are mapped to the reference genome and transcriptome via TopHat2. Unmapped reads are profiled using the seven steps of ROP protocol, described below. (1) Low quality/low-complexity (light brown) and reads matching rDNA repeating unit (dark brown) were excluded. (2) ROP identifies lost human reads (red color) from unmapped reads using a more sensitive alignment. (3) Hyper-edited reads are captured by hyper-editing the pipeline proposed in [17]. (4) ROP identifies lost repeat sequences (green color) by mapping unmapped reads onto the reference repeat sequences. (5) Reads arising from trans-spicing, gene fusion, and circRNA events (orange color) are captured by a TopHat-Fusion and CIRCexplorer2 tools. (6) IgBlast is used to identify reads spanning B and T cell receptor gene rearrangement in the variable domain (V(D)J recombinations) (violet color). (7) Microbial reads (blue color) are captured by mapping reads onto the microbial reference genomes
Fig. 3Combinatorial diversity of IGK locus differentiates disease status. a Heatmap depicting the percentage of RNA-seq samples supporting of particular VJ combination for whole blood (n = 19), nasal epithelium of healthy controls (n = 10), and asthmatic individuals (n = 9). Each row corresponds to a V gene and each column corresponds to a J gene. b Alpha diversity of nasal samples is measured using the Shannon entropy and incorporates total number of VJ combinations and their relative proportions. Nasal epithelium of asthmatic individuals exhibits decreased combinatorial diversity of IGK locus compared to healthy controls (p value = 1 × 10− 3). c Compositional similarities between the nasal samples in terms of gain or loss of VJ combinations of IGK locus are measured across paired samples from the same group (Asthma, Controls) and paired samples from different groups (Asthma vs Controls) using Sørensen–Dice index. Lower level of similarity is observed between nasal samples of asthmatic individuals compared to unaffected controls (p value < 7.3 × 10− 13). Nasal samples of unaffected controls are more similar to each other than to the asthmatic individuals (p value < 2.5 × 10− 9)