Literature DB >> 28239666

Making the most of RNA-seq: Pre-processing sequencing data with Opossum for reliable SNP variant detection.

Abstract

Identifying variants from RNA-seq (transcriptome sequencing) data is a cost-effective and versatile alternative to whole-genome sequencing. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.

Entities: CellLine Chemical Disease Gene Species

Keywords: RNA-seq; SNP; software tools; variant calling

Year: 2017 PMID： 28239666 PMCID： PMC5322827 DOI： 10.12688/wellcomeopenres.10501.2

Source DB: PubMed Journal: Wellcome Open Res ISSN： 2398-502X

Introduction

RNA-seq (transcriptome sequencing) [1] is routinely employed for gene expression analysis, but it can also be used to identify genomic variants in expressed regions alongside whole-exome (WES) and whole-genome sequencing (WGS). Recently, its potential in improving diagnostics was demonstrated in a clinical setting [2]. However, since the prevalent variant calling pipelines have been designed specifically for DNA data, novel tools or modifications to the existing ones are needed for processing RNA-seq data. Detecting variants in lowly expressed genes, covered by only a few reads, poses strict demands on the precision and sensitivity of the method. Moreover, the method needs to be able to cope with intron-spanning RNA-seq reads. A few pipelines for detecting SNPs in RNA-seq data have now been released to address these challenges. eSNV-detect by Tang et al. [3] employs a combination of mappers to overcome systematic errors of individual aligners, followed by variant calling with Samtools and Bcftools. SNPiR by Piskol et al. [4] relies on a single aligner (BWA) to map reads across splice junctions and filters heavily after variant calling done with GATK UnifiedGenotyper, at the cost of decreased sensitivity. Also the developers of GATK have released online their Best Practices for calling variants from RNA-seq data ( https://software.broadinstitute.org/gatk//guide/article?id=3891). All of them mix and match parts of older pipelines developed for DNA data processing in order to make sense of RNA-seq data. The benchmarking in these studies has not been done consistently, making it difficult to directly compare their performance. Current state-of-the-art variant calling algorithms employ a haplotype-driven strategy to achieve higher accuracy. For example Platypus [5] performs a local de novo read assembly to generate candidate variants and reconstruct haplotypes. Variants are then called based on the estimated haplotypes. The approach works well on length scales of up to a few kilobases (typically up to 1.5–2 kb) but longer reads (e.g. reads mapping across large introns) would disrupt it. For this reason Platypus should not be run directly on RNA-seq data. In this work, we have developed a software tool called Opossum [6] specifically to process and filter RNA-seq data and make it suitable for (haplotype-based) variant calling. No additional processing step (e.g. base quality recalibration) or filtering is required. The presence of splice junctions in RNA-seq data means that reads which have been mapped across splice junctions must be split to remove intronic parts which would otherwise disrupt variant calling. Now, after splitting, we would generally lose information of which new shorter reads originated from the same longer read. This, in turn, would mean that more base-changes would be ignored at the variant calling stage since typically bases are ignored from both ends of each read, and also the possible overlap of originally paired-end reads could not be detected any more. Opossum overcomes these issues by merging overlapping reads and by modifying the base qualities of bases at the ends of the original reads before splitting them. As a result, all information is already incorporated into the reads, and the variant caller can be run with minimal settings. Opossum can be used together with different aligners (TopHat [7], Star [8]) and provides ways for adjusting for the peculiarities of each aligner. While it has been designed to work particularly with Platypus [5], Opossum can be used equally well with other variant callers such as GATK HaplotypeCaller [9]. Our approach shows promising results, maintaining high precision and improving sensitivity in detecting SNP variant calls compared to the GATK Best Practices pipeline. As a reference, we have used the strongly validated GIAB (Genome in a Bottle) dataset [10].

Methods

Operation

Opossum [6] is a Python-based software, requiring Python 2.7 (or greater) along with Python packages Pysam v0.10.0 ( https://github.com/pysam-developers/pysam), itertools, argparse, os and sys. Pysam v0.10.0 wraps htslib-1.3, samtools-1.3 and bcftools-1.3 [12]. Opossum has not been tested with the Python 3.X series. As input, Opossum requires a position-sorted BAM file, which is then processed for variant calling. When running the program, the user should specify whether the input BAM file includes any soft clips ( ’SoftClipsExist’, default=False). The user can also decide whether only properly paired reads should be considered ( ’ProperlyPaired’, default=True) and what is the minimum acceptable mapping quality for a read pair ( ’MapCutoff’, default=40). Note that in TopHat and Star, mapping qualities can only take a restricted set of values: from 0 to 3 if a read maps to multiple locations, 50 (TopHat) or 255 (Star) if a read is a uniquely mapped (In the SAM format specification, a value of 255 indicates that a mapping quality is not available. Opossum therefore reassigns to these reads a quality value of 50. Alternatively Star can be run with the option ’–outSAMmapqUnique 50’ to modify the value assigned to uniquely mapped reads). The precise ’MapCutoff’ value is therefore not important for these mappers as long as it is between 4 and 49. However, it could become relevant if Opossum is used in conjunction with other mappers e.g. HiSat2 [13] as quality scores can then take up a wider range of values. Opossum output is a sorted and indexed BAM file on which SNP variant calling can be carried out with, e.g., Platypus with minimal settings since Opossum has already cleaned the data. By default, Platypus flags variants that do not fulfill all of its filtering criteria [5]. These criteria have been designed to make the most out of DNA data. The same criteria can well be used with RNA-seq data if the user wants to maximize precision at the cost of sensitivity. However, if the user seeks a greater balance between precision and sensitivity, it would be advisable to include also variants flagged as ’badReads’, ’SC’, and ’Q20’ among the final variants.

Implementation

Opossum starts by taking several quality control measures. It discards secondary alignments and reads that have a mapping quality lower than the cutoff specified by the user (via ’MapCutoff’). Opossum also gets rid of reads in pairs that have been aligned in the same direction or are pointing outwards, and paired-end reads where the two reads have been mapped to different chromosomes. Next, Opossum gets rid of read duplicates. Duplicates are defined as read pairs having identical 5’ coordinates and orientations. After duplicate reads have been collected, the primary read is chosen among the properly paired reads based on which pair has the highest sum of base qualities. Then the primary read is compared with each secondary read and modified to accommodate differences in the following way: If the primary and secondary reads have a base-wise discrepancy with a very low base quality (i.e. one or both reads have base quality of less than 10), then the higher-quality base is kept. If both base qualities are above 10, then the corresponding base quality in the primary read is set to zero to reflect the uncertainty involved. This differs from e.g. Picard MarkDuplicates ( https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates) which ignores read flags and does not modify primary reads. Single reads are discarded as duplicates if they have the same starting position as a paired-end read; otherwise, a primary read is chosen among the single read duplicates. Opossum merges overlapping paired-end reads to avoid double-counting the overlapping part during variant calling. The user can specify whether overlapping paired-end reads having at least one base mismatch within the overlap region should be kept ( ’KeepMismatches’, default=False). If they are kept and one of the reads has a very low-quality base at a mismatch position, then the higher-quality base is kept. Otherwise if both base qualities are above 10, then the corresponding base quality in the merged read is set to zero. Reads with intronic regions (denoted by N in the CIGAR string) are split to only keep the exonic parts, resulting in new, shorter reads. If the overlapping parts of reads in a pair have not been aligned to the same exons, the pair is discarded as the mapping cannot be trusted. The final, merged reads are always aligned on the forward strand. Bases located either at the beginning or end of a read are particularly vulnerable to spurious base changes. The base changes at the beginning of the reads arise during first-strand cDNA synthesis using random hexamers [14], whereas the base changes at the end result from the read quality getting worse during sequencing and/or adapter read-through. To deal with this, base-changes in the first N and last M bases of the original read are ignored by Opossum by setting the corresponding base qualities to zero ( ’MinFlankStart’ and ’MinFlankEnd’ parameters, default=0 for both). The values for N and M can be determined by evaluating the base mismatch rates at each position of the reads in the sample as shown in Figure 1. N and M would correspond to a threshold below which the mismatch rate falls which is considered acceptable by the user. In the example, the threshold for the error rate was set to 1 percent and therefore the corresponding ’MinFlankStart’ value to 3 as the error rate has fallen below 1% at the third base position. The same applies to the last bases, with the error rate falling definitely below 1% at the third to last position, so ’MinFlankEnd’ was set to 3 as well. Opossum does not currently differentiate between first and second strands and therefore the parameter values obtained for the first strand are applied to all reads. Although the second strand should have less base mismatches [14], it is worth checking that the chosen parameters are in line with it as well. We have provided the code for computing base mismatch rates on GitHub.

Figure 1.

Percentage of error nucleotides at first four positions (left column) and last four positions (right column) in the first strands.

RNA-seq data from GM12878 [11], mapped with TopHat2 v2.0.12.

Percentage of error nucleotides at first four positions (left column) and last four positions (right column) in the first strands.

RNA-seq data from GM12878 [11], mapped with TopHat2 v2.0.12. The behavior of the ’MinFlank’ parameters depend on whether the user has set the ’SoftClipsExist’ parameter to True. If yes, then ’MinFlankStart’ and ’MinFlankEnd’ are only applied to reads containing soft clips. This is because having soft clips indicates that the mapper has had more trouble in aligning the read, and the read can exhibit a much higher base mismatch rate than a read without soft clips. Whether or not the BAM file contains reads with soft clips depends on the mapper used – for instance, by default settings, Star [8] is a more aggressive mapper than TopHat [7], tolerating many more base mismatches and marking those occurring at read ends as soft clips.

Results

RNA-seq data from the pilot genome GM12878 ( https://www.encodeproject.org/experiments//ENCSR000COQ/, GEO accession code: GSM758559) [11] was used to validate the performance of Opossum. The data consisted of 26,978,818 paired-end 76 bp reads. The data was mapped with two different aligners, TopHat2 (v2.0.12) [7] and Star 2-pass (v2.4.2) [8], which have been shown to be among the best aligners for RNA-seq data [15]. The aligned reads were then processed with Opossum, followed by variant calling with either Platypus (v0.8.1) [5] or GATK HaplotypeCaller (v3.4) [9]. When using Platypus, also variants flagged as ’badReads’, ’SC’, or ’Q20’ were taken into account. The results were compared with the benchmark variant calls (v2.19) provided by GIAB (Genome in a Bottle Consortium) for NA12878 ( ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/, [10]). The bed file corresponding to GIAB v2.19 was used to restrict variant calls to reliable regions only. Both precision and sensitivity were computed to evaluate the performance of each variant calling pipeline: Opossum + Platypus, Opossum + GATK HaplotypeCaller, and GATK pipeline (following its Best Practices for RNA-seq guideline, https://software.broadinstitute.org/gatk//guide/article?id=3891). Precision is defined as the fraction of true positives out of all variant calls in RNA-seq data that are supported by at least two reads (two reads is the minimum required by Platypus and GATK HaplotypeCaller by default). For evaluation purposes, those called variants that have been previously reported as RNA-editing sites [16] have been excluded. Sensitivity is defined as the fraction of true positives out of all variant calls in reference data (true positives + false negatives) that are supported by at least two reads in the original (deduped but otherwise unprocessed) BAM file. Table 1 shows that pre-processing RNA-seq data with Opossum maintains high precision and improves sensitivity regardless of whether variant calling is done with GATK or Platypus. For RNA-seq data mapped with TopHat2, precision improves slightly if data is pre-processed with Opossum, while sensitivity increases by 2–3%. For data mapped with Star 2-pass, the Opossum + Platypus pipeline stands out by improving the sensitivity by more than 4%. It is also worth noting that pre-processing with Opossum slightly improves both precision and sensitivity when used in conjunction with GATK HaplotypeCaller, even though Star is recommended by GATK Best Practices and should therefore provide optimal input for the GATK variant caller.

Table 1.

Precision, sensitivity, and runtimes for the three different variant calling pipelines.

Mapper	Variant calling pipeline	Runtime	Precision (%)	Sensitivity (%)
TopHat2	GATK Best Practices	11 h 50 min	97.04	90.08
	Opossum + GATK HaplotypeCaller	13 h 35 min	97.88	92.20
	Opossum + Platypus	5 h 40 min	97.33	92.96

Star 2-pass	GATK Best Practices	14 h 45 min	96.37	88.47
	Opossum + GATK HaplotypeCaller	15 h 35 min	96.92	89.65
	Opossum + Platypus	7 h 0 min	95.23	94.07

Using Platypus also offers a substantial reduction in runtimes compared to GATK – the runtimes fell by at least 50%. This is in line with the processing times reported in the original Platypus publication [5]. Precision and sensitivity are presented as a function of number of supporting bases in Figure 2 and Figure 3. It can be seen that sensitivities converge rapidly to their final value: approximately four supporting reads are enough to detect a variant with a very high probability. Figure 3 also pinpoints that the superiority of the Opossum + Platypus pipeline regarding sensitivity originates from variant calls in very low-coverage areas, with only 2–3 supporting reads. According to Figure 2, precision gets to around 90% with four supporting reads and then steadily increases with higher coverage, with no major differences in the performance between the three pipelines. Both precision and sensitivity require at least two supporting reads in order to be considered in the first place.

Figure 2.

Precision as a function of the number of supporting bases.

RNA-seq data mapped with TopHat2 v2.0.12. GATK HC stands for GATK HaplotypeCaller v3.4.

Figure 3.

Sensitivity as a function of the number of supporting bases.

RNA-seq data mapped with TopHat2 v2.0.12. GATK HC stands for GATK HaplotypeCaller v3.4.

Precision as a function of the number of supporting bases.

RNA-seq data mapped with TopHat2 v2.0.12. GATK HC stands for GATK HaplotypeCaller v3.4.

Sensitivity as a function of the number of supporting bases.

RNA-seq data mapped with TopHat2 v2.0.12. GATK HC stands for GATK HaplotypeCaller v3.4. In conclusion, the combination of Opossum + Platypus would be recommended especially in cases when the user aims for high sensitivity for SNPs, regardless of the mapper used. Moreover, Opossum + Platypus provide the best results with fastest runtimes so it is ideal when there are only limited time or computational resources available. Having validated the capability of Opossum to process RNA-seq data for SNP detection, the next logical step would be to extend its use to detecting indels in future releases. This not only poses stricter demands on the variant caller, but also specifically on the aligner used [17], and has not yet been explored very much in the literature. Further compatibility will also be tested with other RNA-seq aligners (e.g. HiSat2 [13]) and future developments of variant callers.

Software availability

Latest source code: https://github.com/BSGOxford/Opossum Archived source code as at the time of publication: https://dx.doi.org/10.5281/zenodo.223009

License

GNU GPL v3. My concerns have been addressed appropriately in version 2 and I approve indexing of the article without further reservations. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The authors had applied some clever techniques to remove potential false mutations introduced by splicing or RNA editing and developed a pipeline which is even better than the "gold standard" GATK best practice for RNAseq according to the benchmarking. It is a good addition to the ever-growing RNAseq tool box. However, the author should clarify few wrong claims. First and foremost, "RNA-seq provides a cost-effective alternative to whole genome sequencing (WGS) for detecting genomic variants" is a wrong claim since RNAseq only cover partial of the genome where gene are expressed. The genomics coverage provided by RNAseq is different in different tissues or under various biological conditions. RNAseq only covers about 20-40% of exome. This sentence needs to be re-written or removed. Based on this page ( https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/ ), both TopHat and Star are using quite discrete mapping quality scores. The default cutoff of 40 doesn't make too much sense here. A cutoff of from 4 to 49 will create the same result. The author should point out this pitfall and propose a better scoring method for removing bad quality reads. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. OPOSSUM prepares RNAseq data for variant calling and addresses a very important issue in the use of RNAseq for variant calling: preprocessing. Opossum seems to be faster than GATK and provides some improvement in sensitivity. In GATK RNAseq best practice, after RNAseq data preprocessing, there is Indels realignment and base recalibration ( http://gatkforums.broadinstitute.org/gatk/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail). Is this part not required for variant calling after OPOSSUM preprocessing? Minor comment: This link is broken: https://github.com/luntergroup/octopu I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Data from RNA-Seq, usually used for expression analysis, can be coopted to find DNA variants in expressed regions and sites of RNA-editing. A caveat lies in the fact that the two sources of variation can not necessarily be distinguished in a straight-forward manner, and that analyses of allele specific expression might be hampered by biases in mapping and variant calling. mRNA-levels vary by many orders of magnitude, so in order to detect variants in lowly expressed genes, the detection method has to be precise and sensitive in regions covered by only a few reads. Taking this into account and with the focus being restricted to expressed regions of the genome, RNA-Seq is a cost-effective alternative to whole genome sequencing. A tool that helps improving the process, by increasing precision, sensitivity and processing speed would be useful and, indeed, would make the most out of RNA-Seq. The authors show that Opossum meets these demands. Rather than being a variant caller itself, Opossum is basically a preprocessing pipeline to make RNA-seq reads better suited for variant calling than the original raw data. The process executed by Opossum includes: This is described in the manuscript in a clear and comprehensive manner. Quality control and removal of spuriously mapped read-pairs. Duplicate removal and solving of variant calling conflicts between read duplicates. Merging of overlapping reads. Splitting of intron-spanning reads. Flagging of first N and last M bases to be ignored. The authors show that there is a marked increase in sensitivity using the combination of Opossum and Platypus, compared to the GATK Best Practices Pipeline. Likewise, computation time is significantly reduced. This supports the claim that Opossum is a useful tool for variant calling of RNA-Seq data. There are a couple of points that remain to be addressed, though: In conclusion, Opossum is a tool that is useful for a specific task in the variant calling process of RNA-Seq data. The Opossum/Platypus combination results in an increased sensitivity and reduced computation time compared to the GATK Best Practices pipeline. This is of potential benefit for researchers interested in genomic variation in expressed regions, especially in allele-specific expression, and in RNA editing. Therefore, this manuscript deserves to be indexed once the above mentioned points have been addressed. Opossum is a python script, so installation is not a problem. However, it uses samtools sort, and there is an incompatibility with samtools versions. The samtools version used to test the software (1.2) requires a file prefix for temporary files to be stated, which the Opossum code fails to do, causing an error. This should be fixed or at least the dependencies should be stated clearly. It remains unexplained how much of the described improvement of sensitivity is due to Opossum processing or to Platypus variant calling (compared to GATK). We are only presented results with Opossum and Platypus in combination. Is it possible to use Platypus on RNA-seq data at all without the Opossum step? This is not discussed in the manuscript. The authors should make that point clearer. As a minor remark, the sentence in the last paragraph "Having validated the capability of Opossum to detect SNPs" is not entirely accurate, since Opossum itself does not do the variant calling. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

16 in total

1. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing.

Authors: Beryl B Cummings; Jamie L Marshall; Taru Tukiainen; Monkol Lek; Sandra Donkervoort; A Reghan Foley; Veronique Bolduc; Leigh B Waddell; Sarah A Sandaradura; Gina L O'Grady; Elicia Estrella; Hemakumar M Reddy; Fengmei Zhao; Ben Weisburd; Konrad J Karczewski; Anne H O'Donnell-Luria; Daniel Birnbaum; Anna Sarkozy; Ying Hu; Hernan Gonorazky; Kristl Claeys; Himanshu Joshi; Adam Bournazos; Emily C Oates; Roula Ghaoui; Mark R Davis; Nigel G Laing; Ana Topf; Peter B Kang; Alan H Beggs; Kathryn N North; Volker Straub; James J Dowling; Francesco Muntoni; Nigel F Clarke; Sandra T Cooper; Carsten G Bönnemann; Daniel G MacArthur
Journal: Sci Transl Med Date: 2017-04-19 Impact factor: 17.956

2. STAR: ultrafast universal RNA-seq aligner.

Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

Review 4. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

5. Consistent errors in first strand cDNA due to random hexamer mispriming.

Authors: Thomas P van Gurp; Lauren M McIntyre; Koen J F Verhoeven
Journal: PLoS One Date: 2013-12-30 Impact factor: 3.240

6. Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations.

Authors: Zhifu Sun; Aditya Bhagwate; Naresh Prodduturi; Ping Yang; Jean-Pierre A Kocher
Journal: Brief Bioinform Date: 2017-11-01 Impact factor: 11.622

7. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.

Authors: Andy Rimmer; Hang Phan; Iain Mathieson; Zamin Iqbal; Stephen R F Twigg; Andrew O M Wilkie; Gil McVean; Gerton Lunter
Journal: Nat Genet Date: 2014-07-13 Impact factor: 38.330

8. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

9. RADAR: a rigorously annotated database of A-to-I RNA editing.

Authors: Gokul Ramaswami; Jin Billy Li
Journal: Nucleic Acids Res Date: 2013-10-25 Impact factor: 16.971

10. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

Authors: Daehwan Kim; Geo Pertea; Cole Trapnell; Harold Pimentel; Ryan Kelley; Steven L Salzberg
Journal: Genome Biol Date: 2013-04-25 Impact factor: 13.583

18 in total

1. RNAIndel: discovering somatic coding indels from tumor RNA-Seq data.

Authors: Kohei Hagiwara; Liang Ding; Michael N Edmonson; Stephen V Rice; Scott Newman; John Easton; Juncheng Dai; Soheil Meshinchi; Rhonda E Ries; Michael Rusch; Jinghui Zhang
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

2. iMyoblasts for ex vivo and in vivo investigations of human myogenesis and disease modeling.

Authors: Dongsheng Guo; Katelyn Daman; Jennifer Jc Chen; Meng-Jiao Shi; Jing Yan; Zdenka Matijasevic; Amanda M Rickard; Monica H Bennett; Alex Kiselyov; Haowen Zhou; Anne G Bang; Kathryn R Wagner; René Maehr; Oliver D King; Lawrence J Hayward; Charles P Emerson
Journal: Elife Date: 2022-01-25 Impact factor: 8.140

3. Identifying plant genes shaping microbiota composition in the barley rhizosphere.

Authors: Carmen Escudero-Martinez; Max Coulter; Rodrigo Alegria Terrazas; Alexandre Foito; Rumana Kapadia; Laura Pietrangelo; Mauro Maver; Rajiv Sharma; Alessio Aprile; Jenny Morris; Pete E Hedley; Andreas Maurer; Klaus Pillen; Gino Naclerio; Tanja Mimmo; Geoffrey J Barton; Robbie Waugh; James Abbott; Davide Bulgarelli
Journal: Nat Commun Date: 2022-06-16 Impact factor: 17.694

4. A 2-transcript host cell signature distinguishes viral from bacterial diarrhea and it is influenced by the severity of symptoms.

Authors: R Barral-Arca; J Pardo-Seco; F Martinón-Torres; A Salas
Journal: Sci Rep Date: 2018-05-23 Impact factor: 4.379

5. Colorectal cancer-derived extracellular vesicles induce transformation of fibroblasts into colon carcinoma cells.

Authors: Mohamed Abdouh; Matteo Floris; Zu-Hua Gao; Vincenzo Arena; Manuel Arena; Goffredo Orazio Arena
Journal: J Exp Clin Cancer Res Date: 2019-06-14

6. Ancestry patterns inferred from massive RNA-seq data.

Authors: Ruth Barral-Arca; Jacobo Pardo-Seco; Xabi Bello; Federico Martinón-Torres; Antonio Salas
Journal: RNA Date: 2019-04-22 Impact factor: 4.942

7. LncRBase V.2: an updated resource for multispecies lncRNAs and ClinicLSNP hosting genetic variants in lncRNAs for cancer patients.

Authors: Troyee Das; Aritra Deb; Sibun Parida; Sudip Mondal; Sunirmal Khatua; Zhumur Ghosh
Journal: RNA Biol Date: 2020-10-28 Impact factor: 4.652

Making the most of RNA-seq: Pre-processing sequencing data with Opossum for reliable SNP variant detection.

Introduction

Methods

Operation

Implementation

Percentage of error nucleotides at first four positions (left column) and last four positions (right column) in the first strands.

Results

Precision as a function of the number of supporting bases.

Sensitivity as a function of the number of supporting bases.

Software availability

License

1. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing.

2. STAR: ultrafast universal RNA-seq aligner.

3. The Sequence Alignment/Map format and SAMtools.

Review 4. RNA-Seq: a revolutionary tool for transcriptomics.

5. Consistent errors in first strand cDNA due to random hexamer mispriming.

6. Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations.

7. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.

8. An integrated encyclopedia of DNA elements in the human genome.

9. RADAR: a rigorously annotated database of A-to-I RNA editing.

10. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

1. RNAIndel: discovering somatic coding indels from tumor RNA-Seq data.

2. iMyoblasts for ex vivo and in vivo investigations of human myogenesis and disease modeling.

3. Identifying plant genes shaping microbiota composition in the barley rhizosphere.

4. A 2-transcript host cell signature distinguishes viral from bacterial diarrhea and it is influenced by the severity of symptoms.

5. Colorectal cancer-derived extracellular vesicles induce transformation of fibroblasts into colon carcinoma cells.

6. Ancestry patterns inferred from massive RNA-seq data.

7. LncRBase V.2: an updated resource for multispecies lncRNAs and ClinicLSNP hosting genetic variants in lncRNAs for cancer patients.

8. Extensive Variation in Drought-Induced Gene Expression Changes Between Loblolly Pine Genotypes.

9. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines.

10. Indel sensitive and comprehensive variant/mutation detection from RNA sequencing data for precision medicine.