| Literature DB >> 28969586 |
Jérôme Audoux1,2, Mikaël Salson3, Christophe F Grosset4, Sacha Beaumeunier1,2, Jean-Marc Holder1,2, Thérèse Commes1,2, Nicolas Philippe5,6.
Abstract
BACKGROUND: The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices.Entities:
Keywords: Benchmark; Pipeline optimization; RNA-Seq; Transcriptomics
Mesh:
Year: 2017 PMID: 28969586 PMCID: PMC5623974 DOI: 10.1186/s12859-017-1831-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of the SimBA benchmarking procedure. A benchmarking pipeline implemented with SimBA is composed of three components, i/ Simulation of synthetic data using SimCT, ii/ Processing of the synthetic data using a pipeline manager (i.e Snakemake [20], iii/ Qualitative evaluation of the results using BenchCT
Fig. 2SimCT method. SimCT uses a reference FASTA and GTF annotations as input. A first process is intended to introduced biological variations in this reference to create a mutated reference. This new reference is then transfered to FluxSimulator, in order to generate an RNA-Seq experiment. Finaly FluxSimulator output are post-processed to transfer the coordinates from the mutated genome to the original reference
Fig. 3BenchCT evaluation procedures. Each event is evaluated with benchCT with a specific procedure that allow approximate matching. For alignement, only overlap between the prediction and the truth is evaluated. For Splice junctions and Fusions we expect an overlap between the prediction and a candidate in the truth database with a limited agreement distance according to the threshold. For mutation (SNV and Indel), similar procedure is used, as well as the verification of the mutation. For SNVs we evaluate the mutated sequence and for insertions and deletions, the length of the mutation
Software versions and parameters used to generate the results
| Software | Version | Parameters |
|---|---|---|
| HISAT2 | 2.0.4 | –max-intronlen 300000 –novel-splicesite- |
| outfile {output.novel_splice} | ||
| HISAT2_2PASS | 2.0.4 | –max-intronlen 300000 –novel-splicesite- |
| infile {input.novel_splice} | ||
| –novel-splicesite-outfile {output.novel_splice} | ||
| STAR | v2.5.2b | –twopassMode Basic –alignMatesGapMax |
| 300000 | ||
| –alignIntronMax 300000 | ||
| STAR_fusion | v2.5.2b | –twopassMode Basic –alignMatesGapMax |
| 300000 | ||
| –alignIntronMax 300000 –chimSegmentMin | ||
| –chimJunctionOverhangMin 12 | ||
| –chimSegmentReadGapMax 3 | ||
| 12 –alignSJstitchMismatchNmax 5 -1 5 5 | ||
| CRAC | 2.5.0 | -k 22 –detailed-sam –no-ambiguity –deep-snv |
| CRAC_fusion | 2.5.0 | -k 22 –detailed-sam –no-ambiguity –deep- |
| snv –min-chimera-score 0 | ||
| GATK | 1.3.1 | -T HaplotypeCaller -dontUseSoftClippedBases |
| -stand_call_conf 20.0 | ||
| -stand_emit_conf 20.0 | ||
| FREEBAYES | 1.0.2 |
|
| SAMTOOLS mpileup | 1.3.1 |
|
Summary of the data-sets characteristics
| Dataset name | Read size | SNV | Insertion | Deletion | Splice | Colinear | Non colinear |
|---|---|---|---|---|---|---|---|
| fusion | fusion | ||||||
|
| 2×101 | 28554 | 1747 | 1908 | 94904 | 73 | 0 |
|
| 2×150 | 28625 | 1754 | 1915 | 94855 | 64 | 0 |
|
| 2×101 | 47139 | 2564 | 2716 | 132511 | 99 | 14 |
|
| 2×150 | 47577 | 2648 | 2771 | 134235 | 108 | 15 |
Fig. 4Precision and recall of SNV calling. a SNV precision/recall in GRCh38-150bp-normal data-set. b SNV detection in GRCh38-150bp-somatic data-set
Fig. 5Precision and recall of indel calling. a Insertion precision/recall in GRCh38-150bp-somatic. b Intersections of true positives insertions found by calling pipelines in the GRCh38-150bp-somatic data-set
Fig. 6Precision and recall of gene fusion detection. Evaluation of gene fusions detection pipelines on the GRCh38-101bp-160-somatic dataset. Fusions were splited in two category with an individual evaluation. a Colinear fusion where the fusion involves to genomic locations that are located on the same strand of the same chromosome with a distance superior to 300kb. b non-colinear fusions wich does not satisfy the colinear criteria