| Literature DB >> 31703556 |
Mario Stanke1,2, Willy Bruhn3, Felix Becker3,4, Katharina J Hoff3,4.
Abstract
BACKGROUND: Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data.Entities:
Keywords: Genome annotation; Online algorithm; RNA-Seq; Sample
Mesh:
Substances:
Year: 2019 PMID: 31703556 PMCID: PMC6842140 DOI: 10.1186/s12859-019-3182-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Transcriptomic sequencing runs
| Species | Transcriptomic runs in SRA | ≤ 5% of reads align uniquely |
|---|---|---|
| 597 | 15.5% | |
| 357 | 6.4% | |
| 1250 | 7.1% | |
| 685 | 21.2% | |
| 30207 | 22.8% | |
| 170 | 10.6% | |
| 31 | 32.2% | |
| 1252 | 9.2% | |
| 80 | 16.3% | |
| 268 | 13.2% | |
| 16139 | 30.5% | |
| 30 | 40.0% |
The second column shows the total number of RNA runs in SRA for the studied species. The last column shows the percentage of runs that were sampled by VARUS of which the first batch exhibited very low unique alignability, more specifically, at least 95% of reads aligned either not at all or multiple times using HISAT2. Such runs are subsequently ignored by VARUS
Fig. 1VARUS flowchart. VARUS itself outputs a file VARUS.bam with all spliced alignments for each species in the input list. In this study, these alignments were used to annotate the genomes with BRAKER [11]
Fig. 2Expression diversity. Multidimensional scaling plot of the expression profiles of 1000 runs sampled from Drosophila melanogaster from SRA. The red dots mark those runs, whose meta data description included the search string “gut”. The plot was created with edgeR [5]
Fig. 3Intron Accuracy. VARUS’ sensitivity (green curve) and specificity (red curve) as a function of the downloaded number of spots (= reads or read pairs), in order of download. The manual selection method (green and red dot) has a fixed number of reads and horizontal dashed lines at manual sensitivity and specificity are drawn to facilitate comparisons
Intron accuracy
| Intron | #spots | # seq. | |||
|---|---|---|---|---|---|
| sn | sp | [M] | runs | ||
| VARUS | 50 | 566 | |||
| manual | .269 | .179 | 5 | ||
| VARUS | .365 | 50 | 357 | ||
| manual | .666 | 3 | |||
| VARUS | 618 | ||||
| manual | .68 | .26 | 88 | 5 | |
| VARUS | 643 | ||||
| manual | .91 | .262 | 126.6 | 7 | |
| VARUS | 758 | ||||
| manual | .896 | .264 | 58.5 | 5 | |
| VARUS | 170 | ||||
| manual | .823 | .322 | 71.9 | 7 | |
| VARUS | .237 | 50 | 31 | ||
| manual | .778 | 2 | |||
| VARUS | .717 | 403 | |||
| manual | .21 | 214.9 | 6 | ||
| VARUS | 80 | ||||
| manual | .83 | .352 | 71.9 | 6 | |
| VARUS | 400 | 266 | |||
| manual | .942 | .138 | 6 | ||
| VARUS | 983 | ||||
| manual | .85 | .002 | 99.8 | 5 | |
| VARUS | .222 | 30 | |||
| manual | .662 | 81.9 | 3 | ||
The sensitivity (sn) and specificity (sp) with which VARUS and the manual method find introns in the reference genome annotation. The last two columns shows the number of reads or read pairs (spots) in millions that have been downloaded from SRA and from how many different runs they stem. Better values are typeset in boldface
Fig. 4Relative Annotation Accuracy. The right side shows the difference in whole-genome annotation accuracy in percent. The F1-measure of coding exon accuracy (the harmonic mean of sensitivity and specificity) was used, whereby either annotation was compared to the respective reference annotation (see Supplementary Table 2). The left shows the corresponding input data set sizes. The average and mean number of spots chosen by the manual method are 100 and 77 million, respectively, and larger than the 50 million spots downloaded by VARUS. Averaging over the species, the F1 accuracy of BRAKER is 0.62% higher when RNA-Seq is selected by VARUS rather than manual