| Literature DB >> 26500767 |
Abstract
BACKGROUND: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing.Entities:
Keywords: Error correction; Next-generation sequencing; RNA-seq; k-mers
Mesh:
Substances:
Year: 2015 PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1Path extension in Rcorrector. Four possible path continuations at the AGTC k-mer (k=4) in the De Bruijn graph for the r= AAGTCATAA read sequence. Numbers in the vertices represent k-mer counts. The first (top) path corresponds to the original read’s representation in the De Bruijn graph. The extension is pruned after the first step, AGTC →GTCA, as the count M(GTCA)=4 falls below the local cutoff (determined based on the maximum k-mer count (494) of the four possible successors of AGTC). The second path (yellow) has higher k-mer counts but it introduces four corrections, changing the read into AAGTCCGTC. The third path (blue) introduces only two corrections, to change the sequence into AAGTCGTTA, and is therefore chosen to correct the read. The fourth (bottom) path is pruned as the k-mer count for GTCT does not pass the threshold. Paths 2 and 3 are likely to indicate paralogs and/or splice variants of this gene
Accuracy of the six error correction methods on the 100 million simulated reads
| Program |
| Recall | Precision | F-score | Run time | Memory |
|---|---|---|---|---|---|---|
| (min) | (GB) | |||||
| SEECER | 31 | 87.13 | 96.93 | 91.77 | 177 | 61 |
| HShrec | - | 69.53 | 31.74 | 43.58 | 13641 | 30 |
| Coral | 31 | 58.35 | 85.14 | 69.25 | 1391 | 81 |
| Musket | 27 | 78.24 | 96.90 | 86.58 | 152 |
|
| BFC | 27 | 80.45 | 97.91 | 88.32 |
| 6 |
| Rcorrector | 27 |
|
|
| 118 | 5 |
Best performers in each category are highlighted in italic. All programs were run multithreaded, with eight threads
Accuracy of six error correction methods on 100 million simulated reads, by expression level of transcripts. k-mer sizes used are those in Table 1
| Program | Recall | Precision | F-score |
|---|---|---|---|
| Low expression | |||
| SEECER | 32.78 |
| 48.14 |
| HShrec | 24.77 | 0.81 | 1.56 |
| Coral | 31.88 | 64.60 | 42.69 |
| Musket | 13.88 | 33.94 | 19.71 |
| BFC | 25.18 | 58.37 | 35.19 |
| Rcorrector |
| 86.62 |
|
| Medium expression | |||
| SEECER | 86.58 | 97.05 | 91.51 |
| HShrec | 70.57 | 19.57 | 30.64 |
| Coral | 89.07 | 85.12 | 87.05 |
| Musket | 72.02 | 92.16 | 80.86 |
| BFC |
| 96.88 | 92.84 |
| Rcorrector | 87.73 |
|
|
| High expression | |||
| SEECER | 87.39 | 96.90 | 91.90 |
| HShrec | 69.22 | 41.67 | 52.02 |
| Coral | 47.59 | 85.17 | 61.06 |
| Musket | 80.50 | 98.53 | 88.61 |
| BFC | 77.47 | 98.35 | 86.67 |
| Rcorrector |
|
|
|
Best performers are highlighted in italic
Summary of datasets included in the evaluation
| Name | Reads | Read length | Aligned | Perfectly |
|---|---|---|---|---|
| (bp) | aligned | |||
| Simulated | 99,338,716 | 100 | 81,994,413 | 21,070,024 |
| Peach | 38,883,238 | 75 | 24,775,386 | 5,617,514 |
| Lung | 113,313,254 | 50 | 110,771,941 | 85,160,322 |
| Geuvadis | 65,015,656 | 75 | 59,130,806 | 26,468,128 |
Best performers are highlighted in italic
Tophat2 alignments of simulated and real reads
| Simulated reads | |||||
|---|---|---|---|---|---|
|
| Aligned | Observed | Base | Specificity | |
| rate | match rate | ||||
| Original | - | 81,994,413 | 82.540 | 99.391 | - |
| SEECER | 31 | 85,374,347 |
|
| 99.619 |
| Hshrec | - | 77,488,558 | 78.004 | 99.888 | 97.886 |
| Coral | 31 | 84,662,510 | 85.226 | 99.745 | 99.494 |
| Musket | 27 | 84,892,466 | 85.458 | 99.906 | 99.739 |
| BFC | 27 | 84,844,168 | 85.409 | 99.918 | 99.889 |
| Rcorrector | 27 | 85,033,277 | 85.599 | 99.986 |
|
| Peach | |||||
| Original | - | 24,775,386 | 63.717 | 99.198 | - |
| SEECER | 27 | 29,056,747 | 74.728 |
| 99.199 |
| Hshrec | - | 24,496,308 | 63.000 | 99.265 | 96.027 |
| Coral | 23 | 28,974,141 | 74.516 | 99.316 | 99.027 |
| Musket | 27 | 28,345,203 | 72.898 | 99.256 | 99.677 |
| BFC | 31 | 26,553,943 | 68.291 | 99.278 |
|
| Rcorrector | 23 | 30,563,388 |
| 99.833 | 99.628 |
| Lung | |||||
| Original | - | 110,771,941 | 97.757 | 99.717 | - |
| SEECER | 23 | 111,261,651 | 98.189 |
| 98.239 |
| Hshrec | - | 102,121,932 | 90.124 | 99.781 | 89.786 |
| Coral | 23 | 111,107,133 | 98.053 | 99.809 | 98.330 |
| Musket | 27 | 110,907,828 | 97.877 | 99.781 | 98.698 |
| BFC | 23 | 111,427,773 |
| 99.824 | 99.359 |
| Rcorrector | 23 | 111,198,587 | 98.134 | 99.830 |
|
| Geuvadis | |||||
| Original | - | 59,130,806 | 90.949 | 99.477 | - |
| SEECER | 23 | 61,514,024 | 94.614 |
| 98.530 |
| Hshrec | 23 | 51,669,686 | 79.473 | 99.709 | 87.924 |
| Coral | 23 | 61,399,007 | 94.437 | 99.717 | 98.049 |
| Musket | 23 | 60,450,316 | 92.978 | 99.652 | 97.900 |
| BFC | 23 | 61,870,897 |
| 99.775 | 98.790 |
| Rcorrector | 23 | 61,641,866 | 94.811 | 99.814 |
|
Best performers are highlighted in italic
Oases assembly of simulated and real reads
| Program | Simulated | Peach | Lung | Geuvadis | ||||
|---|---|---|---|---|---|---|---|---|
| Recall | Precision | Recall | Precision | Recall | Precision | Recall | Precision | |
| Original | 30.575 | 48.862 | 28.879 |
| 4.957 | 10.475 | 5.997 | 16.749 |
| SEECER | 36.698 |
|
| 16.116 | 4.944 | 10.174 | 6.162 | 16.639 |
| Hshrec | 23.334 | 47.417 | 26.132 | 13.850 | 3.608 |
| 4.266 | 19.101 |
| Coral | 35.039 | 51.942 | 29.784 | 15.881 | 4.934 | 10.174 | 6.170 | 16.372 |
| Musket | 33.845 | 47.769 | 28.760 | 15.991 | 4.920 | 10.577 | 5.846 | 16.901 |
| BFC | 34.789 | 50.579 | 29.633 | 16.211 |
| 10.498 | 6.166 | 16.509 |
| Rcorrector |
| 52.144 | 29.355 | 15.951 | 5.012 | 10.478 |
| 16.375 |
Best performers are highlighted in italic
Fig. 2Transcripts assembled from the original and error-corrected reads at the MTMR11 gene locus. Rcorrector (bottom panel) improves upon the original reads and leads to the most complete reconstruction of the transcript
Bowtie2 alignment of single-cell sequencing reads
|
| Aligned | Rate | Base match rate | Specificity | |
|---|---|---|---|---|---|
| Original | - | 27,002,682 | 92.716 | 98.863 | - |
| SPAdes | - | 27,104,190 | 93.065 | 99.675 | 41.482 |
| SEECER | 27 | 26,937,652 | 92.493 | 99.507 | 99.553 |
| Rcorrector | 19 | 27,227,855 |
|
|
|
Best performers are highlighted in italic
SPAdes assembly of single-cell sequencing reads. NG50 is the minimum contig length such that the total number of bases in contigs this size or longer represents more than half of the length of the reference genome
| NG50 | Misassembly | Edits/100 kbps | Genome | |
|---|---|---|---|---|
| coverage | ||||
| Original | 105,623 |
|
| 95.054 |
| SPAdes | 109,876 | 2 | 7.52 | 94.903 |
| SEECER |
| 2 | 7.26 | 95.059 |
| Rcorrector |
|
| 10.02 |
|
Best performers are highlighted in italic