| Literature DB >> 23558750 |
Hai-Son Le1, Marcel H Schulz, Brenna M McCauley, Veronica F Hinman, Ziv Bar-Joseph.
Abstract
Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)-based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated. Supporting website: http://sb.cs.cmu.edu/seecer/.Entities:
Mesh:
Year: 2013 PMID: 23558750 PMCID: PMC3664804 DOI: 10.1093/nar/gkt215
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.An overview of SEECER. Step 1: We select a random read that has not yet been assigned to any contig HMM. Next, we extract all reads with at least k consecutive nucleotides that overlap with the selected read. Step 2: We cluster all reads and then select the most coherent subset as the initial set of the contig HMM. Step 3: We learn an initial HMM using the alignment specified by the k-mer matches of selected reads. Step 4: We use the consensus sequence defined by the contig HMM to extract additional reads from our unassigned set. These additional reads are used to extend the HMM in both directions. Step 5: When no more reads can be found to extend the HMM, we determine for each of the reads that were used to construct the HMM the likelihood of being generated by this contig HMM. For those with a likelihood above a certain threshold, we use the HMM consensus to correct errors. Step 6: We remove the reads that are assigned or corrected from the unassigned pool. See ‘Materials and Methods’ section for complete details.
Evaluation using a RNA-Seq dataset of 55 M paired-end 45-bp reads of human T cells
| Method | Original | SEECER | Quake | SEED | Coral | HiTEC | Echo |
|---|---|---|---|---|---|---|---|
| Aligned reads (M) | 31.2 | 32.3(+3.6%) | – | 32.6 (+4.5%) | 31.2 (+0.0%) | 31.6 (+1.3%) | |
| Proper read pairs (M) | 22.1 | 23.4 (+5.8%) | – | 24.0 (+8.7%) | 22.1 (−0.0%) | 22.7 (+2.5%) | |
| Zero error reads (M) | 18.3 | 22 (+20.4%) | – | 23.9 (+30.7%) | 18.3 (0.1%) | 19.6 (+7.2%) | |
| Gain | – | 0.25 | – | 0.38 | 0.00 | 0.024 | |
| Assembly full length | 1749 | 1979 (+13%) | 1358 (−22%) | 2092 (+19.6%) | 1713 (−2.7%) | 1916 (+9.6%) | |
| Assembly 80% length | 13 852 | 14 267 (+3%) | 9686 (−30%) | 14 643 (+5.7%) | 13450 (−2.9%) | 14 273 (+3.0%) | |
| Memory (GB) | – | 27 | 32 | – | 34.3 | 49 | 72 |
| Time (hours) | – | 12.25 | 7.25 | – | 2.42 | 6.33 | 13.7 |
Percentages in brackets denote performance compared with original data.
‘–’ means not applicable.
The evaluation is based on Ensembl 65 annotation.
Figure 2.The distribution of mismatches to the reference of pair-mapped reads (using TopHat alignment) of the 55 M paired-end 45 bp reads of human T cells dataset: only reads that are aligned both before and after error correction are shown.
Figure 3.An illustrating example how Oases benefits from SEECER error correction. Top: Tophat read alignments in the EIF3CL gene for exons 9–13 before (first track) and after (second track) SEECER correction with human data. Colored dots highlight positions with deviations to the reference sequence in the gray read alignments. Bottom: Summary view of the whole region displaying the longest transfrag assembled. Oases assembled the transcript ENST00000380876 (EIF3CL) to 95% of its length with SEECER corrected data (red transfrag), whereas it was only assembled to 45% of its length when using the original data (blue transfrag).
Evaluation using a RNA-Seq dataset of 64 M paired-end 76-bp reads of HeLa cell lines
| Method | Original | SEECER | Quake | Coral |
|---|---|---|---|---|
| Aligned reads (M) | 28.9 | 30.6 (+5.9%) | 29.5 (+2.1%) | |
| Proper pairs (M) | 19.4 | 20.8 (+7.2%) | 20.0 (+2.8%) | |
| Zero error reads (M) | 13.7 | 15.5 (+12.7%) | 14.9 (+8.7%) | |
| Gain | – | 0.11 | 0.07 | |
| Assembly full | 4067 | 4113 (+1.1%) | 4378 (+7.65%) | |
| Assembly 80% | 25 647 | 25 644 (−0.0%) | 26 414 (+2.99%) | |
| Memory (GB) | – | 52 | 32 | 37.3 |
| Time (hours) | – | 20.33 | 1 | 3.5 |
Percentages in brackets denote performance compared with original data.
‘–’ means not applicable.
The evaluation is based on Ensembl 65 annotation.
Evaluation using a RNA-Seq dataset of 145 M paired-end 101-bp reads from the long RNA-seq of IMR90 cell lines from ENCODE Consortium
| Method | Original | SEECER | Quake | Coral |
|---|---|---|---|---|
| Aligned reads (M) | 119.0 | 121.9 (+2.46%) | – | |
| Proper pairs (M) | 81.1 | 83.5 (+2.9%) | – | |
| Zero error reads (M) | 76.2 | 92.4 (+21.3%) | – | |
| Gain | – | 0.32 | – | |
| Assembly full | 13 148 | 14 968 (+13.84%) | – | |
| Assembly 80% | 61 522 | 61 178 (−0.6%) | – | |
| Memory (GB) | – | 113 | 60 | >130 |
| Time (hours) | – | 40.25 | 3 | – |
Percentages in brackets denote performance compared with original data.
‘–’ means not applicable.
The evaluation is based on Ensembl 65 annotation.
Figure 4.De novo assembly of sea cucumber data. (A) A living sea cucumber Parastichopus parvimensis. (B) Distribution of BlastX matches of sea cucumber transfrags to known sea urchin peptides. The percentages represent the subset of sea urchin peptides that we have significantly matched to at least one transfrag in time point 1 and/or time point 2 and those that were not matched to any transfrag. (C) Ethidium bromide–stained image of PCR products amplified from sea cucumber cDNA. Primer pairs were designed against 14 assembled transfrags, seven of which matched to known peptides of RNAs (top row), and seven other that had no match in the database (bottom row). Standard ladders of 100-bp size are in the first and last lanes. Each lane is followed by the appropriate no template control to demonstrate that amplification was not due to non-specific contamination.