| Literature DB >> 25691913 |
Matteo Comin1, Andrea Leoni1, Michele Schimd1.
Abstract
BACKGROUND: The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads. Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %).Entities:
Keywords: Alignment-free measures; Reads clustering; Reads quality values
Year: 2015 PMID: 25691913 PMCID: PMC4331138 DOI: 10.1186/s13015-014-0029-x
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Example of quality value redistribution of the word
| Original Word | T | G | A | C | C | A |
|---|---|---|---|---|---|---|
| Accuracy | X | X | 70% | X | X | X |
| Possible Word 1 | T | G | C | C | C | A |
| Accuracy | X | X | 11.25% | X | X | X |
| Possible Word 2 | T | G | G | C | C | A |
| Accuracy | X | X | 11.25% | X | X | X |
| Possible Word 3 | T | G | T | C | C | A |
| Accuracy | X | X | 7.5% | X | X | X |
Recall rates of clustering of mRNA simulated reads (10000 reads of length 200) for different measures, error rates, number of clusters and parameter
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
| |
|
|
| ||||||||
|
|
| 0,813 | 0,810 | 0,801 |
| 0,819 | 0,814 | 0,794 | |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
| 0,807 | |
|
| 0,809 | 0,806 | 0,805 | 0,802 | 0,809 | 0,807 | 0,805 | 0,802 | |
|
| 0,809 | 0,806 | 0,805 | 0,802 | 0,809 | 0,807 | 0,805 | 0,802 | |
|
| 0,811 | 0,807 | 0,806 | 0,801 | 0,810 | 0,806 | 0,805 | 0,801 | |
| KL | 0,812 | 0,809 | 0,807 | 0,802 | 0,812 | 0,809 | 0,807 | 0,802 | |
| Symm, KL | 0,812 | 0,809 | 0,807 | 0,802 | 0,812 | 0,808 | 0,806 | 0,802 | |
|
| 0,811 | 0,807 | 0,806 | 0,801 | 0,809 | 0,806 | 0,805 | 0,800 | |
|
|
| ||||||||
|
|
| 0,689 | 0,683 | 0,662 |
| 0,707 | 0,697 | 0,668 | |
|
|
|
|
| 0,689 |
| 0,711 |
| 0,679 | |
|
|
|
|
|
|
|
| 0,704 |
| |
|
| 0,653 | 0,646 | 0,646 | 0,638 | 0,668 | 0,662 | 0,655 | 0,646 | |
|
| 0,653 | 0,646 | 0,645 | 0,637 | 0,668 | 0,662 | 0,655 | 0,644 | |
|
| 0,682 | 0,673 | 0,671 | 0,657 | 0,685 | 0,677 | 0,674 | 0,663 | |
| KL | 0,694 | 0,687 | 0,685 | 0,672 | 0,696 | 0,689 | 0,687 | 0,675 | |
| Symm, KL | 0,693 | 0,686 | 0,684 | 0,669 | 0,695 | 0,688 | 0,685 | 0,673 | |
|
| 0,675 | 0,668 | 0,662 | 0,654 | 0,675 | 0,671 | 0,665 | 0,655 | |
|
|
| ||||||||
|
|
| 0,613 | 0,606 | 0,574 | 0,627 | 0,616 | 0,591 | 0,551 | |
|
| 0,622 | 0,621 | 0,618 | 0,602 |
|
| 0,602 | 0,572 | |
|
| 0,622 |
|
|
|
|
|
|
| |
|
| 0,580 | 0,563 | 0,566 | 0,535 | 0,582 | 0,571 | 0,572 | 0,555 | |
|
| 0,580 | 0,560 | 0,565 | 0,533 | 0,582 | 0,570 | 0,570 | 0,555 | |
|
| 0,554 | 0,551 | 0,547 | 0,540 | 0,568 | 0,565 | 0,553 | 0,543 | |
| KL | 0,555 | 0,548 | 0,545 | 0,536 | 0,566 | 0,558 | 0,547 | 0,537 | |
| Symm, KL | 0,556 | 0,549 | 0,546 | 0,538 | 0,562 | 0,554 | 0,547 | 0,539 | |
|
| 0,553 | 0,547 | 0,547 | 0,538 | 0,556 | 0,549 | 0,548 | 0,540 | |
|
|
| ||||||||
|
| 0,553 | 0,539 | 0,532 | 0,500 | 0,560 | 0,534 | 0,512 | 0,462 | |
|
|
|
|
| 0,532 | 0,560 | 0,544 | 0,524 |
| |
|
| 0,553 | 0,544 | 0,550 |
|
|
|
| 0,487 | |
|
| 0,483 | 0,475 | 0,470 | 0,463 | 0,509 | 0,494 | 0,485 | 0,470 | |
|
| 0,483 | 0,475 | 0,470 | 0,461 | 0,509 | 0,494 | 0,482 | 0,470 | |
|
| 0,478 | 0,472 | 0,465 | 0,453 | 0,500 | 0,495 | 0,486 | 0,465 | |
| KL | 0,498 | 0,488 | 0,484 | 0,468 | 0,507 | 0,501 | 0,492 | 0,476 | |
| Symm, KL | 0,498 | 0,488 | 0,484 | 0,468 | 0,507 | 0,500 | 0,491 | 0,474 | |
|
| 0,470 | 0,464 | 0,457 | 0,449 | 0,488 | 0,482 | 0,476 | 0,455 | |
Best results are in bold.
Recall rates of clustering of mRNA simulated reads (reads of length 200, =2 and 2 clusters) for different measures, different types of errors and number of reads
|
|
|
|
|
|
|---|---|---|---|---|
|
|
| |||
|
| ||||
|
| 0.86445887 | 0.83981814 | 0.79073482 | 0.80640363 |
|
| 0.86441326 |
|
|
|
|
| 0.86441326 | 0.86375045 | 0.85782736 | 0.85818320 |
|
| 0.86723257 | 0.85428665 | 0.84756397 | 0.85088665 |
|
| 0.86723257 | 0.85613671 | 0.85305013 | 0.85504185 |
|
| 0.86114263 | 0.85504302 | 0.85105192 | 0.85118905 |
|
| 0.86258900 | 0.85247832 | 0.84995366 | 0.85110380 |
| KL |
| 0.85916040 | 0.85026923 | 0.85475077 |
| Simm, KL | 0.86712365 | 0.85695963 | 0.84730941 | 0.85418699 |
|
| ||||
| 0.86594479 | 0.83906192 | 0.78782226 | 0.80686962 | |
|
| 0.86599548 |
|
|
|
|
| 0.86600096 | 0.86099042 | 0.85469494 | 0.85441545 |
|
| 0.86790093 | 0.85433807 | 0.84230775 | 0.84839892 |
|
| 0.86790093 | 0.85770704 | 0.85062824 | 0.85104321 |
|
| 0.86216987 | 0.85477261 | 0.84904670 | 0.85024936 |
|
| 0.86058645 | 0.85312555 | 0.84767965 | 0.85043005 |
| KL |
| 0.85667036 | 0.85002398 | 0.85088847 |
| Simm, KL | 0.86919513 | 0.85488101 | 0.84896184 | 0.84950072 |
|
| ||||
|
| 0.86307749 | 0.83460148 | 0.78680210 | 0.81273009 |
|
| 0.86306541 |
|
|
|
|
| 0.86306541 | 0.86129411 | 0.85330127 | 0.85111236 |
|
| 0.86305839 | 0.85432677 | 0.84295441 | 0.85043303 |
|
| 0.86306276 | 0.85799349 | 0.84868427 | 0.85289041 |
|
| 0.86125521 | 0.85265296 | 0.84487856 | 0.84694314 |
|
| 0.85971734 | 0.85283644 | 0.84325115 | 0.84899721 |
| KL |
| 0.85621086 | 0.84559916 | 0.85108524 |
| Simm, KL | 0.86827273 | 0.85433859 | 0.84321338 | 0.85010800 |
|
| ||||
|
| 0.86131992 | 0.83027426 | 0.79355066 | 0.81057286 |
|
| 0.86134064 |
|
|
|
|
| 0.86128705 | 0.85978356 | 0.85252267 | 0.85262847 |
|
| 0.86477422 | 0.85334750 | 0.84374378 | 0.84947286 |
|
| 0.86477422 | 0.85637033 | 0.84850933 | 0.85162186 |
|
| 0.86370337 | 0.85297951 | 0.84525794 | 0.84901375 |
|
| 0.86242736 | 0.85271505 | 0.84384526 | 0.84832590 |
| KL |
| 0.85488377 | 0.84531374 | 0.85014251 |
| Simm, KL | 0.86580244 | 0.85353783 | 0.84308462 | 0.84878825 |
|
| 0.86179886 | 0.83217374 | 0.79345107 | 0.80917623 |
|
| 0.86166330 |
|
|
|
|
| 0.86166519 | 0.85559541 | 0.85133437 | 0.85345570 |
|
| 0.86317541 | 0.85224352 | 0.84168072 | 0.84837070 |
|
| 0.86317541 | 0.85543020 | 0.84770910 | 0.85121979 |
|
| 0.86262435 | 0.85243814 | 0.84436053 | 0.84898583 |
|
| 0.86122271 | 0.85167640 | 0.84308556 | 0.84801094 |
| KL |
| 0.85473650 | 0.84431637 | 0.84985690 |
| Simm, KL | 0.86488656 | 0.85297623 | 0.84262083 | 0.84815285 |
Best results are in bold.
Recall rates for clustering of mRNA simulated reads(10000 reads, =3, 4 clusters) for different measures, error rates and read length
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
| |
|
|
| ||||||||
|
|
| 0,667 | 0,658 | 0,625 |
| 0,700 | 0,697 | 0,672 | |
|
|
|
|
|
|
|
| 0,710 | 0,693 | |
|
|
| 0,671 |
|
|
| 0,711 |
|
| |
|
| 0,616 | 0,610 | 0,608 | 0,601 | 0,643 | 0,636 | 0,632 | 0,623 | |
|
| 0,616 | 0,610 | 0,607 | 0,602 | 0,643 | 0,635 | 0,631 | 0,622 | |
|
| 0,610 | 0,600 | 0,602 | 0,581 | 0,638 | 0,630 | 0,624 | 0,614 | |
| KL | 0,617 | 0,604 | 0,601 | 0,577 | 0,649 | 0,632 | 0,628 | 0,618 | |
| Symm, KL | 0,613 | 0,603 | 0,599 | 0,576 | 0,647 | 0,632 | 0,627 | 0,616 | |
|
| 0,601 | 0,593 | 0,588 | 0,575 | 0,626 | 0,618 | 0,615 | 0,604 | |
Best results are in bold.
Comparison of assembly with and without clustering preprocess ( =3, 2 clusters)
|
|
|
|
|
|
|---|---|---|---|---|
| No Clustering | 93.55% | 112 | 22823 | 0,828 |
|
| 94.13% |
|
|
|
|
| 93.97% | 138 | 28701 | 0,914 |
|
| 94.24% | 135 | 28297 | 0,904 |
| KL | 94.19% | 135 | 28171 | 0,903 |
| Symm, KL | 94.27% | 134 | 27999 | 0,902 |
|
|
| 134 | 28019 | 0,903 |
The assembly with Velvet is evaluated in terms of mapped contigs, N50, number of contigs and genome coverage. The dataset used is SRR017901 (23.5M bases, 10x coverage) that contains reads of Zymomonas mobilis. Best results are in bold.
Comparison of assembly with and without clustering preprocess ( =3, 3 clusters)
|
|
|
|
|
|
|---|---|---|---|---|
| No Clustering | 96.97% | 122 | 16724 | 0.729 |
|
|
| 175 |
|
|
|
| 98.38% | 174 | 40156 |
|
|
| 98.16% | 175 | 36798 | 0.986 |
| KL | 98.28% | 178 | 37717 | 0.990 |
| Simm, KL | 98.30% | 182 | 37217 | 0.990 |
|
| 98.22% |
| 34866 | 0.987 |
The assembly with Velvet is evaluated in terms of mapped contigs, N50, number of contigs and genome coverage. The dataset used is SRR023794 (117MBases) that contains reads of Helicobacter Pylori. Best results are in bold.
Metagenomic reads classification of ( ), ( ), ( ) and ( )
|
|
|
|
|---|---|---|
|
| 0.79782297 | 0.79129356 |
|
| 0.79775189 | 0.76920676 |
|
|
|
|
|
| 0.64335292 | 0.73455525 |
| KL | 0.78663484 | 0.80525234 |
| Simm, KL | 0.77196713 | 0.79216786 |
|
| 0.73917085 | 0.77062424 |
The recall rates for different measures with k = 4 and 3 and 4 clusters. Best results are in bold.