| Literature DB >> 31406180 |
Antonio Sérgio Cruz Gaia1, Pablo Henrique Caracciolo Gomes de Sá2, Mônica Silva de Oliveira1, Adonney Allan de Oliveira Veras3.
Abstract
The Next-Generation Sequencing (NGS) platforms provide a major approach to obtaining millions of short reads from samples. NGS has been used in a wide range of analyses, such as for determining genome sequences, analyzing evolutionary processes, identifying gene expression and resolving metagenomic analyses. Usually, the quality of NGS data impacts the final study conclusions. Moreover, quality assessment is generally considered the first step in data analyses to ensure the use of only reliable reads for further studies. In NGS platforms, the presence of duplicated reads (redundancy) that are usually introduced during library sequencing is a major issue. These might have a serious impact on research application, as redundancies in reads can lead to difficulties in subsequent analysis (e.g., de novo genome assembly). Herein, we present NGSReadsTreatment, a computational tool for the removal of duplicated reads in paired-end or single-end datasets. NGSReadsTreatment can handle reads from any platform with the same or different sequence lengths. Using the probabilistic structure Cuckoo Filter, the redundant reads are identified and removed by comparing the reads with themselves. Thus, no prerequisite is required beyond the set of reads. NGSReadsTreatment was compared with other redundancy removal tools in analyzing different sets of reads. The results demonstrated that NGSReadsTreatment was better than the other tools in both the amount of redundancies removed and the use of computational memory for all analyses performed. Available in https://sourceforge.net/projects/ngsreadstreatment/ .Entities:
Mesh:
Substances:
Year: 2019 PMID: 31406180 PMCID: PMC6690869 DOI: 10.1038/s41598-019-48242-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Percentage of read redundancy removal per tool for each organism. NP - not processed owing to errors.
| Organism | FastUniq 1.1 | ParDRe 2.2.5 | MarDre 1.3 | CD-HIT-DUP 4.6.8 | Clumpify (bbmap) | NGSReadsTreatment |
|---|---|---|---|---|---|---|
| SRR2014554 | NP | NP | NP | NP | NP | 0.29% |
| ERR007646 | 50.55% | 50.55% | 50.55% | 50.55% | 50.65% | 50.50% |
| SRR2000272 | 0.82% | 0.81% | NP | 0.81% | 0.95% | 1.91% |
| SRR1424625 | 0% | 0% | 0% | 0% | 0.20% | 0.94% |
| SRR933487 | 0.72% | 0.72% | 0.72% | 0.72% | 1.11% | 1.90% |
| SRR6479489 | 0.14% | 0.14% | 0.13% | 0.14% | 0.19% | 1.21% |
| SRR6479482 | 0.14% | 0.14% | 0.14% | 0.14% | 0.18% | 1.17% |
| SRR974839 | 48.74% | 48.74% | 48.74% | 48.74% | 49.07% | 49.08% |
| SRR1144800 | 0.06% | 0.06% | 0.06% | 0.06% | 0.10% | 0.94% |
| SRR7587111 | NP | 0.37% | 0.37% | 0.37% | 0.43% | 0.87% |
| SRR7819959 | NP | 0.74% | 0.74% | NP | 0.80% | 1.36% |
| ERR2375157 | NP | 0.07% | 0.07% | NP | 0.08% | 1.55% |
| SRR6799098 | NP | 2.11% | 2.11% | 2.1% | 2.22% | 2.22% |
| SRR7905974 | NP | 0% | NP | NP | 0% | 0.13% |
| SRR7739756 | NP | 0% | NP | NP | 0% | 0.08% |
| ERR2162181 | NP | 0% | 0% | 0% | NP | 0.33% |
Memory amount used by each tool in megabyte. NP - not processed owing to errors.
| Organism | FastUniq 1.1 | ParDRe 2.2.5 | MarDre 1.3 | CD-HIT-DUP 4.6.8 | Clumpify (bbmap) | NGSReadsTreatment |
|---|---|---|---|---|---|---|
| SRR2014554 | NP | NP | NP | NP | NP | 549 |
| ERR007646 | 3987 | 5387 | 1653 | 5393 | 870 | 543 |
| SRR2000272 | 1722 | 2278 | NP | 3076 | 423 | 537 |
| SRR1424625 | 2571 | 3586 | 1676 | 4097 | 455 | 539 |
| SRR933487 | 1449 | 1950 | 1411 | 2063 | 2215 | 538 |
| SRR6479489 | 2629 | 3629 | 1652 | 4313 | 3454 | 538 |
| SRR6479482 | 2783 | 3850 | 1647 | 4501 | 3616 | 538 |
| SRR974839 | 2725 | 3825 | 1634 | 4499 | 2625 | 540 |
| SRR1144800 | 2633 | 3730 | 1653 | 4351 | 3250 | 540 |
| SRR7587111 | NP | 888 | 1118 | 1561 | 769 | 533 |
| SRR7819959 | NP | 3197 | 1665 | NP | 659 | 537 |
| ERR2375157 | NP | 1989 | 1394 | NP | 744 | 537 |
| SRR6799098 | NP | 247 | 913 | 481 | 373 | 531 |
| SRR7905974 | NP | 3796 | NP | NP | 961 | 526 |
| SRR7739756 | NP | 1704 | NP | NP | 700 | 532 |
| ERR2162181 | NP | 625 | 947 | 779 | NP | 535 |
Percentage of read redundancy removal per tool for each simulated dataset. NP - not processed owing to errors.
| Organism | FastUniq 1.1 | ParDRe 2.2.5 | MarDre 1.3 | CD-HIT-DUP 4.6.8 | Clumpify (bbmap) | NGSReadsTreatment |
|---|---|---|---|---|---|---|
| 0.08% | NP | 0.08% | 0.08% | 0.20% | 0.77% | |
| 0.08% | NP | 0.08% | 0.08% | 0.20% | 0.05% | |
| 0.09% | NP | 0.09% | 0.09% | 0.23% | 1.15% | |
| 0.08% | NP | 0.08% | 0.08% | 0.22% | 0.10% | |
| 0% | NP | NP | 0% | 0% | 1.16% | |
| 0% | NP | NP | 0% | 0% | 0.08% | |
| 0% | NP | NP | 0% | 0% | 1.43% | |
| 0% | NP | NP | 0% | 0% | 0.13% |
Memory amount used by each tool in megabyte for each simulated dataset. NP - not processed owing to errors.
| Organism | FastUniq 1.1 | ParDRe 2.2.5 | MarDre 1.3 | CD-HIT-DUP 4.6.8 | Clumpify (bbmap) | NGSReadsTreatment |
|---|---|---|---|---|---|---|
| 1272 | NP | 1474 | 2173 | 771 | 537 | |
| 1278 | NP | 1533 | 2153 | 538 | 538 | |
| 1583 | NP | 1632 | 2660 | 558 | 537 | |
| 832 | NP | 1250 | 1363 | 569 | 536 | |
| 143 | NP | NP | 477 | 337 | 534 | |
| 143 | NP | NP | 476 | 262 | 534 | |
| 177 | NP | NP | 598 | 400 | 533 | |
| 96 | NP | NP | 321 | 264 | 536 |
Figure 1Evaluation of memory usage for each computational tool in the processing of simulated datasets.
Organisms and SRA number used to validate NGSReadsTreatment.
| Organism | SRA Access number | File size by Dataset | Total of Reads by Dataset | Type Library | Platform |
|---|---|---|---|---|---|
| SRR2014554 | 8192MB | 24248885 | Paired | Illumina HiSeq 2000 | |
| ERR007646 | 2406MB | 14110696 | Paired | (Illumina Genome Analyzer | |
| SRR2000272 | 1350MB | 2990758 | Paired | Illumina MiSeq | |
| SRR1424625 | 1682MB | 6886668 | Paired | Illumina HiSeq 2000 | |
| SRR933487 | 1070MB | 3214312 | Paired | Illumina Genome Analyzer IIx | |
| SRR6479489 | 2048MB | 5641334 | Paired | Illumina HiSeq 2500 | |
| SRR6479482 | 2168MB | 5971022 | Paired | Illumina HiSeq 2500 | |
| SRR974839 | 1936 MB | 7279254 | Paired | Illumina HiSeq 2000 | |
|
| SRR1144800 | 1884MB | 7033428 | Paired | (Illumina HiSeq 2000 |
|
| SRR7587111 | 588MB | 670813 | Paired | 454 Titanium |
|
| SRR7819959 | 1984MB | 3207713 | Single | Ion Torrent |
|
| ERR2375157 | 1201MB | 2106268 | Single | Ion Torrent |
|
| SRR6799098 | 157MB | 160403 | Single | 454 Junior |
|
| SRR7905974 | 2990MB | 163468 | Single | PacBio |
|
| SRR7739756 | 1336MB | 86389 | Single | Oxford nanopore MinIon |
|
| ERR2162181 | 246MB | 1146696 | Single | SoliD 5500 |
Generation of simulated data with different coverage.
| Organism | Coverage | Dataset Name |
|---|---|---|
| 300x | HS25MicoKorea1168P_300 | |
| 200x | HS25MicoKorea1168P_200 | |
| 100x | HS25MicoKorea1168P_100 | |
| 300x | HS25MicoKZN_4207_300 | |
| 200x | HS25MicoKZN_4207_200 | |
| 100x | HS25MicoKZN_4207_100 | |
| 300x | HS25EcoliO103_H2_300 | |
| 200x | HS25EcoliO103_H2_200 | |
| 100x | HS25EcoliO103_H2_100 |