| Literature DB >> 35551365 |
Martin Hunt1,2, Jeremy Swann2, Bede Constantinides2, Philip W Fowler2, Zamin Iqbal1.
Abstract
SUMMARY: Viral sequence data from clinical samples frequently contain contaminating human reads, which must be removed prior to sharing for legal and ethical reasons. To enable host read removal for SARS-CoV-2 sequencing data on low-specification laptops, we developed ReadItAndKeep, a fast lightweight tool for Illumina and nanopore data that only keeps reads matching the SARS-CoV-2 genome. Peak RAM usage is typically below 10MB, and runtime less than one minute. We show that by excluding the polyA tail from the viral reference, ReadItAndKeep prevents bleed-through of human reads, whereas mapping to the human genome lets some reads escape. We believe our test approach (including all possible reads from the human genome, human samples from each of the 26 populations in the 1000 genomes data, and a diverse set of SARS-CoV-2 genomes) will also be useful for others.Entities:
Year: 2022 PMID: 35551365 PMCID: PMC9191204 DOI: 10.1093/bioinformatics/btac311
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Summary of testing ReadItAndKeep and Dehumanizer on Human and SARS-CoV-2 reads
| Dataset | Samples | Total reads | Percent reads retained | Mean run time | Peak RAM (MB) | |||
|---|---|---|---|---|---|---|---|---|
| Dehumanizer | ReadItAndKeep | Dehumanizer | ReadItAndKeep | Dehumanizer | ReadItAndKeep | |||
| Human 75-mers | 1 | 3 182 668 318 | 0.763 | 0.0 | 5082 min | 83 min | 30 378 | 6 |
| Human 150-mers | 1 | 3 181 197 226 | 0.034 | 0.0 | 7332 min | 149 min | 30 376 | 6 |
| Human Illumina | 27 | 20 772 464 024 | 0.903 | 0.0 | 2860 min | 59 min | 11 831 | 8 |
| Human ONT | 1 | 15 666 887 | 10.283 | 0.0 | 9591 min | 103 min | 10 839 | 243 |
| SARS-CoV-2 75-mers | 1 | 29 796 | 100.0 | 100.0 | 48.0 s | 0.3s | 11 305 | 7 |
| SARS-CoV-2 150-mers | 1 | 29 721 | 100.0 | 100.0 | 48.0 s | 0.3s | 11 305 | 7 |
| SARS-CoV-2 Illumina | 246 | 610 451 014 | 99.994 | 99.894 | 102.4 s | 49.6s | 11 330 | 9 |
| SARS-CoV-2 ONT | 189 | 30 422 462 | 100.0 | 99.992 | 52.1 s | 14.3s | 8387 | 9 |
Note: Percent reads retained is calculated from summing across reads from all samples in the dataset. Mean run time is the mean wall clock time used across all samples in the dataset.