| Literature DB >> 25926546 |
Sasha K Ames1, Shea N Gardner2, Jose Manuel Marti3, Tom R Slezak2, Maya B Gokhale1, Jonathan E Allen2.
Abstract
Identifying causative disease agents in human patients from shotgun metagenomic sequencing (SMS) presents a powerful tool to apply when other targeted diagnostics fail. Numerous technical challenges remain, however, before SMS can move beyond the role of research tool. Accurately separating the known and unknown organism content remains difficult, particularly when SMS is applied as a last resort. The true amount of human DNA that remains in a sample after screening against the human reference genome and filtering nonbiological components left from library preparation has previously been underreported. In this study, we create the most comprehensive collection of microbial and reference-free human genetic variation available in a database optimized for efficient metagenomic search by extracting sequences from GenBank and the 1000 Genomes Project. The results reveal new human sequences found in individual Human Microbiome Project (HMP) samples. Individual samples contain up to 95% human sequence, and 4% of the individual HMP samples contain 10% or more human reads. Left unidentified, human reads can complicate and slow down further analysis and lead to inaccurately labeled microbial taxa and ultimately lead to privacy concerns as more human genome data is collected.Entities:
Mesh:
Year: 2015 PMID: 25926546 PMCID: PMC4484388 DOI: 10.1101/gr.184879.114
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Summary of examined LMAT databases
Number of new human k-mers (k = 20) added to searchable database and total k-mers in the database
Figure 1.Average percentage of reads identified as human sequence in HMP samples, using LMAT-Ref, LMAT-GenBank, or LMAT-Grand by body site.
Figure 2.Histogram showing how often different amounts of human reads are found across the collection of sequencer runs. The x-axis displays human read abundance in sequencer runs in bins of 2%. The y-axis shows the percentage of sequencer runs with the amount of human reads specified on the x-axis using a log scale. The highest fraction of human reads in a sequencer run is 94% and found in one run.
Figure 3.Sensitive BLAST search based assignment of reads from an HMP sample reported to have a high abundance of newly labeled human reads. The left panel shows the distribution of taxonomic assignments after reads were binned into clusters of similar reads. The right panel shows the raw abundance based on read counts for each read assignment. Taxonomic assignments with a 0% abundance label reflect percentages <1%.
Percentage of unlabeled reads
Figure 4.Fraction of shared genus (left) and species (right) calls. ROC curve shown using different minimum abundance thresholds to make organism calls. Different taxonomy calling methods are shown. HMP DACC, MetaPhlAn, and LMAT taxonomy calls with different database types: LMAT-RefSeq (RefSeq), LMAT-ML (ML), and LMAT-ML-Human (ML+humanNoprune).
BLAST-based taxonomic assignments of reads with different labels assigned by LMAT and HMP DACC output