| Literature DB >> 26451363 |
Graham Rose1, David J Wooldridge1, Catherine Anscombe1, Edward T Mee2, Raju V Misra1, Saheer Gharbia1.
Abstract
Availability of fast, high throughput and low cost whole genome sequencing holds great promise within public health microbiology, with applications ranging from outbreak detection and tracking transmission events to understanding the role played by microbial communities in health and disease. Within clinical metagenomics, identifying microorganisms from a complex and host enriched background remains a central computational challenge. As proof of principle, we sequenced two metagenomic samples, a known viral mixture of 25 human pathogens and an unknown complex biological model using benchtop technology. The datasets were then analysed using a bioinformatic pipeline developed around recent fast classification methods. A targeted approach was able to detect 20 of the viruses against a background of host contamination from multiple sources and bacterial contamination. An alternative untargeted identification method was highly correlated with these classifications, and over 1,600 species were identified when applied to the complex biological model, including several species captured at over 50% genome coverage. In summary, this study demonstrates the great potential of applying metagenomics within the clinical laboratory setting and that this can be achieved using infrastructure available to nondedicated sequencing centres.Entities:
Year: 2015 PMID: 26451363 PMCID: PMC4584244 DOI: 10.1155/2015/292950
Source DB: PubMed Journal: Int J Genomics ISSN: 2314-436X Impact factor: 2.326
Figure 1The rise of NGS data and taxonomic classification tools. (a) Data in terabytes (TB) based on NCBI SRA deposited data (data taken in January 2015). (b) A survey of metagenomic classification tools with peer reviewed publications until February 2015.
Identification of the 25 target viruses within the viral sample using the two taxonomic classification approaches. Genome lengths based on reference genome used within the study and relative viral load shown by qPCR cycle threshold (Ct). If Ct > 37 this is shown as ND (not detectable). Percentages are based on the total reads within each dataset after preprocessing stages.
| Target virus | Length (bp) | Ct | Mapping approach | Kmer approach | ||
|---|---|---|---|---|---|---|
| Number of reads | % | Number of reads | % | |||
| Adenovirus 2 | 35,937 | 29.71 | 262,395 | 2.3 | 263,857 | 2.3 |
| Adenovirus 41 | 34,188 | ND | 0 | 0.0 | 0 | 0.0 |
| Astrovirus | 6,813 | 30.53 | 19,190 | 0.2 | 17,227 | 0.2 |
| Coronavirus 229E | 27,317 | 36.48 | 0 | 0.0 | 0 | 0.0 |
| Coxsackievirus B4 | 7,397 | 30.72 | 20,706 | 0.2 | 14,422 | 0.1 |
| Cytomegalovirus | 230,290 | 28.95 | 580,431 | 5.2 | 584,018 | 5.2 |
| Epstein-Barr virus | 171,823 | 31.27 | 43,600 | 0.4 | 43,882 | 0.4 |
| Herpes simplex virus 1 | 152,261 | 30.59 | 10,189 | 0.1 | 10,220 | 0.1 |
| Herpes simplex virus 2 | 154,675 | 32.48 | 36,786 | 0.3 | 36,865 | 0.3 |
| Influenza A H1N1 | 10,982 | 32.02 | 157 | 0.0 | 451 | 0.0 |
| Influenza A H3N2 | 12,990 | ND | 0 | 0.0 | 4 | 0.0 |
| Influenza B virus | 14,452 | ND | 0 | 0.0 | 1 | 0.0 |
| Metapneumovirus A | 13,335 | 31.86 | 30,256 | 0.3 | 30,588 | 0.3 |
| Norovirus GI | 7,623 | ND | 0 | 0.0 | 0 | 0.0 |
| Norovirus GII | 7,535 | ND | 3 | 0.0 | 0 | 0.0 |
| Parainfluenza virus 1 | 15,600 | 34.43 | 37,489 | 0.3 | 33,569 | 0.3 |
| Parainfluenza virus 2 | 15,646 | 33.87 | 5,226 | 0.0 | 5,091 | 0.0 |
| Parainfluenza virus 3 | 15,462 | ND | 309 | 0.0 | 270 | 0.0 |
| Parainfluenza virus 4 | 17,304 | 31.83 | 9,272 | 0.1 | 8,728 | 0.1 |
| Parechovirus 3 | 7,322 | 29.35 | 3,985,296 | 35.4 | 2,371,771 | 21.1 |
| Respiratory syncytial virus A | 15,158 | 34.33 | 1,530 | 0.0 | 1,551 | 0.0 |
| Rhinovirus A39 | 7,137 | 31.16 | 13,335 | 0.1 | 13,797 | 0.1 |
| Rotavirus A RVA | 18,562 | 24.49 | 655 | 0.0 | 13 | 0.0 |
| Sapovirus | 7,429 | 33.37 | 1,455 | 0.0 | 507 | 0.0 |
| Varicella-zoster virus | 125,144 | 29.02 | 1,398,178 | 12.4 | 1,409,763 | 12.5 |
|
| ||||||
| Total classified | 6,456,450 | 57.3 | 4,846,595 | 43.0 | ||
Figure 2Metagenomic pipeline constructed to analyse the two datasets. Following shared sequence preprocessing steps, QC passed sequences from the known viral sample entered reference-based mapping with all unmapped reads de novo assembled then classified, whilst the nonhuman model sample entered a kmer-based classification pathway.
Figure 3Bacterial taxa identified within the viral sample. Bacteria identified following de novo assembly of all unmapped reads (42.7%) from the targeted mapping approach. The Krona chart represents abundances of taxa within the 0.5 M reads corresponding to classified bacterial contigs, with abundances based on the number of reads mapping to contigs. Taxa are coloured by average contig blast loge-value.
Figure 4Nonhuman model sample classified using Kraken. (a) Classification of all reads represented by Krona chart. (b) Genome coverage across the 1.9 Mb genome of a reference strain from the most dominant bacterial species identified, L. reuteri.
Genome capture of Lactobacillus, the most dominant genus identified within the nonhuman model dataset. Reads were mapped against a reference genome of the most abundant strain within the respective species identified.
| Species identified (strain name) | Mapped reads | Bases (Mb) | Depth of coverage (bp) | Genome coverage (%) |
|---|---|---|---|---|
|
| 983,484 | 457.1 | 234.7 | 98.6 |
|
| 51,947 | 24.2 | 12.3 | 87.8 |
|
| 66,840 | 31.7 | 15.3 | 68.9 |
|
| 58,678 | 27.3 | 13.2 | 55.4 |
Sequence output and data storage for the two datasets. The number of sequences surviving the common preprocessing stages are shown, whilst classified sequences are based on the targeted then assembly approach within the viral dataset, and the kmer based approach within the nonhuman model dataset. Percentages based on the expected number of PE sequences generated for each sequencing chemistry kit used. Storage (in GB) consists of all fastq and intermediate files including bam and bed format files, generated throughout the analysis.
| Sample | Dataset 1: viral panel | Dataset 2: nonhuman model | ||||
|---|---|---|---|---|---|---|
| Reads within set | % | Data (GB) | Reads within set | % | Data (GB) | |
| Predicted reads | 15,000,000 | — | — | 25,000000 | — | — |
| Sequenced reads | 13,537,917 | 90.3 | 9.1 | 12,734,165 | 50.9 | 13.6 |
| Preprocessing: trimming | 12,223,513 | 81.5 | 15.8 | 11,520,499 | 46.1 | 24.5 |
| Preprocessing: host screen | 11,265,758 | 75.1 | 11,517,217 | 46.1 | ||
| Classified sequences | 8,006,562 | 53.4 | 7.3 | 2,788,450 | 11.2 | 5.5 |
|
| ||||||
| Total storage | 32.2 | 43.6 | ||||