| Literature DB >> 25426337 |
Samier Merchant1, Derrick E Wood2, Steven L Salzberg3.
Abstract
The raw data from a genome sequencing project sometimes contains DNA from contaminating organisms, which may be introduced during sample collection or sequence preparation. In some instances, these contaminants remain in the sequence even after assembly and deposition of the genome into public databases. As a result, searches of these databases may yield erroneous and confusing results. We used efficient microbiome analysis software to scan the draft assembly of domestic cow, Bos taurus, and identify 173 small contigs that appeared to derive from microbial contaminants. In the course of verifying these findings, we discovered that one genome, Neisseria gonorrhoeae TCDC-NG08107, although putatively a complete genome, contained multiple sequences that actually derived from the cow and sheep genomes. Our findings illustrate the need to carefully validate findings of anomalous DNA that rely on comparisons to either draft or finished genomes.Entities:
Keywords: Bioinformatics; DNA sequencing; Genome assembly; Genomics; Microbiome; Sequence analysis
Year: 2014 PMID: 25426337 PMCID: PMC4243333 DOI: 10.7717/peerj.675
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Locations of foreign DNA in Neisseria gonorrhoeae TCDC-NG08107 genome. E-values in column 4 were computed by the BLAST program in a search against the NCBI comprehensive sequence database.
| Genome coordinates | Length | True species | BLAST E-Value |
|---|---|---|---|
| 499351–499709 | 359 | Cow | 3 × 10−168 |
| 1267185–1267393 | 209 | Cow | 1 × 10−71 |
| 1371560– 1371932 | 373 | Cow | 2 × 10−130 |
| 1635755–1635954 | 200 | Cow | 3 × 10−93 |
| 2118014–2118647 | 634 | Sheep | 0.0 |
Results of screening 8 publicly available draft genomes for microbial contaminants. GenBank accession numbers are shown for each genome along with the number of contigs and the size of the draft assembly. The last column shows the sequencing technology used for each project.
| Genome | # of contaminant | Total contaminant | Range of | Total # of | Genome size | Technology |
|---|---|---|---|---|---|---|
| 4 | 9,415 | 4E-71–0.0 | 49,195 | 35 | Illumina | |
| 2 | 904 | 1E-6–5E-22 | 62,912 | 470 | Illumina | |
| 2 | 19,677 | 0.0 | 13,373 | 190 | ABI solid sequencing | |
| 227 | 254,869 | 0.0 | 11,385 | 120 | Sanger | |
| 0 | 0 | N/A | 8,962 | 301 | Sanger | |
| 0 | 0 | N/A | 13,857 | 174 | 454 | |
| 0 | 0 | N/A | 4,884 | 75 | Illumina | |
| 0 | 0 | N/A | 118,433 | 701 | Illumina |