| Literature DB >> 21358816 |
Mark S Longo1, Michael J O'Neill, Rachel J O'Neill.
Abstract
During routine screens of the NCBI databases using human repetitive elements we discovered an unlikely level of nucleotide identity across a broad range of phyla. To ascertain whether databases containing DNA sequences, genome assemblies and trace archive reads were contaminated with human sequences, we performed an in depth search for sequences of human origin in non-human species. Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.Entities:
Mesh:
Year: 2011 PMID: 21358816 PMCID: PMC3040168 DOI: 10.1371/journal.pone.0016410
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Human sequences found in non-primate databases.
A) Representative Clustal alignment of NCBI non-primate trace archive reads to a consensus primate specific AluY. B) Representative NCBI non-primate trace archive reads mapped to single human locus. C) Summary of non-primate databases contaminated with human sequence. D) Percent of public databases identified as contaminated.
Figure 2Chimeric NCBI Trace Archive read containing both Pseudomonas and human sequence.
Approximately one half of this 920 bp sequence entry is >99% identical to human (blue and purple) while the other half is >99% identical to Pseudomonas aeruginosa (green). The Alu alignment used to identify this sequence is shown in purple.