| Literature DB >> 27611326 |
Kirill Kryukov1, Tadashi Imanishi1.
Abstract
Contamination in genome assembly can lead to wrong or confusing results when using such genome as reference in sequence comparison. Although bacterial contamination is well known, the problem of human-originated contamination received little attention. In this study we surveyed 45,735 available genome assemblies for evidence of human contamination. We used lineage specificity to distinguish between contamination and conservation. We found that 154 genome assemblies contain fragments that with high confidence originate as contamination from human DNA. Majority of contaminating human sequences were present in the reference human genome assembly for over a decade. We recommend that existing contaminated genomes should be revised to remove contaminated sequence, and that new assemblies should be thoroughly checked for presence of human DNA before submitting them to public databases.Entities:
Mesh:
Year: 2016 PMID: 27611326 PMCID: PMC5017631 DOI: 10.1371/journal.pone.0162424
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Mammalian genomes containing at least 2 kbp of likely human originated (LHO) sequence.
| Organism | Assembly accession | Assembly date | Regions | Total LHO length (bp) | LHO bp / genome Mbp |
|---|---|---|---|---|---|
| GCF_000181335.2 | 2014-11-07 | 42 | 15,056 | 5.79 | |
| GCF_000001895.5 | 2014-07-01 | 24 | 5,907 | 2.16 | |
| GCF_000754665.1 | 2014-10-08 | 8 | 2,750 | 1.04 |
Prokaryote genomes containing at least 2 kbp of likely human originated (LHO) sequence.
| Organism | Assembly accession | Assembly date | Regions | Total LHO length (bp) | LHO bp / genome Mbp |
|---|---|---|---|---|---|
| GCA_000825745.1 | 2014-12-01 | 34 | 7,301 | 1,146.80 | |
| GCA_000171775.1 | 2007-10-23 | 10 | 4,221 | 460.72 | |
| GCA_000170635.1 | 2007-06-08 | 14 | 3,099 | 1,082.27 | |
| GCA_000955865.1 | 2015-03-18 | 5 | 2,946 | 503.02 |
Fig 1Phylogenetic trees comparing close homologs of human-originated regions.
Fig 2Proportion of LHO’s that have hits in older builds of the human genome.
Fig 3Schematic flowchart of the method.
Summary of the genomes used.
| Group | Genomes | Sequences | Total size (bp) |
|---|---|---|---|
| Primates | 24 | 4,517,145 | 70,885,537,497 |
| Non-primate mammals | 82 | 9,207,268 | 210,526,504,865 |
| Non-mammal vertebrates | 133 | 13,837,380 | 146,702,887,277 |
| Non-vertebrate eukaryotes | 1,884 | 49,326,244 | 295,414,011,587 |
| Prokaryotes | 43,636 | 5,220,198 | 165,007,759,182 |
| Viruses | 4,843 | 6,379 | 189,550,904 |
| Total | 50,602 | 82,114,614 | 888,726,251,312 |
Non-mammal vertebrate genomes containing at least 2 kbp of likely human originated (LHO) sequence.
| Organism | Assembly accession | Assembly date | Regions | Total LHO length (bp) | LHO bp / genome Mbp |
|---|---|---|---|---|---|
| GCF_000344595.1 | 2013-03-18 | 130 | 33,289 | 15.77 | |
| GCA_000699945.1 | 2014-06-11 | 57 | 14,273 | 12.50 | |
| GCF_000534875.1 | 2014-01-15 | 111 | 13,488 | 11.97 | |
| GCF_000146605.2 | 2014-11-24 | 13 | 3,189 | 2.92 | |
| GCA_000233375.4 | 2015-06-10 | 9 | 3,141 | 1.03 | |
| GCA_000787105.1 | 2014-12-02 | 13 | 2,359 | 3.43 |
Non-vertebrate eukaryote genomes containing at least 2 kbp of likely human originated (LHO) sequence.
| Organism | Assembly accession | Assembly date | Regions | Total LHO length (bp) | LHO bp / genome Mbp |
|---|---|---|---|---|---|
| GCA_000576715.1 | 2014-02-03 | 1,404 | 335,303 | 20,566.80 | |
| GCA_000181115.1 | 2008-06-27 | 576 | 150,537 | 1,018.43 | |
| GCA_000167175.1 | 2004-04-29 | 129 | 35,036 | 1,174.04 | |
| GCA_000338675.1 | 2013-02-07 | 127 | 26,468 | 415.54 | |
| GCA_000167115.1 | 2004-03-15 | 76 | 18,422 | 1,688.22 | |
| GCA_000192455.1 | 2011-03-18 | 60 | 11,195 | 916.47 | |
| GCA_000256725.1 | 2012-03-27 | 53 | 9,963 | 158.19 | |
| GCA_001077415.1 | 2014-07-08 | 76 | 9,391 | 5.77 | |
| GCA_000166955.1 | 2003-03-28 | 34 | 7,498 | 631.54 | |
| GCA_000875885.1 | 2015-02-13 | 20 | 6,350 | 53.96 | |
| GCA_000333975.2 | 2012-12-19 | 19 | 6,145 | 751.22 | |
| GCA_000166975.1 | 2003-03-28 | 10 | 3,793 | 330.68 | |
| GCA_000582825.1 | 2013-08-30 | 19 | 3,607 | 2.55 | |
| GCF_000002595.1 | 2007-10-15 | 11 | 3,505 | 33.25 | |
| GCF_000005005.1 | 2013-10-24 | 14 | 2,854 | 1.39 | |
| GCA_000978595.1 | 2015-04-22 | 6 | 2,782 | 5.18 | |
| GCA_000464645.1 | 2014-06-06 | 9 | 2,672 | 80.36 | |
| GCF_000209225.1 | 2007-08-22 | 10 | 2,567 | 8.63 | |
| GCA_000006565.2 | 2013-08-02 | 10 | 2,242 | 34.25 | |
| GCA_000259835.1 | 2011-08-26 | 10 | 2,085 | 33.83 |