| Literature DB >> 31862787 |
Clementine M Francois1, Faustine Durand2, Emeric Figuet2, Nicolas Galtier2.
Abstract
Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences. However we still lack consistent and comprehensive assessments of contamination prevalence in public genomic data. Here we applied a standardized procedure for foreign sequence annotation to 43 published arthropod genomes from the widely used Ensembl Metazoa database. This method combines information on sequence similarity and synteny to identify contaminant and putative horizontally-transferred sequences in any genome assembly, provided that an adequate reference database is available. We uncovered considerable heterogeneity in quality among arthropod assemblies, some being devoid of contaminant sequences, whereas others included hundreds of contaminant genes. Contaminants far outnumbered horizontally-transferred genes and were a major confounder of their detection, quantification and analysis. We strongly recommend that automated standardized decontamination procedures be systematically embedded into the submission process to genomic databases.Entities:
Keywords: automated detection pipeline; contaminant sequences; curation of genomic databases; horizontal gene transfer
Year: 2020 PMID: 31862787 PMCID: PMC7003083 DOI: 10.1534/g3.119.400758
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1A simplified flow diagram of the pipeline developed for this study. Each species assembly is evaluated independently through this pipeline, which requires the set of coding sequences (CDS) as well as the genomic scaffolds of each species, and an appropriate reference database. In this diagram, boxes referring to ‘data’, ‘reference database’ and ‘tools’ are colored in blue, green, and red, respectively. See the main text for detailed explanations.
Figure 2Prevalence of contaminant and HGT candidates in the 43 arthropod genomes. Contaminant CDS are classified according to their taxonomic group (i.e., originating from eubacteria, archaea, viridiplantae, fungi and ‘protists’). Images courtesy of PhyloPic.