| Literature DB >> 28103801 |
Janus Borner1, Thorsten Burmester2.
Abstract
BACKGROUND: Contaminations from various exogenous sources are a common problem in next-generation sequencing. Another possible source of contaminating DNA are endogenous parasites. On the one hand, undiscovered contaminations of animal sequence assemblies may lead to erroneous interpretation of data; on the other hand, when identified, parasite-derived sequences may provide a valuable source of information.Entities:
Keywords: Apicomplexa; Coccidia; Contamination; Database analysis; Gregarinasina; Haemosporida; Malaria; Parasites; Phylogeny; Piroplasmida
Mesh:
Year: 2017 PMID: 28103801 PMCID: PMC5244568 DOI: 10.1186/s12864-017-3504-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Schematic overview of the ContamFinder pipeline. a All contigs from an assembly were searched against apicomplexan proteomes from the Eukaryotic Pathogen Database (EuPathDB [19, 20]). Sequences without significant hit were discarded. b Amino acid sequences were predicted using the best hitting apicomplexan protein. Low complexity regions and repeats in the sequence were masked. c The predicted amino acid sequences were searched against the EuPathDB and UniProt database. Sequences with the best hit outside of Apicomplexa were discarded. d Unprocessed contigs corresponding to the hits from the previous step were searched against the EuPathDB and UniProt databases. Sequences that had their best hit outside of Apicomplexa were discarded. Contigs and sequence regions that were kept and used in the next step are shown in green; sequences that were discarded are denoted in red. Parasite-derived proteins in the search database are shown in blue, others in yellow
Performance of the ContamFinder pipeline employing three different sequence similarity search tools compared to an all-vs-all blastx search
| Assembly type | Assembly size | all-vs-all blastx search (BLAST+) | ContamFinder (BLAST+) | ContamFinder (RAPsearch2) | ContamFinder (GHOSTX) | |
|---|---|---|---|---|---|---|
|
| transcriptome | 25.1 Mb | 82 h 14 min | 15 h 57 min | 40 min | 25 min |
|
| genome | 14.3 Mb | 36 h 9 min | 1 h 12 min | 8 min | 3 min |
Fig. 2Venn diagrams showing shared and unique hits from analyses using different search strategies on the assemblies of Capra hircus (a) and Odocoileus virginianus (b)
Numbers of parasite-derived contigs in publicly available genome and transcriptome assemblies
| Host species | WGS/TSA ID | Assembly type | # parasite-derived contigs | # sequences in dataset 1 | # sequences in dataset 2 |
|---|---|---|---|---|---|
|
| GBTA01 | transcriptome | 8347 | 370 | 208 |
|
| AWGU01 | genome | 4013 | 793 | 244 |
|
| AWGT01 | genome | 3098 | - | - |
|
| AAPN01 | genome | 1397 (119) | 540 | 178 |
|
| GBDM01 | transcriptome | 1137 | 160 | 102 |
|
| GBBP01 | transcriptome | 919 | 339 | 171 |
|
| GAOJ01 | transcriptome | 405 | 107 | 63 |
|
| GATX01 | transcriptome | 226 | 81 | 57 |
|
| CABD02 (CABD03) | genome | 148 (3) | 33 | 15 |
|
| GADZ01 | transcriptome | 148 | 35 | 25 |
|
| GBBS01 | transcriptome | 120 | 54 | 33 |
|
| GAFN01 | transcriptome | 119 | - | - |
|
| GAFI01 | transcriptome | 114 | - | - |
|
| GBCX01 | transcriptome | 104 | 29 | 21 |
|
| AEGY01 | genome | 98 | 34 | 11 |
|
| AEGZ01 | genome | 98 | - | - |
|
| ALWT01 | genome | 66 | 9 | - |
|
| GAFD01 | transcriptome | 62 | - | - |
|
| GAMM01 | transcriptome | 61 | 30 | 27 |
|
| GADI01 | transcriptome | 56 | - | - |
|
| GADH01 | transcriptome | 41 | 18 | - |
|
| GAXQ01 | transcriptome | 39 | 18 | 17 |
|
| GADZ01 | transcriptome | 24 | - | - |
|
| ABJB01 | genome | 26 | 7 | - |
|
| AADC01 | genome | 24 | 6 | - |
|
| GBKF01 | transcriptome | 21 | 12 | - |
|
| GAFW01 | transcriptome | 15 | 6 | - |
|
| GAGD01 | transcriptome | 10 | 4 | - |
|
| GBCG01 | transcriptome | 8 | - | - |
|
| GAOE01 | transcriptome | 8 | - | - |
|
| GANP01 | transcriptome | 7 | 5 | - |
|
| GAEY01 | transcriptome | 7 | 2 | |
|
| GAFX01 | transcriptome | 6 | - | - |
|
| AMDV01 | genome | 5 | 2 | - |
|
| JNOX01 | genome | 5 | 2 | - |
|
| AGSK01 | transcriptome | 5 | 1 | - |
|
| GACU01 | transcriptome | 4 | 3 | - |
|
| JJRN01 | genome | 4 | 2 | - |
|
| GAAX01 | transcriptome | 4 | 3 | - |
|
| CAVT01 | genome | 3 | 2 | - |
|
| GAFC01 | transcriptome | 3 | 2 | - |
|
| BAUQ01 | genome | 2 | 1 | - |
|
| GBID01 | transcriptome | 2 | - | - |
|
| GAMN01 | transcriptome | 2 | - | - |
|
| GACW01 | transcriptome | 1 | - | - |
|
| GAOG01 | transcriptome | 1 | - | - |
|
| GAAV01 | transcriptome | 1 | - | - |
|
| GADN01 | transcriptome | 1 | - | - |
|
| GAPU01 | transcriptome | 1 | - | - |
|
| GDAP01 | transcriptome | 1 | - | - |
|
| ADMZ02 | genome | 1 | - | - |
aAssembly was not used in phylogenetic analyses because it is based on the same raw data as another assembly
bAssembly was not used in phylogenetic analyses because it contains sequences from multiple parasite species
cData based on a superseded assembly version; the number of parasite-derived contigs in the current version is given in parentheses
Fig. 3Maximum likelihood tree based on a RAxML analysis of dataset 1 (1,420 genes, 67 taxa). The tree was rooted with Chromerida
Fig. 4Majority-rule consensus tree based on a PhyloBayes analysis of dataset 2 (301 genes, 49 taxa). Bootstrap support values from a RAxML analysis were mapped onto the tree topology. Bayesian posterior probabilities < 1.00 and bootstrap support values < 100% are given at the nodes, respectively; n.s.: split was not supported in the ML analysis; splits that have 1.00 posterior probability and 100% bootstrap support are denoted by a dark circle. The tree was rooted with Chromerida