Literature DB >> 28637232

PhylOligo: a package to identify contaminant or untargeted organism sequences in genome assemblies.

Ludovic Mallet¹, Tristan Bitard-Feildel², Franck Cerutti¹, Hélène Chiapello¹.

Abstract

MOTIVATION: Genome sequencing projects sometimes uncover more organisms than expected, especially for complex and/or non-model organisms. It is therefore useful to develop software to identify mix of organisms from genome sequence assemblies.
RESULTS: Here we present PhylOligo, a new package including tools to explore, identify and extract organism-specific sequences in a genome assembly using the analysis of their DNA compositional characteristics.
AVAILABILITY AND IMPLEMENTATION: The tools are written in Python3 and R under the GPLv3 Licence and can be found at https://github.com/itsmeludo/Phyloligo/. CONTACT: ludovic.mallet@inra.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28637232 PMCID： PMC5860033 DOI： 10.1093/bioinformatics/btx396

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The development of sequencing technologies has enabled to target the genome of complex non-model organisms and communities of organisms. Some of these non-model organisms can be challenging to isolate from their environment or cannot be cloned in vitro. They might alternatively be compulsorily associated with cognate commensal or parasitic organisms, or even embedded within a host. Consequently, genome assembly datasets sometimes include DNA from unexpected sources like mixture of untargeted species, but may also contain organelles or even laboratory contaminants. The presence of additional untargeted species was indeed reported in several recent genome assemblies, for instance in the draft assembly of domestic cow (Merchant ), in several isolates of the phytopathogenic fungi Magnaporthe oryzae (Chiapello ) or recently in the tardigrade genome (Delmont and Eren, 2016). Such mixed assemblies may produce several biases and problems in downstream bioinformatics analyses and raise the need for tools able to deal with mixed-organism DNA assemblies. Several tools were recently designed or used to detect and filter untargeted organisms from sequence datasets. A first type of approach, used by khmer software (Crusoe ), is to compute k-mer frequencies on short reads to pre-process and filter read datasets prior to de novo sequence assembly. Other packages like Blobtools (Kumar ) and Anvi’o (Eren ) combine sequence properties (GC content, oligonucleotide profile) and additional information such as depth of coverage, similarity to public databases and reference gene sets to identify untargeted species using both raw reads and assembled contigs of a genomic dataset. Finally, a last type of approach is to use software dedicated to metagenomic species read binning, such as CONCOCT (Alneberg ) that use sequence composition and coverage across multiple samples to automatically cluster contigs into genomes or Kraken (Wood and Salzberg, 2014) that relies on exact alignment of k-mers to a k-mer reference database to assign taxonomic labels to metagenomic DNA. Here we present PhylOligo, a toolset designed to explore, segment and subtract untargeted material from assembled sequences using an ab initio alignment-free approach relying only on the intrinsic oligonucleotide signature of an assembled genomic dataset. Compared to existing software, PhylOligo provides several features to explore assemblies, including: (i) a customizable oligonucleotide pattern, including continuous and spaced pattern k-mers (Břinda ; Leimeister ; Noé and Martin, 2014). (ii) handling bare contig-level assemblies (raw reads and coverage information are not required for detecting untargeted species) (iii) an interactive cladogram-based visualization of the contig signature similarity and cumulative size to explore the signature clusters to profile putative additional materials (iv) an effective sliding window-based partitioning scan of the assembly based on a supervised learning and a double-threshold system asserting that regions are labelled as untargeted organism when meeting two criteria: (i) being distant enough from the host sequence oligonucleotide profile (first threshold) and (ii) being close enough from a cluster of untargeted sequences previously selected by supervised learning (second threshold). Our strategy present several advantages. (i) Unlike approaches that process short read datasets prior to the de novo sequence assembly and use sequence homology information, PhylOligo allows the identification of potentially uncharacterised and distantly related sequences in already assembled genomic datasets, the handling of any type of genome assembly, shunning the dependency on the availability of raw sequencing reads data, additional data and patchiness of knowledge in databases; (ii) The double-threshold species-specific filtration prevents the removal of HGTs and the subsequent fragmentation of the assembly; (iii) Learning the compositional profile on longer and assembled sequences such as contigs compared to unassembled reads, allows for a refined oligonucleotide profile, unbiased from heterogeneous sequencing depth along the sequence. Moreover, the partitioning process of PhylOligo provides the possibility to detect and split chimeric sequences or mis-scaffolding;

2 Workflow strategy

Our strategy includes 3 main steps: (i) assembly exploration using an interactive tree visualization based on oligonucleotide profiles computed from all genomic contigs, (ii) oligonucleotide profile prototype learning based on contig subsets selected by the user at nodes of the tree and (iii) assembly partitioning to locate organism-specific regions and classify contigs or segments according to the learned prototypes.

2.1 Assembly exploration

PhylOligo allows for a visual exploration of the compositional similarity distribution and structure of the contigs in an assembly based on either continuous (k-mers) or spaced-pattern oligonucleotide frequencies. The oligonucleotide profile of each contig is computed and a pairwise distance matrix based on metrics including Euclidean or Jensen-Shannon is produced (Fig. 1A) to generate an interactive Neighbour-Joining tree. Branch width is drawn proportional to the cumulated length of the contigs in a clade, allowing the user to track where the main part of the assembly clusters (assumed to correspond to the targeted organisms) and what significant clades branch out as hint for separate organisms (see Fig. 1B). Thanks to the Ape package (Paradis ), sequences from a clade are interactively selected on the tree and exported to fasta files to learn a prototype of their oligonucleotide profile. An alternative unsupervised clustering method relying on HDBSCAN (Campello ) and t-SNE (van der Maaten and Hinton, 2008) for visualization is also implemented (Fig. 1C).

Fig. 1

Visualization and interactive exploration of assemblies. (A) Pairwise compositional divergence of contigs produced by PhylOligo. Contigs are reordered by hierarchical clustering. (B) Contig tree produced by PhylOligo on the tardigrade genome. The clade in red is the current selection pointed by the user. (C) Contigs clustered by HDBSCAN on oligonucleotide frequencies, Data from Magnaporthe oryzae. Red and blue are predicted clusters, grey are unclassified. The hyperspace is reduced to 2 dimensions with t-SNE. (D) Determination of the untargeted threshold in ContaLocate based on the distribution of distances between the untargeted clade and the scanning windows over the whole assembly (Color version of this figure is available at Bioinformatics online.)

2.2 Prototype learning

ContaLocate then allows the learning of oligonucleotide profiles from the main and presumed additional organisms identified by user-selected subsets from the previous step. These subsets must cumulate at least 50 Kb in order to generate an accurate prototype sufficiently representative of an organism. Learning subsets can be generated or complemented from public sequences, specialised databases (Ménigaud ) or other tools (Alneberg ; Eren ; Kumar ).

2.3 Assembly partitioning

The assembly is then scanned with sliding windows to locate organism-specific regions using oligonucleotide divergences computed against the targeted and the additional profiles. The distribution of the divergence against both is used to establish two thresholds best separating the different modes in the density functions (see Fig. 1D). The thresholds are visually validated by the user and can also be adjusted manually. Genomic regions with a divergence simultaneously crossing respective thresholds to the targeted and to the additional profiles are labelled as part of the additional organism and exported as a GFF file.

3 Results

3.1 Synthetic datasets

We evaluated the performances of PhylOligo by generating artificial contaminations on 32 contig datasets generated by GRINDER (Angly ) from real Refseq genome data (see section 6.1 of Supplementary Material for detailed protocol). The species were chosen to cover the main domain of life (archea, bacteria, fungi, protozoa and vertebrate) and different degrees of genome complexity, content, length and composition. We benchmarked the automatic version of PhylOligo that uses the unsupervised HDBSCAN clustering and evaluated performances by using three indicators: (i) the cluster specificity i.e. the maximum fraction of contaminant in a cluster, (ii) the cluster sensitivity, i.e. the fraction of the whole contaminant aggregated in the cluster and (iii) an hybrid score, which indicated the best computed value of the product of cluster specificity and sensitivity. We used PhylOligo default parameters on all the combinations of the 32 simulated genomes assemblies. We also evaluated the impact of the k-mer parameter by panelling continuous and spaced-pattern k-mers on a focus subset of ten pairs (see results in Table 1). Complete results are detailed in section 6.2 of Supplementary Material. Overall, the benchmark demonstrates a great ability to discriminate contaminant clusters with very high specificity and good sensitivity, suited with the requirements for supervised learning and partitioning. Concerning k parameter impact, we obtained best results according to our hybrid score for two spaced patterns: 11001 (mean:0.8133, median: 0.9499) and 110101 (mean:0.8344, median:0.9459). Continuous k-mers of length 4 and 5 also performed well but with slightly lower scores (median scores of 0.7909 and 0.9305 for k = 4 and 5 respectively).

Table 1.

Impact of k-mer pattern on the hybrid score (best computed value of the product of cluster specificity and sensitivity) for 10 pairs of simulated data

	K-mer pattern
Species mix	111	1111	11111	11001	110101	111001
S.enterica in A.fumigatus	0.39	0.79	0.94	0.45	0.93	0.97
B.cereus in C.canadensis	0.99	0.99	0.98	0.99	0.98	0.99
B.mallei in H.sapiens	1.00	0.99	0.99	0.99	0.99	0.99
A.fulgidus in P.tetraurelia	0.99	0.99	0.99	0.99	0.99	0.99
A.fumigatus in P.tigris	0.96	0.96	0.93	0.95	0.95	0.95
G.intestinalis in X.tropicalis	0.95	0.99	0.95	0.99	0.96	0.98
S.enterica in T.vaginalis	0.41	0.50	0.70	0.72	0.71	0.72
B.cereus in A.australis	0.73	0.73	0.71	0.72	0.71	0.72
S.cerevisiae in T.vaginalis	0.60	0.56	0.49	0.57	0.51	0.58
S.pombe in T.vaginalis	0.65	0.61	0.43	0.58	0.47	0.58
Mean	0.69	0.74	0.75	0.81	0.83	0.78
Median	0.73	0.79	0.93	0.95	0.95	0.95
Min	0.01	0.01	0.15	0.45	0.47	0.05
Max	1.00	0.99	0.99	0.99	0.99	0.99

Impact of k-mer pattern on the hybrid score (best computed value of the product of cluster specificity and sensitivity) for 10 pairs of simulated data

3.2 Real datasets

PhylOligo has been successfully applied to identify untargeted large bacterial regions in four out of nine fungal genomic datasets of Magnaporthe (Chiapello ). GOHTAM (Ménigaud ) taxonomical assignment of these additional regions confirmed their homogeneity and origin from Burkholderiales. Targeted Blast comparisons indicated that some of these supplementary regions were almost identical to Burkholderia fungorum sequences (100% identity for 16S, recA and gyrB genes) suggesting an origin or relatedness to one or several bacterial isolate(s) of this species. PhylOligo was applied to filter genome assemblies, validated with BUSCO (Simão ) and DOGMA (Dohmen ) (see Supplementary Material) and allowed to continue further bioinformatics analyses without rebuilding the costly initial genome assembly and annotation processes. PhylOligo was also used to explore the scaffolds of the tardigrade assembly (Boothby ), for which a multiple contamination was previously proposed (Delmont and Eren, 2016; Koutsovoulos ). We compared the the topology of the compositional cladograms established with PhylOligo on both the initial and the filtered assembly obtained with Anvi’o (Eren ). Our results showed that the cladogram produced with PhylOligo exhibited a topology where the curated assembly was monophyletic, with a sequence subset and topology highly concordant with the results of Anvi’o (see Supplementary Material Section 5.2).

3.3 PhylOligo performances

PhylOligo handles assembly contigs up to a count of several dozen thousand on a modern workstation within minutes and up to few hundred thousand on a high-memory server. Input sequences can be quality- filtered or sub-sampled with a preprocessing tool to allow for improved signal and quick tests. Several parallel computation optimizations and data compression methods including HDF5 are available to improve performance on larger datasets. Click here for additional data file.

18 in total

1. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

2. DOGMA: domain-based transcriptome and proteome quality assessment.

Authors: Elias Dohmen; Lukas P M Kremer; Erich Bornberg-Bauer; Carsten Kemena
Journal: Bioinformatics Date: 2016-05-05 Impact factor: 6.937

3. Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade.

Authors: Thomas C Boothby; Jennifer R Tenlen; Frank W Smith; Jeremy R Wang; Kiera A Patanella; Erin Osborne Nishimura; Sophia C Tintori; Qing Li; Corbin D Jones; Mark Yandell; David N Messina; Jarret Glasscock; Bob Goldstein
Journal: Proc Natl Acad Sci U S A Date: 2015-11-23 Impact factor: 11.205

4. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini.

Authors: Georgios Koutsovoulos; Sujai Kumar; Dominik R Laetsch; Lewis Stevens; Jennifer Daub; Claire Conlon; Habib Maroon; Fran Thomas; Aziz A Aboobaker; Mark Blaxter
Journal: Proc Natl Acad Sci U S A Date: 2016-03-24 Impact factor: 11.205

5. GOHTAM: a website for 'Genomic Origin of Horizontal Transfers, Alignment and Metagenomics'.

Authors: Sabine Ménigaud; Ludovic Mallet; Géraldine Picord; Cécile Churlaud; Alexandre Borrel; Patrick Deschavanne
Journal: Bioinformatics Date: 2012-03-15 Impact factor: 6.937

6. Unexpected cross-species contamination in genome sequencing projects.

Authors: Samier Merchant; Derrick E Wood; Steven L Salzberg
Journal: PeerJ Date: 2014-11-20 Impact factor: 2.984

7. The khmer software package: enabling efficient nucleotide sequence analysis.

Authors: Michael R Crusoe; Hussien F Alameldin; Sherine Awad; Elmar Boucher; Adam Caldwell; Reed Cartwright; Amanda Charbonneau; Bede Constantinides; Greg Edvenson; Scott Fay; Jacob Fenton; Thomas Fenzl; Jordan Fish; Leonor Garcia-Gutierrez; Phillip Garland; Jonathan Gluck; Iván González; Sarah Guermond; Jiarong Guo; Aditi Gupta; Joshua R Herr; Adina Howe; Alex Hyer; Andreas Härpfer; Luiz Irber; Rhys Kidd; David Lin; Justin Lippi; Tamer Mansour; Pamela McA'Nulty; Eric McDonald; Jessica Mizzi; Kevin D Murray; Joshua R Nahum; Kaben Nanlohy; Alexander Johan Nederbragt; Humberto Ortiz-Zuazaga; Jeramia Ory; Jason Pell; Charles Pepe-Ranney; Zachary N Russ; Erich Schwarz; Camille Scott; Josiah Seaman; Scott Sievert; Jared Simpson; Connor T Skennerton; James Spencer; Ramakrishnan Srinivasan; Daniel Standage; James A Stapleton; Susan R Steinman; Joe Stein; Benjamin Taylor; Will Trimble; Heather L Wiencko; Michael Wright; Brian Wyss; Qingpeng Zhang; En Zyme; C Titus Brown
Journal: F1000Res Date: 2015-09-25

8. Deciphering Genome Content and Evolutionary Relationships of Isolates from the Fungus Magnaporthe oryzae Attacking Different Host Plants.

Authors: Hélène Chiapello; Ludovic Mallet; Cyprien Guérin; Gabriela Aguileta; Joëlle Amselem; Thomas Kroj; Enrique Ortega-Abboud; Marc-Henri Lebrun; Bernard Henrissat; Annie Gendrault; François Rodolphe; Didier Tharreau; Elisabeth Fournier
Journal: Genome Biol Evol Date: 2015-10-09 Impact factor: 3.416

9. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots.

Authors: Sujai Kumar; Martin Jones; Georgios Koutsovoulos; Michael Clarke; Mark Blaxter
Journal: Front Genet Date: 2013-11-29 Impact factor: 4.599

10. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies.

Authors: Tom O Delmont; A Murat Eren
Journal: PeerJ Date: 2016-03-29 Impact factor: 2.984

7 in total

1. A demonstration of unsupervised machine learning in species delimitation.

Authors: Shahan Derkarabetian; Stephanie Castillo; Peter K Koo; Sergey Ovchinnikov; Marshal Hedin
Journal: Mol Phylogenet Evol Date: 2019-07-16 Impact factor: 4.286

Review 2. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

3. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data.

Authors: Andrew J Low; Catherine D Carrillo; Adam G Koziol; Paul A Manninger; Burton Blais
Journal: PeerJ Date: 2019-05-31 Impact factor: 2.984

4. The Genome of the Blind Soil-Dwelling and Ancestrally Wingless Dipluran Campodea augens: A Key Reference Hexapod for Studying the Emergence of Insect Innovations.

Authors: Mosè Manni; Felipe A Simao; Hugh M Robertson; Marco A Gabaglio; Robert M Waterhouse; Bernhard Misof; Oliver Niehuis; Nikolaus U Szucsich; Evgeny M Zdobnov
Journal: Genome Biol Evol Date: 2020-01-01 Impact factor: 3.416

Review 5. Contamination detection in genomic data: more is not enough.

Authors: Luc Cornet; Denis Baurain
Journal: Genome Biol Date: 2022-02-21 Impact factor: 13.583

6. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains.

Authors: Luis Acuña-Amador; Aline Primot; Edouard Cadieu; Alain Roulet; Frédérique Barloy-Hubler
Journal: BMC Genomics Date: 2018-01-16 Impact factor: 3.969

7. Detection of bacterial contaminants and hybrid sequences in the genome of the kelp Saccharina japonica using Taxoblast.

Authors: Simon M Dittami; Erwan Corre
Journal: PeerJ Date: 2017-11-17 Impact factor: 2.984

7 in total