Literature DB >> 30044797

Consensus assessment of the contamination level of publicly available cyanobacterial genomes.

Luc Cornet^1,2, Loïc Meunier¹, Mick Van Vlierberghe¹, Raphaël R Léonard^1,3, Benoit Durieu⁴, Yannick Lara⁴, Agnieszka Misztak^1,5, Damien Sirjacobs¹, Emmanuelle J Javaux², Hervé Philippe⁶, Annick Wilmotte⁴, Denis Baurain¹.

Abstract

Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2018 PMID： 30044797 PMCID： PMC6059444 DOI： 10.1371/journal.pone.0200323

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Publicly available genomes are the basic ingredient of numerous studies, from single-gene functional studies to multi-genome phylogenetic inferences. Their quality (e.g., completeness, structural and functional annotation, contamination level) is thus of primary importance. Completeness and annotation have attracted some attention [1-3] but, surprisingly, the issue of contaminating sequences has remained untackled at large scale (i.e., for a whole phylum), despite the well known evidence that contaminants are frequently introduced during experiments [4,5] or stem from natural associations and insufficient purification [6]. Overlooked contaminants may have major detrimental effects on biological conclusions [5]. For instance, the disappearance of two well-accepted monophyletic clades of charophycean green algae (Coleochaetales and Zygnematales) was initially reported [7], but later revealed to be due to cross-contaminating sequences in the dataset made of transcriptomes of green algae and plants [8]. Similarly, another phylogenomic study [9] wrongly concluded to a basal emergence of bilaterian animals, owing to a combination of contamination and taxonomic misidentification [10]. In practice, contaminants arise at different steps, from sampling to sequencing, and anywhere in between [5,11,12]. Yet, for many (microbial) organisms, an aggravating factor is the difficulty to obtain and keep axenic (i.e., pure) cultures [13], explaining why a number of sequenced microbial strains are not devoid of contaminants. While some tools can assess the technical quality of genome assemblies (e.g., QUAST, [2]), or their completeness in terms of gene content (e.g., BUSCO, [1], CheckM [14], ProDeGe [15], acdc [16]) or even their contamination level (e.g., CheckM, [14]), bioinformatic procedures are still needed to recognize and eliminate the contaminating sequences. As their efficiency ultimately depends on the quality and representativity of the reference databases used for taxonomic classification [3,17], the identification of the genome regions contaminated by foreign sequences (and their elimination) is an important endeavor, both to avoid wrong conclusions and to prevent such genomes from polluting the reference databases, which would lead to misclassifying newly obtained (e.g., metagenomic) sequences. Recognizing contaminant sequences in genome assemblies, whether prior or after public release, is not a trivial task. When such foreign sequences originate from expected sources (e.g., bacterial cloning vectors, human contamination), they are easy to detect and remove, for example by mapping raw sequencing reads against complete reference genomes of usual contaminant organisms, as part of read quality control before genome assembly (e.g., BBsplit available at https://sourceforge.net/projects/bbmap/). A complication arises with contamination sources closely related to the organism(s) of interest (e.g., human diversity studies or ancient DNA studies). These often require special precautions, both in the laboratory and in downstream analyses [18]. Similarly, parallel processing of multiple organisms, evolutionary related or not, is expected to result in cross-contamination events. Yet, in this case, new sequences can be attributed to their true source based on the comparison of their sequencing coverage across the different samples (Simion et al., 2018, “in press”). With organisms belonging to taxonomic groups for which (high-quality) representative genomes are already available, identifying the genuinely homologous sequences is easy. In contrast, sequences for which there exists no close counterpart in the corresponding reference genomes can be anything from divergent paralogues to new genes, possibly acquired horizontally, or even foreign regions from co-sequenced organisms. Finally, when both the genome under study and its potential contaminants belong to new or scarcely sampled groups, as in the context of large-scale phylogenomic studies, separating the wheat from the chaff can become very challenging [19]. Cyanobacteria, traditionally called blue-green algae, form a large and morphologically diverse group of bacteria [13], which are of primary interest in ecological, palaeobiogeology and evolutionary studies. Hence, the appearance of oxygenic photosynthesis in this phylum had a critical impact on early Earth, its evolution, and on the early biosphere by increasing the level of free oxygen in the ocean and atmosphere, creating new ecological niches [20-22]. Today, they colonize a wide range of illuminated ecosystems, from human-managed to extremophile, and from marine or freshwater habitats (as picoplankton or benthic mats) to hypersaline environments or hot springs, in polar, temperate and tropical regions, with the exception of acidic waters [23]. Usually, contaminants are organisms that live in close proximity to the sequenced organism (i.e., parasites, symbionts, epiphytes) [6,10] or that are simultaneously studied [6,8]. It is for instance well known that growth of picocyanobacteria (Synechococcus and Prochlorococcus) is improved by mutualistic interactions with heterotrophic organisms, which notably remove toxic reactive oxygen species from these phototroph/heterotroph systems [24-27]. Cyanobacteria are also known to excrete different compounds, including polysaccharides, proteins, nucleic acids, osmolytes, in their immediate environment [28,29]. Bacteria from other phyla, such as Bacteroidetes and Proteobacteria, feed on those extracellular productions and thus live in close relationship with Cyanobacteria [28,30,31]. Because of this trophic coupling, the majority of available cyanobacterial strains are not axenic, hence resulting in potential genome contamination. In this respect, we recently noticed that six genome assemblies (GCA_000472885.1, GCA_000817745.1, GCA_000817775.1, GCA_000817785.1, GCA_000817735.1, GCA_000828075.1) presented two or even three copies of many proteins usually encoded by a single gene, one being the likely genuine cyanobacterial protein and the other one(s) being closely related to non-cyanobacterial species (data not shown). As these observations prompted us to investigate the issue thoroughly, we designed a novel strategy, based on state-of-the-art tools, to evaluate the contamination level of publicly available genomes. Altogether, our analyses allowed us to establish a global ranking of 440 cyanobacterial genome assemblies, from the most contaminated to the least contaminated. These results suggest that genome producers should strive to upload genomic sequences devoid of contamination and that genome consumers should be cautious when analyzing such data in downstream studies. Moreover, we advocate the stance that raw sequencing reads should be released along with newly assembled genomes, so as to help genome scientists to evaluate the quality of primary data.

Results and discussion

Case study strategy developed for the assessment of the contamination level of cyanobacteria

To evaluate the contamination status of public cyanobacterial genome assemblies, we implemented a strategy based on combining evidence across multiple independent methods (Fig 1). The rationale was that, given the complexity of the task, no single method would be expected to reach both maximal sensitivity and specificity. Moreover, we were interested not only in quantifying the overall contamination level of cyanobacterial genomes, but also in recognizing and eliminating foreign regions or scaffolds, so as to generate “decontaminated” assemblies. Hence, we first considered a few reliable loci, i.e., genes that have an extremely low probability of horizontal gene transfer (HGT), such as SSU rRNA (16S) and ribosomal proteins. We used RNAmmer/SINA [32,33], an approach based on a reference SSU rRNA (16S) database, and “42”, a BLASTX-based program coupled with a reference ribosomal protein database to detect sequences in cyanobacterial genomes that are phylogenetically related to non-cyanobacterial organisms. Second, we used CheckM [14], a comprehensive package based on the phylogenetic labelling of lineage-specific marker genes, thereby allowing us to probe a larger part of the cyanobacterial assemblies. Third, we explored three genome-wide methods to maximize our detection power. We investigated their parameterization by comparing their results with those based on ribosomal genes, considered as the gold standard. Hence, we tested Kraken [34], a widely used metagenomic classifier based on long (signature) DNA kmers (21–31 nt), against a curated reference database derived from Ensembl Bacteria, and CONCOCT [35], a short DNA kmer (4–6 nt) metagenomic binning package that does not require a reference database, but designed to take advantage of sequencing coverage. As a third genome-wide method, we turned to DIAMOND blastx [36], using the same reference database as with Kraken. Altogether, our analyses allowed us to establish a global ranking of the 440 publicly available cyanobacterial genome assemblies, from the most contaminated to the least contaminated. We conclude that about 20 assemblies are highly contaminated by sequences from foreign phyla, whereas >200 additional assemblies are likely to be at least slightly contaminated. Finally, we provide download links to alternative versions of these assemblies, in which contaminating regions (or whole scaffolds) have been masked.

Fig 1

Graphical abstract of the study.

Properties of the cyanobacterial genome assemblies

We downloaded 440 cyanobacterial genome assemblies, altogether representing the eight orders defined in Komarek et al. [37], as well as the newly erected Gloeomargaritales [38]. These genomes correspond to 421 different organisms, the remaining 19 assemblies being updated versions of existing genomes or independent assemblies (or sequencing plus assembly) of the same strain (S1 Table). An overview of the morphology, the habitat and the taxonomy of the strains composing our dataset is given in S1 Fig. QUAST analyses allowed us to probe the size and the fragmentation level of the assemblies (Fig 2). While many unicellular Cyanobacteria have a genome size around 2 Mbp, filamentous, and especially heterocystous, Cyanobacteria are among the bacterial organisms featuring the largest genomes (Fig 2A). However, these analyses also revealed that 11 assemblies were <500 kbp, which strongly suggests that they do not correspond to complete genomes, whereas some others are suspiciously large for Cyanobacteria (5 assemblies >15 Mbp, including 2 >68 Mbp but with >90% “N” nucleotides). Regarding the fragmentation level, even if some genomes are assembled into a low number of scaffolds (e.g., one chromosome and a few plasmids), a large part of our dataset consists in genomes represented by more fragmented assemblies, with 57% of the genomes having >20 scaffolds (Fig 2B and S1 Table).

Fig 2

Overview of public cyanobacterial genome assemblies.

The 440 strains were classified into four morphologies (unicellular, filamentous, heterocystous, unknown). a. Distribution of genome sizes (total length in scaffolds >1000 nt). b. Distribution of the numbers of scaffolds (>1000 nt) by assembly. Note the logarithmic scale of the X axis. Details about strain habitats are available in S1 Fig.

Overview of public cyanobacterial genome assemblies.

Ribosomal genes as a first estimator of the contamination level

Since C. Woese and co-workers [39], the SSU rRNA (16S) gene is considered the “gold standard” taxonomic marker. Consequently, large, broadly-sampled and trustworthy reference databases have been available for a long time [40], as well as specialized software to predict (e.g., RNAmmer, [32]) and classify (e.g., SINA, [33]) rRNA genes of unknown origin with both high sensitivity and specificity. Strikingly, 85 assemblies of our dataset appeared devoid of any SSU rRNA (16S) sequence, including 6 of the 13 ultrasmall “genomes” (S1 Table). Moreover, 4 assemblies contained only unclassified sequences and 1 assembly only one non-cyanobacterial sequence. Among the 350 assemblies featuring at least one cyanobacterial sequence, 15 contained at least one sequence of non-cyanobacterial origin (7 assemblies had 1, 5 had 2, and 3 had 3). Albeit the presence of one or more foreign SSU rRNA (16S) gene(s) almost certainly reveals a contaminated genome assembly, it only represents a single gene [41], even if it can be found in multiple copies (up to four loci) in Cyanobacteria [42]. To increase the odds of identifying sequences of non-cyanobacterial sources, we turned to a reference database of ribosomal protein genes [43]. While the latter are much less sampled than SSU rRNA (16S) genes, they have the advantage of representing about 50 loci spread over about 10 operons [44]. However, they are not totally immune to HGT (e.g., rps14 [45]). That is why we first inferred phylogenetic trees for all alignments built from the database to identify and remove xenologous reference sequences. Mining of the cyanobacterial genome assemblies against ribosomal protein alignments with our own software “42” (which controls for orthology relationships; available at https://bitbucket.org/dbaurain/42/) showed that 21 assemblies contained at least one ribosomal protein gene of foreign origin (8 assemblies between 1 and 8 proteins, 9 assemblies between 16 and 41, and 4 assemblies between 56 and 80; S2 Table). As expected, almost all assemblies featuring at least one foreign SSU rRNA (16S) gene also showed at least one foreign ribosomal protein gene (14 out of 16 assemblies). This allowed us to classify assemblies into three categories, based on the number of ribosomal gene methods identifying contaminating sequences in each assembly: 2 (14 assemblies), 1 (9 assemblies) and 0 (417 assemblies).

Lineage-specific marker genes as an extended estimator of the contamination level

Even if ribosomal gene methods are sensitive, their power is limited because ribosomal genes represent a small fraction of a cyanobacterial genome (about 0.4–1.0%). Indeed, should ribosomal genes be partially or completely missing from the contaminating fraction of an assembly, it would lead to an underestimation of the contamination level. To mitigate this risk, we turned to CheckM [14], a two-step contamination detection tool based on lineage-specific marker genes, 104 bacterial genes and 150 archeal genes (both including ribosomal proteins, as with “42”). According to the classification used by Parks et al. [14], CheckM results indicated that 12 cyanobacterial genome assemblies were very highly contaminated (>15%), 2 highly contaminated (>10% to ≤15%), 7 moderately contaminated (>5% to ≤10%), 301 lowly contaminated (≤5%), whereas only 118 assemblies were not contaminated (= 0%). Interestingly, three assemblies (GCA_000472885.1, GCA_000828085.1, GCA_000817785.1) showed a level of contamination >100%, along with a completeness of 100%, thereby suggesting the presence of more than two contaminant organisms in each of them (multiple occurrences of the same markers; Table 1). All the 29 assemblies flagged as contaminated by at least one of the two ribosomal gene methods were all also tagged by CheckM. However, 4 assemblies (GCF_000828075.2, GCA_000341585.1, PRJNA165539, GCA_000775285.1) contaminated at >10% in terms of ribosomal proteins were only tagged as lowly contaminated by CheckM (3.41%, 1.96%, 0.88%, 0.84%, respectively). The reason for this discrepancy is likely that CheckM organizes marker genes into sets of collocated genes for estimating contamination. Indeed, genes that reside in close proximity to each other do not provide independent information regarding the overall level of contamination within a genome.

Table 1

Global ranking of cyanobacterial genome assemblies.

Genome assembly		Assembly propreties		Ranking results
Accession	Organism name	Scaffolds.ge.1000.nt	Total.length.ge.1000.nt	rRNA	rprot	CheckM	Kraken	DIAMOND	CONCOCT	rank.avg.6.na	rank.6.na
*+GCA_000472885.1	Mastigocoleus testarum BC008	974	15866152	66.67	62.5	200	31.84	35.15	44.27	8.83	1
GCF_001482745.1	Oscillatoriales cyanobacterium MTP1	68	7647882	50	40	80.53	44.93	32.99	50.41	10.25	2
GCA_000817745.1	Aphanocapsa montana BDHKU210001	296	11500044	50	28.17	32.76	25.16	21.4	47.55	13.42	3
GCA_000817785.1	Hassallia byssoidea VB512170	62	13096531	40	41.86	104.63	NA	20.35	38.95	16.10	4
GCA_000963755.2	Trichodesmium erythraeum 21–75	1320	7900996	50	53.33	85.48	6.49	17.75	49.8	16.67	5
GCF_000963755.1	Trichodesmium erythraeum 21–75	1320	7900996	50	53.33	85.48	6.49	17.76	49.13	16.83	6
GCA_001458455.1	Chrysosporum ovalisporum strain UAM-MAO	336	4815140	50	61.21	22.19	11.45	27.1	39.3	17.08	7
GCA_000817775.1	Lyngbya confervoides BDU141951	298	8799693	50	13.73	23.11	21.24	15.94	36.71	18.00	8
GCA_000817735.1	Scytonema millei VB511283	118	11627246	33.33	44.57	44.54	NA	18.07	34.98	19.00	9
GCA_000828085.1	Scytonema tolypothrichoides VB-61278	214	10008488	50	4.35	104.63	NA	5.51	25.69	32.90	10
°GCA_000634395.1	Prochlorococcus sp. scB245a_518D8	76	1282892	100	35.56	11.96	11.36	41.55	21.74	33.00	11
°GCF_001637395.1	Leptolyngbya valderiana BDU 20041	420	6991351	0	11.94	54.29	34.21	15.13	49.63	49.92	12
GCA_000828075.1	Tolypothrix campylonemoides VB511288	135	10627177	75	44.44	5.39	NA	12.95	14.19	51.00	13
°GCF_001637315.1	Phormidium willei BDU 130791	171	4600567	0	27.87	5.42	11.28	6.66	85.01	55.58	14
GCF_000828075.2	Tolypothrix campylonemoides VB511288	61	9468441	75	40.79	3.41	0	6.23	92.28	57.00	15
GCA_000341585.2	Prochlorothrix hollandica PCC 9006	10	5525469	75	4.35	4.39	6.45	3.75	10.91	58.58	16
°GCA_000934435.1	Mastigocladus laminosus UU774	174	8560182	0	38.46	5.53	12.63	11.15	16.43	81.42	17
GCF_000346485.2	Scytonema hofmannii PCC 7110	27	12284271	20	0	2.77	0.69	1.35	86.56	93.17	18
°-PRJNA165539	Cyanobacteria bacterium JGI 0000014-E08	59	327995	0	10	0.88	4.45	18.13	24.59	93.83	19
GCF_001904775.1	Phormidium tenue NIES-30	44	5821893	0	0	1.45	13.12	2.38	96.83	98.67	20
GCA_000341585.1	Prochlorothrix hollandica PCC 9006	1354	2669044	0	11.11	1.96	6.02	3.32	8.65	102.08	21

(*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA;

(˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria;

(+) indicates assemblies that are too large (>15,000 kbp);

(-) indicates assemblies that are too small (<500 kbp).

(*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (<500 kbp).

Genome-wide estimation of the contamination level using long DNA kmers

To confirm marker-based results and improve our detection power, we looked for a genome-wide method and used a metagenomic software package relying on the classification of long (21–31 nt) signature DNA kmers, Kraken [34]. Kraken splits genomes into kmers that it organizes in a taxonomic tree that is then queried to classify raw sequencing reads to taxa of increasing ranks, depending on the conservation of their component kmers (see [34] for details). In this work, we “simulated” raw sequencing reads by splitting the cyanobacterial scaffolds into pseudo-reads of 250 nt and fed them to Kraken in order to analyze the taxonomic composition of each assembly. To minimize potential issues due to incomplete or aberrant genomes, we limited the genome-wide analyses reported in this section to the 343 assemblies that were not too small (≥500 kbp) nor too large (≤15,000 kbp) and that contained at least one cyanobacterial SSU rRNA (16S) gene(s). However, all 440 assemblies were eventually analyzed and the corresponding results are shown in S2 Table, in which those 97 atypical assemblies are denoted by various symbols. Three variables affect Kraken ability to classify sequences: the kmer size, the reference genome database, and the confidence parameter. To maximize the fraction of classified sequences (including those from evolutionarily isolated assemblies containing many unique kmers), we used a kmer size of 21 nt and built a curated database (27,762 genomes) from the release 30 of Ensembl Bacteria [46]. These important methodological choices are discussed in S1 Appendix (see also S2, S3A and S3B Figs). The confidence threshold is a parameter meant to adjust the trade-off between specificity and precision (in terms of taxonomic ranks). Because it has a large impact on Kraken behavior, fine-tuning this parameter is a crucial step. To this end, we compared Kraken results obtained on the 343 typical assemblies at three different confidence thresholds (0.02, 0.04 and 0.06) to the results obtained on the same assemblies with the ribosomal gene methods (categories 0, 1 and 2), here considered as the gold standard (Fig 3A). We derived a contamination level from Kraken results for each assembly by summing the classifications that corresponded to non-cyanobacterial sources. For some assemblies, a non-negligible pool of classified sequences were actually classified to the high-ranking “Bacteria” taxon (or to the lower “Terrabacteria” taxon). While these sequences were genuinely part of the total (100%), we did not include them in our contaminating fraction, nor in our cyanobacterial fraction and labelled them as “unknown”. For our purposes, lowering Kraken precision thus affects its sensitivity, since a reduced precision increases the share of these classified-yet-unknown sequences.

Fig 3

Comparison of genome-wide estimates of the contamination level with ribosomal gene results, considered as the gold standard.

Based on Fig 3A, and with the aim of minimizing the fraction of unclassified/unknown sequences (sensitivity) without wrongly tagging as contaminated too many apparently clean genomes (specificity), we chose to work with the 0.04 confidence threshold to analyze our dataset (S2 Table). Overall, Kraken analyses provided an independent confirmation that 137 cyanobacterial genome assemblies are contaminated at a level ≥1% (255 at a level >0%). However, a number of these assemblies appear to have contaminating fractions well above the median, despite the SSU rRNA (16S) and ribosomal protein analyses suggesting these genomes are free of contamination (outliers in Fig 3A). Contaminated fractions of the 343 cyanobacterial genome assemblies (expressed in %) were estimated with Kraken using three different confidence thresholds (a), or with DIAMOND blastx using three different hit number thresholds (b), or with CONCOCT using three different kmer sizes (c), then partitioned into three sets of assemblies, based on the number of ribosomal gene methods (SSU rRNA (16S) and ribosomal proteins) identifying contaminating sequences in each assembly (0, 1 or 2). Upper and lower whiskers extend from the hinge to the largest and lowest value no further than 1.5 * IQR from the hinge, respectively. Data points beyond the end of the whiskers are outliers and are plotted individually. The values given on top of each series in panels a and b are the median and IQR for the unclassified fractions across the 343 cyanobacterial assemblies, independently of the ribosomal category.

Genome-wide estimation of the contamination level using whole proteomes

Our thorough optimization of Kraken analyses led us to conclude that it is of limited use for our purpose, because it looks for known signature kmers. Given the fast evolutionary rate of nucleotide sequences, genomes of organisms not closely related to any reference genome cannot be classified. Moreover, combining all signature kmers into a single data structure makes it impossible to exclude self-matches when probing an assembly for contamination. This makes Kraken results uninformative for any genome used to build the reference database. To address this problem, we developed a detection protocol based on DIAMOND blastx [36], an ultrafast clone of the BLAST algorithm. In contrast to Kraken, DIAMOND blastx can classify organisms that are distant from all those of the reference database (protein sequences being much more conserved than nucleotide sequences), while still running fast (20,000 times faster than the NCBI BLAST+ implementation) [47]. Besides, it allows skipping self-matches when parsing its output for last-common ancestor (LCA) classification of pseudo-reads. We used the same curated database as for Kraken, except that reference sequences were conceptual translations of the genes instead of whole genomic sequences. Our LCA inference algorithm was inspired from MEGAN [48]. Using a tree-like collapsing of best hits taxonomy down to a bit score threshold expressed as a percentage of the highest bit score, it often allows labelling pseudo-reads based on more than a single best hit. The better sensitivity of protein searches improved the classification of distant assemblies (S3C Fig; slope = -243, Pearson r = -0.76, P-value = 7.92e-35). However, the unclassified fraction of the assemblies was larger with DIAMOND blastx (median = 14.5%, IQR = 14.6%) than with Kraken (median = 0.2%, IQR = 11.4%). This reduced power of DIAMOND blastx is due to the fact that protein-based searches only target coding sequences. Accordingly, the unclassified fraction fits well with the fraction of non-coding sequences in the assemblies (median = 15.9%, IQR = 7.3%). The potential of DIAMOND blastx was then evaluated exactly as for Kraken, using the 343 cyanobacterial genome assemblies and the ribosomal gene results as the gold standard. A striking difference was that the contamination levels estimated with DIAMOND blastx were much narrower, while still showing a good agreement with ribosomal gene categories (Fig 3B). Since the bit-score threshold did not make any obvious difference in our setup, we settled on a value of 5% of the best hit bit score, as in MEGAN (S2 Table). Even though DIAMOND blastx contamination levels were in line with Kraken levels across all ribosomal gene categories (compare medians of the middle series between panels a and b of Fig 3), all cyanobacterial genome assemblies appeared to contain a non-null fraction of foreign sequences with DIAMOND blastx. In contrast, 181 assemblies were devoid of any foreign sequence with Kraken, including the 170 untestable assemblies (NA in S2 Table) that are part of its reference database. As a contribution towards better publicly available genome assemblies, we provide download links to our curated reference database derived from Ensembl Bacteria 30, a post-processor script for DIAMOND blastx output (https://figshare.com/s/344be979f5f2237ad735), and “decontaminated” versions of the 440 cyanobacterial genome assemblies (based on DIAMOND blastx results) (https://figshare.com/s/08193cb71f0a65d99c69).

Genome-wide estimation of the contamination level using short DNA kmers

As explained in S1 Appendix, an issue with database-based methods, such as Kraken and DIAMOND blastx, is that assemblies from Cyanobacteria that are isolated from a phylogenetic point of view can be very difficult to investigate due to the proper lack of related reference genomes in the database. To estimate the contamination level while accounting for sequences originating from not yet sampled organisms, we resorted to a program that does not require a reference database: CONCOCT [35]. This software clusters assembly sequences into non-hierarchical groups based on a Principal Component Analysis (PCA) of short (4–6 nt) DNA kmer frequencies. We hypothesized that a high number of CONCOCT groups would hint to a mixture of several genomes into a single assembly. However, we failed to demonstrate any correlation between the number of CONCOCT groups based on kmer frequencies and the DIAMOND blastx contamination level (e.g., for kmer size = 4: Pearson r = -0.04, P-value = 0.31). By looking at detailed CONCOCT results, we noticed that many cyanobacterial genome assemblies had an overwhelming share of their genomic sequence included in the most abundant CONCOCT group (median = 94.45%, IQR = 15.28%). We thus confidently assumed that this group represented the genuine organism. For practical purposes, we further considered all the other groups as contaminants, which allowed us to compute contamination level estimates by summing the latter fractions. Of course, this is an approximation, because some minor groups might actually correspond to atypical regions of the genuine organism (e.g., repeated elements, plasmids). Nevertheless, when applying this simple metric to tetramer frequencies of the 343 typical assemblies, it performed well in comparison to ribosomal gene methods (Fig 3C), in spite of a non-negligible number of assemblies wrongly flagged as highly contaminated (outliers in Fig 3C). In contrast, it did not work with pentamers and hexamers, because the most abundant CONCOCT group at these kmer sizes often corresponds to a (much) lower share of the genomic sequences (median = 90.94%, IQR = 21.56% for kmer size = 5 and median = 1.89%, IQR = 62.76% for kmer size = 6).

Global ranking of cyanobacterial genome assemblies

As the three genome-wide methods had largely confirmed the results obtained with ribosomal genes and lineage-specific marker genes on a subset of 343 cyanobacterial genome assemblies, we decided to use the contamination levels estimated through all six methods to build a consensus global ranking of the 440 assemblies composing our dataset (S2 Table). Congruence between individual methods and with this global ranking was investigated using Spearman rank correlations (S3 Table). Ribosomal gene methods correlated weakly when considering either 440 (rho = 0.32/0.37) or 343 assemblies (rho = 0.34/0.35), but worked much better on the top-50 (rho = 0.74/0.79). This is hardly surprising given that, being less sensitive than genome-wide methods for their limited number of interrogated loci, they can only find contaminants in highly contaminated assemblies. On the opposite, the database-free CONCOCT correlated well on the 440 assemblies (rho = 0.63) and this also result held for the 343 assemblies (rho = 0.61). Yet, it did not when focusing on the top-50 (rho = 0.14), as expected from its reasonable but rough metric. The situation for Kraken was similar to that of CONCOCT, with rho = 0.59 (440), rho = 0.59 (343), rho = 0.32 (50), suggesting that in spite of the shortcomings of its monolithic reference database (see S1 Appendix), Kraken remains useful for real applications. CheckM was the method that correlated the best on average—rho = 0.75 (440), rho = 0.76 (343), rho = 0.73 (50)—followed by DIAMOND blastx—rho = 0.61 (440), rho = 0.68 (343), rho = 0.75 (50). Interestingly, even the best methods had their outliers (e.g., CheckM: GCA_001456025.1 at rank 83, Kraken: GCA_001039265.1 at rank 36, DIAMOND blastx: GCA_000484535.1 at rank 119). This seems unavoidable as pipelines involving several programs and databases are difficult to develop and test exhaustively [49]he contamination level is more valuable than relying on a single approach. Based on this global ranking, we defined three sets of assemblies. The top-21 (19 different genomes) contains assemblies tagged as contaminated by at least five of the six detection methods (except GCF_001904775.1, which is tagged by four methods) and thus corresponds to those with the highest level of contaminants. Three assemblies (GCA_000472885.1, GCA_000817785.1, GCA_000828085.1) even show CheckM contamination levels >100%, suggesting contamination by more than one organism. For these 21 assemblies, we explored the distribution of contaminants over the genomic scaffolds using DIAMOND blastx (S4 Fig). In many cases, it is very clear that at least two types of scaffolds co-exist, one corresponding to cyanobacterial segments (in green) and the other to contaminant segments (in red). Such a partition is often reflected in the GC-content of the scaffolds (e.g., GCA_000472885.1, GCA_000817745.1, GCF_001637395.1), and ribosomal gene (both SSU rRNA (16S) and ribosomal proteins) hits nearly always co-localize with the expected type of scaffold (e.g., GCA_000341585.2). This indicates that the different methods are congruent even within a single genome assembly and that their consensus faithfully reflects the underlying contamination level. After the top-21, a second part of the ranking (down to rank 230) contains assemblies that are only slightly contaminated. This is where discrepancies between individual methods are the most common (e.g., GCF_001456025.1 at rank 83, GCA_000291825.1 at rank 227). Finally, the last part of the ranking (210 assemblies) contains genomes that have a very low level of contaminants (generally <1% for both CheckM and DIAMOND blastx estimates).

Taxonomic analysis of contaminants

According to DIAMOND blastx, the two most frequent contaminant phyla in our top-21 ranking of contaminated assemblies are Proteobacteria and Bacteroidetes (Fig 4; see S5 Fig for a similar analysis for all 440 assemblies). Proteobacteria contaminate 19 assemblies of the top-21 at a level ≥1%, Bacteroidetes contaminate 5 assemblies, whereas Firmicutes and Chloroflexi contaminate 1 assembly. Other contaminants (Acidobacteria, Actinobacteria) are only found at <1% in cyanobacterial genome assemblies. The detection of Proteobacteria can be explained by the presence of these bacteria in cyanobacterial polysaccharide sheaths, which are a source of nutrition for them [28]. Such sheath colonization often hinders efforts to make cyanobacterial cultures axenic [13]. Hence, the complete genome of Blastomonas sp., a heterotrophic proteobacterium living in close association with Cyanobium sp., has been recently obtained from a non-axenic culture of the latter [29]. Regarding Bacteroidetes, members of this phylum can degrade polymeric compounds, and one isolate was hypothesized to feed on cyanobacterial biomass [50].

Fig 4

Taxonomic distribution of contaminating sequences in contaminated genomes, based on DIAMOND blastx estimates.

Fig 4 further shows that some assemblies have a high level of unclassified sequences (up to 58%). As aforementioned, with methods based on reference databases, the efficiency of classification heavily depends of the diversity of the database. It means that sequences from uncommon organisms, such as extremophiles, are difficult to classify because they are under-represented in reference databases. As an example, Cyanobacteria bacterium JGI 0000014-E08 (PRJNA165539; unpublished), which was collected in a lagoon with a reportedly unique microbial diversity (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA165539), displays 58.4% of unclassified sequences. Moreover, a part of the unclassified fraction of this genome is probably of cyanobacterial origin, since cyanobacterial ribosomal gene hits are located on scaffolds with a high level of unclassified segments (S4 Fig), which also highlights the usefulness of combining the results of multiple methods, especially those of genome-wide approaches that output detailed taxonomic reports (i.e., per scaffold). The 21 top-ranking contaminated cyanobacterial genome assemblies (see Table 1) were analyzed with DIAMOND blastx, as explained in the main text (using a LCA approach against a protein version of our curated Ensembl 30 database). Taxonomic classifications (expressed in % of the genomic sequence) are summarized at the phylum level. The “unclassified” classification corresponds to sequences that do not match any reference protein in the database, whereas “unknown” corresponds to high-ranking LCAs (Bacteria or Terrabacteria). “others” include the following phyla (in descending order of frequency): Spirochaetes, Deinococcus-Thermus, Candidatus Tectomicrobia, Verrucomicrobia, Nitrospirae, Tenericutes, Armatimonadetes, Fusobacteria, Thermotogae, Gemmatimonadetes, Synergistetes, Chlamydiae, Thaumarchaeota, Thermodesulfobacteria, Crenarchaeota, Chrysiogenetes, Dictyoglomi. Complete results for the 440 assemblies are available in S2 Fig, whereas an overview of the contaminant genera found in the 21 top-ranking assemblies in given in S5 Table.

Validation using the sequencing coverage

On average, contaminated genome assemblies have a higher number of scaffolds than non-contaminated assemblies, owing to both the decrease in sequencing coverage and the increase in assembly complexity associated with the mixing of several organisms in the same “genome” (Pearson r = 0.36, P-value = 2.71e-15). In some cases, however, chimerical scaffolds containing a mixture of cyanobacterial and non-cyanobacterial segments can be observed (e.g., GCF_001482745.1, GCA_000643395.1). Confirmed by detailed taxonomic analyses of the scaffolds (S4 Table), this result suggests that these scaffolds are probably the products of a too greedy concatenation of contigs during the finishing steps of genome sequencing. Ideally, such a bold interpretation should be supported by additional evidence, so as to exclude false positives due to methodological artifacts (e.g., reference database issues, lack of sensitivity of BLAST heuristics) and/or HGT. To distinguish between genuine contamination and other events, we tried to take advantage of the sequencing coverage, i.e., by monitoring the number of reads mapping at every position of an assembly. Indeed, contaminant DNA is expected to be less abundant than genuine DNA (or at least not in equal proportion), which should result into non-cyanobacterial scaffolds and/or sub-scaffolds having a lower (or at least different) coverage than their cyanobacterial counterparts [35,51]. In principle, the coverage along a genome can easily be obtained by remapping the raw sequencing reads to the corresponding assembly. Unfortunately, a careful search in NCBI SRA for raw sequencing reads corresponding to the 440 cyanobacterial genome assemblies revealed that such data files were only available for 83 (19%) assemblies, of which 63 (14%) were really usable for computing the sequencing coverage. In particular, coverage could be computed for only one (yet the most) highly contaminated assembly (GCA_000472885.1 at rank 1). This large “genome” (15.8 Mbp) is composed of two populations of scaffolds, each one characterized by a distinct value of GC-content (Fig 5A). As for most other assemblies of the top-21 (see S4 Fig), this parameter agrees with our classification of scaffolds as either cyanobacterial (in green) or contaminant (in red), whether using ribosomal genes or DIAMOND blastx for classification. Hence, scaffolds characterized by a high GC-content are those that likely originate from the contaminating organism, rather than from the supposedly sequenced cyanobacterium. As expected, differences in sequencing coverage correlate well with the split into two populations, cyanobacterial scaffolds having a median coverage of 68 (IQR = 9), whereas contaminant scaffolds drop to 26 (IQR = 13). In contrast, when running similar analyses on a non-contaminated cyanobacterial genome assembly (GCF_000214075.1, rank = 379), both the GC-content and the sequencing coverage remain largely uniform, except for plasmid segments visible at extremities of the GC-content range (Fig 5B). This confirms that this assembly only contains cyanobacterial scaffolds, as predicted without using the sequencing coverage.

Fig 5

Validation of our methods for detecting contaminants using the sequencing coverage.

For non-cyanobacterial segments in assemblies for which coverage cannot be obtained, it might be difficult to decide between a contamination (i.e., the experimentally-dependent introduction of a foreign DNA sequence in a genome assembly), a HGT event (i.e., the natural introduction of a foreign DNA fragment into a genome to be sequenced) or a methodological artefact (e.g., a chimerical or contaminated genome in the reference database). With respect to artefacts, we are quite confident that they should rarely affect assemblies in our top-21, because of the high level of congruence observed between the detection methods, including published approaches (e.g., CheckM) and in-house protocols based on robust algorithms (e.g., MEGAN-like LCA inference), using a large and curated reference database. Regarding HGT, it is assumed to be common in prokaryotes [52]. However, recent studies do not necessarily agree on the extent of the phenomenon (e.g., [53,54]), and most authors argue that HGT is more frequent among closely related organisms [55-58]. Hence, HGT has been shown to occur within Cyanobacteria [59-62], often at different rates [63,64], depending on the evolutionary constraints acting on the transferred genes, i.e., “stand-alone” functions vs operon-encoded functions [61,65-69]. In the context of the present study, aimed at assessing the contamination level of publicly available genomes, our opinion is that HGT should not be a major issue because we only consider as contaminants the segments that are clearly non-cyanobacterial (i.e., we do not address intra-cyanobacterial contamination) and because real HGT events are more likely to be found in the classified-yet-unknown fraction of the assemblies. Indeed, as HGT makes look highly similar sequences from organisms potentially very distant from a taxonomic point of view, LCA-based methods (Kraken, DIAMOND blastx) label them with high-ranking taxa, such as Bacteria, that our post-processors then convert to “unknown”. Nevertheless, studies specifically designed to assess the extent of HGT in existing genomes should probably double-check that some of their results are not the product of contamination rather than transfer, whether in reference genomes or in genome assemblies under investigation. Two representative cyanobacterial genome assemblies, the highly contaminated GCA_000472885.1 (A) and the non-contaminated GCF_000214075.1 (B), were split into segments of 10,000 nt for detailed analysis. On the X axis, segment-containing scaffolds were sorted by increasing average GC-content, with vertical thin lines representing scaffold boundaries. Top panels: segments were classified as either cyanobacterial (green), contaminated (red), unknown (grey) or unclassified (white), using DIAMOND blastx. The overlaid light blue curve is the GC-content of the corresponding segments. Open symbols give the locations of the detected ribosomal genes (circles: SSU rRNA (16S), squares: ribosomal proteins), either from cyanobacterial sources (green) or from non-cyanobacterial sources (red). Bottom panels: the light red curve gives the sequencing coverage of the segments.

Conclusion

Among the 440 cyanobacterial genome assemblies surveyed in this study, we concluded that 21 were highly contaminated (corresponding to 19 different genomes). Proteobacteria and Bacteroidetes are the most common contaminants, probably due to their close proximity with some Cyanobacteria in the environment, and maybe because of trophic interactions complicating the generation of axenic strains [28]. The six genomes that initially raised our attention, and motivated us to design a consensus pipeline for detecting contaminants, all occupy the top-tier of our global ranking (GCA_000472885.1, rank 1; GCA_000817745.1, rank 3; GCA_000817785.1, rank 6 GCA_000817775.1, rank 8; GCA_000817735.1, rank 9; GCA_000828075.1, rank 13). Moreover, 10 of the top-21 contaminated assemblies discovered in this study have now been removed from RefSeq, or have at least been flagged as suspicious in their metadata by NCBI curators, suggesting that our ranking procedure is accurate. The vast majority of publicly available cyanobacterial genomes are only slightly contaminated (209 assemblies) or likely non-contaminated (210 assemblies). However, the impact of an error (in particular, a contamination) can be very detrimental [3]. For instance, researchers may infer that a particular enzyme is present in some cyanobacterium, whereas in reality it is absent from the phylum, or that some Cyanobacteria are much more prone to HGT events than others. As researchers, we are more interested by exceptions than by those that play along the rules. Hence, finding an orthologous gene in >400 cyanobacterial genomes attracts less attention (and impact) than a gene specific to Bacteroidetes in a handful of Cyanobacteria. In other words, we are much more prone to ingenuously study erroneous than correct data, thereby amplifying small errors and polluting the scientific literature with incorrect statements resulting from the analysis of contaminated genomes. It is therefore of prime importance to eliminate as many errors as possible from genomic sequences that now constitute a mainstay of numerous research domains, and this is especially important for newly generated genomes making their way up to reference databases used for taxonomic classification. To this end, releasing the raw reads at the same time as the assembled genome should become mandatory. The availability of such data would allow others not only to repeat (or improve) the assembly process, but also spot obvious contaminants based on discrepancies in sequencing coverage. However, such a policy would likely be insufficient, because one cannot expect every genomicist to look for contaminants in tens of thousand of bacterial genomes by downloading and re-mapping raw reads on the corresponding assemblies. To ease the task of the large scientific community using genomic data, the sequencing coverage of each position, as well as other key statistics obtained during the assembly process (e.g., the number of non-mapping reads, the number of incongruently mapping paired ends) should be published along with the genome assembly. This would allow researchers to discard from their homology searches all the regions with a low coverage (probable contaminants) or with a high coverage (probable repeated elements). Technically, making available the .bam files in addition to the raw read files would provide most of the information needed by others. Implementing this constraint is then simply a matter of editorial policy, exactly as when, decades ago, Molecular Biology and Evolution decided to reject any manuscript featuring a phylogenetic tree devoid of statistical support (e.g., bootstrap). In contrast, factoring genomic meta-data (sequencing coverage, paired-end incongruence, etc.) into homology searches is likely to be more difficult. Indeed, it will require devising a standard format and modifying numerous software packages to take advantage of it, but it is certainly worth the effort if we want to make the most of genomic data.

Materials and methods

Cyanobacterial genome assemblies

Cyanobacterial genome assemblies have been collected from four databases using a custom Perl script. The latter uses the assembly number to query public sequence databases according to a predetermined order, so that the genome redundancy across banks is handled consistently. Genomes were first collected from the release 30 of Ensembl Bacteria [46], then from the NCBI (with a priority on RefSeq [70], then from PATRIC [71], and finally from the bacterial part of the JGI genome portal [72]). A first batch of downloads was carried out on the 14th of April 2016 and a second batch on the 11th of December 2016. Assemblies were first analyzed using QUAST 2.3 [2] with default values. The number of scaffolds, the number of scaffolds >1000 nt, the total length of the genomes, the total length of the genomes based on scaffolds >1000 nt, the length of the largest scaffold, the GC-content, the N50/N75 values, the L50/L75 values and the number of N’s per 100 kbp were collected and summarized in S1 Table, along with the morphology, the habitat and the taxonomy of each corresponding strain.

SSU rRNA (16S) analyses

SSU rRNA (16S) genes in cyanobacterial genome assemblies were predicted with RNAmmer 1.2, a rRNA predictor using hidden Markov models [32]. The putative extracted SSU rRNA sequences were taxonomically classified by SINA 1.2.11, a multiple sequence alignment and classifier tool (associated with SILVA rRNA database) [33]. SINA was used in combination with the non-redundant and curated release 123.1 of SILVA, which contains 645,151 SSU rRNA (16S) sequences for the classification [40]. A rough estimate of the contamination level in each assembly was computed by taking the ratio of the number of SSU rRNA (16S) genes classified to foreign (i.e., non-cyanobacterial) Bacteria to the total number of predicted SSU rRNA (16S) genes. The distance matrix used for benchmarking reference databases was computed by aligning the SSU rRNA (16S) sequences with MAFFT 7.273 [73], then selecting the conserved positions of the alignment with BMGE 1.12 [74] and finally calculating the distance matrix with IQPNNI 3.3.2, a tree-reconstruction algorithm based on maximum likelihood (ML) and used here with the GTR+Γ4 model [75].

Ribosomal protein analyses

The contamination level estimates of cyanobacterial genome assemblies were refined using a broadly-sampled dataset of 52 multiple sequence alignments (MSAs) of ribosomal proteins aligned with MAFFT. The dataset was assembled from a ribosomal protein database for prokaryotes, RiboDB 1.4.0 [43], available at https://ribodb.univ-lyon1.fr/. These protein sequences were used as a reference to detect and classify contaminating sequences with “42”, a program whose aim is to add (and to optionally align) sequences to a pre-existing MSA while controlling for orthology relationships (Baurain et al., to be published elsewhere; https://bitbucket.org/dbaurain/42/). Here, we used “42”'s ability to spot contaminants by attempting to add ribosomal proteins from each of our 440 cyanobacterial genomes to each MSA of our dataset of ribosomal MSAs. In practice, “42” searches each genome for sequences that are homologous to sequences of the current MSA and then, through advanced heuristics, sort out orthologues from possible paralogues. Each orthologous sequence is further classified by computing the LCA of the three sequences of the MSA that are the most similar to the newly added orthologue, with a minimum identity threshold of 70% and a minimum alignment length threshold of 30 amino acids. By default, we expect that the inferred LCA is in agreement with the taxonomy of the genome from which comes the orthologue, i.e., of cyanobacterial origin. If not, the orthologue is considered as a contaminant of the genome assembly (or remains unclassified if not matching anything close enough in the reference database). We used RAxML trees inferred under the LG+F model [76] and the CAT approximation to check whether reference Cyanobacteria were monophyletic for all 52 ribosomal proteins. When automated parsing of the bipartitions suggested that it was not the case, visual inspection of the ML tree allowed us to decide between HGT (6 genes) and sequence divergence/reconstruction error (3 genes) as the cause for the non-monophyly. Before proceeding, a few xenologous sequences were thus removed from 6 MSAs (L12, L31, L36, S16, S21, uS4). Moreover, 6 MSAs (L10, L25, L28, L29, L32, S21) were completely discarded due to poor performance of “42” (i.e., no or too few added sequences, orthology assessment issues, for further details, see S6 Table). The remaining 46 MSAs represented 178,634 sequences sampled across a variety of 3474 different bacterial and archaeal strains in total. They were used to estimate the contamination levels as with SSU rRNA (16S) genes, but by summing over all individual ribosomal proteins found in each cyanobacterial genome assembly, i.e., ignoring their possible collocation on the chromosome.

CheckM analyses

We analysed the 440 cyanobacterial genome assemblies with CheckM 1.0.7. CheckM uses a concatenation of predicted ribosomal proteins to automatically place the assembly under study in a reference genome tree. Then, it estimates the completeness and the contamination level of the assembly by searching the sequence for lineage-specific marker genes provided with the software (see [14]). We used the typical automatical workflow option lineage_wf. The contamination percentage extracted from the output was directly considered as the contamination level estimate.

Kraken analyses

Kraken 0.10.5-beta is a program that assigns taxonomic labels to genomic DNA sequences [34]. It uses exact (signature) kmer (21, 25, 31 nt) alignment to associate a taxonomic tree with a kmer database that it builds itself according to a default or user-specified genome set [24]. We used a newly created curated database derived from Ensembl Bacteria 30 (see S1 Appendix for details) with three different confidence thresholds (0.02, 0.04 and 0.06) to analyze cyanobacterial pseudo-reads. Kraken labels were further post-processed to classify pseudo-reads as either cyanobacterial, contaminating (non-cyanobacterial), unknown (too high-ranking LCA, such as “Bacteria” or “Terrabacteria”) or unclassified (no hit in the database). Finally, a global contamination level was computed for each assembly by taking the ratio of the number of non-cyanobacterial pseudo-reads to the total number of pseudo-reads.

DIAMOND blastx analyses

To check the effect of using proteins instead of whole nucleotide genomes on detection sensitivity, we repeated most Kraken analyses with DIAMOND 0.8.22.84 blastx [36]. For this, we built a DIAMOND database from the corresponding 27,762 complete proteomes of our curated Ensembl 30 genome set. Cyanobacterial genome assemblies, still split into pseudo-reads of 250 nt, were BLASTed against this protein database. Each pseudo-read was then labelled by computing the LCA of its best hits (excluding self-matches), provided they had a bit-score ≥80 and within 95% of the bit-score of the first hit (MEGAN-like algorithm [48]). We tested three different thresholds for the minimal number of hits required for LCA inference (1, 2 or 3) and chose to use a minimal number of 1 hit. From there, pseudo-read classification and contamination level estimation were carried out exactly as for Kraken. Decontaminated cyanobacterial genome assemblies, in which regions corresponding to contaminating pseudo-reads have been removed (or only masked with N’s), are available for download at https://figshare.com/s/08193cb71f0a65d99c69. For S5 Table, the MEGAN-like algorithm was run using a bit-score threshold of 99% (instead of 95%).

CONCOCT analyses

We also analyzed cyanobacterial genome assemblies with CONCOCT 0.4.1 [35] using three default values for kmer size (4, 5 and 6). For each assembly, the script cut_up_fasta.py (provided with CONCOCT) was used to cut the genome sequences into non-overlapping segments of 10,000 nt. Dummy coverage files with a value of 1 for every segment were also created. CONCOCT was used on the resulting files with default values: a length threshold of 1000 nt for the segments and a number of groups of 400 for the clustering based on the principal component analysis (PCA) of kmer frequencies. For each assembly, we computed the total number of segments for the whole assembly and counted the number of segments associated with every group determined by CONCOCT. Then, we selected the group containing the largest number of segments and considered it to be the “core” genome without any contamination or recent HGT. To estimate the contamination level of an assembly, we used the simple following formula: one minus the ratio between the number of segments belonging to the largest group and the total number of segments. Importantly, CONCOCT was designed for metagenomic analyses and requires coverage data to realize a meaningful clustering of sequencing reads after the PCA [35]. This is also the case for most other programs created for metagenomics, such as MetaBat [51], VizBin [77], GroopM [78], MaxBin [79] and the older CompostBin [80]. Unfortunately, sequencing coverage could be obtained for only 63 assemblies (see Validation using the sequencing coverage for details). Yet, when we reran CONCOCT using coverage files, we did not observe any large difference in clustering with respect to the results obtained without coverage (data not shown). However, none of these 63 assemblies (including 7 from the 97 atypical assemblies), except Mastigocoleus testarum BC008 (GCA_000472885.1, unpublished), were highly contaminated (DIAMOND blastx level <2%). Thus, the use of real coverage values might have had little effect on the clustering results. This interpretation is strengthened by the observation that real coverage values were indeed relatively constant within each of the 62 non-contaminated assemblies, as assessed through the corresponding quartile coefficients of dispersion (median = 0.055, IQR = 0.053).

Global ranking

Since all methods were in general agreement, we decided to retain all of them in our final global ranking of cyanobacterial genome assemblies. This consensus ranking was computed by averaging the ranks of each assembly in the individual rankings obtained for each of the six methods (S2 Table). For the 170 assemblies included in the reference genome database, Kraken results were considered as missing values (NA) when taking the rank average. Spearman rank correlations were computed for all methods and the global ranking, considering all 440 assemblies, only the 343 typical assemblies, or only the 50 top-ranking assemblies out of the 440 (S3 Table), using the R option pairwise.complete.obs to handle missing values. To confirm that foreign sequences in contaminated cyanobacterial genomes were genuine contaminants, we checked that their sequencing coverage was different from the coverage of cyanobacterial sequences. Coverage was estimated using BBMap (http://bbmap.sourceforge.net/) from raw read data (when available). However, this was possible for only 63 of the 83 available SRA files: 12 genomes had avf_fold values <5 and 8 genomes returned no result at all. Among the 63 assemblies with coverage, only one (GCA_000472885.1) was part of our top-21 ranking, all other assemblies corresponding to slightly or non-contaminated genomes. For each assembly, sequencing coverage, GC-content (computed with a custom Perl script) and taxonomic classification (cyanobacterial, contaminated, unknown and unclassified sequence fractions, as determined by DIAMOND blastx) were computed on 10,000-nt segments (as for CONCOCT) and used for generating combined genome maps. In the corresponding plots, scaffolds are arranged by GC-content, which allows highlighting the covariation of the latter with contamination level and sequencing coverage. As an additional congruence test, the positions of the SSU rRNA (16S) and ribosomal protein genes classified to Cyanobacteria and/or to contaminant organisms were also plotted below their corresponding scaffolds.

Optimization of Kraken database.

(PDF) Click here for additional data file. The 440 strains were classified into the eight orders defined in Komarek et al. 2014 [Syn: Synechococcales, Nos: Nostocales, Osc: Oscillatoriales, Chc: Chroococcales, Pleu: Pleurocapsales, Spi: Spirulinales, Glb: Gloeobacterales, Chd: Chroococcidiopsidales (and Glm: Gloeoemargaritales)], and further broken into either four morphologies (left panel: unicellular, filamentous, heterocystous, unknown) or six habitats (right panel: seawater, freshwater, terrestrial, extremophile, symbiotic and unknown). Horizontal axes are in number of assemblies. (PDF) Click here for additional data file.

Sensitivity of Kraken according to kmer size and confidence threshold.

The classified fractions of the 440 cyanobacterial genomes (expressed in %) are summarized as one box-and-whiskers plot for every combination of Kraken parameters. Boxes correspond to interquartile ranges (IQR = Q3–Q1), whereas medians (Q2) are shown as thick horizontal black lines. Upper and lower whiskers extend from the hinge to the largest and lowest value no further than 1.5 * IQR from the hinge, respectively. Data points beyond the end of the whiskers are outliers and are plotted individually. (PDF) Click here for additional data file.

Sensitivity of Kraken and DIAMOND blastx as a function of the distance to reference genomes using two different databases.

For each of the 343 cyanobacterial genome assemblies, the classified fraction using Kraken (at a kmer size of 21, expressed in %) (a,b) or DIAMOND blastx (c) is plotted against its evolutionary distance (expressed in substitutions per site) to the most closely related genome in the default Kraken database (a) or in the curated Ensembl 30 database (b,c). Distances were estimated by predicting and comparing SSU rRNA (16S) genes under the GTR+Γ4 model, and the minimum distance was selected for every genome of our dataset. r is the Pearson correlation coefficient between the two variables. (PDF) Click here for additional data file.

Validation of our methods for detecting contaminants.

The top-21 ranking assemblies were split into segments of 10,000 nt for detailed analysis. On the X axis, segment-containing scaffolds were sorted by increasing average GC-content, with vertical thin lines representing scaffold boundaries. Segments were classified as either cyanobacterial (green), contaminated (red), unknown (grey) or unclassified (white), using DIAMOND blastx. The overlaid light blue curve is the GC-content of the corresponding segments. Open symbols give the locations of the detected ribosomal genes (circles: SSU rRNA (16S), squares: ribosomal proteins), either from cyanobacterial sources (green) or from non-cyanobacterial sources (red). (PDF) Click here for additional data file.

Taxonomic distribution of contaminating sequences in all cyanobacterial genome assemblies, based on DIAMOND blastx estimates.

The 440 assemblies (in the global ranking order of S2 Table) were analyzed with DIAMOND blastx, as explained in the main text (using a LCA approach against a protein version of our curated Ensembl 30 database). Taxonomic classifications (expressed in % of the genomic sequence) are summarized at the phylum level. The “unclassified” classification corresponds to sequences that do not match any reference protein in the database, whereas “unknown” corresponds to high-ranking LCAs (Bacteria or Terrabacteria). “others” include the following phyla (in descending order of frequency): Euryarchaeota, Nitrospinae, Deinococcus-Thermus, Candidatus Tectomicrobia, Verrucomicrobia, Spirochaetes, Nitrospirae, Armatimonadetes, Aquificae, Synergistetes, Gemmatimonadetes, Chlamydiae, Thermotogae, Fusobacteria, Deferribacteres, Thaumarchaeota, Thermodesulfobacteria, Tenericutes, Chrysiogenetes, Crenarchaeota, Lentisphaerae, Caldiserica, Dictyoglomi. (PDF) Click here for additional data file.

List of the public cyanobacterial genome assemblies used in this study.

The table gives the accession, the name (binomial incl. strain), the taxonomy (order, NCBI lineage), the ecology (morphology and habitat), and the assembly properties (as determined by QUAST) of the 440 assemblies surveyed. Assemblies are sorted by NCBI lineage. (*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (<500 kbp). (XLSX) Click here for additional data file.

Global ranking of the cyanobacterial genome assemblies.

The table gives the accession, the name (binomial incl. strain), the main assembly properties (as determined by QUAST), and the ranking results of the 440 assemblies surveyed. Assemblies are sorted from the most contaminated to the less contaminated. (*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (<500 kbp). Assemblies included in the reference database have their Kraken contamination level set to missing data (NA). (XLSX) Click here for additional data file.

Correlation of all six methods with the global ranking.

Spearman rank correlations were computed for all six methods and the global ranking, considering either all 440 assemblies, or only the 343 typical assemblies, or only the 50 top-ranking assemblies out of the 440. (XLSX) Click here for additional data file.

Taxonomic analysis of the scaffolds of the assemblies in the top-21 ranking.

The table gives the DIAMOND blastx taxonomic classifications (in %, at the phylum level) of each scaffold of the top-21 assemblies (one sheet per assembly). For each scaffold, the GC-content (in %) and the scaffold length (in nt) are also reported. (XLSX) Click here for additional data file.

Taxonomic overview of the contaminant genera of the assemblies in the top-21 ranking.

Only contaminants that could be identified at least at the genus level are considered. Redundant assemblies were only analyzed once. Absolute counts are in pseudo-reads, whereas relative counts are expressed with respect to the total number of pseudo-reads indentified at least at the genus level. (XLSX) Click here for additional data file.

Summary of the optimization of the “42” analyses.

The table gives the number of ribosomal proteins added to each MSA using different combinations of “42” parameters. The final combination is highlighted on a red background. (XLSX) Click here for additional data file.

72 in total

1. Multigene phylogeny of the green lineage reveals the origin and diversification of land plants.

Authors: Cédric Finet; Ruth E Timme; Charles F Delwiche; Ferdinand Marlétaz
Journal: Curr Biol Date: 2010-12-09 Impact factor: 10.834

2. The Paleoproterozoic snowball Earth: a climate disaster triggered by the evolution of oxygenic photosynthesis.

Authors: Robert E Kopp; Joseph L Kirschvink; Isaac A Hilburn; Cody Z Nash
Journal: Proc Natl Acad Sci U S A Date: 2005-08-01 Impact factor: 11.205

3. Pathway evolution by horizontal transfer and positive selection is accommodated by relaxed negative selection upon upstream pathway genes in purple bacterial carotenoid biosynthesis.

Authors: Jonathan L Klassen
Journal: J Bacteriol Date: 2009-10-09 Impact factor: 3.490

4. The plastid ancestor originated among one of the major cyanobacterial lineages.

Authors: Jesús A G Ochoa de Alda; Rocío Esteban; María Luz Diago; Jean Houmard
Journal: Nat Commun Date: 2014-09-15 Impact factor: 14.919

5. An Early-Branching Freshwater Cyanobacterium at the Origin of Plastids.

Authors: Rafael I Ponce-Toledo; Philippe Deschamps; Purificación López-García; Yvan Zivanovic; Karim Benzerara; David Moreira
Journal: Curr Biol Date: 2017-01-26 Impact factor: 10.834

6. Domain enhanced lookup time accelerated BLAST.

Authors: Grzegorz M Boratyn; Alejandro A Schäffer; Richa Agarwala; Stephen F Altschul; David J Lipman; Thomas L Madden
Journal: Biol Direct Date: 2012-04-17 Impact factor: 4.540

7. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.

Authors: Elmar Pruesse; Jörg Peplies; Frank Oliver Glöckner
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

8. Fermentation couples Chloroflexi and sulfate-reducing bacteria to Cyanobacteria in hypersaline microbial mats.

Authors: Jackson Z Lee; Luke C Burow; Dagmar Woebken; R Craig Everroad; Mike D Kubo; Alfred M Spormann; Peter K Weber; Jennifer Pett-Ridge; Brad M Bebout; Tori M Hoehler
Journal: Front Microbiol Date: 2014-02-26 Impact factor: 5.640

9. GroopM: an automated tool for the recovery of population genomes from related metagenomes.

Authors: Michael Imelfort; Donovan Parks; Ben J Woodcroft; Paul Dennis; Philip Hugenholtz; Gene W Tyson
Journal: PeerJ Date: 2014-09-30 Impact factor: 2.984

10. The Ensembl gene annotation system.

Authors: Bronwen L Aken; Sarah Ayling; Daniel Barrell; Laura Clarke; Valery Curwen; Susan Fairley; Julio Fernandez Banet; Konstantinos Billis; Carlos García Girón; Thibaut Hourlier; Kevin Howe; Andreas Kähäri; Felix Kokocinski; Fergal J Martin; Daniel N Murphy; Rishi Nag; Magali Ruffier; Michael Schuster; Y Amy Tang; Jan-Hinnerk Vogel; Simon White; Amonida Zadissa; Paul Flicek; Stephen M J Searle
Journal: Database (Oxford) Date: 2016-06-23 Impact factor: 3.451

13 in total

1. Molecular characteristics of global β-lactamase-producing Enterobacter cloacae by genomic analysis.

Authors: Jincao Hu; Jia Li; Chang Liu; Yan Zhang; Hui Xie; Chuchu Li; Han Shen; Xiaoli Cao
Journal: BMC Microbiol Date: 2022-10-21 Impact factor: 4.465

2. Metagenomic assembly of new (sub)polar Cyanobacteria and their associated microbiome from non-axenic cultures.

Authors: Luc Cornet; Amandine R Bertrand; Marc Hanikenne; Emmanuelle J Javaux; Annick Wilmotte; Denis Baurain
Journal: Microb Genom Date: 2018-08-23

3. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data.

Authors: Andrew J Low; Catherine D Carrillo; Adam G Koziol; Paul A Manninger; Burton Blais
Journal: PeerJ Date: 2019-05-31 Impact factor: 2.984

Review 4. Cyanobacteria evolution: Insight from the fossil record.

Authors: Catherine F Demoulin; Yannick J Lara; Luc Cornet; Camille François; Denis Baurain; Annick Wilmotte; Emmanuelle J Javaux
Journal: Free Radic Biol Med Date: 2019-05-09 Impact factor: 7.376

5. Obtaining Genome Sequences of Mutualistic Bacteria in Single Microcystis Colonies.

Authors: Jing Tu; Liang Chen; Shen Gao; Junyi Zhang; Changwei Bi; Yuhan Tao; Na Lu; Zuhong Lu
Journal: Int J Mol Sci Date: 2019-10-11 Impact factor: 5.923

6. Filling the Gaps in the Cyanobacterial Tree of Life-Metagenome Analysis of Stigonema ocellatum DSM 106950, Chlorogloea purpurea SAG 13.99 and Gomphosphaeria aponina DSM 107014.

Authors: Pia Marter; Sixing Huang; Henner Brinkmann; Silke Pradella; Michael Jarek; Manfred Rohde; Boyke Bunk; Jörn Petersen
Journal: Genes (Basel) Date: 2021-03-09 Impact factor: 4.096

7. ORPER: A Workflow for Constrained SSU rRNA Phylogenies.

Authors: Luc Cornet; Anne-Catherine Ahn; Annick Wilmotte; Denis Baurain
Journal: Genes (Basel) Date: 2021-10-29 Impact factor: 4.096

8. De Novo Transcriptome Meta-Assembly of the Mixotrophic Freshwater Microalga Euglena gracilis.

Authors: Javier Cordoba; Emilie Perez; Mick Van Vlierberghe; Amandine R Bertrand; Valérian Lupo; Pierre Cardol; Denis Baurain
Journal: Genes (Basel) Date: 2021-05-29 Impact factor: 4.096

9. Day and Night: Metabolic Profiles and Evolutionary Relationships of Six Axenic Non-Marine Cyanobacteria.

Authors: Sabine Eva Will; Petra Henke; Christian Boedeker; Sixing Huang; Henner Brinkmann; Manfred Rohde; Michael Jarek; Thomas Friedl; Steph Seufert; Martin Schumacher; Jörg Overmann; Meina Neumann-Schaal; Jörn Petersen
Journal: Genome Biol Evol Date: 2019-01-01 Impact factor: 3.416

10. Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies.

Authors: Clementine M Francois; Faustine Durand; Emeric Figuet; Nicolas Galtier
Journal: G3 (Bethesda) Date: 2020-02-06 Impact factor: 3.154