| Literature DB >> 30044797 |
Luc Cornet1,2, Loïc Meunier1, Mick Van Vlierberghe1, Raphaël R Léonard1,3, Benoit Durieu4, Yannick Lara4, Agnieszka Misztak1,5, Damien Sirjacobs1, Emmanuelle J Javaux2, Hervé Philippe6, Annick Wilmotte4, Denis Baurain1.
Abstract
Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30044797 PMCID: PMC6059444 DOI: 10.1371/journal.pone.0200323
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Graphical abstract of the study.
Fig 2Overview of public cyanobacterial genome assemblies.
The 440 strains were classified into four morphologies (unicellular, filamentous, heterocystous, unknown). a. Distribution of genome sizes (total length in scaffolds >1000 nt). b. Distribution of the numbers of scaffolds (>1000 nt) by assembly. Note the logarithmic scale of the X axis. Details about strain habitats are available in S1 Fig.
Global ranking of cyanobacterial genome assemblies.
| Genome assembly | Assembly propreties | Ranking results | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accession | Organism name | Scaffolds.ge.1000.nt | Total.length.ge.1000.nt | rRNA | rprot | CheckM | Kraken | DIAMOND | CONCOCT | rank.avg.6.na | rank.6.na |
| * | 974 | 15866152 | 66.67 | 62.5 | 200 | 31.84 | 35.15 | 44.27 | 8.83 | 1 | |
| GCF_001482745.1 | Oscillatoriales cyanobacterium MTP1 | 68 | 7647882 | 50 | 40 | 80.53 | 44.93 | 32.99 | 50.41 | 10.25 | 2 |
| GCA_000817745.1 | 296 | 11500044 | 50 | 28.17 | 32.76 | 25.16 | 21.4 | 47.55 | 13.42 | 3 | |
| GCA_000817785.1 | 62 | 13096531 | 40 | 41.86 | 104.63 | NA | 20.35 | 38.95 | 16.10 | 4 | |
| GCA_000963755.2 | 1320 | 7900996 | 50 | 53.33 | 85.48 | 6.49 | 17.75 | 49.8 | 16.67 | 5 | |
| GCF_000963755.1 | 1320 | 7900996 | 50 | 53.33 | 85.48 | 6.49 | 17.76 | 49.13 | 16.83 | 6 | |
| GCA_001458455.1 | 336 | 4815140 | 50 | 61.21 | 22.19 | 11.45 | 27.1 | 39.3 | 17.08 | 7 | |
| GCA_000817775.1 | 298 | 8799693 | 50 | 13.73 | 23.11 | 21.24 | 15.94 | 36.71 | 18.00 | 8 | |
| GCA_000817735.1 | 118 | 11627246 | 33.33 | 44.57 | 44.54 | NA | 18.07 | 34.98 | 19.00 | 9 | |
| GCA_000828085.1 | 214 | 10008488 | 50 | 4.35 | 104.63 | NA | 5.51 | 25.69 | 32.90 | 10 | |
| °GCA_000634395.1 | 76 | 1282892 | 100 | 35.56 | 11.96 | 11.36 | 41.55 | 21.74 | 33.00 | 11 | |
| °GCF_001637395.1 | 420 | 6991351 | 0 | 11.94 | 54.29 | 34.21 | 15.13 | 49.63 | 49.92 | 12 | |
| GCA_000828075.1 | 135 | 10627177 | 75 | 44.44 | 5.39 | NA | 12.95 | 14.19 | 51.00 | 13 | |
| °GCF_001637315.1 | 171 | 4600567 | 0 | 27.87 | 5.42 | 11.28 | 6.66 | 85.01 | 55.58 | 14 | |
| GCF_000828075.2 | 61 | 9468441 | 75 | 40.79 | 3.41 | 0 | 6.23 | 92.28 | 57.00 | 15 | |
| GCA_000341585.2 | 10 | 5525469 | 75 | 4.35 | 4.39 | 6.45 | 3.75 | 10.91 | 58.58 | 16 | |
| °GCA_000934435.1 | 174 | 8560182 | 0 | 38.46 | 5.53 | 12.63 | 11.15 | 16.43 | 81.42 | 17 | |
| GCF_000346485.2 | 27 | 12284271 | 20 | 0 | 2.77 | 0.69 | 1.35 | 86.56 | 93.17 | 18 | |
| ° | Cyanobacteria bacterium JGI 0000014-E08 | 59 | 327995 | 0 | 10 | 0.88 | 4.45 | 18.13 | 24.59 | 93.83 | 19 |
| GCF_001904775.1 | 44 | 5821893 | 0 | 0 | 1.45 | 13.12 | 2.38 | 96.83 | 98.67 | 20 | |
| GCA_000341585.1 | 1354 | 2669044 | 0 | 11.11 | 1.96 | 6.02 | 3.32 | 8.65 | 102.08 | 21 | |
(*) indicates assemblies for which raw read data are in principle available for download from NCBI SRA;
(˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria;
(+) indicates assemblies that are too large (>15,000 kbp);
(-) indicates assemblies that are too small (<500 kbp).
Fig 3Comparison of genome-wide estimates of the contamination level with ribosomal gene results, considered as the gold standard.
Fig 4Taxonomic distribution of contaminating sequences in contaminated genomes, based on DIAMOND blastx estimates.
Fig 5Validation of our methods for detecting contaminants using the sequencing coverage.