| Literature DB >> 30365512 |
Adonney Allan de Oliveira Veras1, Bruno Merlin1, Pablo Henrique Caracciolo Gomes de Sá2.
Abstract
The availability of biological information in public databases has increased exponentially. To ensure the accuracy of this information, researchers have adopted several methods and refinements to avoid the dissemination of incorrect information; for example, several automated tools are available for annotation processes. However, manual curation ensures and enriches biological information. Additionally, the genomic finishing process is complex, resulting in increased deposition of drafts genomes. This introduces bias in other omics analyses because incomplete genomic content is used. This is also observed for complete genomes. For example, genomes generated by reference assembly may not include new products in the new sequence or errors or bias can occur during the assembly process. Thus, we developed ImproveAssembly, a tool capable of identifying new products missing from genomic sequences, which can be used for complete and draft genomes. The identified products can improve the annotation of complete genomes and drafts while significantly reducing the bias when the information is used in other omics analyses.Entities:
Mesh:
Year: 2018 PMID: 30365512 PMCID: PMC6203371 DOI: 10.1371/journal.pone.0206000
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1ImproveAssembly workflow.
Complete genomes (green arrows), draft genomes (red arrows) and both (black arrows).
Organisms and SRA number used to validate ImproveAssembly.
| Organism | SRA Acess Number |
|---|---|
| SRR974839 | |
| SRR1144793 | |
| SRR974846 | |
| SRR857301 | |
| SRR6479489 | |
| SRR6479482 | |
| SRR2014554 | |
| SRR1424625 | |
| SRR2000272 | |
| SRR2537294 | |
| SRR3223744 | |
| SRR4240341 | |
| ERR007646 |
Results of the assembly process with SPADES.
| Organism | N50 | Larger Contig | Smaller Contig | Contigs | Total of bases |
|---|---|---|---|---|---|
| 140.760 | 409.569 | 181 | 144 | 4.529.368 | |
| 148.396 | 326.983 | 81 | 98 | 4.644.355 | |
| 94.587 | 244.438 | 184 | 180 | 4.757.688 | |
| 132.490 | 347.926 | 220 | 170 | 4.578.480 |
Example of E. coli RR1 tabular file containing the locus_tag and products identified by ImproveAssembly.
| Locus_Tag | Function predicted with Rast |
|---|---|
| SRR2014554_complete_assembly_1590 | FIG00640785: hypothetical protein |
| SRR2014554_complete_assembly_2482 | hypothetical protein |
| SRR2014554_complete_assembly_1492 | hypothetical protein |
| SRR2014554_complete_assembly_0217 | hypothetical protein |
| SRR2014554_complete_assembly_0701 | FIG00640293: hypothetical protein |
| SRR2014554_complete_assembly_1789 | hypothetical protein |
| SRR2014554_complete_assembly_0308 | hypothetical protein |
| SRR2014554_complete_assembly_0604 | hypothetical protein |
| SRR2014554_complete_assembly_0573 | FIG01045439: hypothetical protein |
| SRR2014554_complete_assembly_3104 | Gene D protein |
| SRR2014554_complete_assembly_1077 | hypothetical protein |
| SRR2014554_complete_assembly_3348 | hypothetical protein |
| SRR2014554_complete_assembly_0043 | hypothetical protein |
| SRR2014554_complete_assembly_1275 | hypothetical protein |
| SRR2014554_complete_assembly_3613 | hypothetical protein |
| SRR2014554_complete_assembly_4327 | hypothetical protein |
| SRR2014554_complete_assembly_2682 | hypothetical protein |
| SRR2014554_complete_assembly_0962 | Hypothetical response regulatory protein ygeK |
| SRR2014554_complete_assembly_2451 | hypothetical protein |
| SRR2014554_complete_assembly_4223 | FIG00641106: hypothetical protein |
| SRR2014554_complete_assembly_2991 | C4-dicarboxylate transporter DcuC (TC 2.A.61.1.1) |
| SRR2014554_complete_assembly_3322 | Mobile element protein |
| SRR2014554_complete_assembly_4169 | Ferredoxin |
Quantity of new products for all thirteen organisms.
Total amount of products in the input genome. Total of new products identified for each organism, along with quantity of products with function already described, amount of hypothetical proteins and genome status.
| Organism | Total products in the input Genome | Total identified new products | Products with function | Hypothetical protein | Genome Status |
|---|---|---|---|---|---|
| 4323 | 23 | 5 | 18 | Complete | |
| 4306 | 21 | 1 | 20 | Complete | |
| 4298 | 47 | 12 | 35 | Complete | |
| 4306 | 17 | 6 | 11 | Complete | |
| 5317 | 98 | 36 | 62 | Complete | |
| 4154 | 10 | 5 | 5 | Complete | |
| 4369 | 38 | 16 | 22 | Draft | |
| 4501 | 33 | 2 | 31 | Complete | |
| 4920 | 28 | 2 | 26 | Complete | |
| 4353 | 169 | 98 | 71 | Complete | |
| 5756 | 23 | 2 | 21 | Complete | |
| 4374 | 4 | 1 | 3 | Complete | |
| 5157 | 31 | 9 | 22 | Draft | |
| 4323 | 18 | 8 | 10 | Draft |
ImproveAssembly result for the e. coli strains used as draft.
Shows the otal amount of products in the input genome, total of new products identified for each organism, along with quantity of products with function already described, amount of hypothetical proteins and the genome status.
| Organism | Total products in the input Genome | Total identified new products | Products with function | Hypothetical protein | Genome Status |
|---|---|---|---|---|---|
| 4339 | 11 | 3 | 8 | Draft | |
| 4468 | 13 | 4 | 9 | Draft | |
| 4676 | 21 | 5 | 16 | Draft | |
| 4353 | 13 | 2 | 11 | Draft |
Fig 2Pangenomic analysis of seven E. coli strains without the presence of new products identified by ImproveAssembly.
Fig 3Pangenomic analysis of seven E. coli strains with the addition of new products identified by ImproveAssembly.