| Literature DB >> 29942087 |
Allan Veras1, Fabricio Araujo1, Kenny Pinheiro1, Luis Guimarães1, Vasco Azevedo2, Siomar Soares3, Artur da Costa da Silva1, Rommel Ramos4.
Abstract
High-throughput sequencing technologies are a milestone in molecular biology for facilitating great advances in genomics by enabling the deposit of large volumes of biological data to public databases. The availability of such data has made possible the comparative genomic analysis through pipelines, using the entire gene repertoire of genomes. However, a large number of unfinished genomes exist in public databases; their number is approximately 16-fold higher than the number of complete genomes, which creates bias during comparative analyses. Therefore, the present work proposes a new tool called Pan4Drafts, an automated pipeline for pan-genomic analysis of draft prokaryotic genomes to maximize the representation and accuracy of the gene repertoire of unfinished genomes by using reads from sequencing data. Pan4Draft allows to perform comparative analyses using different methodologies such as combining complete and draft genomes, using only draft genomes or only complete genomes. Pan4Draft is available at http://www.computationalbiology.ufpa.br/pan4drafts and the test dataset is available at https://sourceforge.net/projects/pan4drafts .Entities:
Mesh:
Year: 2018 PMID: 29942087 PMCID: PMC6018222 DOI: 10.1038/s41598-018-27800-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Pan4Draft pipeline. Processes highlighted in orange consist of the addition of gene content, which may have been left out of the contigs file due to assembly artifacts. Processes highlighted in blue are frameshift adjustment steps.
Evaluation of accuracy between draft and complete genomes for different strains before and after using Pan4Draft (core and accessory genomes).
| Organism | Before Pan4Draft | After Pan4Draft | |||||
|---|---|---|---|---|---|---|---|
| Total products from complete genome | Total products found | Products with 100% similarity (core) | Products with 100% similarity (accessory) | Total products found | Products with 100% similarity (core) | Products with 100% similarity (accessory) | |
| 4920 | 4664 | 2937 (62.97%) | 800 (17.15%) | 4437 | 2895 (65.25%) | 760 (17.13%) | |
| 4396 | 4356 | 3109 (71.37%) | 974 (22.36%) | 4188 | 3041 (72.61%) | 911 (21.75%) | |
| 4369 | 4339 | 3106 (71.58%) | 962 (22.17%) | 4186 | 3052 (72.91%) | 896 (21.40%) | |
| 4501 | 4461 | 3092 (69.31%) | 1009 (22.62%) | 4330 | 3017 (69.68%) | 947 (21.87%) | |
| 5130 | 5150 | 3090 (60.00%) | 931 (18.08%) | 4927 | 3029 (61.48%) | 900 (18.27%) | |
| 5007 | 8882 | 3036 (34.18%) | 871 (9.81%) | 6684 | 3010 (45.03%) | 861 (12.88%) | |
| 5032 | 5032 | 3138 (62.36%) | 965 (19.18%) | 5032 | 3082 (61.25%) | 962 (19.12) | |
Analysis of amount frameshifts found.
| Organism | Before Pipeline | After Pipeline |
|---|---|---|
| 349 | 227 | |
| 227 | 194 | |
| 273 | 174 | |
| 287 | 177 | |
| 374 | 274 | |
| 998 | 855 | |
| 367 | 367 |
Figure 2Synteny analysis of the genomes and its reference. For each letter, the top side of the figure represents the genomes before using Pan4Draft and the bottom side represents the genomes after using Pan4Draft. The X axe represents the reference and the Y axe represents the ordered multifasta file. (a) the synteny analysis of E. coli P12b (SRA SRX1012260, Run SRR2000272. (b) synteny analysis of E. coli K-12 str. GM4792 Lac+ (SRA SRX1295865, RUN SRR2537294). (c) E. coli RR1 (SRA SRX1021885, RUN SRR2014554). (d) E. coli KLY (SRA SRX610250, RUN SRR1424625). (e) E. coli 042 (SRA ERX002221, RUN ERR007646). (f) E. coli 042 (SRA ERX002221, RUN ERR007646).
Figure 3Orthologs cluster analysis of the pan-genome shows that the clusters are more similar between the complete genome and the genome after using Pan4Draft. (a) represents the orthologs cluster of the complete genomes (b) represents the orthologs cluster of the genome before using Pan4Draft and (c) represents the orthologs cluster after using Pan4Draft.
Figure 4Plot of similarity values observed before and after the use of the Pan4Draft for the analysis using the core (a) and accessory genome (b) the controls (complete genomes) are shown in yellow.
Processing time for the local execution of the steps in seconds: Step 1-Bowtie mapping against input file, Step 2-De novo assembly using Velvet, Step 3-BLAST contigs, Step 4-Adding contigs, Step 8-Frameshift identification, Step 11-Standardization of archives, Step 12-Comparative analysis. Step 9 (remote Uniprot blast) and step 10 (consensus generation of products with frameshifts based on reads) are run in remote environments so their processing time are not shown.
| Organism | Step 1 | Step 2 | Step 3 | Step 4 | Total Contigs | Step 8 | Step 11 | Step 12 |
|---|---|---|---|---|---|---|---|---|
| 645.47 | 2563.29 | 71.69 | 1.29 | 5 | 0. 35 | 16.23 | 2033.01 | |
| 1131.82 | 1175.06 | 71.25 | 1.91 | 0 | 0.32 | 23.55 | 2033.01 | |
| 1046.82 | 1205.18 | 71.70 | 1.20 | 6 | 0.36 | 18.50 | 2033.01 | |
| 735.57 | 1296.20 | 71.39 | 1.10 | 0 | 0. 37 | 25.71 | 2033.01 | |
| 582.22 | 1508.30 | 71.34 | 2.23 | 3 | 0.62 | 24.95 | 2033.01 | |
| 389.06 | 2166.90 | 71.40 | 2.54 | 229 | 0. 59 | 28.15 | 2033.01 | |
| — | — | — | — | — | — | 24.10 | — |