| Literature DB >> 32138767 |
Ahmed M Moustafa1, Paul J Planet2,3,4.
Abstract
To understand diversity in enormous collections of genome sequences, we need computationally scalable tools that can quickly contextualize individual genomes based on their similarities and identify features of each genome that make them unique. We present WhatsGNU, a tool based on exact match proteomic compression that, in seconds, classifies any new genome and provides a detailed report of protein alleles that may have novel functional differences. We use this technique to characterize the total allelic diversity (panallelome) of Salmonella enterica, Mycobacterium tuberculosis, Pseudomonas aeruginosa, and Staphylococcus aureus. It could be extended to others. WhatsGNU is available from https://github.com/ahmedmagds/WhatsGNU.Entities:
Keywords: Compression; M. tuberculosis; Microbial genomics; P. aeruginosa; Panallelome; Pangenome; S. aureus; S. enterica; blastp
Mesh:
Substances:
Year: 2020 PMID: 32138767 PMCID: PMC7059281 DOI: 10.1186/s13059-020-01965-w
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Workflow and performance of WhatsGNU. a Workflow for the WhatsGNU tool and its compression technique. The tool starts by compressing the database of proteins. The second step is to match each protein from a query genome to an exact match in the compressed database. The final step is to produce a report with a GNU (Gene Novelty Unit) score for each protein. b Compressed Databases available in WhatsGNU. c A collector’s curve expresses the number of exact matches (unique alleles) as a function of the number of genomes sequenced. The size of the panallelome of available genomes of S. aureus on GenBank and Staphopia were compared. The 1000, 2000, 4000, 8524, 10,350, 20,000, and 30,000 genomes from the 43,914 S. aureus genomes available on Staphopia were randomly selected. The random sampling step was done three times, independently. The error bars are shown in green. d Effect of the number of isolates on the running wall time of WhatsGNU and blastp. Both WhatsGNU and blastp were used on a single CPU and 16 GB of RAM. The S. aureus database used for WhatsGNU was previously processed and serialized using the Python3 pickle module. The time needed to find exact matches for each of the 2893 proteins of S. aureus NCTC 8325 was noted for WhatsGNU and blastp. 1, 100, and 1000 copies of NCTC 8325 genome were used to evaluate the running time for WhatsGNU. For blastp, to reduce computational costs, the running time of one NCTC 8325 genome was multiplied by 100 and 1000, respectively. Running time would differ on desktops with different specifications. Blastp running time can be reduced by using multiple threads if more than one CPU is available
Fig. 2Visualization methods of WhatsGNU. a Box showing some potential WhatsGNU uses and options. b A histogram of GNU scores of a clinical-CC8-USA300 S. aureus genome “SSTI_179_1”. c Percentage of genomes of clonal complexes (CC) 5, 8, 22, 398,30, and 1 with an exact match for three proteins, SbnD (staphyloferrin B export MFS transporter), TraG (Transfer complex protein), and ArcB (ornithine carbamoyltransferase) from the same genome used in b. d A heatmap of GNU score for key components of the TCA cycle, the glycolytic pathway, and terminal components of the electron transport chain in eighteen different clinical S. aureus isolates. Proteins are listed on the left and isolates numbers on the bottom. In the case of annotated cells, ‘r’ refers to ortholog variant rarity index (OVRI) scores that are less than 0.045. This can be interpreted as an indication that GNU scores this low or lower are very rare in this ortholog group. e Volcano plot showing proteins with a lower average GNU score in a case group (atopic dermatitis) compared to a control group (soft and skin tissue infection). Proteins with lower average GNU score in the AD case group of 18 CC8 S. aureus isolates are shown in red. Proteins with lower average GNU score in the SSTI control group of 49 CC8 S. aureus isolates are shown in green. The P value is from a Mann–Whitney–Wilcoxon test. A second volcano plot with Y-axis as OVRI is shown in supplementary figure 3. Example WhatsGNU reports are in the supplementary data