| Literature DB >> 32075081 |
Ernesto Borrayo-Carbajal1, Isaias May-Canche2,3, Omar Paredes2, J Alejandro Morales2, Rebeca Romo-Vázquez2, Hugo Vélez-Pérez2.
Abstract
Alignment-free k-mer-based algorithms in whole genome sequence comparisons remainan ongoing challenge. Here, we explore the possibility to use Topic Modeling for organismwhole-genome comparisons. We analyzed 30 complete genomes from three bacterial families bytopic modeling. For this, each genome was considered as a document and 13-mer nucleotiderepresentations as words. Latent Dirichlet allocation was used as the probabilistic modeling of thecorpus. We where able to identify the topic distribution among analyzed genomes, which is highlyconsistent with traditional hierarchical classification. It is possible that topic modeling may be appliedto establish relationships between genome's composition and biological phenomena.Entities:
Keywords: Alignment-Free; Bacteria Genome Comparison; Topic Model
Mesh:
Year: 2020 PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Genomic information from bacteria selected for Whole-genome k-mer topic modeling association.
| Accession No. | Family | Organism | Genome Size (bp) |
|---|---|---|---|
| AE001273.1 |
| 1,042,519 | |
| AE002160.2 |
| 1,072,950 | |
| AE009440.1 |
| 1,225,935 | |
| AE015925.1 |
| 1,173,390 | |
| AP006861.1 |
|
| 1,166,239 |
| CP002549.1 |
| 1,171,660 | |
| CP002608.1 |
| 1,106,197 | |
| CP006571.1 |
| 1,041,170 | |
| CP015840.1 |
| 1,059,583 | |
| CR848038.1 |
| 1,144,377 | |
| BA000031.2 |
| 3,288,558 | |
| BA000037.2 |
| 3,354,505 | |
| CP000020.2 |
| 2,897,536 | |
| CP000626.1 |
| 1,108,250 | |
| CP000789.1 |
|
| 3,765,351 |
| CP002284.1 |
| 3,063,912 | |
| CP002377.1 |
| 3,294,546 | |
| CR354531.1 |
| 4,085,304 | |
| FM178379.1 |
| 3,325,165 | |
| FM954972.2 |
| 3,299,303 | |
| AL590842.1 |
| 4,653,728 | |
| CP000720.1 |
| 4,723,306 | |
| CP000826.1 |
| 5,448,853 | |
| CP002505.1 |
| 4,864,217 | |
| CP002774.1 |
|
| 5,443,009 |
| CP006250.1 |
| 5,328,010 | |
| CP016940.1 |
| 4,593,248 | |
| CP017236.1 |
| 3,856,634 | |
| HG738868.1 |
| 5,123,091 | |
| LN890288.1 |
| 650,317 |
Figure 1Schematic procedure of Whole-genome k-mer topic modeling association. To-be-compared genomes are retrieved either from databases or from experimental procedures (1) to be decomposed into k-mers (2) and then analyzed in order to determine the adequate topic number (3) to finally perform the topic classification as summarized in (4).
Figure 2Box plot for Cumulative Relative Entropy for different k sizes involving 30 bacterial genomes. The suggested threshold is below 0.1 to maximize differences between the genomes. Notice that k = 13 is the first k where neither value is above the threshold.
Figure 3Phylogenomic classification of bacterial families Chlamydiales, Vibrionaceae, and Yersiniaceae based on topic modeling (In this work, three topics).
Figure 4Phylogenomic classification of bacterial families Chlamydiales, Vibrionaceae, and Yersiniaceae based on the methodology of Sims and Kim [3], and including over-represented words.