| Literature DB >> 31827584 |
Matyas Cserhati1, Peng Xiao1, Chittibabu Guda1.
Abstract
Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5' UTR, intron, and 3' UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7-9 bp for the whole genome, 5' and 3' UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31827584 PMCID: PMC6881769 DOI: 10.1155/2019/4259479
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Flowchart depicting the algorithm. First, the whole-genome sequences or subgenomic region of interest for all species are analyzed, and the WGKS is produced. This is a list of all possible k-mers together with their normalized score values. These WGKSs are compared in an all-versus-all manner, using the Pearson correlation coefficient. This produces a CC matrix, which is then visualized in a heatmap, depicting species relationships.
Figure 2Heatmap depicting CC values calculated in an all-versus-all pairwise manner between the 63 species included in the analysis based on the whole-genome k-mer signature for octamers. Colors closer to yellow or white indicate higher CC values, while those closer to red indicate lower CC values. The range of the CC values in this matrix is from 0.259 to 1.0.
Figure 3UPGMA, WPGMA, and NJ trees for 1−CC values for all species pairs from each of the three genera: (a) Drosophila, (b) Glossina, and (c) Anopheles.
Number of statistically significant genome k-mers and minimum score for all species.
| Species | No. of significant k-mers | Min. score | No. of hits in JASPAR database |
|---|---|---|---|
| Anopheles_albimanus | 1646 | 0.383 | NA |
| Anopheles_arabiensis | 1629 | 0.349 | NA |
| Anopheles_atroparvus | 1425 | 0.346 | NA |
| Anopheles_christyi | 1366 | 0.414 | NA |
| Anopheles_cracens | 1523 | 0.433 | NA |
| Anopheles_culicifacies | 1440 | 0.371 | NA |
| Anopheles_darlingi | 1646 | 0.413 | NA |
| Anopheles_dirus | 1648 | 0.387 | NA |
| Anopheles_epiroticus | 1562 | 0.375 | NA |
| Anopheles_farauti | 1397 | 0.435 | NA |
| Anopheles_funestus | 1579 | 0.340 | NA |
| Anopheles_gambiae | 1509 | 0.394 | NA |
| Anopheles_koliensis | 1309 | 0.461 | NA |
| Anopheles_maculatus | 1613 | 0.377 | NA |
| Anopheles_melas | 1551 | 0.379 | NA |
| Anopheles_merus | 1755 | 0.281 | NA |
| Anopheles_minimus | 1406 | 0.379 | NA |
| Anopheles_nili | 1206 | 0.427 | NA |
| Anopheles_punctulatus | 1276 | 0.456 | NA |
| Anopheles_quadriannulatus | 1771 | 0.270 | NA |
| Anopheles_sinensis | 1381 | 0.419 | NA |
| Anopheles_stephensi | 1666 | 0.369 | NA |
| Drosophila_albomicans | 2279 | 0.428 | 23 |
| Drosophila_americana | 2209 | 0.428 | 22 |
| Drosophila_ananassae | 2067 | 0.481 | 22 |
| Drosophila_arizonae | 2293 | 0.405 | 19 |
| Drosophila_biarmipes | 1899 | 0.475 | 19 |
| Drosophila_bipectinata | 1934 | 0.449 | 15 |
| Drosophila_busckii | 2406 | 0.442 | 25 |
| Drosophila_elegans | 1768 | 0.519 | 19 |
| Drosophila_erecta | 2047 | 0.470 | 17 |
| Drosophila_eugracilis | 1838 | 0.424 | 21 |
| Drosophila_ficusphila | 1591 | 0.435 | 19 |
| Drosophila_grimshawi | 2377 | 0.465 | 16 |
| Drosophila_kikkawai | 1834 | 0.468 | 16 |
| Drosophila_melanogaster | 1805 | 0.472 | 20 |
| Drosophila_miranda | 1973 | 0.429 | 28 |
| Drosophila_mojavensis | 2435 | 0.435 | 17 |
| Drosophila_nasuta | 1981 | 0.468 | 15 |
| Drosophila_navojoa | 2239 | 0.508 | 19 |
| Drosophila_obscura | 2029 | 0.500 | 22 |
| Drosophila_persimilis | 2111 | 0.423 | 24 |
| Drosophila_pseudoobscura | 2046 | 0.393 | 26 |
| Drosophila_rhopaloa | 1757 | 0.427 | 17 |
| Drosophila_sechellia | 1883 | 0.456 | 20 |
| Drosophila_serrata | 1820 | 0.410 | 15 |
| Drosophila_simulans | 1758 | 0.496 | 21 |
| Drosophila_suzukii | 1937 | 0.442 | 21 |
| Drosophila_takahashi | 1834 | 0.364 | 20 |
| Drosophila_virilis | 2415 | 0.475 | 23 |
| Drosophila_willistoni | 2223 | 0.425 | 21 |
| Drosophila_yakuba | 1843 | 0.410 | 19 |
| Glossina_austeni | 1741 | 0.367 | NA |
| Glossina_brevipalpis | 1973 | 0.360 | NA |
| Glossina_fuscipes | 1787 | 0.370 | NA |
| Glossina_pallidipes | 1732 | 0.373 | NA |
| Glossina_palpalis_gambiensis | 1810 | 0.342 | NA |
| Glossina_morsitans_morsitans | 1735 | 0.377 | NA |
CC statistics for k-mers of lengths 7–9 bp for different combinations of the genera under study.
| Group comparison | Min | Median | Mean | Max | Std. dev. | No. of comparisons |
|---|---|---|---|---|---|---|
| Heptamers | ||||||
|
| 0.913 | 0.957 | 0.955 | 0.999 | 0.022 | 231 |
| Non- | 0.590 | 0.833 | 0.837 | 0.999 | 0.087 | 630 |
|
| 0.590 | 0.874 | 0.869 | 0.999 | 0.072 | 435 |
| Non- | 0.677 | 0.938 | 0.882 | 0.999 | 0.104 | 378 |
|
| 0.965 | 0.994 | 0.986 | 0.999 | 0.014 | 15 |
| Non- | 0.441 | 0.739 | 0.772 | 0.999 | 0.144 | 1326 |
|
| 0.441 | 0.648 | 0.644 | 0.770 | 0.059 | 660 |
|
| 0.677 | 0.740 | 0.744 | 0.787 | 0.027 | 132 |
|
| 0.642 | 0.749 | 0.745 | 0.812 | 0.033 | 180 |
|
| 0.528 | 0.559 | 0.562 | 0.643 | 0.030 | 22 |
|
| 0.266 | 0.620 | 0.573 | 0.667 | 0.102 | 30 |
|
| 0.485 | 0.492 | 0.499 | 0.534 | 0.018 | 6 |
|
| 0.568 | 0.617 | 0.629 | 0.702 | 0.043 | 22 |
|
| 0.242 | 0.484 | 0.474 | 0.567 | 0.065 | 30 |
|
| 0.570 | 0.590 | 0.589 | 0.617 | 0.017 | 6 |
|
| ||||||
| Octamers | ||||||
|
| 0.904 | 0.950 | 0.948 | 0.999 | 0.023 | 231 |
| Non- | 0.588 | 0.824 | 0.822 | 0.998 | 0.089 | 630 |
|
| 0.588 | 0.858 | 0.857 | 0.997 | 0.069 | 435 |
| Non- | 0.655 | 0.93 | 0.869 | 0.999 | 0.113 | 378 |
|
| 0.948 | 0.988 | 0.978 | 0.998 | 0.020 | 15 |
| Non- | 0.443 | 0.723 | 0.761 | 0.999 | 0.143 | 1326 |
|
| 0.443 | 0.637 | 0.633 | 0.760 | 0.055 | 660 |
|
| 0.655 | 0.716 | 0.719 | 0.755 | 0.026 | 132 |
|
| 0.621 | 0.728 | 0.723 | 0.791 | 0.034 | 180 |
|
| 0.521 | 0.554 | 0.556 | 0.634 | 0.029 | 22 |
|
| 0.279 | 0.610 | 0.567 | 0.652 | 0.094 | 30 |
|
| 0.477 | 0.484 | 0.490 | 0.522 | 0.017 | 6 |
|
| 0.564 | 0.611 | 0.624 | 0.696 | 0.042 | 22 |
|
| 0.259 | 0.481 | 0.477 | 0.565 | 0.061 | 30 |
|
| 0.564 | 0.585 | 0.583 | 0.608 | 0.016 | 6 |
|
| ||||||
| Nonamers | ||||||
|
| 0.886 | 0.939 | 0.938 | 0.996 | 0.025 | 231 |
| Non- | 0.577 | 0.805 | 0.801 | 0.993 | 0.092 | 630 |
|
| 0.577 | 0.838 | 0.839 | 0.992 | 0.069 | 435 |
| Non- | 0.629 | 0.919 | 0.852 | 0.996 | 0.121 | 378 |
|
| 0.919 | 0.975 | 0.961 | 0.993 | 0.028 | 15 |
| Non- | 0.439 | 0.705 | 0.747 | 0.996 | 0.143 | 1326 |
|
| 0.439 | 0.624 | 0.619 | 0.746 | 0.053 | 660 |
|
| 0.629 | 0.689 | 0.691 | 0.724 | 0.024 | 132 |
|
| 0.589 | 0.697 | 0.694 | 0.766 | 0.034 | 180 |
|
| 0.510 | 0.544 | 0.545 | 0.619 | 0.027 | 22 |
|
| 0.285 | 0.594 | 0.553 | 0.636 | 0.086 | 30 |
|
| 0.464 | 0.470 | 0.475 | 0.503 | 0.014 | 6 |
|
| 0.555 | 0.602 | 0.615 | 0.685 | 0.041 | 22 |
|
| 0.270 | 0.475 | 0.474 | 0.558 | 0.058 | 30 |
|
| 0.551 | 0.572 | 0.570 | 0.592 | 0.014 | 6 |
CC values were calculated for the genera Anopheles, Drosophila, and Glossina as well as between these three genera and between two outliers, Apis mellifera and Caenorhabditis elegans, and these two genera. For each combination, the minimum, mean, median, maximum CC values were calculated as well as the standard deviation and the number of species comparisons.
Figure 4Heatmap depicting similarity of the mitochondrial genomes across 28 species. Lower similarity values are shown in darker, redder colors, closer to 0% similarity, whereas higher similarity values, closer to 100%, are shown in brighter, yellow/white colors. The range of similarity values is between 0 and 100%.
Figure 5Pearson correlation coefficient (CC) values between species of Anopheles, Drosophila, and Glossina as well as the two control species, A. mellifera and C. elegans for octamers. The first three columns represent CC values between all pairs of species within each genera of Anopheles, Drosophila, and Glossina, respectively; columns 4–6 represent comparisons across the species from three genera, 7–9 represent comparison between C. elegans and the three genera, while 10–12 represent comparison between A. mellifera and the three genera.
Figure 6Common nonrepetitive (nondimer and nontrimer) octamer content between 11 Anopheles, 15 Drosophila, and 5 Glossina species. Each included octamer had a minimum score of 0.5.
Figure 7Comparison of the similarity in the 5′ and 3′ UTRs with the genus of Drosophila and between the species of Drosophila and A. gambiae, using k-mers of lengths 7–9 bp: (a) 5′ UTR and (b) 3′ UTR. Yellow bars represent comparisons among Drosophila species, and green bars represent comparison between Drosophila species and A. gambiae.
CC statistics for 5′, 3′ UTRs and introns for k-mer lengths k = 7–9 bp between A. gambiae and Drosophila.
| Comparison | Region | k | Min | Median | Mean | Max | St. dev. |
|
|
|---|---|---|---|---|---|---|---|---|---|
| Within | 5′ UTR | 7 | 0.692 | 0.862 | 0.841 | 0.975 | 0.100 | 21 | NA |
|
| 5′ UTR | 7 | 0.623 | 0.734 | 0.722 | 0.774 | 0.050 | 7 | 5.1 |
| Within | 3′ UTR | 7 | 0.651 | 0.828 | 0.809 | 0.963 | 0.101 | 21 | NA |
|
| 3′ UTR | 7 | 0.524 | 0.620 | 0.599 | 0.644 | 0.043 | 7 | 6.2 |
| Within | Introns | 7 | 0.759 | 0.894 | 0.895 | 0.996 | 0.058 | 66 | NA |
| Within | 5′ UTR | 8 | 0.503 | 0.786 | 0.737 | 0.940 | 0.153 | 21 | NA |
|
| 5′ UTR | 8 | 0.422 | 0.643 | 0.620 | 0.694 | 0.090 | 7 | 0.024 |
| Within | 3′ UTR | 8 | 0.487 | 0.705 | 0.688 | 0.908 | 0.125 | 21 | NA |
|
| 3′ UTR | 8 | 0.392 | 0.513 | 0.498 | 0.562 | 0.055 | 7 | 1.2 |
| Within | Introns | 8 | 0.392 | 0.690 | 0.676 | 0.981 | 0.135 | 66 | NA |
| Within | 5′ UTR | 9 | 0.280 | 0.626 | 0.569 | 0.854 | 0.183 | 21 | NA |
|
| 5′ UTR | 9 | 0.201 | 0.453 | 0.431 | 0.512 | 0.104 | 7 | 0.023 |
| Within | 3′ UTR | 9 | 0.334 | 0.526 | 0.524 | 0.795 | 0.122 | 21 | NA |
|
| 3′ UTR | 9 | 0.242 | 0.360 | 0.356 | 0.422 | 0.056 | 7 | 5.9 |
| Within | Introns | 9 | 0.721 | 0.855 | 0.854 | 0.978 | 0.062 | 66 | NA |
Minimum, mean, median, and maximum CC values were calculated for the 5′, 3′ UTR and intron regions of different Drosophila species compared to A. gambiae. The num[[parms resize(1),pos(50,50),size(200,200),bgcol(156)]] comparisons and the p value are also included. †For 5′ and 3′ UTRs, the following Drosophila species were examined: D. ananassae, erecta, grimshawi, melanogaster, mojavensis, pseudoobscura, and simulans. ‡For introns, the following Drosophila species were examined: D. ananassae, erecta, grimshawi, melanogaster, mojavensis, persimilis, pseudoobscura, sechelia, simulans, virilis, willistoni, and yakuba.
Figure 8Number of common k-mers of lengths 7–9 bp for all seven Drosophila species for 5′ and 3′ UTRs and introns.
Figure 9Pearson correlation coefficient values range from all-versus-all comparison of twelve Drosophila species for k-mer lengths 7–9 bp which have data from the intron regions.