| Literature DB >> 34817803 |
Yan-Ting Jin1,2, Cong Ma1, Xin Wang1, Shu-Xuan Wang1, Kai-Yue Zhang1, Wen-Xin Zheng3, Zixin Deng2, Ju Wang4, Feng-Biao Guo5.
Abstract
In 2002, our research group observed a gene clustering pattern based on the base frequency of A versus T at the second codon position in the genome of Vibrio cholera and found that the functional category distribution of genes in the two clusters was different. With the availability of a large number of sequenced genomes, we performed a systematic investigation of A2-T2 distribution and found that 2694 out of 2764 prokaryotic genomes have an optimal clustering number of two, indicating a consistent pattern. Analysis of the functional categories of the coding genes in each cluster in 1483 prokaryotic genomes indicated, that 99.33% of the genomes exhibited a significant difference (p < 0.01) in function distribution between the two clusters. Specifically, functional category P was overrepresented in the small cluster of 98.65% of genomes, whereas categories J, K, and L were overrepresented in the larger cluster of over 98.52% of genomes. Lineage analysis uncovered that these preferences appear consistently across all phyla. Overall, our work revealed an almost universal clustering pattern based on the relative frequency of A2 versus T2 and its role in functional category preference. These findings will promote the understanding of the rationality of theoretical prediction of functional classes of genes from their nucleotide sequences and how protein function is determined by DNA sequence.Entities:
Keywords: A2 versus T2; Base frequency; Protein function preference; The second codon position; Two unequal clusters
Mesh:
Substances:
Year: 2021 PMID: 34817803 PMCID: PMC9124167 DOI: 10.1007/s12539-021-00493-w
Source DB: PubMed Journal: Interdiscip Sci ISSN: 1867-1462 Impact factor: 3.492
The protein coding genes of 3799 genomes collected for analysis
| Domains | Genomes | Detail |
|---|---|---|
| Prokaryotes | 2764 | 164 archaea and 2600 bacteria |
| Eukaryotes | 1035 | 68 metazoa, 186 protists, 735 fungi, 44 plants, 1 |
| All genomes | 3799 |
The 26 function categories could be classed into four super-categories
| Super-category | Number | Code letter |
|---|---|---|
| Information storage and processing | 5 | J, K, L, A, B |
| Cellular processes and signaling | 11 | D, Y, V, T, M, N, Z, W, U, O, X |
| Metabolism has eight categories | 8 | C, G, E, F, H, I, P, Q |
| Poorly characterized | 2 | R, S |
Fig. 1Coding genes are divided into two unequal clusters by the base frequencies of A and T at the second position of codons. The scatter plots of 12 representative genomes from three domains with f(A2) as the x axis and f(T2) as the y axis both ranging from 0 to 0.7. The clustering phenomenon in archaea and bacteria is significant: a small cluster with much higher f(T2) and a large cluster with similar f(T2) and f(A2). This phenomenon was not significant in eukaryotes
Fig. 2The best choice of cluster number is 2. A Taking all three domains as a whole, 78.63% of genomes had an optimal K of 2. B General distribution of quantitative optimal K values indicates that two clusters are the best choice for prokaryotes and some of eukaryotes (Table S1)
The Chi-squared test results of 1483 genomes on the protein function difference in the two unequal clusters
| Genome number | 8 | 1475 | 10 | 1473 |
| Frequency | 0.54% | 99.46% | 0.67% | 99.33% |
Fig. 3The distribution and difference in COG functional categories in the two unequal clusters of 1483 genomes. A In two representative genomes in prokaryotes, P-related genes prevailed in the small cluster, while J-, K- and L-related genes were observed at a higher proportion in the large cluster. B Cumulative overrepresented genome numbers of 26 functional categories. The 26 functional categories are listed clockwise, beginning with A and ending with S, according to super-category: information genes, cellular processes and signaling, metabolism and poorly characterized genes. C Overrepresentation in large or small clusters for each functional category at the phylum level