| Literature DB >> 12697059 |
Lars Juhl Jensen1, Marie Skovgaard, Thomas Sicheritz-Pontén, Merete Kjaer Jørgensen, Christiane Lundegaard, Corinna Cavan Pedersen, Nanna Petersen, David Ussery.
Abstract
BACKGROUND: For most sequenced prokaryotic genomes, about a third of the protein coding genes annotated are "orphan proteins", that is, they lack homology to known proteins. These hypothetical genes are typically short and randomly scattered throughout the genome. This trend is seen for most of the bacterial and archaeal genomes published to date.Entities:
Mesh:
Substances:
Year: 2003 PMID: 12697059 PMCID: PMC156604 DOI: 10.1186/1471-2164-4-12
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Atlas of the entire Properties are shown as colored concentric circles representing the chromosome. These have all been smoothed by calculating 5,000 bp running averages.
Figure 2Comparison of protein length and codon usage for the unknown and known regions. (a) The length distributions of annotated proteins is visualized as Gaussian kernel density estimates. Based on the distributions we see no reason to suspect that the protein sequences from the two unknown regions are the result of random ORFs. (b) No differences in the relative usage of alternative codons for amino acids are observed between annotated CDSs from either of the two unknown regions and CDSs annotated in the known regions. This strongly indicates that the majority of the annotated novel genes in the unknown regions are true protein coding genes.
Figure 3Position of the patterns GGGGG and GGAGG relative to translation start. Despite GGAGG being the reverse complement of the 3' end of M. kandleri 16S rRNA, the pattern GGGGG is found to have a much stronger preference for being located just upstream of translation start.
Distribution of M. kandleri specific protein families. Only the subset of the 45 protein families showing a preference for either region I or II is shown. The presence of specific protein families within each region suggests that the two regions serve different functions.
| Family | Region I | Region II | Known |
| MK-9 | 5 | - | - |
| MK-7 | 3 | - | 1 |
| MK-6 | 3 | - | 4 |
| MK-5 | 3 | 1 | 4 |
| MK-17 | 2 | - | - |
| MK-23 | 2 | - | - |
| MK-26 | 2 | - | - |
| MK-27 | 2 | - | - |
| MK-8 | 2 | - | 1 |
| MK-22 | 2 | - | 1 |
| MK-1 | 3 | 6 | 11 |
| MK-10 | - | 5 | - |
| MK-2 | - | 4 | 1 |
| MK-3 | - | 4 | 3 |
| MK-11 | - | 3 | - |
| MK-12 | 1 | 3 | 1 |
| MK-14 | - | 3 | 2 |
| MK-37 | - | 3 | 3 |
| MK-31 | - | 2 | - |
| MK-34 | - | 2 | 1 |
| MK-28 | - | 2 | 2 |
Figure 4Atlases of the two unknown regions. The properties are visualized as in Figure 1 except that a smoothing window of 1,000 bp was applied to all parameters.
Figure 5Amino acid compositional biases in the unknown regions. The bias of each amino acid is represented by a bar with length proportional to the log-ratio between its amino acid frequency within one of the unknown regions and its frequency within known regions. Regions I and II are represented by gray and black bars respectively. The amino acids have been sorted by their codon AT-content, which is both listed and visualized as bars.
Fraction of proteins assignable to cellular role categories. The estimated number of genes in each genome is compared to the number of genes for which a cellular role could be assigned using EUCLID.
| Organism | No. genes estimated | No. genes assigned | %of estimate |
| 1,477 | 653 | 44 | |
| 1,376 | 684 | 50 | |
| 1,706 | 867 | 51 | |
| 2,035 | 1,045 | 51 | |
| 2,288 | 1,186 | 52 | |
| 2,686 | 1,420 | 53 | |
| 3,456 | 1,850 | 54 | |
| 1,683 | 911 | 54 | |
| 1,448 | 786 | 54 | |
| 1,573 | 895 | 57 | |
| 1,818 | 1,074 | 58 | |
| 1,350 | 781 | 58 | |
| 1,497 | 855 | 58 | |
| 1,466 | 867 | 60 | |
| 1,250 | 783 | 63 | |
| 1,243 | 792 | 64 |