Literature DB >> 19107199

A systematic survey of mini-proteins in bacteria and archaea.

Fengyu Wang1, Jingfa Xiao, Linlin Pan, Ming Yang, Guoqiang Zhang, Shouguang Jin, Jun Yu.   

Abstract

BACKGROUND: Mini-proteins, defined as polypeptides containing no more than 100 amino acids, are ubiquitous in prokaryotes and eukaryotes. They play significant roles in various biological processes, and their regulatory functions gradually attract the attentions of scientists. However, the functions of the majority of mini-proteins are still largely unknown due to the constraints of experimental methods and bioinformatic analysis. METHODOLOGY/PRINCIPAL
FINDINGS: In this article, we extracted a total of 180,879 mini-proteins from the annotations of 532 sequenced genomes, including 491 strains of Bacteria and 41 strains of Archaea. The average proportion of mini-proteins among all genomic proteins is approximately 10.99%, but different strains exhibit remarkable fluctuations. These mini-proteins display two notable characteristics. First, the majority are species-specific proteins with an average proportion of 58.79% among six representative phyla. Second, an even larger proportion (70.03% among all strains) is hypothetical proteins. However, a fraction of highly conserved hypothetical proteins potentially play crucial roles in organisms. Among mini-proteins with known functions, it seems that regulatory and metabolic proteins are more abundant than essential structural proteins. Furthermore, domains in mini-proteins seem to have greater distributions in Bacteria than Eukarya. Analysis of the evolutionary progression of these domains reveals that they have diverged to new patterns from a single ancestor.
CONCLUSIONS/SIGNIFICANCE: Mini-proteins are ubiquitous in bacterial and archaeal species and play significant roles in various functions. The number of mini-proteins in each genome displays remarkable fluctuation, likely resulting from the differential selective pressures that reflect the respective life-styles of the organisms. The answers to many questions surrounding mini-proteins remain elusive and need to be resolved experimentally.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 19107199      PMCID: PMC2602986          DOI: 10.1371/journal.pone.0004027

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Mini-proteins are polypeptides consisting of no more than 100 amino acids (AA), which are widespread in both prokaryotes and eukaryotes and found to play important roles in a variety of functionalities. Mini-proteins usually contain a single domain. In prokaryotes, well known mini-proteins include chaperonin Hsp10, translation initiation factor IF-1, ribosomal proteins and others. In eukaryotes, certain important signalling molecules, animal toxins and protease inhibitors belong to the mini-protein family [1]. James Kastenmayer reported that the Saccharomyces cerevisiae genome codes for 299 mini-proteins based on experimental approaches and computational analysis [2]. Some mini-proteins have been used as model systems to study the determinants of protein folding and stability because of their simple and typical structures [3], [4]. Moreover, some exhibit structural scaffolds valuable to the study of binding activities, identification of frameworks for peptidomimetic design, or search for novel drug candidates [5]. Besides their importance in structural studies, reports on the regulatory functions of mini-proteins have recently aroused extensive interests, especially in Bacteria. For instance, Wu et al. [6], [7] have elucidated the functions of two mini-proteins from Pseudomonas aeruginosa. These proteins were expressed in response to specific environmental stresses and actively participate in the suppression of the type III secretion system, achieving coordinated gene expression, thus playing a critical role in host infection. Within dormant spores of Bacillus, Clostridium and related species, a group of small, acid-soluble spore proteins (SASP) are the crucial factors enabling spores to survive for years, protecting spore DNA from damaging agents [8]. According to binding studies of peptides of various sizes, the minimal size of a functional epitope is around 8 AA, with an average size of 15–20 AA. Therefore, a mini-protein as short as 8 AA is capable to binding targets and to exhibit biological functions. It is not surprising then that mini-proteins with sizes up to 100 AA can perform a variety of relevant functions and participate in regulation of various biological processes. However, little effort had been put to explore their functions; instead, most researches focus on large proteins that are conserved and/or essential among organisms [9]. The characterization of mini-proteins presents difficulties in experimental and bioinformatic approaches. Experimentally, mini-proteins are difficult to isolate and identify due to their small sizes; likewise, in bioinformatic analyses, short genes are the most difficult to predict. Therefore, to provide a clue for their functions, it is necessary to conduct in depth and systematic studies of the mini-proteins. In this report, we analyzed all annotated protein sequences that are ≤100 amino acids (AA) from 532 completed genome data, including 491 sequences of Bacteria and 41 sequences of Archaea, deposited in the Microbial Genome Database at the National Center for Biotechnology Information (NCBI) [10]. We focused our attention on three aspects: the component distribution of mini-proteins (including length, number, and conservation), the characteristics of mini-proteins in bacterial and archaeal species, and the possible reasons why they possess such characteristics. The results indicate that mini-proteins account for an average of 10.99% of all annotated sequences in Bacteria and Archaea, comprising numerous species-specific proteins and hypothetical proteins. The functions of very few mini-proteins are known, but these involve many important biological processes. Moreover, hypothetical mini-proteins contain a fraction of highly conserved sequences, indicating that they play important functional roles.

Results

Mini-protein length distribution

We downloaded 532 sequenced genome data of prokaryotes, consisting of 491 strains of Bacteria and 41 strains of Archaea, from National Centre for Biotechnology Information (NCBI). A total of 180,879 annotated protein sequences with no more than 100 amino acids were extracted. The length distribution of these mini-proteins shows increase in frequency for progressively longer sequences (Figure 1A). Mini-proteins with ≤30AA are the minority in all data, representing merely 1,897 sequences, and accounting for 1.05% of all mini-proteins. The longest sequences, 90AAtransposase-like protein B (remnant) in Clostridium difficile 630. Proteins of 100 amino acids are the most abundant, with 4,092 sequences.
Figure 1

A: Mini-protein length distribution. B: Distribution of all mini-proteins.

A: Mini-protein length distribution. B: Distribution of all mini-proteins.

Mini-protein overview in phylum

The 532 sequenced genomes we collected from NCBI belonged to species classified in 18 distinct phyla, 3 from Archaea and 15 from Bacteria. Four phyla were represented by single genome sequences, i.e., Nanoarchaeota from Archaea and Aquificae, Fusobacteria and Planctomycetes from Bacteria. Moreover, we treated Proteobacteria's five classes as phyla to describe, namely Alpha-, Beta-, Delta-, Epsilon- and Gamma-, because they are represented by the largest number of genomes, with 258 strains accounting for nearly half of sequenced genomes. As shown in Table 1, the overall proportion of mini-proteins among all annotated genomic proteins is 10.99%. Planctomycetes has the highest number of mini-proteins, comprising 26.54% or 1,944 sequences. In contrast, Aquificae has the least number of mini-proteins, merely encoding 48 mini-proteins in the whole genome, representing 3.08%. However, these two phyla contain only one genome each, Rhodopirellula baltica and Aquifex aeolicus, respectively. Except for these two extremes, other phyla encode similar proportions of mini-proteins, although greater variability is observed when considering individual strains. For instance, the Alphaproteobacterium Anaplasma phagocytophilum contains 33.39% mini-proteins, more than any other genome. On the other extreme, in the genome of Clostridium tetani (Firmicutes), a human pathogen causing tetanus, there are no mini-proteins annotated except for 4 sequences on its plasmid. Genomes from nine other species of Clostridium have been sequenced. In sharp contrast to the C. tetani, these nine genomes contain a normal proportion of mini-proteins, ranging from 14.25% to 8.27%. Moreover, similar average proportions of mini-proteins, 11.28%, 11.30% and 9.33%, respectively, are annotated in the genomes of three archaeal phyla.
Table 1

Overview of mini-proteins in phylum.

DomainPhylumSumAverage%Range%Average LengthMinimum LengthOrganism NumberAverage Sum
ArchaeaCrenarchaeota295311.288.36–18.23771812246
ArchaeaEuryarchaeota764211.307.83–15.00761628273
ArchaeaNanoarchaeota509.339.338154150
BacteriaAcidobacteria6965.585.35–5.8084372348
BacteriaActinobacteria141958.014.53–15.88761042338
BacteriaAquificae483.083.088047148
BacteriaBacteroidetes35828.895.51–13.11742711326
BacteriaChlamydiae12969.986.22–17.73733011118
BacteriaChlorobi129511.796.79–21.8072305259
BacteriaChloroflexi97913.296.31–18.1674274245
BacteriaCyanobacteria1205717.317.80–30.83711526464
BacteriaFirmicutes3646512.600.16–25.11706113323
BacteriaFusobacteria23211.2211.2272201232
BacteriaPlanctomycetes194426.5426.54663511944
BacteriaAlphaproteobacteria2124611.035.09–33.39742065327
BacteriaBetaproteobacteria2134710.025.02–24.49731344485
BacteriaDeltaproteobacteria566410.121.80–18.99721815378
BacteriaEpsilonproteobacteria209911.037.35–16.75691211191
BacteriaGammaproteobacteria426269.754.58–26.91739123347
BacteriaSpirochaetes322513.635.49–28.7161149358
BacteriaThermi7617.465.75–9.2978114190
BacteriaThermotogae4778.627.28–9.2873303159
18087910.99%74532346

Specific and shared mini-proteins

To investigate conservation among mini-proteins, we took several representative phyla to determine the proportion of their mini-proteins that are specific or shared to each taxonomic level (species, genus, family, order, class, phylum and domain). Conservation was established by sequence similarity as determined by BLAST comparisons (see Table 2). Our criteria for the definition of specific vs. shared include the following: (i) Except for species-specific proteins, the specificity at other taxonomic levels must meet two conditions, namely not only are they particular at a certain level, but they also simultaneously exist in all categories at the lower levels. For instance, as a query sequence, one mini-protein belongs to a certain species and a certain genus, and the results indicate that its homologs are only present in all species in the same genus. In this case, we call it a “genus-specific” protein. Similarly, if its homologs are found in other genera in the same family, then we name it the “genus-shared”; (ii) Given that a genus might only have one sequenced species, a mini-protein named “species-specific” does not automatically become genus-specific. This rule also applies to other levels; (iii) Because of filtration by various parameters, the entries shown in the results are less than the number of mini-proteins used in the initial searches.
Table 2

Specific or shared mini-proteins in phyla.

Mini-Proteins CategoriesEuryarchaeota%Actinobacteria%Cyanobacteria%Firmicutes%Gamma Proteobacteria%Spirochaetes%
Sum7643141951205736465426263225
Blast-result7638141841205536383425343212
Species-specific485463.55755153.24724560.102030455.811871143.99244376.06
Species-shared2383.12261918.469077.52593016.30468411.01692.15
Genus-specific4315.646614.6623546.4719654.6252416.31
Genus-shared4135.41230.199452.60537112.63
Family-specific4095.35660.475124.256041.663670.8680.25
Family-shared390.51197213.908442.322430.570
Order-specific2373.101270.35280.0720.06
Order-shared194116.10510.1439389.26
Class-specific470.622061.451941.618782.41
Class-shared4876.389722.6739519.29
Phylum-specific50.075424.50150.04
Phylum-shared2192.8710877.666635.5031438.6431807.481635.07
Domain-specific260.34
Domain-shared2333.05220.16280.232160.59960.2330.09

Note: The averages of species-specific, phylum-shared and domain-shared are 58.79%, 6.20% and 0.73%, respectively. Because Actinobacteria contains only one class, class-specific mini-proteins are equal to the phylum-specifics'; similarly, Spirochaetes contains one class and one order, so the order-specific mean the phylum-specific. The hypothetical proteins account for 82.80%, 86.51%, 85.77%, 79.91%, 84.49% and 95.37% of the species-specific in Euryarchaeota, Actinobacteria, Cyanobacteria, Firmicutes, Gammaproteobacteria and Spirochaetes, respectively.

Note: The averages of species-specific, phylum-shared and domain-shared are 58.79%, 6.20% and 0.73%, respectively. Because Actinobacteria contains only one class, class-specific mini-proteins are equal to the phylum-specifics'; similarly, Spirochaetes contains one class and one order, so the order-specific mean the phylum-specific. The hypothetical proteins account for 82.80%, 86.51%, 85.77%, 79.91%, 84.49% and 95.37% of the species-specific in Euryarchaeota, Actinobacteria, Cyanobacteria, Firmicutes, Gammaproteobacteria and Spirochaetes, respectively. From Table 2, it is clear that the species-specific mini-proteins are the majority in all of the phyla (average proportion is 58.79%), suggesting that these proteins potentially take on some unique functions that contribute to the adaptation of organisms to different habitats. However, 85.81% of them are annotated as “hypothetical protein” and the authenticity of their existence has not been confirmed. In contrast, shared or conserved proteins account for a small fraction, with 6.20% phylum-shared and 0.73% domain-shared (conserved in both Archaea and Bacteria). It is worthy of attention that Firmicutes comprise a larger proportion of these shared mini-proteins than any other bacterial phyla. In addition, although the proportion of hypothetical proteins is low among the conserved proteins, some hypothetical proteins are phylum-shared and domain-shared proteins. However, most phylum-shared proteins are well-characterized, such as, in the phylum-shared class, various ribosomal proteins, cold shock protein, translation initiation factor IF-1; in the domain-shared class rubredoxin, transcriptional regulator and gas vesicle protein (see Table 3 for a complete list).
Table 3

Phylum-shared and domain-shared mini-proteins in phyla.

EuryarchaeotaActinobacteriaCyanobacteriaFirmicutesGamma proteobacteriaSpirochaetes
Phylum-sharedsum219sum1087sum663sum3143sum3180sum163
hypothetical protein39hypothetical protein190hypothetical protein216hypothetical protein541hypothetical protein602hypothetical protein33
ribosomal protein135ribosomal protein482ribosomal protein214ribosomal protein1470ribosomal protein1132ribosomal protein83
enzyme or submit26enzyme or submit91enzyme or submit52phage protein231cold-shock protein387translation initiation
Other19cold-shock protein82redoxin29cold-shock protein179enzyme or submit219factor IF-110
redoxin58acyl carrier protein24DNA-binding protein110redoxin175acyl carrier protein7
translation initiationtranslation initiationtranslation initiationDNA-binding protein144carbon storage regulator5
factor IF-137factor IF-122factor IF-1100translation initiationGroES chaperone4
10 KD chaperonin22S4-like RNAenzyme or submit90factor IF-1112other21
other125binding protein15redoxin63acyl carrier protein105
other91acyl carrier protein45other304
sporulation protein S38
other276
Domain-sharedsum223sum22sum28sum216sum96sum3
hypothetical protein99hypothetical protein7hypothetical protein12hypothetical protein78hypothetical protein21hypothetical protein1
enzyme or submit34redoxin5gas vesicle protein8transcriptionalrubredoxin32rubredoxin2
redoxin28transcriptionalrubredoxin4regulator49transcriptional
transcriptionalregulator4enzyme or submit4DNA-binding protein32regulator15
regulator14YHS domain protein3redoxin14cation transport
gas vesicle protein11other3enzyme or submit13regulator7
other47other30other21

Conservation of hypothetical proteins

The aforementioned results show that hypothetical proteins accounted for a large proportion of mini-proteins, even among the conserved phylum-shared and domain-shared ones. In fact, about 70.03% or 126,670 mini-proteins are designated as hypothetical proteins, while merely 29.97% or 54,209 proteins possess functional or structural annotations. Moreover, 25,394 mini-proteins have been classified in the COG (Clusters of Orthologous Groups) database [11] and approximately 17.81% of them are unknown function (see Figure 2 for details). We further focused on these hypothetical proteins (also including uncharacterized protein and protein of unknown function, here together referred to as “hypothetical proteins”) to search for more conserved mini-proteins for better classification. We selected one strain from each genus that contains the most mini-proteins as representative in all phyla (Table 1) and analyzed these mini-proteins' conservation among all data (see Materials and Methods). We then picked out the mini-proteins whose homologous proteins are present in at least five of the phyla. As before, the five classes in Proteobacteria were also treated as distinct phyla.
Figure 2

Mini-proteins in COG.

As a result, we found many new groups of conserved mini-proteins and obtained 28 groups of proteins conforming to the above conditions. Then we compiled the data and searched for their functional domains on the Pfam [12] or InterProScan [13] websites (see Table 4 in details). These 28 groups of mini-proteins can be divided into three types. First, mini-proteins are well studied, with detailed functional and/or structural information, including group 01–07 and group 27–28. Second, they are the mini-proteins with domains named as DUF (Domain of Unknown Function) or UPF (Uncharacterized Protein Family). Third, the conservation is lower than those of above two types, whose domains are only found in Pfam-B, which supplements the databases' principal body (Pfam-A) and contains small families of proteins. The conservation of mini-proteins represented in Table 4 is generally high, with lowest similarity of 42% among these groups. Mini-proteins assembled in a group usually belong to the same domain, but proteins in groups 07, 18 and 20 include representatives from the two domains of Bacteria and Archaea. However, proteins of groups 18 and 20 are poorly characterized.
Table 4

Conservative analysis of hypothetical proteins in mini-proteins.

Serial NumberDomain NameBlast SumPhylumPositivesIdentityAnnotationDomain Description
group01BMC103AcidobacteriaActinobacteriaCyanobacteriaFirmicutesFusobacteriaPlanctomycetesAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaEpsilonproteobacteriaGammaproteobacteria66.40%7.40%microcompartments protein;carboxysome shell protein;propanediol/ethanolamineutilization protein;Bacterial microcompartments are primitive organelles composed entirely of protein subunits. The microcompartment is the carboxysome, a protein shell for sequestering carbon fixation reactions.
group02PAAR-motif16AcidobacteriaBacteroidetesChloroflexiCyanobacteriaPlanctomycetesAlphaproteobacteriaDeltaproteobacteriaEpsilonproteobacteriaGammaproteobacteria86.40%24.30%PAAR repeat-containing protein ;hypothetical protein ;This motif is found usually in pairs in a family of bacterial membrane proteins. It is also found as a triplet of tandem repeats comprising the entire length in another family of hypothetical proteins.
group03Plasmid-killer20ActinobacteriaCyanobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteria74.20%22.70%plasmid maintenance system killer protein;hypothetical protein;Several plasmids with proteic killer gene systems have been reported. All of them encode a stable toxin and an unstable antidote. The activation of those systems result in cell filamentation and cessation of viable cell production.
group04Plasmid-Txe40AcidobacteriaActinobacteriaChloroflexiCyanobacteriaFirmicutesAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteriaSpirochaetes70.80%5.70%addiction module toxin (Txe/YoeB family);hypothetical protein;The plasmid encoded Axe-Txe proteins act as an antitoxin-toxin pair.
group05RHH-29ActinobacteriaCyanobacteriaAlphaproteobacteriaBetaproteobacteriaGammaproteobacteria74.20%28.10%putative transcriptional regulators(CopG/Arc/MetJ family);hypothetical protein;This family of proteins is about 80 amino acids in length and their function is unknown. The proteins contain a conserved GRY motif. This family appears to be related to ribbon-helix-helix DNA-binding proteins.
group06YcfA-like11FirmicutesAlphaproteobacteriaBetaproteobacteriaGammaproteobacteriaSpirochaetes60.70%17.90%YcfA-like protein;hypothetical protein;This family is similar to the YcfA protein expressed by E. coli. Most of these proteins are hypothetical proteins of unknown function.
group07zf-UBP12AcidobacteriaActinobacteriaCyanobacteriaEuryarchaeotaDeltaproteobacteriaGammaproteobacteria71.20%26.00%putative Zn-finger domain;hypothetical protein;Zn-finger in ubiquitin-hydrolases and other protein
group08DUF37144AcidobacteriaActinobacteriaBacteroidetesChlorobiChlamydiaeCyanobacteriaFirmicutesFusobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteriaSpirochaetesThermiThermotogae46.50%7.10%alpha-hemolysin;protein of unknown function DUF37;hypothetical protein;This domain is found in short (75 amino acid) hypothetical proteins from various bacteria. The domain contains three conserved cysteine residues.
group09DUF19613ChlorobiFirmicutesAlphaproteobacteriaDeltaproteobacteriaGammaproteobacteria86.60%13.40%CRISPR-associated protein;protein of unknown function DUF196;hypothetical protein;This domain describes proteins of unknown function.Trm112p-like protein; The bacterial members are about 60–70 amino acids in length and the eukaryotic examples are about 120 amino acids in length. The C terminus contains the strongest conservation.
group10DUF343132ActinobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteria49.50%5.30%tetraacyldisaccharide -1-P 4-kinase ;protein of unknown function DUF343;hypothetical protein;
group11DUF37046FirmicutesCyanobacteriaThermotogaeChloroflexiDeltaproteobacteria81.40%20.60%protein of unknown function DUF370 ;hypothetical protein;Domain of unknown function
group12DUF42717ActinobacteriaBacteroidetesChloroflexiCyanobacteriaThermiBetaproteobacteriaGammaproteobacteria85.40%27.10%protein of unknown function DUF427 ;hypothetical protein ;Domain of unknown function
group13DUF4336AcidobacteriaChlorobiChloroflexiCyanobacteriaAlphaproteobacteria70.20%19.00%protein of unknown function DUF433 ;hypothetical protein ;Domain of unknown function
group14DUF52843AcidobacteriaCyanobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteria75.30%10.40%accessory protein involved in assembly of Fe-S clusters;protein of unknown function DUF528;hypothetical protein;Domain of unknown function
group15DUF89111CyanobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteria65.50%13.60%protein of unknown function DUF891;hypothetical protein;This family consists of hypothetical bacterial proteins of unknown function as well as phage Gp49 proteins.
group16DUF132878AcidobacteriaBacteroidetesAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteria57.00%7.00%putative inner membrane protein;hypothetical protein ;This family consists of several hypothetical bacterial proteins of around 50 residues in length. The function of this family is unknown.
group17DUF145837ActinobacteriaChlorobiAlphaproteobacteriaBetaproteobacteriaGammaproteobacteria56.10%5.60%protein of unknown function DUF1458 ;hypothetical protein ;Members of this family are typically of around 70 residues in length. The function of this family is unknown.
group18UPF015014AcidobacteriaChloroflexiCyanobacteriaEuryarchaeotaFirmicutesDeltaproteobacteria70.70%6.10%protein of unknown function UPF0150;hypothetical protein;This domain is found next to a DNA binding helix-turn-helix domain, which suggests that this is some kind of ligand binding domain.
group19pfam-B_840931ChlorobiAlphaproteobacteriaBetaproteobacteriaEpsilonproteobacteriaGammaproteobacteriaThermi73.80%17.80%predicted membrane protein;hypothetical protein;
group20pfam-B_1121327BacteroidetesEuryarchaeotaFirmicutesDeltaproteobacteriaEpsilonproteobacteriaGammaproteobacteriaThermi63.50%22.40%hypothetical protein
group21pfam-B_2081327BacteroidetesCyanobacteriaPlanctomycetesBetaproteobacteriaGammaproteobacteria50.00%13.40%hypothetical protein
group22pfam-B_2088515ActinobacteriaFirmicutesAlphaproteobacteriaBetaproteobacteriaGammaproteobacteria44.60%11.90%uncharacterized conserved small protein like protein;hypothetical protein;
group23pfam-B_4995522BacteroidetesAlphaproteobacteriaBetaproteobacteriaEpsilonproteobacteriaGammaproteobacteria64.30%27.10%oxygen-sensitive ribonucleoside-triphosphate reductase;hypothetical protein ;
group24pfam-B_1086295BacteroidetesAcidobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteria47.40%4.20%hypothetical protein;
group25pfam-B_1393367ChlorobiCyanobacteriaAlphaproteobacteriaDeltaproteobacteriaGammaproteobacteria75.30%15.70%hypothetical protein;
group26pfam-B_6607;pfam-B_9422[1] 35BacteroidetesChlorobiAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaEpsilonproteobacteriaGammaproteobacteria51.80%15.50%hypothetical protein;
group27signal-peptide;transmembrane-regions43AcidobacteriaAlphaproteobacteriaBetaproteobacteriaDeltaproteobacteriaGammaproteobacteria55.10%10.10%conserved hypothetical membrane protein;hypothetical protein;
group28TRASH;zf-HIT[2] 31CyanobacteriaAlphaproteobacteriaBetaproteobacteriaGammaproteobacteriaThermi42.00%12.50%zinc finger protein;hypothetical protein;TRASH :metallochaperone-like domainzf-HIT :This presumed zinc finger contains up to 6 cysteine residues that could coordinate zinc.

Note: [1] Pfam-B_6607 and pfam-B_9422 are continuous; [2] Different domains are searched through different sequences.

Note: [1] Pfam-B_6607 and pfam-B_9422 are continuous; [2] Different domains are searched through different sequences.

Evolutionary analysis of domains

We further investigated the domains (or motifs and conserved regions) within the conserved hypothetical proteins in Table 4 as well as the phylum-shared and the domain-shared proteins in Table 3, and observed four patterns in the process of their evolution (see in Figure 3). We noticed that (i) these domains are highly conserved and widespread. Four domains, Plasmid-killer, Plasmid-Txe, RHH-2 and DUF370, were specific to Bacteria; other domains were conserved in Bacteria as well as in Viruses, Archaea and Eukarya. Except for Zf-UBP, which is mainly represented in eukaryotes, all other domains mainly exist in Bacteria. This suggests that the domains in mini-proteins are more likely to contribute to the bacterial species rather than that of eukaryotes; (ii) these domains seem to have evolved independently in mini-proteins, except for the PAAR-motif, which is often observed in tandem repeats. However, with the extension of protein lengths, the domains developed at least two patterns, except for those with independent evolution such as RHH-2, DUF37, DUF196, DUF370 and DUF528; (iii) independent domains seem to be more frequent than any of the three patterns, whereas self-tandem is the major pattern for PAAR-motifs and, chimera with other domains is the major pattern for Zf-UBP and YHS domains.
Figure 3

Patterns of domains.

Dashed lines mean the domain have been evolved to a part of other domain or protein family's conserved region.

Patterns of domains.

Dashed lines mean the domain have been evolved to a part of other domain or protein family's conserved region. Domains represent the functional and evolutionary units of proteins, and almost all mini-proteins contain one domain. Results of our analysis indicate that individual domains evolve independently. Most domains develop new patterns during long-term evolution although the patterns of independent domains account for the majority in terms of number. In the course of evolution, proteins have a general tendency to fuse into two or multi-domain units from the single unit, which may help proteins develop new functions. As shown in Figure 3, proteins in pattern 2 achieve the functional integration through combination with different domains, which is a predominant route of protein evolution. In regard to pattern 3, it is also a relatively common method of protein evolution from single to multiple domains. The number of self-tandem domains is variable in proteins. For example, BMC (bacterial microcompartment) is always tandem with two repeats, but in proteins CSD (cold-shock domain) is not stabilized and tandem up to six domains. A typical example of independent evolution is DUF37, which originates from group 08 in Table 4. This group of mini-proteins includes the largest searched sequences and covers all phyla of Bacteria, 144 total sequences of the 15 phyla. The majority of them are hypothetical proteins or proteins of unknown function, except 4 proteins that are annotated alpha-hemolysin, which is a bacterial toxin that can assemble a transmembrane pore. In the InterPro database [14], we detected 653 proteins possessing this domain, including one sequence in virus, 9 sequences in green plants and 643 sequences in Bacteria. Also, these proteins do not comprise another domains any more, which suggests that DUF37 evolved independently. In addition, many domains consist of at least two patterns. A good example is BMC within the group 01 of mini-proteins which involves 103 sequences of 11 phyla. We found that 843 proteins contain this domain in the InterPro database and summarized its evolutionary patterns. From Figure 4A, we can find that beside independent domain (62.51%), BMC has developed other two patterns: self-tandem (18.04%) as well as chimera with other domain (19.45%). In spite of different patterns, the proteins still possess similar functions, which indicate that one BMC domain is necessary to exert its function instead of requiring tandem of two BMC domains. We further investigated its phylogeny and used Cyanobacteria as an example (Figure 4B). It is clearly observed that the self-tandem and chimera with other domain pattern are divergent from independent domain because the BMC domains in pattern 3 or 4 and pattern 2 or 5 form two independent clusters, respectively. The left and right domains are clustered in pattern 3 or 4, respectively. This implies that the existence of tandem domains may not be the result of the domain duplication, rather the transfer of the domains between proteins.
Figure 4

A: Patterns of BMC domain. Dashed lines mean the domain have been evolved to a part of other domain or protein family's conserved region. Green represents IPR011238 (polyhedral organelle shell protein PduT) and IPR009193 (polyhedral organelle shell protein, EutL/PduB type) in pattern4; blue represents IPR009193 (polyhedral organelle shell protein, EutL/PduB type) and IPR009307 (ethanolamine utilization EutS) in pattern 5. They are all polyhedral organelle shell proteins. B: Phylogeny of BMC domain in Cyanobacteria. Letter L and R represent the left and right domain in pattern 3 or 4, respectively. Letter r represents the domains in pattern 2 or 5.

A: Patterns of BMC domain. Dashed lines mean the domain have been evolved to a part of other domain or protein family's conserved region. Green represents IPR011238 (polyhedral organelle shell protein PduT) and IPR009193 (polyhedral organelle shell protein, EutL/PduB type) in pattern4; blue represents IPR009193 (polyhedral organelle shell protein, EutL/PduB type) and IPR009307 (ethanolamine utilization EutS) in pattern 5. They are all polyhedral organelle shell proteins. B: Phylogeny of BMC domain in Cyanobacteria. Letter L and R represent the left and right domain in pattern 3 or 4, respectively. Letter r represents the domains in pattern 2 or 5.

Discussion

Our study collected all annotated mini-protein sequences from the sequenced genomic data and carried out the comprehensive and systemic analysis, although previously there were a few sporadic reports about the structural and functional analyses of mini-proteins [2]–[8]. We found that the number of mini-proteins gradually increases with their length in amino acids. In particular, mini-proteins in the range of 70AA15]. This is the reason why we have chosen this length as the cut-off of proteins for analysis. With regard to smaller proteins, it has been suggested that mini-proteins (40–50AA) can exhibit a well defined three-dimensional structure through disulfide bridges, metal ion binding and specific hydrophobic interactions [4]. However, Samuel et al. reported that mini-proteins with just 20 amino acids can also adopt well-defined globular shapes [16]. Surprisingly, our analysis indicated that the number of mini-proteins ≤30AA is very low. It is possible that many small mini-proteins may have been filtered out during annotation, grossly under estimating the actual number of mini-proteins. Our results indicate that mini-proteins are numerous, accounting for an average of 10.99% of all genomic data in Bacteria and Archaea. Despite the enormous total sum, distribution of the mini-proteins exhibits remarkable variation among different strains. For example, more than 30% of proteins encoded by the genome are mini-proteins in two strains of Prochlorococcus marinus (30.83% and 30.30%) as well as Anaplasma phagocytophilum HZ (33.39%). A. phagocytophilum HZ represents the greatest percentage of mini-proteins encoded on the genome. By contrast, Clostridium tetani in Firmicutes, represents a unique strain with no known mini-proteins encoded on its genome. Interestingly, both the maximum and the minimum belong to the bacterial domain. Consequently, the range of variation of mini-protein content in Bacteria (0.16%–33.39%) spans much greater than in archaea (7.83–18.23%). Although the concrete biological significance is unknown, we speculate that this phenomenon may relate to the fact that ecological conditions of bacterial species are more diverse and complicated than that of archaeal species which are mostly in constant but extreme environments [17], [18]. In addition, even among closely related species, the relative proportions of the mini-proteins vary greatly. In Clostridium, except for C. tetani which encodes no mini-proteins, the other nine strains all encode mini-proteins, ranging from 8.27% to 14.25% of the total number of proteins. Species of this genus are ubiquitous in soils, aquatic sediments and the intestinal tracts of animals and humans; hence they display metabolic and biological diversity. Surprisingly, ferredoxin, ATP synthase subunit C and 50S ribosomal protein L27 are less than 100 amino acids and belong to mini-proteins in the nine strains, but in C. tetani, they are 290AA, 333AA and 101AA long, respectively. It is plausible that even subtle changes in the environment may become a selective pressure for mini-proteins, and the differences among Clostridium are the result of multifactor influence. However, it is difficult to determine which environmental factors affect the evolution of mini-proteins. Nonetheless, a few examples can provide some clues to certain extent. For instance, the aforementioned Anaplasma includes two sequenced species, A. phagocytophilum HZ (contains 33.39% mini-proteins) and A. marginale str. St. Maries (contains 7.59% mini-proteins). They are both obligate intracellular pathogens, but they inhabit granulocytes and erythrocyte, respectively [19], [20], therefore, differences in the host intracellular environments might account for the significant differences in relative proportion of the mini-proteins. Moreover, the proportions of mini-proteins in the genome are actually dissimilar between different isolates of the same species, such as two strains in Spirochaetes, Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130 and L. interrogans serovar Lai str. 56601, contain 10.74% and 28.71% mini-proteins, respectively. We speculate that this difference between mini-protein proportions may reflect different selection pressures the two strains are exposed to, resulting in different leptospiral serovars that are derived from structural heterogeneity in the carbohydrate component of lipopolysaccharides [21]. Our results reveal that one characteristic of mini-protein data is that species-specific proteins predominate, whereas conserved proteins are the minority, which ought to be the chief reason for the fluctuations in mini-protein content. Why are species-specific proteins so numerous? We speculate several possible reasons: first, the mini-proteins help organisms to adapt to the diverse and distinctive ecological niches, thus many of them are species-specific. Particularly in Bacteria, some species freely live in various aqueous or terrestrial environments, while others are intracellular parasites, obligate and facultative parasites of animals and plants. Second, some of the mini-proteins are short remnants of longer genes that were present in their early ancestors. Third, some proteins probably evolved too rapidly to maintain homologues and intermediate sequences. Fourth, similar proteins have been incorrectly annotated [22]. In fact, mini-proteins are capable of being very good candidates for the species-specific. On one hand, the vast majority of mini-proteins contain one domain which lets them exert functions simply and directly through protein-protein interactions or binding DNA or RNA sequences. On the other hand, since mini-proteins require less to translate and fold, organisms use them to regulate relevant pathways and respond to subtle changes in the environment, which accords with the hypothesis that organisms tend to minimize costs of protein biosynthesis [23]. Additionally, the amount of conserved proteins is less, but most of them are necessary for the survival of organisms, especially those phylum-shared and domain-shared ones. Another characteristic of mini-protein data is that although hypothetical proteins are the majority and the proteins with known functions are the minority, the functions of mini-proteins are diverse. As shown in Figure 2, mini-proteins are involved in broad functional classes, including information storage and processing, cellular processes and signalling, and metabolism. In fact, they are distributed in nearly all subclasses of three larger classes, except for RNA processing and modification, nuclear structure, cytoskeleton and extracellular structures (data not shown). This result implies that regulatory and metabolic proteins are more common than constitutive or structural proteins, which can also be observed clearly from phylum-shared and domain-shared proteins. As previously mentioned, some of 299 mini-proteins in the S. cerevisiae genome are required for growth under genotoxic conditions including exposure to hydroxyurea (HU), bleomycin and ultraviolet (UV), suggesting that they play important roles to harsh environmental conditions [2]. Furthermore, the proportion of hypothetical proteins is very high, about 70.03%. This might be due to the fact that (i) a great deal of mini-proteins are species-specific; (ii) some of the mini-proteins might be incorrectly annotated; and (iii) there are technical difficulties in identifying the functions of mini-proteins. However, we discovered that even in hypothetical proteins there are still a fraction of conserved sequences, including conserved proteins at each taxonomic level and 28 groups of proteins spanning beyond five phyla. These will be useful for us to correctly annotate proteins and further explore the function and evolution of mini-proteins; especially those highly conserved sequences listed in Table 4 which are more biologically significant and will be an emphasis of our future studies. Mini-proteins have received significantly less attention from researchers due to the constraints of experimental and bioinformatic approaches. Here, we investigated the annotated mini-proteins from the sequenced genomic data and discovered some overall rules which could establish a foundation for further studies. However, the answers to many questions remain elusive and wait to be resolved in the future. They include (i) how to identify potentially more mini-proteins in various genomes; (ii) how to confirm the functions of identified mini-proteins; (iii) what are the biological functions of the mini-proteins; and (iv) what are the driving forces for the evolution of the mini-proteins.

Materials and Methods

We collected 532 completed prokaryotes genomes from the National Center for Biotechnology Information (NCBI) up to date the 2nd July, 2007; and extracted all annotated protein sequences ≤100 amino acids in their chromosomes and plasmids, as the length of one domain is usually below that cut-off value. Moreover, every strain was classified according to the NCBI taxonomy database. We first analyzed the overall length distribution of mini-proteins and described the main characteristics of the each phylum. And then, to detect the special or shared mini-proteins of six representative phyla we started by carrying out a BLASTP search of every mini-protein sequence in one phylum against all mini-protein data we extracted. In regard to the last results, we recorded the matches for each protein sequence with an E-value lower than 10−5, sequence identity higher than 60% and filtered low-complexity sequences. In addition, to explore the conservation of mini-proteins in all phyla we also carried out BLASTP searches using mini-proteins queries from a representative species for every genus against all mini-protein data with parameters as previously described. Amino acid sequence alignments were obtained with Clustalx software [24]. For the domain analysis, we investigated them using the Pfam or InterPro websites and proteins' sequences include all prokaryotic and eukaryotic data. Moreover, we used Mega [25] software (bootstrapped neighbor-joining method) for phylogenetic reconstructions.
  21 in total

Review 1.  Uniquely folded mini-protein motifs.

Authors:  B Imperiali; J J Ottesen
Journal:  J Pept Res       Date:  1999-09

2.  Multiple sequence alignment in parallel on a workstation cluster.

Authors:  Justin Ebedes; Amitava Datta
Journal:  Bioinformatics       Date:  2004-02-05       Impact factor: 6.937

3.  MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment.

Authors:  Sudhir Kumar; Koichiro Tamura; Masatoshi Nei
Journal:  Brief Bioinform       Date:  2004-06       Impact factor: 11.622

4.  Complete genome sequencing of Anaplasma marginale reveals that the surface is skewed to two superfamilies of outer membrane proteins.

Authors:  Kelly A Brayton; Lowell S Kappmeyer; David R Herndon; Michael J Dark; David L Tibbals; Guy H Palmer; Travis C McGuire; Donald P Knowles
Journal:  Proc Natl Acad Sci U S A       Date:  2004-12-23       Impact factor: 11.205

Review 5.  Brain peptides: what, where, and why?

Authors:  D T Krieger
Journal:  Science       Date:  1983-12-02       Impact factor: 47.728

6.  Comparative analysis of the LPS biosynthetic loci of the genetic subtypes of serovar Hardjo: Leptospira interrogans subtype Hardjoprajitno and Leptospira borgpetersenii subtype Hardjobovis.

Authors:  A de la Peña-Moctezuma; D M Bulach; T Kalambaheti; B Adler
Journal:  FEMS Microbiol Lett       Date:  1999-08-15       Impact factor: 2.742

7.  PtrB of Pseudomonas aeruginosa suppresses the type III secretion system under the stress of DNA damage.

Authors:  Weihui Wu; Shouguang Jin
Journal:  J Bacteriol       Date:  2005-09       Impact factor: 3.490

8.  Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae.

Authors:  James P Kastenmayer; Li Ni; Angela Chu; Lauren E Kitchen; Wei-Chun Au; Hui Yang; Carole D Carter; David Wheeler; Ronald W Davis; Jef D Boeke; Michael A Snyder; Munira A Basrai
Journal:  Genome Res       Date:  2006-03       Impact factor: 9.043

9.  Protein length in eukaryotic and prokaryotic proteomes.

Authors:  Luciano Brocchieri; Samuel Karlin
Journal:  Nucleic Acids Res       Date:  2005-06-10       Impact factor: 16.971

10.  Comparative genomics of emerging human ehrlichiosis agents.

Authors:  Julie C Dunning Hotopp; Mingqun Lin; Ramana Madupu; Jonathan Crabtree; Samuel V Angiuoli; Jonathan A Eisen; Jonathan Eisen; Rekha Seshadri; Qinghu Ren; Martin Wu; Teresa R Utterback; Shannon Smith; Matthew Lewis; Hoda Khouri; Chunbin Zhang; Hua Niu; Quan Lin; Norio Ohashi; Ning Zhi; William Nelson; Lauren M Brinkac; Robert J Dodson; M J Rosovitz; Jaideep Sundaram; Sean C Daugherty; Tanja Davidsen; Anthony S Durkin; Michelle Gwinn; Daniel H Haft; Jeremy D Selengut; Steven A Sullivan; Nikhat Zafar; Liwei Zhou; Faiza Benahmed; Heather Forberger; Rebecca Halpin; Stephanie Mulligan; Jeffrey Robinson; Owen White; Yasuko Rikihisa; Hervé Tettelin
Journal:  PLoS Genet       Date:  2006-02-17       Impact factor: 5.917

View more
  18 in total

1.  Cloning, Expression, Purification, Regulation, and Subcellular Localization of a Mini-protein from Campylobacter jejuni.

Authors:  Soumeya Aliouane; Jean-Marie Pagès; Jean-Michel Bolla
Journal:  Curr Microbiol       Date:  2016-01-11       Impact factor: 2.188

2.  Role for Escherichia coli YidD in membrane protein insertion.

Authors:  Zhong Yu; Mariëlle Lavèn; Mirjam Klepsch; Jan-Willem de Gier; Wilbert Bitter; Peter van Ulsen; Joen Luirink
Journal:  J Bacteriol       Date:  2011-07-29       Impact factor: 3.490

3.  Functional diversification of the RING finger and other binuclear treble clef domains in prokaryotes and the early evolution of the ubiquitin system.

Authors:  A Maxwell Burroughs; Lakshminarayan M Iyer; L Aravind
Journal:  Mol Biosyst       Date:  2011-05-06

4.  sRNAscanner: a computational tool for intergenic small RNA detection in bacterial genomes.

Authors:  Jayavel Sridhar; Narmada Sambaturu; Suryanarayanan Ramkumar Narmada; Radhakrishnan Sabarinathan; Hong-Yu Ou; Zixin Deng; Kanagaraj Sekar; Ziauddin Ahamed Rafi; Kumar Rajakumar
Journal:  PLoS One       Date:  2010-08-05       Impact factor: 3.240

5.  TWS1, a Novel Small Protein, Regulates Various Aspects of Seed and Plant Development.

Authors:  Elisa Fiume; Virginie Guyon; Carine Remoué; Enrico Magnani; Martine Miquel; Damaris Grain; Loïc Lepiniec
Journal:  Plant Physiol       Date:  2016-09-09       Impact factor: 8.340

6.  Complete genome and transcriptomes of Streptococcus parasanguinis FW213: phylogenic relations and potential virulence mechanisms.

Authors:  Jianing Geng; Cheng-Hsun Chiu; Petrus Tang; Yaping Chen; Hui-Ru Shieh; Songnian Hu; Yi-Ywan M Chen
Journal:  PLoS One       Date:  2012-04-18       Impact factor: 3.240

7.  Comparative Genomics of Listeria Sensu Lato: Genus-Wide Differences in Evolutionary Dynamics and the Progressive Gain of Complex, Potentially Pathogenicity-Related Traits through Lateral Gene Transfer.

Authors:  Matteo Chiara; Marta Caruso; Anna Maria D'Erchia; Caterina Manzari; Rosa Fraccalvieri; Elisa Goffredo; Laura Latorre; Angela Miccolupo; Iolanda Padalino; Gianfranco Santagada; Doriano Chiocco; Graziano Pesole; David S Horner; Antonio Parisi
Journal:  Genome Biol Evol       Date:  2015-07-15       Impact factor: 3.416

8.  Insight into the specific virulence related genes and toxin-antitoxin virulent pathogenicity islands in swine streptococcosis pathogen Streptococcus equi ssp. zooepidemicus strain ATCC35246.

Authors:  Zhe Ma; Jianing Geng; Li Yi; Bin Xu; Ruoyu Jia; Yue Li; Qingshu Meng; Hongjie Fan; Songnian Hu
Journal:  BMC Genomics       Date:  2013-06-07       Impact factor: 3.969

9.  Identification and characterization of potential therapeutic candidates in emerging human pathogen Mycobacterium abscessus: a novel hierarchical in silico approach.

Authors:  Buvaneswari Shanmugham; Archana Pan
Journal:  PLoS One       Date:  2013-03-19       Impact factor: 3.240

10.  Genome Sequence of Bacillus endophyticus and Analysis of Its Companion Mechanism in the Ketogulonigenium vulgare-Bacillus Strain Consortium.

Authors:  Nan Jia; Jin Du; Ming-Zhu Ding; Feng Gao; Ying-Jin Yuan
Journal:  PLoS One       Date:  2015-08-06       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.