| Literature DB >> 28338802 |
Daniel Straub1,2, Stephan Wenkel1,2.
Abstract
MicroProteins are small single-domain proteins that act by engaging their targets into different, sometimes nonproductive protein complexes. In order to identify novel microProteins in any sequenced genome of interest, we have developed miPFinder, a program that identifies and classifies potential microProteins. In the past years, several microProteins have been discovered in plants where they are mainly involved in the regulation of development by fine-tuning transcription factor activities. The miPFinder algorithm identifies all up to date known plant microProteins and extends the microProtein concept beyond transcription factors to other protein families. Here, we reveal potential microProtein candidates in several plant and animal reference genomes. A large number of these microProteins are species-specific while others evolved early and are evolutionary highly conserved. Most known microProtein genes originated from large ancestral genes by gene duplication, mutation and subsequent degradation. Gene ontology analysis shows that putative microProtein ancestors are often located in the nucleus, and involved in DNA binding and formation of protein complexes. Additionally, microProtein candidates act in plant transcriptional regulation, signal transduction and anatomical structure development. MiPFinder is freely available to find microProteins in any genome and will aid in the identification of novel microProteins in plants and animals.Entities:
Keywords: metazoa; miPFinder; microProteins; plants; protein–protein interaction
Mesh:
Substances:
Year: 2017 PMID: 28338802 PMCID: PMC5381583 DOI: 10.1093/gbe/evx041
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FFlow chart miPFinder. Mandatory steps are with a light gray background. Orange, databases; green, data packages; gray, tools; blue, lists; white, custom functions.
Overview of miPFinder Results
| Species | Protein sequences | microProtein candidate families | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Original | Filtered | % | Total | Trans-miP | % | PPID | % | PPI | % | ≥50% ≥250aa | % | |
| 35386 | 35364 | 99.94 | 551 | 531 | 96 | 193 | 35 | 80 | 15 | 328 | 60 | |
| 34727 | 34415 | 99.10 | 1767 | 1767 | 100 | 419 | 24 | 140 | 8 | 1344 | 76 | |
| 51472 | 50631 | 98.37 | 2772 | 1422 | 51 | 587 | 21 | 106 | 4 | 2011 | 73 | |
| 47205 | 46544 | 98.60 | 866 | 861 | 99 | 206 | 24 | n.d. | n.d. | 557 | 64 | |
| 52424 | 52417 | 99.99 | 1661 | 1578 | 95 | 305 | 18 | 62 | 4 | 1090 | 66 | |
| 88760 | 80694 | 90.91 | 5132 | 3673 | 72 | 1007 | 20 | 195 | 4 | 3688 | 72 | |
| 101933 | 48542 | 47.62 | 1235 | 320 | 26 | 340 | 28 | 44 | 4 | 850 | 69 | |
| 56337 | 31983 | 56.77 | 526 | 221 | 42 | 186 | 35 | 48 | 9 | 346 | 66 | |
| 44487 | 32031 | 72.00 | 371 | 201 | 54 | 165 | 44 | 28 | 8 | 253 | 68 | |
| 30362 | 30152 | 99.31 | 218 | 128 | 59 | 74 | 34 | 24 | 11 | 124 | 57 | |
| 30939 | 30925 | 99.95 | 768 | 372 | 48 | 168 | 22 | 41 | 5 | 551 | 72 | |
Coding sequence length is a multiple of 3 and contains a start and a stop codon; for H. sapiens and M. Musculus protein coding sequences of GENCODE basic that are not flagged as lacking any transcription evidence.
MicroProtein candidates do not contain a cis-miP.
Percentage of total microProtein candidate families (column “Total”).
Sequences with annotated protein–protein interaction domain (PPID).
Protein–protein interaction of at least one microProtein candidate with at least one putative ancestor according to STRING data.
≥50% of related sequences are ≥250aa in length.
n.d., not determined.
Known MicroProteins Identified by miPFinder
| MicroProtein group members | Ancestor count | Known miPs | Rating | Min. evalue | cis-mip | % small | % medium | % large | Pfam | PPID | PPI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AT2G45450.1; AT3G60890.1; | 4 | ZPR3 | 147 | 3.9E-06 | no | 50 | 0 | 50 | bZIP transcription factor | Yes | Yes | |
| 125 | TCL1, TCL2, ETC1, ETC2, CPC, ETC3, TRY | 223 | 9.5E-27 | no | 8 | 16 | 76 | Myb-like DNA-binding domain | Yes | Yes | ||
| 56 | PAR1, PAR2 | 182 | 1.3E-09 | no | 9 | 25 | 66 | Helix-loop-helix DNA-binding domain | Yes | Yes | ||
| 10 | PRE3, PRE5, BNQ3, KDR, BNQ2, PRE1 | 147 | 9.3E-06 | no | 33 | 11 | 56 | Helix-loop-helix DNA-binding domain | Yes | Yes | ||
| 22 | MIP1A, MIP1B | 229 | 2.0E-15 | no | 6 | 29 | 65 | B-box zinc finger | Yes | Yes | ||
| 8 | MIF1, MIF2, MIF3 | 288 | 1.0E-31 | no | 18 | 35 | 47 | ZF-HD protein dimerization region | No | Yes | ||
| 8 | KNATM | 174 | 5.0E-17 | no | 17 | 17 | 67 | KNOX2 domain | No | No |
Only one protein identifier per gene is shown. Gene identifiers of known microProteins are in italics.
Whether microProtein candidates contain cis-miPs.
Percent of related sequences (BLAST or hmmsearch) that are ≤140aa in length.
Percent of related sequences (BLAST or hmmsearch) that are 141–249aa in length.
Percent of related sequences (BLAST or hmmsearch) that are ≥250aa in length.
pfam domain of highest score.
Whether pfam domain is annotated as protein–protein interaction domain.
Protein–protein interaction of at least one microProtein with at least one related large sequence according to STRING database.
FCircos plot of individual microProtein candidates. Links indicate conservation between species based on OrthoFinder. Red, in all 11 species; dark blue, exclusively in all five metazoans; light blue, only in metazoans; dark green, exclusively in all six plants; light green, only in plants.
Conserved MicroProtein Candidates
| Species | miP candidates | Excl. in metazoa | Excl. in all metazoa | Excl. in plants | Excl. in all plants | In all species | Total PRT | % | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total | % | PPI | % | PRT | GRP | PRT | GRP | PRT | GRP | PRT | GRP | PRT | GRP | |||
| 1589 | 5 | 751 | 47 | 461 | 151 | 105 | 30 | 22 | 7 | 588 | 37 | |||||
| 4160 | 12 | 1399 | 34 | 1554 | 641 | 108 | 31 | 24 | 7 | 1686 | 41 | |||||
| 6215 | 12 | 1784 | 29 | 1874 | 902 | 116 | 32 | 21 | 7 | 2011 | 32 | |||||
| 1990 | 4 | 733 | 37 | 444 | 165 | 87 | 30 | 17 | 7 | 548 | 28 | |||||
| 3447 | 7 | 945 | 27 | 678 | 299 | 103 | 30 | 29 | 7 | 810 | 23 | |||||
| 10591 | 13 | 3315 | 31 | 2055 | 955 | 119 | 33 | 33 | 7 | 2207 | 21 | |||||
| 2841 | 6 | 1107 | 39 | 530 | 161 | 14 | 3 | 15 | 7 | 559 | 20 | |||||
| 1209 | 4 | 681 | 56 | 358 | 132 | 8 | 4 | 16 | 7 | 382 | 32 | |||||
| 907 | 3 | 576 | 64 | 203 | 81 | 5 | 3 | 14 | 7 | 222 | 24 | |||||
| 567 | 2 | 235 | 41 | 73 | 24 | 8 | 3 | 22 | 7 | 103 | 18 | |||||
| 1324 | 4 | 416 | 31 | 58 | 41 | 6 | 3 | 11 | 7 | 75 | 6 | |||||
| total | 34840 | 11942 | 1222 | 439 | 41 | 16 | 7066 | 3113 | 638 | 186 | 224 | 77 | 9191 | |||
Exclusively in the specified group but not conserved among all.
Percentage of total microProtein candidate sequences (column “Total”).
Percentage of filtered sequences (table 2, column “Filtered”).
Sequences with annotated protein–protein interaction domain.
PRT, number of protein sequences; GRP, number of miP candidate families (groups).
FMicroProtein candidates in transcription factor families in metazoans. The presence of microProtein candidates in human (upper left, red), mouse (upper right, blue), zebrafish (right, yellow), fruit fly (bottom, green) and roundworm (left, gray) in the respective transcription factor family is indicated as bold line.
FGene Ontology and protein class analysis of microProtein-subsets. For all sets, only the most significant ancestor of a microProtein candidate family was analyzed. (A and B) The subsets represent microProtein candidate families with the following conservation in: a: all species; b: all plants; c: all dicots; d: some dicots; e: all monocots; f: some monocots; g: some plants; h: all metazoa; i: all vertebrates; j: some vertebrates; k: nonvertebrates; l: some metazoa. (A) The GO terms are sorted in order of their descending average abundance of all subsets and color coded by their subset specific percent of genes with GO annotation. (B) Selected GO terms extracted from A as indicated by dashed lines. NF, not found. (C) Protein classes that are regulated by Arabidopsis (left) and human (right) high probability microProteins.