| Literature DB >> 20158875 |
Alexander A Kanapin1, Nicola Mulder, Vladimir A Kuznetsov.
Abstract
UNLABELLED: We consider the problem of biological complexity via a projection of protein-coding genes of complex organisms onto the functional space of the proteome. The latter can be defined as a set of all functions committed by proteins of an organism. Alternative splicing (AS) allows an organism to generate diverse mature RNA transcripts from a single mRNA strand and thus it could be one of the key mechanisms of increasing of functional complexity of the organism's proteome and a driving force of biological evolution. Thus, the projection of transcription units (TU) and alternative splice-variant (SV) forms onto proteome functional space could generate new types of relational networks (e.g. SV-protein function networks, SFN) and lead to discoveries of novel evolutionarily conservative functional modules. Such types of networks might provide new reliable characteristics of organism complexity and a better understanding of the evolutionary integration and plasticity of interconnection of genome-transcriptome-proteome functions.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20158875 PMCID: PMC2822532 DOI: 10.1186/1471-2164-11-S1-S4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Database sources and statistics of Functional labels (FLs), proteins, and fractions of poly- and monoform TUs in different species.
| Organism | Database | # FLs | #Proteins | Poly/MonoTUs (%) |
|---|---|---|---|---|
| ARATH | TAIR/RIKEN | 2463 | 27247 | 6/8920 (0.07) |
| CAEEL | WormBase | 1150 | 7854 | 65/4200 (1.55) |
| DROME | FlyBase | 2335 | 11129 | 141/7185 (1.96) |
| CIOIN | Ensembl | 2934 | 12417 | 505/8567 (5.89) |
| FUGRU | Ensembl | 3306 | 24245 | 843/10760 (7.83) |
| MOUSE | RIKEN/FANTOM | 5172 | 52957 | 2353/18574 (12.67) |
| HUMAN | RIKEN/FANTOM | 5183 | 49829 | 2315/15944 (14.52) |
Top 10 functional labels for human and mouse data sets
| Human | Mouse | ||||
|---|---|---|---|---|---|
| Number of occurence | FL ID | Keywords | Number of occurence | FL ID | Keywords |
| 172 | FLh5145 | RNA-binding | 245 | FLm4392 | Kinase |
| 178 | FLh4376 | Kinase | 258 | FLm4574 | Membrane |
| 207 | FLh2670 | DNA-binding | 267 | FLm3271 | G-protein coupled receptor |
| 303 | FLh4819 | Metal-binding | 321 | FLm4818 | Metal-binding |
| 316 | FLh4994 | Nuclear protein | 342 | FLm3276 | G-protein coupled receptor |
| 324 | FLh5132 | Ribonucleoprotein | 394 | FLm4993 | Nuclear protein |
| 332 | FLh3280 | G-protein coupled receptor | 501 | FLm5125 | Ribonucleoprotein |
| 489 | FLh3273 | G-protein coupled receptor | 902 | FLm2595 | DNA-binding regulation |
| 882 | FLh2637 | DNA-binding | 950 | FLm4724 | Membrane |
| 1011 | FLh4717 | Membrane | 1091 | FLm3270 | G-protein coupled receptor |
Keyword over-representation statistics. (1) Monoform TUs and (2) polyform TUs.
| 1. | ||||
|---|---|---|---|---|
| Human | Mouse | Keyword | ||
| order by p-value | p-value | order by p-value | p-value | |
| 1 | 7.81E-20 | 1 | 4.24E-51 | G-PROTEIN COUPLED RECEPTOR |
| 2 | 1.38E-18 | 3 | 4.18E-44 | TRANSDUCER |
| 3 | 2.21E-16 | 2 | 1.58E-50 | OLFACTION |
| 4 | 9.54E-09 | 6 | 4.64E-07 | RIBOSOMAL PROTEIN |
| 5 | 8.06E-08 | 4 | 1.58E-32 | SENSORY TRANSDUCTION |
| 6 | 5.4E-06 | 7 | 0.000989 | RIBONUCLEOPROTEIN |
| 7 | -- | 5 | 2.29E-14 | RECEPTOR |
| Human | Mouse | Keyword | ||
| order by p-value | p-value | order by p-value | p-value | |
| 1 | 2.66E-45 | 2 | 3.06E-71 | ATP-BINDING |
| 2 | 2.36E-39 | 1 | 2.04E-78 | NUCLEOTIDE-BINDING |
| 3 | 5.43E-22 | 7 | 4.29E-20 | METAL-BINDING |
| 4 | 3.88E-19 | 4 | 3.15E-27 | TRANSFERASE |
| 5 | 6.36E-17 | 3 | 1.34E-31 | KINASE |
| 6 | 1.11E-15 | 9 | 4.27E-18 | TYROSINE-PROTEIN KINASE |
| 7 | 8.91E-15 | 5 | 2.69E-25 | SERINE-THREONINE PROTEIN KINASE |
| 8 | 1.07E-14 | 8 | 1.12E-19 | HYDROLASE |
| 9 | 1.05E-10 | -- | IRON | |
| 10 | 1.25E-10 | 6 | 3.08E-24 | NUCLEAR PROTEIN |
| 11 | -- | 10 | 1.44E-13 | TRANSCRIPTION |
Only the top 10 keywords are included in the polyform statistics
Figure 1Correlation between average number of splice variants produced by TU and fraction of polyform TUs. The linear trend line and correlation coefficient are displayed. The red point represents Arabidopsis data. The Y axis represents the fraction of polyform TUs; the X axis represents an average number of transcripts (splice variants) produced by a TU for a given organism.
Figure 2Distribution of the top 30 keywords from the set of TUs shared among all organisms analyzed.
Figure 3Best-fit statistics of three transcript-protein relation functions in the mouse (left) and human (right) data sets. A and B: best-fit frequency distributions of the number of FLs in a given TU; C and D: best-fit frequency distribution of the number of distinct TUs attributed with a given FL in a proteome subset related to selected TUs. E and F: the frequency distributions of the number of splice variant events per TU. The mixture probabilistic model (1) was used for identification of the empirical frequency distributions. Blue symbols: data used for parameterisation of the first model (P); blue lines best-fit function P. Read symbols: data used for parameterisation of the first model (P); blue lines best-fit function P. SigmaPlot analytical and graphical tools were used.
Descriptive statistics of three studied Transcript-Protein Function relations in mouse and human. m1: number of one-to one relationships (singletons), p1: % singletons; Skewness: estimated skewness of the empirical frequency distribution.
| 1. | |||||||
|---|---|---|---|---|---|---|---|
| mouse | 23640 | 20928 | 1.13 | 9 | 18575 | 88.8 | 4.05 |
| Hs | 20929 | 18260 | 1.15 | 9 | 15945 | 87.3 | 3.99 |
| mouse | 23640 | 5172 | 4.58 | 1091 | 3198 | 61.8 | 25.96 |
| human | 20929 | 5183 | 4.04 | 1011 | 3280 | 63.3 | 28.85 |
| mouse | 52957 | 20928 | 2.53 | 74 | 8920 | 42.6 | 4.01 |
| human | 49828 | 18260 | 2.73 | 73 | 6957 | 38.09 | 5.55 |
A mixture probabilistic model and best-fit parameters of the model for Splice variant-TU relationships based on available mouse and human data
| Splice variants-TU's | Mouse | Human |
|---|---|---|
| Model 3: | ||
| 13898 ± 9509.3 | 11279 ± 119.9 | |
| 0.5 ± 0.25 | 0.49 ± 0.006 | |
| 0.07 | <0.0001 | |
| 0.0836 | <0.0001 | |
| Std Error of Estimator | 3749 | 48.8 |
| NAN | 4.16 ± 0.127 | |
| NAN | 8.75 ± 0.282 | |
| NAN | <0.0001 | |
| NAN | <0.0001 | |
| Std Error of Estimator | NAN | 2.7032 |
Figure 4SFN representation for the organisms analyzed
Figure 5The biggest cluster in Human-Mouse common FL space. The size of the nodes is proportional to the number of connecting edges.
SFN statistics and its relation to organism complexity
| Organism | #nodes | #edges | Hetero-geneity | Average # neighbors |
|---|---|---|---|---|
| ARATH | 11 | 7 | 0.35 | 1.27 |
| CAEEL | 68 | 53 | 0.7 | 1.56 |
| DROME | 254 | 155 | 0.48 | 1.22 |
| CIOIN | 676 | 503 | 0.88 | 1.49 |
| FUGRU | 997 | 848 | 1.07 | 1.7 |
| MOUSE | 2562 | 2594 | 1.67 | 2.03 |
| HUMAN | 2511 | 2573 | 1.89 | 2.05 |
Functional labels with the highest number of links.
| FL ID | Number of links | Keywords |
|---|---|---|
| Flc3553 | 11 | Membrane |
| Flc3674 | 11 | Membrane |
| Flc2789 | 14 | GTP-binding |
| Flc3759 | 15 | |Metal-binding |
| Flc3914 | 15 | |Nuclear protein |
| Flc4038 | 15 | |RNA-binding |
| Flc3398 | 16 | |Kinase |
| Flc2047 | 17 | |DNA-binding |
| Flc2018 | 22 | |DNA-binding |
| Flc3505 | 40 | |Membrane |
Data presented for FLs shared between human and mouse.
Figure 6Data flow in the Functional Label generation algorithm. The diagram describes a general approach to the FL generation for a given protein sequence (splice variant) via conserved domains and sequence similarity.