| Literature DB >> 15640447 |
Mingyi Liu1, Heiko Walch, Shaoping Wu, Andrei Grigoriev.
Abstract
We present evidence of remarkable genome-wide mobility and evolutionary expansion for a class of protein domains whose borders locate close to the borders of their encoding exons. These exon-bordering domains are more numerous and widely distributed in the human genome than other domains. They also co-occur with more diverse domains to form a larger variety of domain architectures in human proteins. A systematic comparison of nine animal genomes from nematodes to mammals revealed that exon-bordering domains expanded faster than other protein domains in both abundance and distribution, as well as the diversity of co-occurring domains and the domain architectures of harboring proteins. Furthermore, exon-bordering domains exhibited a particularly strong preference for class 1-1 intron phase. Our findings suggest that exon-bordering domains were amplified and interchanged within a genome more often and/or more successfully than other domains during evolution, probably the result of extensive exon shuffling and gene duplication events. The diverse biological functions of these domains underscore the important role they play in the expansion and diversification of animal proteomes.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15640447 PMCID: PMC546140 DOI: 10.1093/nar/gki152
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Exon-bordering domains. (a) Nomenclature. The exon borders were examined for their positions in relation to domain borders as described previously (14). Nucleotide positions of exon borders were translated into amino acid positions with codon phase information preserved. The first amino acid outside domain was designated as position −1 and the first amino acid inside domain was position +1, and so on. Domain border boxes were defined as short ranges covering the proximity of the start or end position of domains, i.e. border box of [−10, +10] covers 10 amino acids outside and 10 amino acids inside domain. A hypothetical 4-exon transcript, its protein translation, a domain instance on the protein and the two border boxes for the domain are illustrated. Inside the two border boxes, partial nucleotide/amino acid sequences covering the junctions between exons 1 (purple) and 2 (yellow), domain (green) and exons 4 (light brown) are shown. Inside the border boxes, the exon junctions are indicated by the arrowheads and font colors in nucleotide sequences (black and red). Domain border positions in the border boxes are indicated by dotted lines as well as background and font colors. In the illustration, the particular domain instance correlates with exon 2 at N-terminus at amino acid position −1, and exon 3 at C-terminal position +1. Based on the position of the exon border in relation to triplet codon, the N-terminal exon border for the domain is of phase 1 (after the first nucleotide of the triplet codon for Serine at amino acid position −1), while the C-terminal exon border is of phase 2. A domain is considered an exon-bordering domain only when the observed number of exon borders inside all of its border boxes is significantly higher than expected (as detailed in the Methods). (b) EGF-like domain examples. The cDNA and protein domain structures of three EGF-like domain-containing proteins are illustrated proportionally. Each mosaic protein is annotated with gene name and EnsEMBL protein ID on top left. Exons are alternatively colored pink and green. In each case, EGF-like domain (abbreviated as EGF_L) is encoded by one exon. A few other exon-matching exon-bordering domain instances such as Sushi, Fibronectin type III (abbreviated as FN_III) domains were also illustrated. TM, transmembrane domain; SP, signal peptide; Lectin_C, Lectin C-type domain; TSPN, Laminin G domain; VWF_C, von Willebrand factor type C domain; TSP1, Thrombospondin type 1 domain; TSP3, Thrombospondin type 3 repeat; and TSP_C, Thrombospondin C-terminal region.
Figure 2Naming convention for domain instances. The exon-border match instances (2END-ALL) include all the domain instances in genome whose border boxes on both ends contain exon borders. Exon-bordering domain instances also contain the instances of one-sided exon-border match. 2END-SIGNIFICANT (2END-SIG in illustration) instances contain the intersection of exon-bordering domain instances and 2END-ALL, while 2END-RANDOM denotes the resulting set by subtracting 2END-SIGNIFICANT from 2END-ALL.
Exon-bordering domains are abundant and widely distributed
| Domain name | Score | Peak score | Domain # rank | Gene # rank | Co-occur # rank | Co-occur type rank | Architecture # rank | |
|---|---|---|---|---|---|---|---|---|
| KRAB box | 0 | 0 | 1 | 14 | 5 | 1 | 70 | 28 |
| EGF-like domain | 0 | 1.5E−251 | 2 | 5 | 13 | 2 | 2 | 1 |
| Sushi domain (SCR repeat) | 2.0E−308 | 5.2E−197 | 3 | 17 | 56 | 77 | 53 | 34 |
| Fibronectin type III domain | 2.4E−213 | 2.3E−119 | 4 | 11 | 18 | 8 | 13 | 10 |
| Low-density lipoprotein receptor domain class A | 2.4E−105 | 3.1E−77 | 5 | 26 | 81 | 17 | 24 | 20 |
| Ankyrin repeat | 2.9E−104 | 1.4E−43 | 6 | 3 | 10 | 21 | 8 | 9 |
| Leucine Rich Repeat | 5.1E−93 | 5.6E−49 | 7 | 10 | 16 | 16 | 14 | 13 |
| Immunoglobulin domain | 9.1E−80 | 5.7E−11 | 8 | 2 | 3 | 5 | 6 | 8 |
| B-box zinc finger | 5.4E−74 | 5.4E−74 | 9 | 52 | 37 | 43 | 82 | 37 |
| Laminin EGF-like (Domains III and V) | 5.8E−73 | 2.1E−38 | 10 | 24 | 101 | 22 | 22 | 27 |
We listed the top 10 exon-bordering domains selected based on the statistical significance of their correlation to exons (Materials and Methods). Domain score is the overall P-value calculated for the correlation of given domain with exons as described in Methods. Peak score is the lowest P-value for all positions inside the [−10, +10] domain border boxes. P-value column ranks all exon-bordering domains based on P-values of exon-domain correlation; Domain # column ranks all human domains based on the total number of domain instances in genome; Gene # ranks domains based on the total number of human genes containing the given domain; Co-occur # ranks domains based on the total number of co-occurring domains for a domain; Co-occur type ranks domains based on the total number of different types of co-occurring domains for the domain; Architecture # ranks domains based on the total number of different domain architectures for the genes containing the domain. All of these data categories measure the abundance and distribution of a given domain.
Exon-bordering domains were preferentially amplified during evolution
| Category | Human | Worm | Insect | Fish | Rodent |
|---|---|---|---|---|---|
| (a) | |||||
| Domain # | <2.2E−16 | 4.9E−10 | 2.3E−12 | 0.52 | 4.7E−05 |
| Gene # | <2.2E−16 | 1.6E−09 | 1.2E−12 | 0.83 | 5.9E−04 |
| Coocur # | <2.2E−16 | 5.4E−12 | 2.3E−11 | 0.047 | 0.11 |
| Coocur type | 5.2E−13 | 3.9E−08 | 3.9E−07 | 0.61 | 0.67 |
| Architecture # | <2.2E−16 | <2.2E−16 | <2.2E−16 | 0.069 | 5.5E−03 |
| (b) | |||||
| Domain # | 42 | 19 | 15 | 31 | 35 |
| Gene # | 24 | 15 | 6 | 18 | 20 |
| Coocur # | 39 | 15 | 11 | 26 | 33 |
| Coocur type | 20 | 7 | 5 | 16 | 18 |
| Architecture # | 26 | 7 | 5 | 18 | 20 |
(a) P values in Wilcoxon tests. We collected data in five categories (see Table 1 legend) that gauge the abundance and distribution of domains in genome. We compared the exon-bordering domain group to the background group (that contained all the remaining human domains) in Wilcoxon test with alternative hypothesis that exon-bordering domains had a higher mean (column ‘Human’). In the other data columns, a ratio was first produced using numbers obtained in human against the total numbers from the labeled evolutionary group. This ratio indicates the amplification fold for a domain in the data category. We used only domains present in both the evolutionary group and human in generating ratios for these data categories. A better mean amplification ratio in Wilcoxon test suggests a preferential amplification of the exon-bordering domain group in these categories. (b) Percentage of exon-bordering domains in different species. For each of the five data categories, we collected the number contributed by exon-bordering domains as a percentage of the total number. For example, in worm, 19% of all domain instances come from exon-bordering domain instances. In contrast, 42% of all domain instances in human are exon-bordering domain instances.
Strong preference for the 1-1 phase class
| Phase class | Observed | % | Expected | |
|---|---|---|---|---|
| (a) Set 2END-ALL | ||||
| 1-1 | 1535 | 32 | 1063 | 3.7E−60 |
| 0-0 | 1018 | 21 | 685 | 6.0E−43 |
| 0-1 | 573 | 12 | 901 | 9.1E−34 |
| 1-0 | 439 | 9 | 808 | 8.8E−46 |
| 0-2 | 296 | 6 | 301 | 0.77 |
| 2-0 | 313 | 6 | 278 | 0.029 |
| 1-2 | 252 | 5 | 355 | 1.4E−08 |
| 2-2 | 230 | 5 | 122 | 4.1E−23 |
| 2-1 | 222 | 5 | 365 | 6.2E−15 |
| Symmetrical | 2783 | 57 | 1870 | 3.1E−159 |
| (b) Set 2END-SIGNIFICANT | ||||
| 1-1 | 1253 | 45 | 603 | 3.8E−197 |
| 0-0 | 513 | 19 | 388 | 8.1E−12 |
| 0-1 | 300 | 11 | 511 | 4.9E−25 |
| 1-0 | 154 | 6 | 458 | 1.7E−54 |
| 0-2 | 106 | 4 | 171 | 3.3E−07 |
| 2-0 | 123 | 4 | 157 | 0.0048 |
| 1-2 | 96 | 4 | 201 | 1.3E−14 |
| 2-2 | 109 | 4 | 69 | 1.2E−06 |
| 2-1 | 111 | 4 | 207 | 3.8E−12 |
| Symmetrical | 1875 | 68 | 1060 | 5.1E−223 |
| (c) Set 2END-RANDOM | ||||
| 1-1 | 282 | 13 | 461 | 5.0E−21 |
| 0-0 | 505 | 24 | 297 | 6.2E−39 |
| 0-1 | 273 | 13 | 390 | 4.6E−11 |
| 1-0 | 285 | 13 | 350 | 1.5E−04 |
| 0-2 | 190 | 9 | 130 | 7.0E−08 |
| 2-0 | 190 | 9 | 120 | 5.7E−11 |
| 1-2 | 156 | 7 | 154 | 0.85 |
| 2-2 | 121 | 6 | 53 | 2.2E−21 |
| 2-1 | 111 | 5 | 158 | 9.3E−05 |
| Symmetrical | 908 | 43 | 810 | 1.2E−05 |
For domain instances that correlated with exons on both ends (sets 2END-ALL, 2END-SIGNIFICANT and 2END-RANDOM in Figure 2), we recorded the phase classes as x–y where x is the phase of exon border at N-terminus of domain while y is the phase at C-terminus. The observed numbers for each phase class were thus tallied in human. The percentage of each phase class in all cases is displayed in column marked ‘%’. The expected numbers and P-values for the significance of each phase class were calculated as described in Methods. The ‘symmetrical’ class comes from the total of phase classes 0–0, 1–1 and 2–2.
Figure 3Intron phase distribution in all nine species. Phase classes were collected in all nine species for the sets (a) 2END-ALL and (b) 2END-SIGNIFICANT (see Figure 2 legend). Each of the nine possible phase classes was plotted as a percentage of the total phase classes in all species. Exon-bordering domain subgroup strongly favors phase 1-1.