| Literature DB >> 16899135 |
Rex T Nelson1, Randy Shoemaker.
Abstract
BACKGROUND: Large scale gene analysis of most organisms is hampered by incomplete genomic sequences. In many organisms, such as soybean, the best source of sequence information is the existence of expressed sequence tag (EST) libraries. Soybean has a large (1115 Mbp) genome that has yet to be fully sequenced. However it does have the 6th largest EST collection comprised of ESTs from a variety of soybean genotypes. Many EST libraries were constructed from RNA extracted from various genetic backgrounds, thus gene identification from these sources is complicated by the existence of both gene and allele sequence differences. We used the ESTminer suite of programs to identify potential soybean gene transcripts from a single genetic background allowing us to observe functional classifications between gene families as well as structural differences between genes and gene paralogs within families. The identification of potential gene sequences (pHaps) from soybean allows us to begin to get a picture of the genomic history of the organism as well as begin to observe the evolutionary fates of gene copies in this highly duplicated genome.Entities:
Mesh:
Year: 2006 PMID: 16899135 PMCID: PMC1557498 DOI: 10.1186/1471-2164-7-204
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Distribution of pHaps Among 12,702 Gene-families
| Potential Haplotypes per Gene family (45,255 total pHaps) | Frequency in the Gene families (% of total) |
| 1 | 8,535 (67%) |
| 2 | 1316 (10%) |
| 3 | 706 (6%) |
| 4 | 472 (4%) |
| 5 | 320 (3%) |
| 6 | 205 (2%) |
| 7 | 190 (2%) |
| 8 | 134 (1%) |
| 9 | 89 (< 1%) |
| 10 | 85 (< 1%) |
| >10 | 651 (5%) |
Figure 1Distribution of insertion lengths and consecutive substitutions within gene families. A) Consecutive base substitutions demonstrate that single base substitutions are the primary size class, consisting of 90% of all substitutions which reduce in number rapidly. The largest consecutive stretch of substitutions was 12. B) Insertion lengths in terms of percent of the total number of insertion events are shown. Insertion lengths demonstrate excess insertions of lengths 3, 6, and 9 bases however, the largest size class is single base Insertions which compose 58% of all insertion events. The data shown is for insertions less than 16 bases in length.
Base Substitution and Indel Frequencies within Gene families
| Type of LDSDS | Transcriptome Wide | Restricted to Families with >1 pHap |
| Base Substitutions | 15.8/kb | 40.3/kb |
| In/Dels | 1.3/kb | 3.4/kb |
Distribution of GOslim terms among ESTminer gene families
| GOslim Category | Multiple | Single |
| Chromatin Binding | 0 | 8 |
| carbohydrate binding | 2 | 0 |
| motor activity | 2 | 15 |
| receptor activity | 3 | 11 |
| nuclease activity | 7 | 15 |
| signal transducer activity | 11 | 32 |
| transcription regulator activity | 15 | 41 |
| oxygen binding | 25 | 46 |
| enzyme regulator activity | 31 | 298 |
| lipid binding | 42 | 35 |
| protein binding | 55 | 73 |
| translation factor activity | 56 | 78 |
| translation regulator activity | 0 | 1 |
| receptor binding | 63 | 160 |
| nucleotide binding | 72 | 109 |
| RNA binding | 97 | 85 |
| kinase activity | 111 | 330 |
| binding | 123 | 149 |
| transferase activity | 166 | 323 |
| transporter activity | 196 | 346 |
| transcription factor activity | 287 | 490 |
| hydrolase activity | 294 | 591 |
| structural molecule activity | 308 | 84 |
| catalytic activity | 431 | 596 |
| molecular function unknown | 771 | 1569 |
| No Functional Annotation | 999 | 3050 |
Distribution of GOslim terms among individual ESTminer potential gene sequences (pHaps).
| GOslim Category | Multiple | Single |
| Chromatin Binding | 0 | 10 |
| translation regulator | 0 | 1 |
| motor activity | 2 | 14 |
| carbohydrate binding | 5 | 0 |
| receptor activity | 10 | 11 |
| nuclease activity | 27 | 12 |
| signal transducer | 32 | 16 |
| transcription regulator | 40 | 40 |
| oxygen binding | 100 | 47 |
| enzyme regulator | 298 | 36 |
| kinase activity | 402 | 297 |
| lipid binding | 407 | 34 |
| translation factor | 431 | 78 |
| nucleotide binding | 525 | 104 |
| RNA binding | 659 | 83 |
| protein binding | 898 | 84 |
| transferase activity | 1160 | 335 |
| transcription factor | 1311 | 441 |
| binding | 1408 | 156 |
| transporter | 1460 | 334 |
| hydrolase activity | 1950 | 558 |
| structural molecule | 2982 | 83 |
| catalytic activity | 3569 | 634 |
| receptor binding | 3725 | 165 |
| function unknown | 4861 | 1573 |
| No Functional Annotation | 10458 | 3389 |
Figure 2Distribution of GOslim terms among gene families. Histogram of GOslim terms associated with all gene families. Red bars indicate gene families with multiple genes and blue bars represent gene families which were composed of a single gene. A single asterisk indicates a significant departure from independence in a Chi-square test (1df, p ≤ 0.05) and a double asterisk indicates a probability level of p ≤ 0.01. In general, families composed of few genes (single) made up the majority of all family types in each category with the exception of the categories of structural molecule activity, RNA and lipid binding where multiple gene families appear to be in the majority.
Figure 3Distribution of GOslim terms among individual genes. Histogram ofGOslim terms associated with all genes. Red bars indicate genes from multiple gene families (multiple) and blue bars represent genes from families with few members (single). Asterisks indicate comparisons where multiple gene families contained more (Red) or fewer (Blue) members than expected. Significance was judged at the 0.05 probability level (single asterisk) using a Chi-square test in each category. Double asterisks indicates significance at the 0.01 probability level. Genes from multiple member families are the predominant class of genes in each category. The pHaps were not randomly distributed among the GO categories with proteins involved in kinase, hydrolase, oxygen binding, transcription regulator, nuclease, signal transducer and transcription factor activities appearing to contain fewer members than the average multiple gene family while families in the categories of enzyme regulator structural molecule and catalytic activity and receptor, protein and lipid binding appear to have larger than average multiple gene families.
Enriched GO terms in gene families with few members (single)
| GO:0016740 | transferase activity |
| GO:0016772 | transferase activity, transferring phosphorus-containing groups |
| GO:0008170 | N-methyltransferase activity |
| GO:0008276 | protein methyltransferase activity |
| GO:0042054 | histone methyltransferase activity |
| GO:0016757 | transferase activity, transferring glycosyl groups |
| GO:0016773 | phosphotransferase activity, alcohol group as acceptor |
| GO:0016758 | transferase activity, transferring hexosyl groups |
| GO:0016279 | protein-lysine N-methyltransferase activity |
| GO:0018024 | histone-lysine N-methyltransferase activity |
| GO:0046974 | histone lysine N-methyltransferase activity (H3-K9 specific) |
| GO:0046976 | histone lysine N-methyltransferase activity (H3-K27 specific) |
| GO:0016278 | lysine N-methyltransferase activity |
| GO:0005554 | molecular function unknown |
| GO:0008757 | S-adenosylmethionine-dependent methyltransferase activity |
| GO:0008194 | UDP-glycosyltransferase activity |
| GO:0016410 | N-acyltransferase activity |
| GO:0016407 | acetyltransferase activity |
| GO:0008080 | N-acetyltransferase activity |
| GO:0003682 | chromatin binding |
| GO:0030554 | adenyl nucleotide binding |
| GO:0005524 | ATP binding |
| GO:0003700 | transcription factor activity |
| GO:0030515 | snoRNA binding |
| GO:0030599 | pectinesterase activity |
| GO:0016301 | kinase activity |
| GO:0004672 | protein kinase activity |
| GO:0004674 | protein serine/threonine kinase activity |
| GO:0016538 | cyclin-dependent protein kinase regulator activity |
| GO:0004428 | inositol or phosphatidylinositol kinase activity |
| GO:0015291 | porter activity |
| GO:0015290 | electrochemical potential-driven transporter activity |
| GO:0015171 | amino acid transporter activity |
| GO:0005275 | amine transporter activity |
| GO:0015203 | polyamine transporter activity |
| GO:0005279 | amino acid-polyamine transporter activity |
| GO:0015359 | amino acid permease activity |
| GO:0005342 | organic acid transporter activity |
| GO:0046943 | carboxylic acid transporter activity |
| GO:0016789 | carboxylic ester hydrolase activity |
| GO:0016788 | hydrolase activity, acting on ester bonds |
| GO:0016810 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds |
| GO:0016811 | hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, in linear amides |
| GO:0004722 | protein serine/threonine phosphatase activity |
| GO:0004721 | phosphoprotein phosphatase activity |
| GO:0008026 | ATP-dependent helicase activity |
| GO:0004386 | helicase activity |
| GO:0008236 | serine-type peptidase activity |
| GO:0004803 | transposase activity |
| GO:0003777 | microtubule motor activity |