| Literature DB >> 15784138 |
Brian J Haas1, Jennifer R Wortman, Catherine M Ronning, Linda I Hannick, Roger K Smith, Rama Maiti, Agnes P Chan, Chunhui Yu, Maryam Farzad, Dongying Wu, Owen White, Christopher D Town.
Abstract
BACKGROUND: Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15784138 PMCID: PMC1082884 DOI: 10.1186/1741-7007-3-7
Source DB: PubMed Journal: BMC Biol ISSN: 1741-7007 Impact factor: 7.431
Statistics for Arabidopsis reannotation Release 5.
| Chr. 1 | Chr. 2 | Chr. 3 | Chr.4 | Chr. 5 | Total | |
| Length (Mb) | 30.269 | 19.702 | 23.465 | 18.582 | 26.978 | 118.998 |
| %GC | ||||||
| overall | 35.9 | 35.9 | 36.3 | 36.2 | 35.9 | 36.0 |
| coding | 44.1 | 44.2 | 44.3 | 44.2 | 44.1 | 44.2 |
| intronic | 32.4 | 32.3 | 32.6 | 32.4 | 32.3 | 32.4 |
| intergenic | 30.8 | 31.4 | 31.6 | 31.6 | 31.1 | 31.2 |
| # genes | 6,772 | 4,104 | 5,233 | 3,985 | 6,113 | 26,207 |
| gene density (kb/gene) | 4.47 | 4.80 | 4.48 | 4.66 | 4.41 | 4.5 |
| Avg. gene length (bp)a | 2,287 | 2,156 | 2,197 | 2,269 | 2,227 | 2,232 |
| Avg. protein length | 425 | 398 | 417 | 421 | 419 | 417 |
| # genes in protein families | 4,834 | 2,884 | 3,803 | 2,839 | 4,281 | 18,641 |
| #genes duplicated via segmental chromosome duplications | 1,868 | 961 | 1,315 | 1,147 | 1,291 | 6,582 |
| #genes found tandemly duplicated | 993 | 545 | 750 | 636 | 813 | 3,737 |
| #genes with alt splicing isoforms | 600 | 412 | 444 | 357 | 517 | 2,330 |
| #genes with annotated UTRs | 4,717 | 2,936 | 3,575 | 2,724 | 4,147 | 18,099 |
| #transposons and pseudogenes | 748 | 817 | 837 | 652 | 732 | 3,786 |
| # tRNA genes | 240 | 96 | 93 | 79 | 123 | 631 |
| Exons | ||||||
| # exons | 37,710 | 21,428 | 27,937 | 21,800 | 33,255 | 142,130 |
| total length (Mb) | 10.378 | 5.919 | 7.812 | 6.011 | 9.170 | 39.290 |
| avg exons/gene | 5.57 | 5.22 | 5.34 | 5.47 | 5.44 | 5.42 |
| avg exon size | 275 | 276 | 280 | 276 | 276 | 276 |
| Introns | ||||||
| # introns | 30,938 | 17,324 | 22,704 | 17,814 | 27,191 | 115,921 |
| total length (Mb) | 5.060 | 2.903 | 3.657 | 3.016 | 4.416 | 19.053 |
| avg size | 164 | 168 | 161 | 169 | 163 | 164 |
| # distinct proteins | 7,176 | 4,451 | 5,540 | 4,231 | 6,457 | 27,855 |
| # proteins with interpro domains | 6,142 | 3,686 | 4,676 | 3,573 | 5,441 | 23,518 |
| # with TM domain | 2,047 | 1,429 | 1,599 | 1,316 | 1,768 | 8,159 |
| Signal peptides | ||||||
| secretory | 1,262 | 797 | 974 | 773 | 1,103 | 4,909 |
| chloroplast | 1,062 | 681 | 845 | 666 | 1,021 | 4,275 |
| mitochondria | 820 | 490 | 612 | 430 | 736 | 3,088 |
aLength of genomic sequence from annotated transcriptional start to stop.
Figure 1The Arabidopsis genome as depicted in release 5 of the Arabidopsis genome annotation. Each BAC sequence region within each chromosome is shown colored according to the original sequencing group. The unsequenced NOR and 5SrDNA clusters are colored black and centromeric regions are colored red, both with rounded edges and drawn to scale based on their estimated sizes.
Summary statistics for TIGR Arabidopsis annotation releases.
| Nature (12/00) | Release 1 (8/01) | Release 2 (1/02) | Release 3 (8/02) | Release 4 (4/03) | Release 5 (1/04) | |
| Genome size (Mb) | 115.410 | 116.238 | 117.227 | 117.077 | 119.055 | 118.998 |
| protein-coding genes | 25,498 | 25,554 | 26,156 | 27,117 | 27,170 | 26,207 |
| transposons and pseudogenes | NA | 1,274 | 1,305 | 1,967 | 2,218 | 3,786 |
| Genes annotated as alternatively spliced | NA | 0 | 28 | 162 | 1,267 | 2,330 |
| genes with UTRs | NA | 4,140 | 10,219 | 11,691 | 17,060 | 18,099 |
| Protein-coding genes similar to transposon ORFsa | NA | 487 | 485 | 528 | 531 | 6 |
| gene density (kb per gene) | 4.5 | 4.55 | 4.48 | 4.32 | 4.38 | 4.54 |
| exons / gene | 5.2 | 5.23 | 5.25 | 5.24 | 5.31 | 5.42 |
| average exon length (bp) | 250 | 256 | 265 | 266 | 279 | 276 |
| average intron length (bp) | 168 | 168 | 167 | 166 | 166 | 164 |
| Gene structures altered since previous release. | NA | - | u: 2,853 | u:1,366 | u: 2,347 | u: 2,858 |
Gene structure modifications from each previous release are represented by u: updated, a: added, d: deleted, m: merged, and s: split. aAnnotated protein-coding genes with a BLASTP match containing an E-value <= 1e-20.
Genes classified by alternative splicing variation.
| Genes with isoform type | ||
| Alternative acceptor and/or donor | 1,050 | 70% |
| Unspliced introns | 926 | 67% |
| Alternate terminal exons | 99 | 28% |
| Exon skipping | 130 | 68% |
| Start or end within intron | 520 | 47% |
Figure 2Screenshot of the Annotation Station gene editor. The evidence for gene identification and gene modeling is viewed using proprietary software called Annotation Station, developed by Neomorphic and maintained now by Affymetrix. This tool, similar to Apollo that was developed at Berkley and Sanger [105], is used by human annotators as a genome navigation tool and gene structure modeling tool. The gene models, proteins and transcript alignments are shown for an approximately 4.5 kb window along the minus strand of BAC F10O3 in the region encoding the 3-methylcrotonyl-CoA carboxylase 1 (At1g03090). The curated gene structures are shown in dark green on the white background towards the bottom of the view, with exons filled, and introns and UTRs unfilled. Above this curation within the black background, evidence is shown from bottom to top as follows: splice site predictions, computational gene predictions, protein alignments shown in orange, EST alignments from searching the various plant Gene Indices in varied colors, regions of homology to the genome of Brassica oleracea shown in dark blue at the top of the view, and PASA Arabidopsis transcript alignment assemblies at the top shown in bright pink. The vertical marker line indicates the position of a skipped exon (supported by both PASA FL-cDNA and protein alignments) that results in two protein isoforms.
Figure 3Distribution of proteins within families constructed using two distinct family building methods: our currently employed domain composition based clustering versus the single-linkage BLASTP-based clustering method originally described.A: Frequency distribution of family sizes created by the two methods. B: Difference between the two methods evaluated at the protein levelon a per protein basis. The difference in family size between domain-based clustering and the single-linkage clustering method (DBC – SLC) was calculated for each protein that was included in a family using both methods. The histogram shows the total number of proteins found at each size difference displayed on the abscissa, binned at increments of 10.
Figure 4The distribution of genes in major categories of the Gene Ontologies. Each of the 26,207 protein coding genes was assigned to at least one GO term, with our primary focus the assignment of genes to Molecular Function terms. The ontology categories illustrated correspond to those of the plant GO slim obtained from
Transposon classification.
| gypsy-like retrotransposon family (Athila) | 511 |
| gypsy-like retrotransposon family | 374 |
| copia-like retrotransposon family | 494 |
| non-LTR retrotransposon family (LINE) | 264 |
| other | 9 |
| hAT-like transposase family (hobo/Ac/Tam3) | 77 |
| CACTA-like transposase family (En/Spm) | 69 |
| CACTA-like transposase family (Ptta/En/Spm) | 127 |
| CACTA-like transposase family (Tnp1/En/Spm) | 37 |
| CACTA-like transposase family (Tnp2/En/Spm) | 102 |
| Mutator-like transposase family | 268 |
| Mariner-like transposase family | 9 |
| other | 14 |