| Literature DB >> 15608275 |
Abstract
The public expressed sequence tag collections are continually being enriched with high-quality sequences that represent an ever-expanding range of taxonomically diverse plant species. While these sequence collections provide biased insight into the populations of expressed genes available within individual species and their associated tissues, the information is conceivably of wider relevance in a comparative context. When we consider the available expressed sequence tag (EST) collections of summer 2004, most of the major plant taxonomic clades are at least superficially represented. Investigation of the five million available plant ESTs provides a wealth of information that has applications in modelling the routes of plant genome evolution and the identification of lineage-specific genes and gene families. Over four million ESTs from over 50 distinct plant species have been collated within an EST analysis pipeline called openSputnik. The ESTs were resolved down into approximately one million unigene sequences. These have been annotated using orthology-based annotation transfer from reference plant genomes and using a variety of contemporary bioinformatics methods to assign peptide, structural and functional attributes. The openSputnik database is available at http://sputnik.btk.fi.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15608275 PMCID: PMC539994 DOI: 10.1093/nar/gki040
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1A depiction of the phylogenetic relationships among the major plant lineages as published previously (23). The evolutionary tree has been overlaid with the names of plant species having large EST collections (>5000 sequences) that are available in the current release of openSputnik. The symbol ‘**’ denotes the plant groups where either small EST collections (>1000 ESTs) are available or as-yet unreleased sequences are known to exist. This figure reveals the taxonomic distribution of large plant EST collections, but also highlights the strong bias towards the agriculturally important species.
Figure 2A simplification of the directed acyclic graph that describes the analytical pipeline used to build the openSputnik database. As starting material, species-specific EMBL flat files are imported and all annotations are retained. This creates a sequence source ‘EST collection’. This source is used to derive two other annotative sources, the ‘UNIGENE collection’ and the ‘PEPTIDE collection’ (sources shown in red). When the sources have been built, they are annotated using a variety of methods highlighted in green. The analyses anchored to the schema are used to create derived annotations including Funcat and GO terms (shown in orange). All analyses are made available to the database user via the openZputnik interface.
Table summarizing the sequence content of the openSputnik database
| Organism name | No. of ESTs | EST sequence (bp) | No. of singletons | No. of assembies | Unigene sequence (bp) | Redundancy | Peptide sequence (aa) | Protein coding potential |
|---|---|---|---|---|---|---|---|---|
| 19 582 | 13 016 289 | 7252 | 4020 | 8 544 747 | 1.5 | 2 531 519 | 88.9 | |
| 190 741 | 84 128 065 | 17 675 | 20 109 | 22 482 688 | 3.7 | 6 135 202 | 81.9 | |
| 20 151 | 10 184 665 | 9244 | 3706 | 7 368 791 | 1.4 | 2 015 990 | 82.1 | |
| 37 159 | 21 438 036 | 8041 | 5447 | 8 389 217 | 2.6 | 2 403 184 | 85.9 | |
| 22 433 | 10 226 020 | 7326 | 3056 | 5 496 951 | 1.9 | 1 477 080 | 80.6 | |
| 154 600 | 82 230 382 | 18 211 | 10 989 | 23 178 755 | 3.5 | 2 388 596 | 30.9 | |
| 23 337 | 12 738 998 | 5311 | 3416 | 5 474 795 | 2.3 | 1 473 294 | 80.7 | |
| 7128 | 3 624 193 | 3202 | 1203 | 2 457 784 | 1.5 | 579 834 | 70.8 | |
| 5952 | 2 873 079 | 2230 | 697 | 1 597 282 | 1.8 | 349 001 | 65.5 | |
| 5468 | 2 529 150 | 3146 | 741 | 1 908 962 | 1.3 | 564 147 | 88.7 | |
| 344 524 | 158 703 384 | 28 963 | 24 892 | 33 585 032 | 4.7 | 8 648 792 | 77.3 | |
| 38 915 | 26 139 867 | 10 007 | 6076 | 13 043 919 | 2.0 | 2 958 835 | 68.1 | |
| 13 571 | 8 414 112 | 5934 | 1914 | 5 367 083 | 1.6 | 1 334 901 | 74.6 | |
| 5416 | 2 476 009 | 3595 | 641 | 2 022 087 | 1.2 | 450 943 | 66.9 | |
| 4875 | 2 228 284 | 3313 | 530 | 1 830 094 | 1.2 | 402 306 | 65.9 | |
| 59 841 | 25 553 028 | 11 900 | 6050 | 8 654 947 | 3.0 | 2 086 806 | 72.3 | |
| 12 787 | 4 929 193 | 4646 | 1029 | 2 309 089 | 2.1 | 516 763 | 67.1 | |
| 10 340 | 4 149 627 | 3844 | 1012 | 1 997 115 | 2.1 | 458 465 | 68.9 | |
| 372 431 | 198 114 717 | 25 405 | 23 033 | 37 345 565 | 5.3 | 9 139 515 | 73.4 | |
| 25 899 | 15 289 506 | 4572 | 4829 | 6 252 258 | 2.4 | 1 682 965 | 80.8 | |
| 68 188 | 35 969 889 | 12 427 | 7998 | 13 090 218 | 2.7 | 3 527 514 | 80.8 | |
| 36 311 | 13 987 475 | 7646 | 4248 | 5 529 908 | 2.5 | 1 635 214 | 88.7 | |
| 150 228 | 75 468 371 | 13 178 | 14 870 | 19 372 969 | 3.9 | 5 380 403 | 83.3 | |
| 8346 | 3 842 358 | 2408 | 901 | 1 770 921 | 2.2 | 503 014 | 85.2 | |
| 187 763 | 101 662 463 | 19 448 | 17 189 | 27 597 708 | 3.7 | 6 630 342 | 72.1 | |
| 25 803 | 15 782 659 | 4831 | 3137 | 5 941 245 | 2.7 | 1 541 786 | 77.9 | |
| 10 323 | 5 104 499 | 8710 | 630 | 4 738 148 | 1.1 | 952 839 | 60.3 | |
| 5268 | 2 367 832 | 2756 | 591 | 1 658 572 | 1.4 | 452 963 | 81.9 | |
| 260 901 | 136 090 821 | 30 971 | 20 934 | 34 467 815 | 3.9 | 8 593 185 | 74.8 | |
| 12 121 | 7 911 359 | 3043 | 1526 | 3 439 590 | 2.3 | 894 960 | 78.1 | |
| 20 120 | 8 487 980 | 4419 | 2431 | 3 269 096 | 2.6 | 886 986 | 81.4 | |
| 102 219 | 54 477 833 | 10 114 | 13 309 | 15 177 696 | 3.6 | 3 521 525 | 69.6 | |
| 15 719 | 7 679 661 | 4974 | 2452 | 4 209 291 | 1.8 | 1 036 699 | 73.9 | |
| 110 622 | 51 626 003 | 14 632 | 11 610 | 15 972 215 | 3.2 | 3 945 832 | 74.1 | |
| 6390 | 4 107 970 | 1644 | 1209 | 2 220 609 | 1.8 | 568 758 | 76.8 | |
| 10 446 | 5 769 749 | 3856 | 1480 | 3 192 053 | 1.8 | 862 949 | 81.1 | |
| 30 296 | 14 140 412 | 7031 | 3664 | 5 503 910 | 2.6 | 1 522 330 | 83.0 | |
| 70 091 | 30 629 346 | 14 699 | 7954 | 11 475 126 | 2.7 | 3 192 054 | 83.5 | |
| 13 050 | 6 174 206 | 2634 | 2218 | 2 413 573 | 2.6 | 706 585 | 87.8 | |
| 20 979 | 9 801 783 | 2774 | 2045 | 2 853 651 | 3.4 | 681 731 | 71.7 | |
| 11 452 | 6 496 591 | 3206 | 1588 | 3 135 288 | 2.1 | 883 165 | 84.5 | |
| 246 301 | 156 538 942 | 29 895 | 25 089 | 45 845 406 | 3.4 | 11 003 162 | 72.0 | |
| 8807 | 4 377 943 | 4784 | 1155 | 3 165 611 | 1.4 | 788 520 | 74.7 | |
| 9194 | 4 313 461 | 3793 | 1346 | 2 687 830 | 1.6 | 662 342 | 73.9 | |
| 94 525 | 51 346 134 | 6651 | 15 983 | 16 752 895 | 3.1 | 4 715 299 | 84.4 | |
| 161 766 | 83 411 684 | 16 955 | 17 704 | 23 132 774 | 3.6 | 6 004 630 | 77.9 | |
| 21 387 | 9 750 610 | 5371 | 3507 | 4 673 286 | 2.1 | 1 209 822 | 77.7 | |
| 5548 | 3 242 045 | 2498 | 713 | 2 048 965 | 1.6 | 578 303 | 84.7 | |
| 6562 | 2 607 871 | 1988 | 753 | 1 103 776 | 2.4 | 276 188 | 75.1 | |
| 511 732 | 257 643 801 | 49 171 | 33 666 | 51 549 049 | 5.0 | 12 964 652 | 75.5 | |
| 9973 | 4 956 308 | 3941 | 1681 | 3 212 869 | 1.5 | 810 910 | 75.7 | |
| 6533 | 3 604 678 | 1032 | 1052 | 1 385 939 | 2.6 | 349 250 | 75.6 | |
| 135 712 | 74 769 503 | 9616 | 12 893 | 16 019 102 | 4.7 | 4 176 665 | 78.2 | |
| 384 391 | 173 945 698 | 24 266 | 25 725 | 29 187 808 | 6.0 | 7 017 868 | 72.1 | |
| 9783 | 4 896 796 | 6536 | 1456 | 4 140 824 | 1.2 | 890 004 | 64.5 |
A total of 55 plant species are included in the current release, and represent a broad taxonomic distribution of species. Shown are the number of ESTs and the total nucleotide length for all EST sequences. The number of resulting singleton unigenes and multi-member assemblies is shown, along with the summed length of all available unigene sequence. The difference between total nucleotide length in EST and unigene sequences is summarized as apparent redundancy. Since peptide sequences have been prepared for each of the unigenes the length of all derived peptide is also shown and a measure of apparent coding potential across the whole unigene set is also shown.