| Literature DB >> 17202165 |
Emmet A O'Brien1, Liisa B Koski, Yue Zhang, LiuSong Yang, Eric Wang, Michael W Gray, Gertraud Burger, B Franz Lang.
Abstract
The TBestDB database contains approximately 370,000 clustered expressed sequence tag (EST) sequences from 49 organisms, covering a taxonomically broad range of poorly studied, mainly unicellular eukaryotes, and includes experimental information, consensus sequences, gene annotations and metabolic pathway predictions. Most of these ESTs have been generated by the Protist EST Program, a collaboration among six Canadian research groups. EST sequences are read from trace files up to a minimum quality cut-off, vector and linker sequence is masked, and the ESTs are clustered using phrap. The resulting consensus sequences are automatically annotated by using the AutoFACT program. The datasets are automatically checked for clustering errors due to chimerism and potential cross-contamination between organisms, and suspect data are flagged in or removed from the database. Access to data deposited in TBestDB by individual users can be restricted to those users for a limited period. With this first report on TBestDB, we open the database to the research community for free processing, annotation, interspecies comparisons and GenBank submission of EST data generated in individual laboratories. For instructions on submission to TBestDB, contact tbestdb@bch.umontreal.ca. The database can be queried at http://tbestdb.bcm.umontreal.ca/.Entities:
Mesh:
Year: 2007 PMID: 17202165 PMCID: PMC1899108 DOI: 10.1093/nar/gkl770
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Publicly available sequence content of TBestDB (July 1, 2006)
| Organism name | No. of ESTs | No. of clusters |
|---|---|---|
| 13 814 | 5262 | |
| 3464 | 2573 | |
| 5073 | 2149 | |
| 3623 | 1557 | |
| 2376 | 700 | |
| 2730 | 1718 | |
| 3462 | 2318 | |
| 12 759 | 3330 | |
| 8863 | 2516 | |
| 5124 | 1388 | |
| 9867 | 2448 | |
| 4673 | 1478 | |
| 4791 | 3664 | |
| 17 236 | 8651 | |
| 8745 | 2831 | |
| 9505 | 4986 | |
| 1188 | 701 | |
| 6804 | 2038 | |
| 4009 | 1763 | |
| 2756 | 1762 | |
| 12 205 | 6095 | |
| 4323 | 2286 | |
| 5452 | 2565 | |
| 16 544 | 11 903 | |
| 4437 | 2314 | |
| 9798 | 4505 | |
| 19 182 | 4438 | |
| 5615 | 1771 | |
| 3662 | 2004 | |
| 6433 | 2677 | |
| 126 | 115 | |
| 2272 | 1230 | |
| 7590 | 3383 | |
| 9684 | 3078 | |
| 4445 | 1247 | |
| 5062 | 2151 | |
| 5641 | 1542 | |
| 17 644 | 6797 | |
| 12 570 | 5105 | |
| 3840 | 1008 | |
| 9300 | 3520 | |
| 6615 | 2666 | |
| 5256 | 2217 | |
| 8006 | 2763 | |
| 5365 | 2079 | |
| 4475 | 2595 | |
| 3919 | 1435 | |
| 31 548 | 9050 | |
| 9615 | 2686 | |
| Total | 371 484 | 149 058 |
Figure 1EST processing pipeline. EST tracefiles are accepted in .scf or .abi format via a dedicated sftp server. Any EST for which phred cannot read more than 60 nt of high-quality sequence is discarded. The default value for quality is 99% certainty of identification of each residue (ABI sequence technology), but this value has been set to slightly lower thresholds in certain instances where justified by the effective quality. The parameters used for cross_match have been adjusted slightly from the defaults—the minscore value has been changed from 20 to 17, to allow for slightly more relaxed matches, as this was found to give the best identification and masking of short linker sequences. At this point any EST sequence containing fewer than 60 unmasked residues is removed from further consideration. AutoFACT combines the most informative of the top 10 BLAST hits from the European Ribosomal Database (BLASTN), UniRef90 (BLASTX), KEGG (BLASTX), COG (BLASTX), Pfam (RPS-BLAST), and NCBI's nr (BLASTX) and est_others (TBLASTX) databases. Default parameters bitscore >40 and E-value <1 × 10−4 were used. Rapid Annotation is performed using BLASTX against a specialized set of sequences (see Annotation in text) with an E-value cut-off of 1× 10−4. Top-BLAST-hit annotations are from TBLASTX hits to NCBI's nr database using an E-value cut-off of 1 × 10−4. ORF prediction is performed by translating the consensus sequence in all frames, identifying stop codons and marking any potential ORF longer than 20 residues.
Databases searched and classification information assigned by AutoFACT
| Database | Classification Information | Reference |
|---|---|---|
| European Ribosomal Database | Large subunit (LSU) ribosomal RNAs | ( |
| Small subunit (SSU) ribosomal RNAs | ||
| Gene Ontology terms | ( | |
| UniProt's UniRef 90 | Enzyme Commission numbers | |
| Locus names | ||
| Clusters of Orthologous Groups (COG) | Functional categories | ( |
| Metabolic pathways | ( | |
| Kyoto Encyclopedia of Genes and Genomes (KEGG) | Enzyme Commission numbers | |
| Locus names | ||
| Protein Families Database (Pfam) | Protein domains | ( |
| NCBI's non-redundant database (nr) | N/A | ( |
| NCBI's est_others database |
Figure 2Cluster information page. The head of the cluster information page contains the cluster consensus sequence, links to the ESTs assembled within the cluster and all annotation information. The lower half of the page contains an image illustrating the structure of the cluster. The positions of each EST are indicated. ESTs originating from different libraries are shown in different colours. The read direction of each EST is shown with an arrowhead when that information is available and ESTs that have been internally reverse-complemented by phrap in the process of cluster assembly are indicated in outline. A multiple alignment is then shown depicting the ESTs and clustered consensus sequence in the same pattern (the right-hand portion of the sequence alignment is truncated in order to improve readability of the figure).