| Literature DB >> 18402700 |
Robin P Smith1, William J Buchser, Marcus B Lemmon, Jose R Pardinas, John L Bixby, Vance P Lemmon.
Abstract
BACKGROUND: Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18402700 PMCID: PMC2322989 DOI: 10.1186/1471-2105-9-186
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Data Pipeline. Raw sequence data is imported into EST Express along with phred scores, where it is then screened for contaminating vector sequences and masked for quality. Good quality sequences are then batch BLASTed against a local UniGene database and the top hit is assigned to each sample. A local copy of the Entrez Gene database is then linked to the UniGene identifier and used to annotate each sequence with a description, Gene Ontology identifiers, RefSeq mRNA and protein links, and genomic context. Oligo(dT)-primed sequences can then be analyzed for full-length status using a local copy of the RefSeq protein database and the Entrez Gene cross references. The user interface then provides several ways to browse and visualize the results from the pipeline.
Figure 2Screenshots from EST Express. A: Screenshot of the "Plate Viewer" page showing details for plate JN02001X1 owned by user "rsmith" in project "robinDNQ". For each matched sample in the plate a UniGene identifier is listed, along with the BLAST score and Entrez Gene and full-length annotations. B: Capture from the "Project Viewer" page showing a graphical breakdown of ESTs within a project. "Vector" refers to sequences designated vector-only. "Bad_sequence" refers to sequences with low quality sequence reads. "Unknown" refers to samples that are neither vector-only nor low quality, but do not match against the UniGene database. "Uniques" refers to the number of unique UniGene clusters in the project and "Repeats" refers to additional instances of those unique clusters. C: Capture from the "New Gene" library tool, showing the number of new unique UniGene clusters found with each successive round of sequencing. Further rounds of sequencing produce progressively fewer unique clusters. Both B and C were produced dynamically using the JPGraph PHP graphics library.
Figure 3Results of analyses on the subtracted data set. A: Distribution of identifications made by EST Express for all 2,016 samples. B: Distribution of associations made for 1,068 distinct UniGene entries.