| Literature DB >> 18832365 |
Byungwook Lee1, Gwangsik Shin.
Abstract
The EST division of GenBank, dbEST, is widely used in many applications such as gene discovery and verification of exon-intron structure. However, the use of EST sequences in the dbEST libraries is often hampered by inconsistent terminology used to describe the library sources and by the presence of contaminated sequences. Here, we describe CleanEST, a novel database server that classified dbEST libraries and removes contaminants. We classified all dbEST libraries according to species and sequencing center. In addition, we further classified human EST libraries by anatomical and pathological systems according to eVOC ontologies. For each dbEST library, we provide two different cleansed sequences: 'pre-cleansed' and 'user-cleansed'. To generate pre-cleansed sequences, we cleansed sequences in dbEST by alignment of EST sequences against well-known contamination sources: UniVec, Escherichia coli, mitochondria and chloroplast (for plant). To provide user-cleansed sequences, we built an automatic user-cleansing pipeline, in which sequences of a user-selected library are cleansed on-the-fly according to user-selected options. The server is available at http://cleanest.kobic.re.kr/ and the database is updated monthly.Entities:
Mesh:
Year: 2008 PMID: 18832365 PMCID: PMC2686460 DOI: 10.1093/nar/gkn648
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Distribution of contaminated EST sequences in dbEST libraries
| Frequency of contaminated ESTs (%) | Nunber of libraries |
|---|---|
| >90 | 171 |
| 90–80 | 80 |
| 80–70 | 91 |
| 70–60 | 122 |
| 60–50 | 214 |
| 50–40 | 218 |
| 40–30 | 381 |
| 30–20 | 835 |
| 20–10 | 1963 |
| <10 | 18 382 |
Figure 1.Distribution of average contamination ratio of sequencing centers. Sequencing centers were classified according to the number of their total sequences in dbEST and were calculated an average contamination ratio of each class. The x-axis represents the classes of sequencing centers and y-axis represents their contamination ratio. The average contamination ratio is lower for centers that have submitted a larger number of sequences. Small sequencing centers (<10 000 ESTs) have more than double the contamination of large sequencing centers (>1 000 000 ESTs).
Figure 2.Screenshots of CleanEST showing the query, ‘lung AND cancer’. (A) Query of ‘lung’ in the Anatomy ontology and ‘cancer’ in the Pathology ontology of the eVOC search menu. (B) Library list showing the results of the query.