MOTIVATION: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
MOTIVATION: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
Authors: Javier Terol; Ana Conesa; Jose M Colmenero; Manuel Cercos; Francisco Tadeo; Javier Agustí; Enriqueta Alós; Fernando Andres; Guillermo Soler; Javier Brumos; Domingo J Iglesias; Stefan Götz; Francisco Legaz; Xavier Argout; Brigitte Courtois; Patrick Ollitrault; Carole Dossat; Patrick Wincker; Raphael Morillon; Manuel Talon Journal: BMC Genomics Date: 2007-01-25 Impact factor: 3.969
Authors: Lee H Pratt; Chun Liang; Manish Shah; Feng Sun; Haiming Wang; St Patrick Reid; Alan R Gingle; Andrew H Paterson; Rod Wing; Ralph Dean; Robert Klein; Henry T Nguyen; Hong-Mei Ma; Xin Zhao; Daryl T Morishige; John E Mullet; Marie-Michèle Cordonnier-Pratt Journal: Plant Physiol Date: 2005-09-16 Impact factor: 8.340
Authors: P Kerr Wall; Jim Leebens-Mack; André S Chanderbali; Abdelali Barakat; Erik Wolcott; Haiying Liang; Lena Landherr; Lynn P Tomsho; Yi Hu; John E Carlson; Hong Ma; Stephan C Schuster; Douglas E Soltis; Pamela S Soltis; Naomi Altman; Claude W dePamphilis Journal: BMC Genomics Date: 2009-08-01 Impact factor: 3.969