MOTIVATION: The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes. RESULTS: In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages. AVAILABILITY: Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.
MOTIVATION: The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes. RESULTS: In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages. AVAILABILITY: Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.
Authors: Rajeev K Varshney; Wenbin Chen; Yupeng Li; Arvind K Bharti; Rachit K Saxena; Jessica A Schlueter; Mark T A Donoghue; Sarwar Azam; Guangyi Fan; Adam M Whaley; Andrew D Farmer; Jaime Sheridan; Aiko Iwata; Reetu Tuteja; R Varma Penmetsa; Wei Wu; Hari D Upadhyaya; Shiaw-Pyng Yang; Trushar Shah; K B Saxena; Todd Michael; W Richard McCombie; Bicheng Yang; Gengyun Zhang; Huanming Yang; Jun Wang; Charles Spillane; Douglas R Cook; Gregory D May; Xun Xu; Scott A Jackson Journal: Nat Biotechnol Date: 2011-11-06 Impact factor: 54.908
Authors: N J Kenny; K W Chan; W Nong; Z Qu; I Maeso; H Y Yip; T F Chan; H S Kwan; P W H Holland; K H Chu; J H L Hui Journal: Heredity (Edinb) Date: 2015-09-30 Impact factor: 3.821
Authors: Rafael D Acemel; Juan J Tena; Ibai Irastorza-Azcarate; Ferdinand Marlétaz; Carlos Gómez-Marín; Elisa de la Calle-Mustienes; Stéphanie Bertrand; Sergio G Diaz; Daniel Aldea; Jean-Marc Aury; Sophie Mangenot; Peter W H Holland; Damien P Devos; Ignacio Maeso; Hector Escrivá; José Luis Gómez-Skarmeta Journal: Nat Genet Date: 2016-02-01 Impact factor: 38.330
Authors: Chris R Smith; Christopher D Smith; Hugh M Robertson; Martin Helmkampf; Aleksey Zimin; Mark Yandell; Carson Holt; Hao Hu; Ehab Abouheif; Richard Benton; Elizabeth Cash; Vincent Croset; Cameron R Currie; Eran Elhaik; Christine G Elsik; Marie-Julie Favé; Vilaiwan Fernandes; Joshua D Gibson; Dan Graur; Wulfila Gronenberg; Kirk J Grubbs; Darren E Hagen; Ana Sofia Ibarraran Viniegra; Brian R Johnson; Reed M Johnson; Abderrahman Khila; Jay W Kim; Kaitlyn A Mathis; Monica C Munoz-Torres; Marguerite C Murphy; Julie A Mustard; Rin Nakamura; Oliver Niehuis; Surabhi Nigam; Rick P Overson; Jennifer E Placek; Rajendhran Rajakumar; Justin T Reese; Garret Suen; Shu Tao; Candice W Torres; Neil D Tsutsui; Lumi Viljakainen; Florian Wolschin; Jürgen Gadau Journal: Proc Natl Acad Sci U S A Date: 2011-01-31 Impact factor: 11.205
Authors: Christopher D Smith; Aleksey Zimin; Carson Holt; Ehab Abouheif; Richard Benton; Elizabeth Cash; Vincent Croset; Cameron R Currie; Eran Elhaik; Christine G Elsik; Marie-Julie Fave; Vilaiwan Fernandes; Jürgen Gadau; Joshua D Gibson; Dan Graur; Kirk J Grubbs; Darren E Hagen; Martin Helmkampf; Jo-Anne Holley; Hao Hu; Ana Sofia Ibarraran Viniegra; Brian R Johnson; Reed M Johnson; Abderrahman Khila; Jay W Kim; Joseph Laird; Kaitlyn A Mathis; Joseph A Moeller; Monica C Muñoz-Torres; Marguerite C Murphy; Rin Nakamura; Surabhi Nigam; Rick P Overson; Jennifer E Placek; Rajendhran Rajakumar; Justin T Reese; Hugh M Robertson; Chris R Smith; Andrew V Suarez; Garret Suen; Elissa L Suhr; Shu Tao; Candice W Torres; Ellen van Wilgenburg; Lumi Viljakainen; Kimberly K O Walden; Alexander L Wild; Mark Yandell; James A Yorke; Neil D Tsutsui Journal: Proc Natl Acad Sci U S A Date: 2011-01-31 Impact factor: 11.205
Authors: Dayana Yahalomi; Stephen D Atkinson; Moran Neuhof; E Sally Chang; Hervé Philippe; Paulyn Cartwright; Jerri L Bartholomew; Dorothée Huchon Journal: Proc Natl Acad Sci U S A Date: 2020-02-24 Impact factor: 11.205