Literature DB >> 18613945

A race through the maze of genomic evidence.

Timothy R Hughes, Frederick P Roth.

Abstract

Entities: Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2008 PMID： 18613945 PMCID： PMC2447543 DOI： 10.1186/gb-2008-9-s1-s1

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

× No keyword cloud information.

One of the most surprising aspects of the completed human and mouse genome sequences [1-3] has been the relatively small number of protein-coding genes. The current estimate of <24,000 protein-coding genes in human and mouse is only four times that of budding yeast [4]. A complete encyclopedia of biochemical, cellular, and physiological gene functions is now an immediate rather than a long-term goal. The unifying theme for papers in this supplement to Genome Biology is the automated inference of molecular function of gene products, and of their membership within cellular components and biological processes. In each study, thousands of variables describing genes and gene-gene relationships have been integrated using machine learning methods to infer Gene Ontology (GO) terms for essentially all genes in the studied genome. Systematic biochemical and genetic experimentation in simpler model organisms has contributed to the rapid increase in the proportion of characterized genes. For example, about 80% of yeast genes have some annotated function [5]. Beyond simpler model systems such as yeast, however, cost and time requirements may preclude many systematic analyses, for example, resource-intensive phenotype assays in adult animals. Fortunately, efforts in simpler model organisms have illustrated that gene functions can be inferred on the basis of data types that are easier to collect systematically. Protein sequence features, expression patterns, and protein-protein interactions, for example, can provide powerful clues to function. This raises the prospect of directing resource-intensive experimentation toward the genes most likely to yield positive results. In yeast, this concept is well established, and the tradeoffs between performance measures, the efficacy of combinations of different data types in making different types of predictions, and the applicability of diverse inference algorithms are topics of active research. Large scale experimentation in mammals is coming of age. A wide variety of mRNA expression analysis experiments are available in public data repositories, for example, the Gene Expression Omnibus [6]. The majority of human genes have at least one moveable open reading frame clone [7-9], enabling expression studies in vitro and in model systems. 'Knockdown' reagents targeting most mouse and human genes are now available [10], facilitating analysis of biochemical and cellular gene functions. Efforts are underway to create a mutant allele of each mouse gene [11], which will enable analysis of physiological and developmental roles. The first paper in this supplemental issue [12] describes the 'MouseFunc' challenge, in which nine bioinformatics teams independently predicted mouse GO terms. Importantly, each used a common collection of training data and common benchmarks, which allowed comparison among the inference methods, data sets, and categories of gene functions. Predictions were tested using cross-validation (annotation for a subset of genes was hidden from the participants). Predictions were further tested by two forms of prospective evaluation: first, using GO annotations that had been added to the database since the inception of the study; and second, literature related to top-scoring novel predictions was investigated intensively by experienced mouse biologists. Each of the companion papers in this issue is connected to the MouseFunc challenge, either in the nature of the algorithms used, in the datasets employed, or both. Guan and coworkers (led by Olga Troyanska) apply a support vector machine approach to predict mouse gene function [13]. They go on to apply this approach to the more tractable model eukaryote Saccharomyces cerevisiae and to test specific predictions experimentally. Mostafavi and colleagues (led by Quaid Morris) apply a ridge regression approach to predict mouse and yeast gene function [14]. The approach is quite fast, permitting their 'GeneMania' software to perform predictions 'on the fly' with a training set provided by the user. Kim and coworkers (led by Edward Marcotte) infer mouse gene function both directly and via a functional linkage network [15]. Functional linkage graphs contain connections between genes weighted by confidence that are functionally related [16]. Obozinski and colleagues (led by William Noble) investigate the possibility of inconsistency between predictions of different functions [17]. For example, it is possible for some approaches to assign a higher prediction score to 'DNA helicase activity' than to its logical parent term 'helicase activity'. They show that 'reconciliation' methods that enforce consistency between different GO term predictions can improve performance. Tian and coworkers [18] and Tasan and colleagues [19], both teams led by one of us (FPR), each combine guilt-by-profiling and guilt-by-association approaches to make predictions. Tian and coworkers describe the methodology and apply it to predict S. cerevisiae gene functions, while Tasan and colleagues apply the methodology to predict both functions and phenotypes for mouse genes. Many other quantitative fields have benefitted by standardization of training and test sets. For example, the Critical Assessment of Techniques for Protein Structure Prediction (CASP) challenge [20] has made rigorous comparisons among protein structure predictions. This special issue suggests the value of similar standardization in the arena of function prediction. Importantly, inferences about function and phenotype made in this issue are not black or white, but rather are expressed in shades of gray. Biology will long remain in the 'working model' phase, in which each statement about a gene's role must be accompanied by some uncertainty. An honest assessment of our uncertainties could allow us to direct resources efficiently to those experiments most likely to resolve these uncertainties. Quantitative predictions allow individual users requiring highly stringent predictions to impose a high prediction score threshold, while users may lower their threshold and include additional false positives if they wish to cast a wide net and catch more true positives. The approaches taken in this issue have common limitations. To reduce the scope of the computational problem and eliminate the potential for inflated performance estimates due to circular reasoning, participants did not have access to GO annotations from other species. Although the training data did incorporate many previous transfers of annotation from other species by orthology, these methods could also benefit from a similar standardization and benchmarking strategy. We also note that identifying the best strategies does not always help us to understand why the best strategies worked well. Because of the computationally intensive nature of function prediction, only a limited number of variant approaches were evaluated. A full factorial analysis of variations on the most successful strategies will help provide this understanding and allow future optimization. The high precision of top predictions for many GO terms illustrates the richness and value of data sources that have become available for mammals over recent years. However, one lesson learned is that it is difficult to achieve both high precision and high recall. Currently, no algorithms achieve both for most functional categories. Improvements in either the inference methods, the problem setup, or in the information content of the data sets themselves will be needed in order to make a major dent in the more than 10,000 currently uncharacterized mouse and human genes.

Abbreviations

GO, Gene Ontology.

Competing interests

The authors declare that they have no competing interests.

20 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

2. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

3. The sequence of the human genome.

Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal: Science Date: 2001-02-16 Impact factor: 47.728

4. Exploration of human ORFeome: high-throughput preparation of ORF clones and efficient characterization of their protein products.

Authors: Takahiro Nagase; Hisashi Yamakawa; Shinichi Tadokoro; Daisuke Nakajima; Shinichi Inoue; Kei Yamaguchi; Yasuhide Itokawa; Reiko F Kikuno; Hisashi Koga; Osamu Ohara
Journal: DNA Res Date: 2008-03-02 Impact factor: 4.458

5. Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function.

Authors: Weidong Tian; Lan V Zhang; Murat Taşan; Francis D Gibbons; Oliver D King; Julie Park; Zeba Wunderlich; J Michael Cherry; Frederick P Roth
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

6. Consistent probabilistic outputs for protein function prediction.

Authors: Guillaume Obozinski; Gert Lanckriet; Charles Grant; Michael I Jordan; William Stafford Noble
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

7. Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy.

Authors: Wan Kyu Kim; Chase Krumpelman; Edward M Marcotte
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

8. An en masse phenotype and function prediction system for Mus musculus.

Authors: Murat Taşan; Weidong Tian; David P Hill; Francis D Gibbons; Judith A Blake; Frederick P Roth
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

9. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence.

Authors: Lourdes Peña-Castillo; Murat Tasan; Chad L Myers; Hyunju Lee; Trupti Joshi; Chao Zhang; Yuanfang Guan; Michele Leone; Andrea Pagnani; Wan Kyu Kim; Chase Krumpelman; Weidong Tian; Guillaume Obozinski; Yanjun Qi; Sara Mostafavi; Guan Ning Lin; Gabriel F Berriz; Francis D Gibbons; Gert Lanckriet; Jian Qiu; Charles Grant; Zafer Barutcuoglu; David P Hill; David Warde-Farley; Chris Grouios; Debajyoti Ray; Judith A Blake; Minghua Deng; Michael I Jordan; William S Noble; Quaid Morris; Judith Klein-Seetharaman; Ziv Bar-Joseph; Ting Chen; Fengzhu Sun; Olga G Troyanskaya; Edward M Marcotte; Dong Xu; Timothy R Hughes; Frederick P Roth
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

10. Predicting gene function in a hierarchical context with an ensemble of classifiers.

Authors: Yuanfang Guan; Chad L Myers; David C Hess; Zafer Barutcuoglu; Amy A Caudy; Olga G Troyanskaya
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

6 in total

1. Predicting the lethal phenotype of the knockout mouse by integrating comprehensive genomic data.

Authors: Yuan Yuan; Yanxun Xu; Jianfeng Xu; Robyn L Ball; Han Liang
Journal: Bioinformatics Date: 2012-03-13 Impact factor: 6.937

Review 2. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns.

Authors: Karen R Christie; Eurie L Hong; J Michael Cherry
Journal: Trends Microbiol Date: 2009-07-02 Impact factor: 17.079

3. Predicting gene function using hierarchical multi-label decision tree ensembles.

Authors: Leander Schietgat; Celine Vens; Jan Struyf; Hendrik Blockeel; Dragi Kocev; Saso Dzeroski
Journal: BMC Bioinformatics Date: 2010-01-02 Impact factor: 3.169

4. GFam: a platform for automatic annotation of gene families.

Authors: Rajkumar Sasidharan; Tamás Nepusz; David Swarbreck; Eva Huala; Alberto Paccanaro
Journal: Nucleic Acids Res Date: 2012-07-11 Impact factor: 16.971

5. Protein function prediction by massive integration of evolutionary analyses and multiple data sources.

Authors: Domenico Cozzetto; Daniel W A Buchan; Kevin Bryson; David T Jones
Journal: BMC Bioinformatics Date: 2013-02-28 Impact factor: 3.169

6. Understanding protein-nanoparticle interaction: a new gateway to disease therapeutics.

Authors: Karuna Giri; Khader Shameer; Michael T Zimmermann; Sounik Saha; Prabir K Chakraborty; Anirudh Sharma; Rochelle R Arvizo; Benjamin J Madden; Daniel J Mccormick; Jean-Pierre A Kocher; Resham Bhattacharya; Priyabrata Mukherjee
Journal: Bioconjug Chem Date: 2014-05-22 Impact factor: 4.774

6 in total