David R Nelson1. 1. The University of Tennessee Health Sciences Center, Memphis, TN 38163, USA. dnelson@utmem.edu
Abstract
The current proliferation of mammalian genomes is creating a nomenclature issue caused by naming genes based on their best BLAST hit to a gene in another annotated genome. The rat genome is relying heavily on the mouse genome for nomenclature, but not all rat genes have direct orthologues in the mouse; often, there are paralogous groups of genes--due to expansions of that gene subfamily in one or the other genome. Many of these genes have already been assigned names in the rat, so that renaming them based on BLAST scores leads to duplicate sets of names. The supposed orthology created by name sharing across genomes is not always found. These inaccurate names are appearing in frequently used sites, such as the University of California Santa Cruz Genome Browser. The example of rat cytochrome P450 (Cyp) genes is presented here, but other gene families are also likely to be affected.
The current proliferation of mammalian genomes is creating a nomenclature issue caused by naming genes based on their best BLAST hit to a gene in another annotated genome. The rat genome is relying heavily on the mouse genome for nomenclature, but not all rat genes have direct orthologues in the mouse; often, there are paralogous groups of genes--due to expansions of that gene subfamily in one or the other genome. Many of these genes have already been assigned names in the rat, so that renaming them based on BLAST scores leads to duplicate sets of names. The supposed orthology created by name sharing across genomes is not always found. These inaccurate names are appearing in frequently used sites, such as the University of California Santa Cruz Genome Browser. The example of ratcytochrome P450 (Cyp) genes is presented here, but other gene families are also likely to be affected.
The rat genome has been sequenced and assembled [1], creating a need for rat gene nomenclature. The obvious source of gene nomenclature for the rat would seem to be the mouse genome. Ideally, orthologues should have the same name. This logic has led to an automated naming of rat genes -- leading to problems of two kinds. First, the rat has long been an experimental animal. Genes from both rat and mouse were sequenced and named for nearly 20 years before the genomes were being sequenced. In the example of cytochromes P450, the first mammalian sequences Cyp2b1 and Cyp2b2 were determined in the rat [2]. The mouse sequences began to appear two years later with Cyp1a1 and Cyp1a2 [3,4]. The established nomenclature for CYP genes has been in place since 1987 [5-7], and these names have been used in publications for several years. Because the names were assigned independently, mostly in chronological order, orthologues do not always carry the same name.The second nomenclature problem has to do with divergence over time between species' genomes. Here, mouse and rat will be discussed, but the same applies to other species such as human and rhesus monkey. When similar genes appear in gene clusters, the one to one relationship of the genes between mouse and rat is often broken, meaning that the orthology is broken. Compared with the 57 CYP genes of the human, the mouse has greatly expanded its set of Cyp genes to 102 full-length genes; [8] the rat has been a little more conservative, with 87 Cyp genes [9]. The solo genes in a mammalianCYP subfamily -- those that occur without related neighbours -- are strict orthologues, and so nomenclature by best reciprocal BLAST hit between mouse and rat is a viable strategy. This works for 31 mouse-rat gene pairs and one pseudogene. Eighty-seven rat genes cannot be matched up to 102 mouse genes as orthologue pairs, however, and this nomenclature method can be seen to fail in the gene clusters.
Results and Discussion
Not all Cyp gene clusters are disordered between mouse and rat. For example, the Cyp4f gene cluster has nine genes in both species and there is a clean 1:1 mapping between orthologous pairs (Table 1). In fact, there are 33 such pairs in the Cyp gene clusters (Table 1); two of these pairs involve matches to pseudogenes in the other species. After these 64 pairs are subtracted and a correction is made in the count for pseudogenes, there are still 40 mouse genes remaining to pair with 24 rat genes. These genes either have no orthologues (paired with an 'x' in Table 1) or they are in paralogous gene sets (shaded and boldened in Table 1).
Table 1
Orthology between mouse and rat Cyp genes
Mouse
Rat
Mouse
Rat
Mouse
Rat
1a1
1a1
2e1
2e1
4f13
4f6
1a2
1a2
2f2
2f4
4f14
4f1
1b1
1b1
2g1
2g1
4f15
4f4
4f16
4f5
2a4
x
2j5
2j5-ps
4f17
4f17
2a5
2a3
2j6
2j4
4f18
4f18
2a12
2a2
2j7
4f37
4f37
2a22
2a1
2j8
2j16
4f39
4f39
2j11
4f40
4f40
2b9
2b3
2j9
2j3
2b10
2b1
2j12
2j10
4v3
4v?
2b13
2b2
2j13
2j13
4x1
4x1
2b12
2b19
2b15
2r1
2r1
5a1
5a1
2b31
2s1
2s1
7a1
7a1
2b23
2b21
2t4
2t1
7b1
7b1
2u1
2u1
8a1
8a1
2c37
x
2w1
2w1
8b1
8b1
2c29
2ab1
2ab1
11a1
11a1
2c38
2c6
2ac1-ps
2ac1
11b1
11b1
2c39
2c7
11b2
11b2
2c44
2c23
3a11
x
11b3
2c50
x
3a16
3a1/3a23
17a1
17a1
2c52-ps
2c11
3a41
3a2
19a1
19a1
2c54
x
3a44
3a73
20a1
20a1
2c55
2c24
3a13
3a9
21a1
21a1
2c80
3a25
24a1
24a1
3a57
3a18
26a1
26a1
2c65
3a59
26b1
26b1
2c66
2c78
x
3a62
26c1
26c1
2c70
2c22
27a1
27a1
2c40
4a12a
27b1
27b1
2c67
2c12
4a12b
4a8
39a1
39a1
2c68
2c13
46a1
46a1
2c69
4a29
x
51a1
51a1
4a30b
x
2d9
2d10
4a14
4a2
2d11
2d1
4a3
2d12
2d5
2d34
4a10
2d22
2d4
4a31
4a1
2d26
2d2
4a32
2d13
2d40
2d3
4b1
4b1
Paralogues are boldened
Orthology between mouse and ratCyp genesParalogues are boldenedThe discrepancy is explained by extra duplications, mostly in the mouse but, in at least some cases, there is duplication in the rat that is not seen in the mouse. Figures 1 and 2 illustrate a comparison of two such gene clusters between rat and mouse in detail. The Cyp2d cluster shows five genes in the rat and two pseudogenes. The mouse has nine genes and 17 pseudogenes. Cyp2d5 and Cyp2d1 in the rat are most similar by BLAST searches to the five mouse genes -- Cyp2d11, Cyp2d10, Cyp2d9, Cyp2d12 and Cyp2d34 -- that are boxed in the mouse cluster; these represent paralogous sets of genes MouseCyp2d13 and Cyp2d40 are almost equally similar to Cyp2d3 in the rat. In between these genes there are six pseudogenes. This whole cluster of genes and pseudogenes may have been derived from a Cyp2d3-like ancestor that expanded in the mouse. Of course, more complicated scenarios are also possible.
Figure 1
The rat and mouse . Expansion in the mouse leads to non-orthologous relationships between these genes. These rat genes have had official CYP names for more than 15 years; in fact, they were the first five genes in the Cyp2d subfamily to be identified. For more details on rat P450 nomenclature, see http://drnelson.utmem.edu/cytochromeP450.html. Abbreviation: bp, base pairs.
Figure 2
The . There has been differential expansion in both species. Strict orthology does not exist, except on the outer edges of the cluster. This is often a feature of genes in gene clusters: the edges of the cluster are more likely to be conserved.
The rat and mouse . Expansion in the mouse leads to non-orthologous relationships between these genes. These rat genes have had official CYP names for more than 15 years; in fact, they were the first five genes in the Cyp2d subfamily to be identified. For more details on rat P450 nomenclature, see http://drnelson.utmem.edu/cytochromeP450.html. Abbreviation: bp, base pairs.The . There has been differential expansion in both species. Strict orthology does not exist, except on the outer edges of the cluster. This is often a feature of genes in gene clusters: the edges of the cluster are more likely to be conserved.Figure 2 shows the Cyp4abx clusters. Notice how the ratCyp4a1 gene has given rise to three Cyp4a genes in the mouse. By contrast, mouseCyp4a14 has duplicated, making Cyp4a2 and Cyp4a3 in the rat, based on BLAST similarities. The mouse cluster is further complicated by an approximately 100 kilobase duplication involving the Cyp4a12 and Cyp4a30 genes. This did not happen in that rat and there does not seem to be a Cyp4a30 equivalent in that animal -- unless it might be the rat Cyp4a33-ps pseudogene. There are seven Cyp gene clusters in the rat, some being even more complex than that described for the Cyp2d and Cyp4abx clusters.The example of mouse versus ratCyp genes that has been chosen in this paper are by no means the only gene sets that will have this problem. In the 5th December, 2002 issue of Nature, in which the mouse genome was reported [10], Table 11 (p. 542) shows the top 50 InterPro domain families in mouse compared with that in human, fish, worm and fly. Cytochrome P450 is ranked 46th in the mouse and 52nd in human. The 45 other families that are more abundant than Cyp in the mouse will potentially have similar nomenclature issues. Fortunately, some of these groups (eg the homeobox genes) have a firmly established nomenclature and will not be renamed. It is not so clear what confusion will descend on the ATPase, kinase, zinc finger protein and the many other gene families.The point made by these figures and tables is that: naming genes cannot be an automatic process, unless one wishes to create confusion. Best reciprocal BLAST hits can be used in assigning names, but they should not be used indiscriminately -- if they are, the result presented in Figure 3 might occur. In fact, it has occurred. Figure 3 is a screenshot of the University of California Santa Cruz (UCSC) browser showing the ratCyp2d cluster, with its five genes. Note that these genes are named Cyp2d22, Cyp2d10, Cyp2d9, Cyp2d13 and Cyp2d26. From Figure 2, it can be seen that these rat genes had already been named Cyp2d4, Cyp2d5, Cyp2d1, Cyp2d3 and Cyp2d2. These names were assigned between 1987 and 1989 by the Committee on Standardized Cytochrome P450 Nomenclature and are official names used in many dozens or hundreds of publications. The two outside 'rat genes Cyp2d22 and Cyp2d26' (Figure 3) are, in fact, orthologues of ratCyp2d4 and Cyp2d2 (Table 1), but the other three rat genes in between Cyp2d22 and Cyp2d26 in Figure 3 -- Cyp2d10, Cyp2d9 and Cyp2d13 -- are not orthologous pairs. Thus, ratCyp genes that already have official names have been renamed to match seemingly orthologous mouseCyp genes. On other views in the UCSC browser, ratCyp genes that already have official names have been renamed for humanCYP genes that are not their orthologues. These names are wrong, yet because they appear in the Genbank database they will probably be used by companies making microarrays and by genome browsers like UCSC and ENSEMBL. This is a very unfortunate practice that may require considerable effort to correct.
Figure 3
The rat . Note that the gene nomenclature being used follows the existing mouse gene nomenclature, which is incorrect. Mouse Cyp2d22 is actually the orthologue of rat Cyp2d4. Mouse Cyp2d26 is actually the orthologue of rat Cyp2d2.
The rat . Note that the gene nomenclature being used follows the existing mouse gene nomenclature, which is incorrect. MouseCyp2d22 is actually the orthologue of ratCyp2d4. MouseCyp2d26 is actually the orthologue of ratCyp2d2.
Conclusions
Gene nomenclature committees have been established to impose order on gene families and in whole genomes to prevent duplication of names and multiple uses of the same root symbol. Gene nomenclature committees have been established to provide an authority that can be trusted. Ignoring the existence of naming systems in order to assign hundreds, or thousands, of names quickly to rat genes to match genes in other genomes will come with a price, and the price will be in failed communication and widespread confusion. These problems are not so different from those that must occur when a carefully constructed language is corrupted.
Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander Journal: Nature Date: 2002-12-05 Impact factor: 49.962
Authors: D R Nelson; T Kamataki; D J Waxman; F P Guengerich; R W Estabrook; R Feyereisen; F J Gonzalez; M J Coon; I C Gunsalus; O Gotoh Journal: DNA Cell Biol Date: 1993 Jan-Feb Impact factor: 3.311
Authors: D W Nebert; M Adesnik; M J Coon; R W Estabrook; F J Gonzalez; F P Guengerich; I C Gunsalus; E F Johnson; B Kemper; W Levin Journal: DNA Date: 1987-02
Authors: D R Nelson; L Koymans; T Kamataki; J J Stegeman; R Feyereisen; D J Waxman; M R Waterman; O Gotoh; M J Coon; R W Estabrook; I C Gunsalus; D W Nebert Journal: Pharmacogenetics Date: 1996-02
Authors: Richard A Gibbs; George M Weinstock; Michael L Metzker; Donna M Muzny; Erica J Sodergren; Steven Scherer; Graham Scott; David Steffen; Kim C Worley; Paula E Burch; Geoffrey Okwuonu; Sandra Hines; Lora Lewis; Christine DeRamo; Oliver Delgado; Shannon Dugan-Rocha; George Miner; Margaret Morgan; Alicia Hawes; Rachel Gill; Robert A Holt; Mark D Adams; Peter G Amanatides; Holly Baden-Tillson; Mary Barnstead; Soo Chin; Cheryl A Evans; Steve Ferriera; Carl Fosler; Anna Glodek; Zhiping Gu; Don Jennings; Cheryl L Kraft; Trixie Nguyen; Cynthia M Pfannkoch; Cynthia Sitter; Granger G Sutton; J Craig Venter; Trevor Woodage; Douglas Smith; Hong-Mei Lee; Erik Gustafson; Patrick Cahill; Arnold Kana; Lynn Doucette-Stamm; Keith Weinstock; Kim Fechtel; Robert B Weiss; Diane M Dunn; Eric D Green; Robert W Blakesley; Gerard G Bouffard; Pieter J De Jong; Kazutoyo Osoegawa; Baoli Zhu; Marco Marra; Jacqueline Schein; Ian Bosdet; Chris Fjell; Steven Jones; Martin Krzywinski; Carrie Mathewson; Asim Siddiqui; Natasja Wye; John McPherson; Shaying Zhao; Claire M Fraser; Jyoti Shetty; Sofiya Shatsman; Keita Geer; Yixin Chen; Sofyia Abramzon; William C Nierman; Paul H Havlak; Rui Chen; K James Durbin; Amy Egan; Yanru Ren; Xing-Zhi Song; Bingshan Li; Yue Liu; Xiang Qin; Simon Cawley; Kim C Worley; A J Cooney; Lisa M D'Souza; Kirt Martin; Jia Qian Wu; Manuel L Gonzalez-Garay; Andrew R Jackson; Kenneth J Kalafus; Michael P McLeod; Aleksandar Milosavljevic; Davinder Virk; Andrei Volkov; David A Wheeler; Zhengdong Zhang; Jeffrey A Bailey; Evan E Eichler; Eray Tuzun; Ewan Birney; Emmanuel Mongin; Abel Ureta-Vidal; Cara Woodwark; Evgeny Zdobnov; Peer Bork; Mikita Suyama; David Torrents; Marina Alexandersson; Barbara J Trask; Janet M Young; Hui Huang; Huajun Wang; Heming Xing; Sue Daniels; Darryl Gietzen; Jeanette Schmidt; Kristian Stevens; Ursula Vitt; Jim Wingrove; Francisco Camara; M Mar Albà; Josep F Abril; Roderic Guigo; Arian Smit; Inna Dubchak; Edward M Rubin; Olivier Couronne; Alexander Poliakov; Norbert Hübner; Detlev Ganten; Claudia Goesele; Oliver Hummel; Thomas Kreitler; Young-Ae Lee; Jan Monti; Herbert Schulz; Heike Zimdahl; Heinz Himmelbauer; Hans Lehrach; Howard J Jacob; Susan Bromberg; Jo Gullings-Handley; Michael I Jensen-Seaman; Anne E Kwitek; Jozef Lazar; Dean Pasko; Peter J Tonellato; Simon Twigger; Chris P Ponting; Jose M Duarte; Stephen Rice; Leo Goodstadt; Scott A Beatson; Richard D Emes; Eitan E Winter; Caleb Webber; Petra Brandt; Gerald Nyakatura; Margaret Adetobi; Francesca Chiaromonte; Laura Elnitski; Pallavi Eswara; Ross C Hardison; Minmei Hou; Diana Kolbe; Kateryna Makova; Webb Miller; Anton Nekrutenko; Cathy Riemer; Scott Schwartz; James Taylor; Shan Yang; Yi Zhang; Klaus Lindpaintner; T Dan Andrews; Mario Caccamo; Michele Clamp; Laura Clarke; Valerie Curwen; Richard Durbin; Eduardo Eyras; Stephen M Searle; Gregory M Cooper; Serafim Batzoglou; Michael Brudno; Arend Sidow; Eric A Stone; J Craig Venter; Bret A Payseur; Guillaume Bourque; Carlos López-Otín; Xose S Puente; Kushal Chakrabarti; Sourav Chatterji; Colin Dewey; Lior Pachter; Nicolas Bray; Von Bing Yap; Anat Caspi; Glenn Tesler; Pavel A Pevzner; David Haussler; Krishna M Roskin; Robert Baertsch; Hiram Clawson; Terrence S Furey; Angie S Hinrichs; Donna Karolchik; William J Kent; Kate R Rosenbloom; Heather Trumbower; Matt Weirauch; David N Cooper; Peter D Stenson; Bin Ma; Michael Brent; Manimozhiyan Arumugam; David Shteynberg; Richard R Copley; Martin S Taylor; Harold Riethman; Uma Mudunuri; Jane Peterson; Mark Guyer; Adam Felsenfeld; Susan Old; Stephen Mockrin; Francis Collins Journal: Nature Date: 2004-04-01 Impact factor: 49.962
Authors: Michael E Cusick; Haiyuan Yu; Alex Smolyar; Kavitha Venkatesan; Anne-Ruxandra Carvunis; Nicolas Simonis; Jean-François Rual; Heather Borick; Pascal Braun; Matija Dreze; Jean Vandenhaute; Mary Galli; Junshi Yazaki; David E Hill; Joseph R Ecker; Frederick P Roth; Marc Vidal Journal: Nat Methods Date: 2009-01 Impact factor: 28.547