| Literature DB >> 12914656 |
Joseph Cheung1, Michael D Wilson, Junjun Zhang, Razi Khaja, Jeffrey R MacDonald, Henry H Q Heng, Ben F Koop, Stephen W Scherer.
Abstract
BACKGROUND: The high quality of the mouse genome draft sequence and its associated annotations are an invaluable biological resource. Identifying recent duplications in the mouse genome, especially in regions containing genes, may highlight important events in recent murine evolution. In addition, detecting recent sequence duplications can reveal potentially problematic regions of the genome assembly. We use BLAST-based computational heuristics to identify large (>/= 5 kb) and recent (>/= 90% sequence identity) segmental duplications in the mouse genome sequence. Here we present a database of recently duplicated regions of the mouse genome found in the mouse genome sequencing consortium (MGSC) February 2002 and February 2003 assemblies.Entities:
Mesh:
Year: 2003 PMID: 12914656 PMCID: PMC193640 DOI: 10.1186/gb-2003-4-8-r47
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Recent segmental duplication in the mouse genome
| Chromosome | Chromosome length | Intrachromosomal duplication | % | Interchromosomal duplication | % | Total | % |
| 1 | 195,869,683 | 1,392,568 | 0.7 | 238,739 | 0.1 | 1,552,908 | 0.8 |
| 2 | 181,423,755 | 1,106,879 | 0.6 | 173,602 | 0.1 | 1,184,304 | 0.7 |
| 3 | 160,674,399 | 790,500 | 0.5 | 158,011 | 0.1 | 948,511 | 0.6 |
| 4 | 152,921,959 | 1,743,027 | 1.1 | 647,795 | 0.4 | 1,921,970 | 1.3 |
| 5 | 149,719,773 | 1,102,772 | 0.7 | 761,950 | 0.5 | 1,560,683 | 1.0 |
| 6 | 149,950,539 | 2,042,585 | 1.4 | 562,415 | 0.4 | 2,339,839 | 1.6 |
| 7 | 134,401,573 | 1,655,438 | 1.2 | 713,287 | 0.5 | 2,038,845 | 1.5 |
| 8 | 128,923,138 | 738,203 | 0.6 | 331,970 | 0.3 | 1,005,575 | 0.8 |
| 9 | 124,467,299 | 437,352 | 0.4 | 188,427 | 0.2 | 623,089 | 0.5 |
| 10 | 130,738,012 | 345,768 | 0.3 | 258,429 | 0.2 | 604,197 | 0.5 |
| 11 | 122,862,689 | 900,355 | 0.7 | 127,774 | 0.1 | 1,012,479 | 0.8 |
| 12 | 114,462,600 | 1,139,786 | 1.0 | 374,365 | 0.3 | 1,404,279 | 1.2 |
| 13 | 116,242,670 | 855,835 | 0.7 | 547,462 | 0.5 | 1,349,974 | 1.2 |
| 14 | 115,844,145 | 450,161 | 0.4 | 451,782 | 0.4 | 748,465 | 0.6 |
| 15 | 104,111,694 | 443,805 | 0.4 | 43,937 | 0.0 | 487,742 | 0.5 |
| 16 | 98,986,639 | 389,255 | 0.4 | 67,290 | 0.1 | 456,545 | 0.5 |
| 17 | 93,529,596 | 1,329,664 | 1.4 | 660,440 | 0.7 | 1,760,982 | 1.9 |
| 18 | 91,041,441 | 162,916 | 0.2 | 58,996 | 0.1 | 210,422 | 0.2 |
| 19 | 61,093,376 | 328,909 | 0.5 | 193,387 | 0.3 | 479,687 | 0.8 |
| X | 149,996,094 | 2,592,361 | 1.7 | 574,950 | 0.4 | 3,018,682 | 2.0 |
| chrUn* | 117,911,829 | 6,049,538 | 5.1 | 5,710,057 | 4.8 | 8,885,604 | 7.5 |
| Total | 2,695,172,903 | 25,997,677 | 1.0 | 12,845,065 | 0.5 | 33,594,782 | 1.2 |
The analysis is based on the February 2003 mouse genome assembly. *chrUn, unmapped chromosome sequence.
Figure 1Intrachromosomal segmental duplications identified in the mouse genome (chromosomes 1-X; results are based on the February 2003 assembly). Each line represents a duplicated module and connects a paralogous duplicon pair. Red, 99-100% sequence identity; purple, 96-98%; green, 93-95%; and blue, 90-92%. Correspondences to chromosome ideograms (obtained from Ensembl) are only crude. Graphics were produced using GenomePixelizer [34].
Comparison between genome assemblies
| Sequence identity level | February 2002 assembly* | February 2003 assembly |
| Duplication content (bp) | ||
| 90-92% | 4,966,470 | 3,543,429 |
| 92-94% | 15,685,840 | 13,981,642 |
| 94-96% | 17,533,730 | 17,970,287 |
| 96-98% | 11,539,392 | 11,731,958 |
| 98-99.5% | 5,865,024 | 5,487,899 |
| †Potential sequence misassignment error detected (bp) | ||
| 99.5-100% | 4,832,594 | 18,456,096 |
The comparison is of duplication content by sequence identity and potential sequence misassignment errors between the February 2002 (MGSCv3) and February 2003 (a hybrid assembly of MGSCv3 with 705 Mb finished BAC sequence) genome assemblies. *Analysis of the duplication content for February 2002 assembly can be found at [14].†Sequences detected to show extremely high percent identity duplications are likely to be genome assembly artifacts and were not included in the duplication content shown in Table 1.
Examples of recent mouse gene duplications
| Locus1* | Gene | Percent identity† | Annotation | Locus2* | Gene | Percent identity† | Annotation | Duplication % identity‡ |
| 1 F | NM_009888 | 99.6 | Cfh (Complement component factor h) | 1 F | M29010 | 99.0 | Complement factor H-related protein mRNA | 97.1 |
| 3 G1 | NM_009669 | 100 | Amy2 (Amylase 2, pancreatic) | 3 G1 | M11896 | 99.6 | Pancreatic amylase B-1 | 97.6 |
| 5 E2 | NM_053184 | 99.9 | Ugt2a1 (UDP glycosyltransferase 2 A1) | 5 E2 | BF144793 | 99.6 | cDNA clone IMAGE:4021939 | 95.7 |
| 5 E2 | NM_009467 | 100 | Ugt2b5(UDP-glucuronosyl-transferase 2b5) | 5 E2 | NM_053215 | 100 | RIKEN cDNA 0610033E06 gene | 93.3 |
| 5 E4 | NM_008620 | 99.9 | Mpa2 (macrophage activation 2) | 5 E4 | BC007143 | 99.5 | Similar to macrophage activation 2 | 90.6 |
| 5 G1 | NM_029693 | 100 | RIKEN cDNA 1700123K08 | 7 B2 | NM_027702 | 100 | RIKEN cDNA 4933421I07 gene | 91.1 |
| 6 C1 | NM_053238 | 100 | V1rc8 (Vomeronasal 1 receptor, C8) | 6 C1 | NM_053239 | 99.7 | V1rc9 (Vomeronasal 1 receptor, C9) | 95.1 |
| 6 D1 | NM_011467 | 99.9 | Spr (sepiapterin reductase) | 6 D1 | BE862957 | 99.5 | EST sequence | 95.8 |
| 6 F1 | AI505330 | 100 | Similar to initiation factor eIF-4AI | 6 F1 | AI503670 | 99.8 | Similar to initiation factor eIF-4AI | 98.9 |
| 6 F2 | NM_008646 | 99.9 | Mug2 (Murinoglobulin 2) | 6 F2 | NM_008645 | 99.9 | Mug1 (Murinoglobulin 1) | 94.6 |
| 6 F3 | NM_020257 | 99.8 | Dcl1 (c-type lectin 1) | 6 F3 | NM_027562 | 99.9 | 4632413B12Rik (C-lectin related protein) | 90.8 |
| 6 F3 | NM_008463 | 99.7 | Klra5 (Killer cell lectin-like receptor, A5) | 6 F3 | NM_008464 | 99.5 | Klra6 (Killer cell lectin-like receptor, A6) | 90.0 |
| 6 F3 | NM_010649 | 99.8 | Klra4 (Killer cell lectin-like receptor A4) | 6 F3 | NM_016659 | 99.8 | Klra1 (Killer cell lectin-like receptor A1) | 91.4 |
| 6 F3 | NM_010737 | 99.8 | Klrb1b (Killer cell lectin-like receptor 1b) | 6 F3 | NM_008527 | 99.9 | Klrb1c (Killer cell lectin-like receptor 1c) | 90.8 |
| 7 A2 | NM_011860 | 100 | Mater (Maternal effect gene) | 7 A1 | AK016782 | 100 | Similar to Mater protein | 96.6 |
| 7 B1 | NM_032541 | 100 | Hamp hepcidin antimicrobial peptide | 7 B1 | AK007975 | 99.8 | Prohepcidin homolog | 92.8 |
| 7 B2 | NM_010115 | 99 | Klk13 (Kallikrein 13) | 7 B2 | NM_008454 | 99.9 | Klk16 (Kallikrein 16) | 92.2 |
| 8 D1 | L11333 | 99.9 | Carboxylesterase | 8 D1 | NM_144511 | 100 | Es31 | 95.2 |
| 9 F4 | NM_130864 | 99.6 | Acaa acetyl-Coenzyme A acyltransferase | 9 F4 | BC019882 | 100 | Similar to acetyl-CoA acyltransferase | 96.6 |
| 10 B3 | NM_013532 | 99.9 | Gp49a (Glycoprotein 49A) | 10 B3 | NM_008147 | 100 | Gp49b (glycoprotein 49B) | 96.7 |
| 10 D2 | NM_017372 | 100 | Lyzs (Lysozyme) | 10 D2 | NM_013590 | 99.8 | Lzp-s (P lysozyme structural) | 95.3 |
| 11 A3.2 | NM_172792 | 100 | hypothetical protein 4932414J04 | 17 D | AK03001 | 100 | Tyrosine protein kinase/cysteine-rich region | 94.0 |
| 11 B1.3 | NM_011396 | 99.9 | Slc22a5 (Solute carrier family 22) | 11 B1.3 | NM_019723 | 100 | Slc22a9 (solute carrier family 22) | 91.2 |
| 11 D | NM_021347 | 100 | Gsdm (Gasdermin) | 11 D | NM_029727 | 99.9 | 2200001G21Rik | 94.2 |
| 12 F1 | BC002065 | 99.6 | Serine protease inhibitor 2-1 | 12 F1 | BY761363 | 99.9 | EST sequence | 92.2 |
| 12 F1 | NM_013772 | 100 | Tcl1b3 (T-cell leukemia/lymphoma 1B, 3) | 12 F1 | NM_013776 | 100 | Tcl1b5 (T-cell leukemia/lymphoma 1B, 5) | 95.3 |
| 13 A1 | NM_013778 | 99.5 | Akr1c13 (Aldo-keto reductase 1, C13) | 13 A1 | NM_013777 | 99.5 | Akrc12 (Aldo-keto reductase 1, C12) | 96.1 |
| 13 A3 | NM_008864 | 99.2 | Csh1 (chorionic somatomammotrophin 1) | 13 A3.3 | AK082929 | 100 | Similar to placental lactogen 1 | 98.8 |
| 13 A4 | NM_011456 | 100 | Spi14 (Serine Protease Inhibitor 14) | 13 A4 | NM_011455 | 100 | Spi13 (serine protease inhibitor 13) | 95.2 |
| 13 D1 | NM_010872 | 100 | Birc1b (Neuronal apoptosis inhibitory 2) | 13 D1 | NM_008670 | 99.9 | Birc1a (Neuronal apoptosis inhibitory 1) | 90.0 |
| 14 C1 | NM_010373 | 99.7 | Gzme (Granzyme E) | 14 C1 | NM_010372 | 99.9 | Gzmd (Granzyme D) | 94.7 |
| 14 C2 | NM_172603 | 100 | 4933417L10Rik | 14 C3 | BE381578 | 100 | EST sequence | 94.0 |
| 15 E2 | NM_007781 | 100 | Csf2rb2 (Colony stimulating factor 2, β-2) | 15 E2 | NM_007780 | 99.8 | Csf2rb1 (Colony stimulating factor 2, β-1) | 95.0 |
| 15 E2 | NM_010005 | 100 | Cyp2d10 (Cytochrome P450, 2d10) | 15 E2 | NM_010006 | 100 | Cyp2d9 (cytochrome P450, 2d9) | 92.1 |
| 16 B1 | NM_023125 | 100 | Kng (Kininogen) | 16 B1 | BI330914 | 99.1 | EST sequence | 90.0 |
| 16 B3 | M92418 | 99.8 | MS2 (Cysteine proteinase inhibitor) | 16 B3 | BB654253 | 100 | EST sequence | 95.2 |
| 17 B2 | NM_009780 | 99.9 | C4 (Complement component 4) | 17 B2 | M21576 | 99.5 | Slp (MHC sex-limited protein) | 96.3 |
| X A2 | NM_008955 | 100 | Psx1 (Placenta specific homeobox 1) | X A2 | NM_023894 | 100 | Homeobox protein GPBOX | 91.6 |
*Locations of duplicons by mouse chromosome banding; locus 1 and 2 represent a duplication pair. †Alignment percent identity between gene and genomic sequences showing correct matches. ‡ % similarity: average DNA percent identity between paralogous gene/transcript sequences in locus 1 and 2 (duplicated pair)
Protein domain enrichment found in recently duplicated mouse genes*
| InterPro entry ID | Protein domain description | Number found in 608 duplicated genes | Number found in all 16,515 annotated genes in genome | Enrichment† |
| IPR000276 | Rhodopsin-like GPCR superfamily | 135 | 1229 | 3.0 |
| IPR000725 | Olfactory receptor | 103 | 861 | 3.3 |
| IPR003006 | Immunoglobulin/major histocompatibility complex | 46 | 372 | 3.4 |
| IPR004072 | Vomeronasal receptor, type 1 | 31 | 108 | 7.8 |
| IPR001909 | KRAB box | 23 | 103 | 6.1 |
| IPR001254 | Serine protease, trypsin family | 21 | 117 | 4.9 |
| IPR002401 | E-class P450, group I | 20 | 61 | 8.9 |
| IPR001128 | Cytochrome P450 | 20 | 68 | 8.0 |
| IPR007086 | Zn-finger, C2H2 subtype | 20 | 139 | 3.9 |
| IPR001314 | Chymotrypsin serine protease, family S1 | 19 | 108 | 4.8 |
| IPR002403 | E-class P450, group IV | 17 | 56 | 8.2 |
| IPR002397 | B-class P450 | 13 | 29 | 11.9 |
| IPR001304 | C-type lectin | 13 | 96 | 3.7 |
| IPR000215 | Serpin | 12 | 48 | 6.8 |
| IPR002402 | E-class P450, group II | 9 | 14 | 18.5 |
| IPR006046 | Glycoside hydrolase family 13 | 7 | 8 | 23.0 |
| IPR006047 | Alpha amylase, catalytic domain | 7 | 10 | 19.2 |
| IPR001400 | Somatotropin hormone | 7 | 32 | 6.1 |
| IPR006048 | Alpha amylase, C-terminal all-beta domain | 6 | 7 | 24.7 |
| IPR002018 | Carboxylesterase, type B | 6 | 13 | 12.3 |
| IPR004073 | Vomeronasal receptor, type 2 | 6 | 13 | 12.3 |
| IPR001039 | Major histocompatibility complex protein, class I | 6 | 17 | 9.9 |
| IPR001828 | Extracellular ligand-binding receptor | 6 | 29 | 5.5 |
| IPR002213 | UDP-glucoronosyl/UDP-glucosyl transferase | 5 | 12 | 11.8 |
| IPR002448 | Odour-binding protein | 4 | 9 | 13.2 |
| IPR000068 | Extracellular calcium-sensing receptor | 4 | 10 | 11.0 |
*Only Ensembl gene annotation (608 genes) was used in this analysis. †All results shown are statistically significant with p-values < 10-5 (chi2 test).
Figure 2The genomic organization of the Mater duplication. (a) Location of the Mater duplication. A snapshot view of GMOD browser (details can be found at [14]). (b) Chromosomal view (mouse chromosome 7) of the three Mater duplication locations (DUP1, DUP2, MaterP). (c) Graphical view of the sequence similarity between DUP1 and DUP2 shown by GenomePixelizer. DUP2 is situated in an inverse orientation with respect to DUP1. Red, 99-100% sequence identity; purple, 96-98%; green, 93-95%; blue, 90-92%; black, 85-89%. (d) Graphical view of the sequence similarity between DUP1 and the MaterP region. As shown, MaterP is an intron-less, retrotransposed pseudogene. Blue, 90-92% sequence identity; black, 85-89%.
Figure 3FISH detection of Mater duplication. (a) Metaphase FISH showing three pairs of signals (yellow) detected on mouse chromosome 7 using BAC clone RP23-225F5 (detection frequency of 70%) mapping to duplicated Mater regions. (b) DAPI banding of the same partial mitotic figures for the identification of mouse chromosome 7. A control probe RP23-464L20 was mapped to a single location in the F2 region (data not shown).
Genes that have undergone recent duplication in both the mouse and human genome*
| Refseq† | Gene description |
| NM_007558 | Bone morphogenetic protein 8a (Bmp8a) |
| NM_007812 | Cytochrome P450, 2a5 (Cyp2a5) |
| NM_009467 | UDP-glucuronosyltransferase 2 family, member 5 (Ugt2b5) |
| NM_009888 | Complement component factor h (Cfh) |
| NM_010856 | Myosin heavy chain, cardiac muscle, adult (Myhca) |
| NM_013778 | Aldo-keto reductase family 1, member C13 (Akr1c13) |
| NM_031176 | Tenascin XB (Tnxb) |
| NM_130864 | Acetyl-coenzyme A acyltransferase (peroxisomal 3-oxoacyl-Coenzyme athiolase) (Acaa) |
| NM_031170 | Keratin complex 2, basic, gene 8 (Krt2-8) |
*Six hundred and seventy-five duplicated mouse gene sequences were aligned to the June 2002 human genome assembly by BLAST (with an initial expected value cutoff of <10-10). The best aligned human genes were subsequently used for reciprocal BLAST alignments (against the mouse genome sequence) to establish a putative orthologous relationship between the mouse and human gene pairs. Using results from our human genome duplication analysis [9,37], we examined regions of the human genome where the human genes are involved in recent segmental duplication. †Italics represents genes that are entirely within a duplication in the mouse genome.