| Literature DB >> 12702206 |
Joseph Cheung1, Xavier Estivill, Razi Khaja, Jeffrey R MacDonald, Ken Lau, Lap-Chee Tsui, Stephen W Scherer.
Abstract
BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5% of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies.Entities:
Mesh:
Year: 2003 PMID: 12702206 PMCID: PMC154576 DOI: 10.1186/gb-2003-4-4-r25
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Segmental duplication content of the human genome
| Chromosome | Size (bp) | Intrachromosomal duplication (bp) | % chromosome (previous)* | Interchromosomal duplication (bp) | % chromosome (previous)* | Total duplications (bp) | % chromosome (previous)** | Errors† (bp) | % chromosomes |
| 1 | 246,874,334 | 5,278,549 | 2.1 (4.4) | 2,854,898 | 1.2 (2.3) | 7,056,274 | 2.9 (5.7) | 4,369,406 | 1.8 |
| 2 | 240,681,600 | 4,917,160 | 2.0 (2.4) | 3,298,723 | 1.4 (1.6) | 6,892,585 | 2.9 (3.2) | 2,311,522 | 1.0 |
| 3 | 194,908,136 | 2,128,493 | 1.1 (2.3) | 1,654,201 | 0.8 (2.0) | 3,146,570 | 1.6 (3.2) | 3,979,610 | 2.0 |
| 4 | 192,019,378 | 2,599,650 | 1.4 (2.3) | 2,164,382 | 1.1 (2.2) | 4,061,432 | 2.1 (3.4) | 2,482,740 | 1.3 |
| 5 | 180,966,400 | 3,519,480 | 1.9 (2.0) | 1,464,945 | 0.8 (1.3) | 4,530,406 | 2.5 (2.8) | 2,297,998 | 1.3 |
| 6 | 170,309,517 | 2,358,252 | 1.4 (2.3) | 743,875 | 0.4 (1.3) | 2,877,392 | 1.7 (3.4) | 569,918 | 0.3 |
| 7 | 157,432,793 | 8,636,434 | 5.5 (6.3) | 2,614,326 | 1.7 (2.9) | 10,139,669 | 6.4 (7.8) | 205,130 | 0.1 |
| 8 | 143,874,322 | 2,318,984 | 1.6 (2.2) | 1,125,241 | 0.8 (2.0) | 2,612,280 | 1.8 (3.0) | 3,956,756 | 2.8 |
| 9 | 132,438,756 | 7,248,232 | 5.5 (7.1) | 4,801,871 | 3.6 (4.7) | 8,341,767 | 6.3 (8.2) | 1,589,734 | 1.2 |
| 10 | 134,416,750 | 5,279,301 | 3.9 (4.3) | 1,375,341 | 1.0 (1.9) | 6,334,458 | 4.7 (5.7) | 1,250,157 | 0.9 |
| 11 | 137,442,545 | 3,622,080 | 2.6 (3.3) | 1,670,412 | 1.2 (1.8) | 4,363,619 | 3.2 (4.4) | 2,028,875 | 1.5 |
| 12 | 131,300,572 | 1,894,547 | 1.4 (2.3) | 971,490 | 0.7 (1.2) | 2,816,187 | 2.1 (3.3) | 3,383,730 | 2.6 |
| 13 | 113,446,104 | 918,255 | 0.8 (1.9) | 1,202,102 | 1.1 (2.3) | 1,855,806 | 1.6 (3.4) | 146,198 | 0.1 |
| 14 | 104,324,908 | 531,219 | 0.5 (0.7) | 820,880 | 0.8 (1.6) | 1,335,177 | 1.3 (2.1) | 13,814 | 0.0 |
| 15 | 99,217,355 | 4,593,233 | 4.6 (6.2) | 2,344,618 | 2.4 (3.9) | 5,634,201 | 5.7 (8.2) | 1,739,894 | 1.8 |
| 16 | 81,671,585 | 4,917,218 | 6.0 (8.3) | 2,228,116 | 2.7 (3.9) | 6,012,178 | 7.4 (9.8) | 2,113,843 | 2.6 |
| 17 | 80,052,782 | 4,775,137 | 6.0 (7.1) | 646,968 | 0.8 (2.5) | 5,274,195 | 6.6 (8.5) | 2,145,614 | 2.7 |
| 18 | 77,516,809 | 525,636 | 0.7 (1.2) | 700,654 | 0.9 (1.9) | 1,226,290 | 1.6 (3.1) | 1,443,775 | 1.9 |
| 19 | 60,013,307 | 2,700,984 | 4.5 (6.8) | 704,757 | 1.2 (2.6) | 3,156,687 | 5.3 (8.1) | 335,190 | 0.6 |
| 20 | 62,842,997 | 592,441 | 0.9 (1.1) | 873,152 | 1.4 (1.8) | 1,052,248 | 1.7 (2.1) | 147,940 | 0.2 |
| 21 | 44,626,493 | 481,879 | 1.1 (1.4) | 1,303,776 | 2.9 (5.1) | 1,504,333 | 3.4 (5.2) | 0 | 0.0 |
| 22 | 47,748,585 | 1,741,766 | 3.6 (6.7) | 1,374,363 | 2.9 (7.4) | 2,770,386 | 5.8 (10.9) | 0 | 0.0 |
| X | 14,924,9818 | 2,625,206 | 1.8 (3.6) | 2,927,714 | 2.0 (2.3) | 5,518,712 | 3.7 (5.5) | 2,185,046 | 1.5 |
| Y | 58,368,225 | 5,959,836 | 10.2 (28.4) | 3,524,276 | 6.0 (25.0) | 8,461,355 | 14.5 (40.7) | 56,204 | 0.1 |
| Un‡ | 1,391,854 | 179,709 | 12.9 (20.4) | 378,110 | 27.2 (32.6) | 407,013 | 29.2 (36.5) | 116,923 | 8.4 |
| Total | 3,043,135,925 | 80,343,681 | 2.6 (3.8) | 43,769,191 | 1.4 (2.6) | 107,381,220 | 3.5 (5.2) | 38,870,017 | 1.3 |
*Previous data on segmental duplications distributed by chromosomes as reported in [8]. †Errors represent data that were detected as potential sequence misassignments. ‡Un, unmapped chromosome sequence.
Figure 1Intrachromosomal segmental duplications identified in the human genome. Three panels of results are displayed for each chromosome. Left, graphical views of the paralogous relationships between recent segmental duplications (graphics produced using GenomePixelizer [29,30]; each line represents a duplicated module; coloring scheme, red = 99% to 100% sequence identity, purple = 96% to 98%, green = 93% to 95%, and blue = 90% to 92%). Middle panel: segmental duplications as detected by BLAST analysis (size of duplication in kb plotted against the length of chromosome in Mb). Right panel: ambSNPs density plot (number of ambSNPs plotted against the length of chromosome in Mb). All analyses were done using the June 2002 human genome sequence assembly.
Figure 2An example of sequence misassignment error as indicated by e-PCR analysis. AC121339 is incorrectly mapped to 3q13.13 in the June 2002 human genome assembly as shown by a consensus number of chromosome X STS markers.
Examples of sequence misassignment errors
| Clone* | Location | Size of region involved (bp) | e-PCR results |
| AC121339† | 3q13.13 | 193,190 | chrX |
| AC016003 | 17q21.31 | 181,582 | chr9 |
| AC119723 | 3q22.1 | 159,924 | chr6 |
| AC093007 | 3q12.1 | 169,882 | chr6 |
| AC110578 | 8p23.2 | 160,554 | chr15 |
| AC108862 | 11p15.3 | 156,150 | chr18 |
| AC113009 | 8q23.1 | 155,171 | chr11 |
| AC104765 | 8q12.1 | 150,029 | chr18 |
| AC105412 | 2p13.1 | 144,924 | chr5 |
| AC092744 | 12p12.3 | 144,009 | chr4 |
| AC099061 | 1p21.3 | 140,516 | chr15 |
| AC108735 | 3p24.3 | 136,005 | chr16 |
| AC122689 | 3q23 | 120,057 | chr12 |
| AC017027 | 1q32.1 | 116,265 | chr5 |
| AC013530 | 3q26.1 | 99,768 | chr8 |
| AC115093 | 11p15.4 | 98,715 | chr1 |
| AC112921 | Xp22.22 | 96,272 | chr3 |
| AC108094 | 16q21 | 94,953 | chr17 |
| AC079186 | 8q12.1 | 78,771 | chr7 |
| AC024573 | Unmapped | 56,016 | chr2 |
| AC115093 | 11p15.4 | 53,858 | chr1 |
**A full list can be obtained from [12]. †See Figure 2 for e-PCR results supporting sequence misassignment.
Comparison of duplications and potential sequence misassignment errors in genome assemblies
| December 2001 | April 2002 | June 2002 | |||||||
| Length | Duplications | Errors | Length | Duplications | Errors | Length | Duplications | Errors | |
| Chr1 | 2,564 | 99 | 115 | 2,459 | 68 | 60 | 2,469 | 71 | 44 |
| Chr2 | 2,413 | 70 | 45 | 2,468 | 79 | 57 | 2,407 | 69 | 23 |
| Chr3 | 2,048 | 49 | 90 | 2,047 | 29 | 73 | 1,949 | 31 | 40 |
| Chr4 | 1,914 | 39 | 44 | 1,970 | 51 | 49 | 1,920 | 41 | 25 |
| Chr5 | 1,848 | 55 | 90 | 1896 | 55 | 112 | 1,810 | 45 | 23 |
| Chr6 | 1,783 | 58 | 56 | 1,828 | 43 | 153 | 1,703 | 29 | 6 |
| Chr7 | 1,638 | 130 | 48 | 1,605 | 119 | 27 | 1,574 | 101 | 2 |
| Chr8 | 1,457 | 35 | 66 | 1,484 | 33 | 43 | 1,439 | 26 | 40 |
| Chr9 | 1,330 | 83 | 38 | 1,291 | 75 | 27 | 1,324 | 83 | 16 |
| Chr10 | 1,421 | 74 | 51 | 1,385 | 72 | 39 | 1,344 | 63 | 13 |
| Chr11 | 1,414 | 51 | 84 | 1,341 | 43 | 36 | 1,374 | 44 | 20 |
| Chr12 | 1,396 | 30 | 83 | 1,342 | 24 | 32 | 1,313 | 28 | 34 |
| Chr13 | 1,151 | 29 | 21 | 1,136 | 22 | 15 | 1,134 | 19 | 1 |
| Chr14 | 1,065 | 27 | 8 | 1,054 | 23 | 10 | 1,043 | 13 | 0 |
| Chr15 | 991 | 62 | 30 | 1,000 | 54 | 20 | 992 | 56 | 17 |
| Chr16 | 938 | 65 | 44 | 932 | 67 | 32 | 817 | 60 | 21 |
| Chr17 | 839 | 66 | 46 | 811 | 46 | 29 | 801 | 53 | 21 |
| Chr18 | 818 | 16 | 59 | 809 | 12 | 32 | 775 | 12 | 14 |
| Chr19 | 769 | 45 | 28 | 730 | 34 | 12 | 600 | 32 | 3 |
| Chr20 | 630 | 10 | 5 | 628 | 12 | 4 | 628 | 11 | 1 |
| Chr21 | 446 | 18 | 3 | 446 | 16 | 2 | 446 | 15 | 0 |
| Chr22 | 478 | 28 | 0 | 477 | 29 | 1 | 477 | 28 | 0 |
| ChrX | 1,517 | 54 | 40 | 1,518 | 61 | 23 | 1,492 | 55 | 22 |
| ChrY | 584 | 86 | 2 | 584 | 95 | 2 | 584 | 85 | 1 |
| ChrUn | 74 | 10 | 1 | 125 | 11 | 43 | 14 | 4 | 1 |
| Total | 31,526 | 1,290 | 1,097 | 31,366 | 1,175 | 932 | 30,431 | 1,074 | 389 |
| % range* | Duplication | Error | Duplication | Error | Duplication | Error | |||
| 90-92% | 135 | 0 | 137 | 0 | 117 | 0 | |||
| 92-94% | 334 | 0 | 334 | 0 | 311 | 0 | |||
| 94-96% | 391 | 0 | 382 | 0 | 367 | 0 | |||
| 96-98% | 451 | 0 | 444 | 0 | 418 | 0 | |||
| 98-100% | 884 | 1,097 | 724 | 932 | 665 | 389 | |||
*All numbers shown in the table are × 100 kb. *Sequence similarity between duplication by five levels of percent identity.
Segmental duplications involved in known genomic disorders and chromosome rearrangements identified by BLAST and ambSNP analyses
| First copy | Second copy(s) | ||||||||
| Disorders | Band | Start* | Size* | ambSNPs† | Start* | Size* | ambSNPs† | Identity | |
| Gaucher disease | 1q22 | 148108965 | 10,649 | 7 | 152776301 | -10,479 | 10 | 95.19 | S |
| Spinal muscular atrophy | 5p14/5q13 | 21621854 | 79,183 | 1,032 | 69175603 | -79,149 | 1,190 | 98.22 | M |
| Williams-Beuren syndrome | 7q11.23 | 70970126 | 359,416 | 380 | 72927299 | 111,773 | 56 | 99.60 | P |
| 73383317 | -227,260 | 355 | 99.20 | P | |||||
| t(4;8)(p16;p23) Wolf-Hirschhorn syndrome | 4p16/8p23 | 8769778 | 99,609 | 3† | 7156209 | -51,677 | 18 | 95.65 | P |
| 7470072 | -82,189 | 387 | 95.81 | P | |||||
| inv dup(8p) der(8)(8p23.1::p23.2-pter) del(8)(p23.1p23.2) | 8p23.1 | 7084847 | 138,560 | 123 | 7756853 | -126,769 | 229 | 99.16 | M |
| 7651975 | 54,807 | 463 | 96.93 | M | |||||
| Prader-Willi syndrome and Angelman syndrome | 15q11/15q13 | 19709020 | 75,325 | 102 | 19961243 | 34,902 | 55 | 98.70 | P |
| 20029574 | 41,965 | 83 | 98.79 | P | |||||
| 19802418 | 251,245 | 548 | 20064937 | 74,780 | 65 | 99.01 | P | ||
| Polycystic kidney disease 1 | 16p13 | 2164789 | 38,034 | 136 | 16249164 | 24,076 | 243 | 98.32 | P |
| Charcot-Marie-Tooth1A/Hereditary neuropathy with pressure palsies | 17p12/17p12 | 14440158 | 23,599 | 272 | 15837032 | 23,585 | 286 | 98.42 | P |
| Smith-Magenis syndrome/ dup(17)(p11.2-p11.2) | 17p12 | 18524425 | 152,700 | 547 | 20492073 | -147,255 | 539 | 99.06 | M |
| 25811482 | 28,239 | 24 | 99.20 | M | |||||
| Neurofibromatosis type 1 | 17q11.2 | 28686414 | 63,356 | 163 | 28952984 | -32,619 | 129 | 98.65 | P |
| DiGeorge syndrome and velocardiofacial syndrome | 22q11.21 | 15662253 | 155,811 | 471 | 18221385 | 155,996 | 322 | 99.42 | P |
| 17742343 | 9,740 | 62 | 97.84 | P | |||||
| 18164371 | -39,696 | 21 | 99.37 | P | |||||
| Chronic myeloid leukemia t(9;22)(pq34;q11) | 9q34/ 22q11 | 123263651 | 36,956 | NA | 20552124 | 26,424 | NA | 91.81 | S |
| Emery-Dreifuss muscular dystrophy | Xq28 | 147627873 | 11,030 | 2 | 147676529 | 11,034 | 2 | 99.61 | S |
| Shwachman-Diamond syndrome | 7q11.21 | 65091051 | 325,140 | 665 | 70647188 | 302,881 | 652 | 97.43 | P |
| Red green color blindness | Xq28 | 148439480 | 21,144 | 61 | 148476598 | 21,834 | 58 | 99.82 | S |
| 17q21 | 40983970 | 43,221 | 66 | 62252214 | 431,52 | 66 | 99.85 | P | |
| Male infertility AZFc microdeletion region 2 | Yq11.22 | 23322362 | 190,336 | 391§ | 23680552 | -185,149 | 393§ | 99.88 | P |
| Yq11.22 | 23908727 | 94,194 | 282§ | 24794944 | -93,690 | 284§ | 99.92 | P | |
| Yq11.22 | 24794944 | 93,690 | 247§ | 27460935 | -94,218 | 248§ | 99.93 | P | |
This table represents a partial list of all known genomic disorders and chromosome rearrangements. *Only the start coordinates (based on June 2002 assembly) for duplicons are shown. Results from BLAST analysis with chromosome coordinates and size of duplicon. For several genomic mutations (Williams-Beuren syndrome, Prader-Willi syndrome and Angelman syndromes, and DiGeorge syndrome) the duplicons shown are incomplete, most of which are composed of several duplication modules. The '-' sign indicates that the second duplicon is in the inverse orientation. †The number of ambSNPs (ambiguously mapped single-nucleotide polymorphisms) found within the genomic segment. NA, not applicable. The ambSNP analysis defines regions containing high densities of contiguous ambSNPs. For some of the segmental duplications involved in genomic disorders, the contiguous lengths of ambSNPs are much larger than those detected by BLAST. The specific sizes of the segmental duplications have to be resolved by detailed characterization of the different modules. ‡Celera representation: S, both copies found in large (> 500 kb) sequence scaffolds; P, partially hit, single copy found, or less than perfect alignments; M, missing from large sequence scaffolds, hitting numerous fragments. § SNPs with multiple locations were used for evaluating the density of ambSNPs.