| Literature DB >> 21396103 |
Todd A Johnson1, Yoshihito Niimura, Hiroshi Tanaka, Yusuke Nakamura, Tatsuhiko Tsunoda.
Abstract
The analysis of contiguous homozygosity (runs of homozygous loci) in human genotyping datasets is critical in the search for causal disease variants in monogenic disorders, studies of population history and the identification of targets of natural selection. Here, we report methods for extracting homozygous segments from high-density genotyping datasets, quantifying their local genomic structure, identifying outstanding regions within the genome and visualizing results for comparative analysis between population samples.Entities:
Mesh:
Year: 2011 PMID: 21396103 PMCID: PMC3129671 DOI: 10.1186/gb-2011-12-3-r21
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Comparison of segment overlap counts between hzAnalyzer and PLINK homozygous segment/runs-of-homozygosity detection routines
| Number of intersecting segments in other dataset | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | Total segments | 0 | 1 | 2 | 3-5 | 6-10 | >10 |
| hzAnalyzer | 5,781 | 440 | 5,240 | 93 | 7 | 0 | 1 |
| PLINK | 8,040 | 30 | 5,059 | 1,777 | 1,108 | 66 | 0 |
Runs of homozygosity were detected using PLINK's default settings (ROH >1 Mb), and a corresponding set of homozygous segments with >1 Mb length selected from the complete hzAnalyzer dataset. The PLINK set was intersected with segments with ≥50 SNP/segment from the complete hzAnalyzer dataset, and the reverse intersection was performed between the hzAnalyzer (>1 Mb) and PLINK (>1 Mb) sets.
Figure 1Identification and summary of putative autozygous segments. (a) High MAD score homozygous segments originate from low frequency haplotypes: for each homozygous segment, a length-based MAD score was calculated and the frequency of haplotypes matching a segment's founder haplotypes estimated within each sample population. A two-dimensional density estimate between the two variables used R's densCols function with nbin = 1,024. (b) Concordance between 1000G data and putative autozygous segments: putative autozygous segments' SNP counts in HapMap Phase 2 compared with heterozygosity in 1000 Genomes Project genotypes. (c,d) Boxplot summaries of putative autozygous segments: (c) genome-wide percent coverage by individual; (d) segment length (outliers not shown). Putative autozygous segments defined as MAD score >10 and founder haplotype frequency = 0.0000. Asterisks mark values that are above the y-axis limit.
Figure 2Chromosomal coverage by homozygous segments as a function of segment size. For each chromosome, the cumulative sum of segment length (sorted in decreasing or increasing order) was calculated for each individual, values interpolated for a set of length values between 0 and 1,000 kb, and the median value curve calculated across each sample population. (a) Cumulative total length (sorted by increasing segment size) as the proportion of mappable chromosomal length. (b) Cumulative total length (sorted by decreasing segment size) as the proportion of mappable chromosomal length. (c) Total segment length ≥MISLgw versus each chromosome's total mappable length (r shown excludes chromosome X).
Figure 3Schematic workflow for summarizing and quantifying contiguous homozygosity. Example 10-Mb region on chromosome 1 illustrating how observed patterns of homozygous segments are processed to create intersecting segment length matrix (ISLMcm), calculate extAUC, and define peaks and outlier peak regions. (a) Coordinates of homozygous segments with length ≥49 kb (regional MISL). (b) Intersecting segment lengths for each locus are combined into ISLMcm. (c) An intersecting segment length vector (ISLV) is extracted from the ISLM, the sign of the values reversed, and extAUC is calculated by integrating the area-under-the-curve of the empirical cumulative distribution function (ECDF) using those values. Dashed red lines mark interval of integration after masking. (d) extAUC peak detection and processing: peaks are detected from a smooth spline function applied to extAUC values, peaks with extreme peak heights selected (outlier peaks), and neighboring outlier peaks that are not well separated are merged into peak regions.
Genome-wide peak counts at different stages of peak processing
| Peak dataset type | YRI | CEU | CHB | JPT |
|---|---|---|---|---|
| Complete peaks | 25,723 | 25,142 | 25,413 | 25,418 |
| Merged peaks | 15,325 | 15,815 | 16,117 | 16,119 |
| Outlier peaks | 873 | 908 | 1,007 | 1,047 |
| Outlier peak regions | 349 | 358 | 401 | 416 |
Examples of top outlier peak regions for each population
| Chr | Position | Low valley | High valley | SNPct | W(bp) | W(cm) | Extentmin | Freqhap-max | Top 5 gene(s) | GeneCt | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YRI | 11 | 48,984,887 | 46,179,339 | 51,434,161 | 0.7583 | 2,260 | 5,254,822 | 7.1256 | 1,944,689 | 0.2966 | 48 | |
| X | 64,291,347 | 62,718,942 | 66,950,662 | 0.5043 | 1,283 | 4,231,720 | 4.8877 | 1,706,793 | 0.2045 | 12 | ||
| 19 | 21,348,594 | 20,644,142 | 21,607,663 | 0.476 | 728 | 963,521 | 2.0049 | 278,309 | 0.4492 | 8 | ||
| 15 | 42,487,755 | 40,203,270 | 42,922,650 | 0.4277 | 1,542 | 2,719,380 | 4.4896 | 280,917 | 0.6695 | 47 | ||
| 13 | 56,734,306 | 54,206,759 | 58,262,174 | 0.3587 | 3,933 | 4,055,415 | 5.588 | 543,152 | 0.4322 | 2 | ||
| 1 | 172,725,722 | 171,321,975 | 173,400,872 | 0.3306 | 1,303 | 2,078,897 | 2.7759 | 1,077,265 | 0.2288 | 16 | ||
| 3 | 49,131,477 | 46,848,671 | 51,894,338 | 0.3307 | 1,993 | 5,045,667 | 6.1529 | 646,401 | 0.3729 | 108 | ||
| 14 | 66,177,344 | 65,574,279 | 67,039,563 | 0.3157 | 901 | 1,465,284 | 2.1116 | 451,154 | 0.3814 | 7 | ||
| 2 | 62,928,490 | 62,640,536 | 64,139,379 | 0.3138 | 864 | 1,498,843 | 1.9062 | 665,460 | 0.3898 | 7 | ||
| 16 | 34,336,349 | 34,040,995 | 35,126,826 | 0.2836 | 446 | 1,085,831 | 1.8457 | 407,250 | 0.3559 | No overlapping gene symbols | 0 | |
| CEU | X | 56,905,680 | 54,032,865 | 58,499,973 | 2.2637 | 1,588 | 4,467,108 | 7.0248 | 2,146,728 | 0.6705 | 30 | |
| 4 | 33,984,210 | 32,226,578 | 34,781,662 | 1.1807 | 1,754 | 2,555,084 | 3.6574 | 1,063,099 | 0.5847 | No overlapping gene symbols | 0 | |
| 10 | 74,474,938 | 73,363,734 | 76,877,918 | 1.1538 | 1,994 | 3,514,184 | 4.2036 | 1,225,977 | 0.5847 | 45 | ||
| 17 | 56,029,088 | 54,786,548 | 56,697,157 | 1.1855 | 950 | 1,910,609 | 2.8676 | 650,495 | 0.7542 | 17 | ||
| 11 | 47,998,372 | 46,181,418 | 51,434,161 | 1.021 | 2,256 | 5,252,743 | 7.1227 | 2,476,725 | 0.3898 | 48 | ||
| 2 | 136,299,164 | 134,916,186 | 137,368,521 | 0.8663 | 2,261 | 2,452,335 | 2.5254 | 983,710 | 0.7373 | 13 | ||
| 12 | 33,861,148 | 32,686,226 | 38,560,061 | 0.8322 | 3,319 | 5,873,835 | 10.0596 | 1,347,796 | 0.3898 | 12 | ||
| 6 | 28,454,924 | 26,007,118 | 30,140,896 | 0.829 | 4,884 | 4,133,778 | 5.9709 | 1,925,294 | 0.1102 | 116 | ||
| 1 | 35,416,859 | 35,088,324 | 36,679,373 | 0.8304 | 716 | 1,591,049 | 1.9898 | 957,224 | 0.6525 | 27 | ||
| 15 | 41,082,380 | 40,198,534 | 43,641,969 | 0.8478 | 2,088 | 3,443,435 | 5.6849 | 620,196 | 0.5424 | 60 | ||
| CHB | X | 65,578,703 | 62,500,211 | 68,093,466 | 5.0603 | 1,811 | 5,593,255 | 6.4603 | 3,865,015 | 1 | 16 | |
| 16 | 46,816,951 | 45,019,628 | 47,582,293 | 1.7237 | 1,197 | 2,562,665 | 4.3969 | 1,239,919 | 0.5682 | 16 | ||
| 3 | 49,185,837 | 46,688,461 | 52,084,708 | 1.7352 | 2,120 | 5,396,247 | 6.5804 | 1,411,092 | 0.7614 | 124 | ||
| 20 | 33,906,145 | 31,887,721 | 34,457,027 | 1.1757 | 1,515 | 2,569,306 | 4.7755 | 641,849 | 0.6477 | 44 | ||
| 17 | 56,257,611 | 54,786,430 | 56,802,151 | 1.0563 | 1,040 | 2,015,721 | 3.0254 | 563,530 | 0.7159 | 17 | ||
| 1 | 50,585,736 | 48,934,817 | 53,018,060 | 1.0833 | 2,299 | 4,083,243 | 5.1065 | 1,010,684 | 0.5227 | 25 | ||
| 2 | 72,358,317 | 72,139,001 | 73,325,161 | 1.0369 | 914 | 1,186,160 | 1.5085 | 785,525 | 0.8523 | 10 | ||
| 15 | 61,947,157 | 61,420,087 | 63,327,879 | 1.0094 | 1,241 | 1,907,792 | 3.1497 | 757,196 | 0.6136 | 27 | ||
| 5 | 43,332,643 | 41,569,404 | 46,432,729 | 0.9655 | 3,211 | 4,863,325 | 7.1409 | 940,407 | 0.3182 | 18 | ||
| 4 | 33,761,093 | 32,224,418 | 34,856,502 | 0.9703 | 1,854 | 2,632,084 | 3.7676 | 705,955 | 0.4773 | No overlapping gene symbols | 0 | |
| JPT | X | 65,574,377 | 62,541,924 | 68,119,789 | 4.6884 | 1,830 | 5,577,865 | 6.4425 | 2,974,147 | 1 | 16 | |
| 3 | 49,116,120 | 46,984,286 | 52,086,091 | 1.5554 | 1,926 | 5,101,805 | 6.2214 | 1,411,982 | 0.6744 | 117 | ||
| 16 | 46,405,964 | 45,019,628 | 47,589,442 | 1.4543 | 1,203 | 2,569,814 | 4.4092 | 1,013,458 | 0.6512 | 16 | ||
| 11 | 51,354,159 | 45,478,264 | 51,434,161 | 1.1994 | 2,899 | 5,955,897 | 8.0762 | 1,844,210 | 0.3837 | 57 | ||
| 1 | 50,634,574 | 49,236,023 | 52,884,998 | 1.1586 | 1,797 | 3,648,975 | 4.5634 | 727,447 | 0.7442 | 22 | ||
| 20 | 33,684,569 | 31,842,856 | 34,417,872 | 1.0984 | 1,512 | 2,575,016 | 4.7861 | 982,049 | 0.4186 | 45 | ||
| 2 | 72,358,024 | 71,951,517 | 73,081,576 | 1.0483 | 956 | 1,130,059 | 1.4372 | 771,138 | 0.8721 | 5 | ||
| 15 | 62,853,070 | 61,443,788 | 63,335,076 | 1.0036 | 1,219 | 1,891,288 | 3.1224 | 868,839 | 0.5465 | 27 | ||
| 14 | 65,950,544 | 65,443,441 | 67,097,562 | 0.9667 | 1,112 | 1,654,121 | 2.3838 | 987,976 | 0.407 | 9 | ||
| 17 | 55,887,456 | 53,750,682 | 56,801,264 | 0.9595 | 1,451 | 3,050,582 | 4.5786 | 555,447 | 0.8023 | 33 |
Outlier peak regions were sorted by peak height and ten representative examples chosen from the top. Columns with abbreviated names not referred to in the text are: Pop., population; Chr, chromosome; Pk. Ht, peak region height; SNPct, number of SNPs in region; W(bp), peak region width in base pairs; W(cm), peak region width in centimorgans; GeneCt, number of genes in region. Genes listed are sorted by length and the top five shown.
Figure 4Measures of contiguous homozygosity surrounding outlier peak regions. Two of the top four peak regions were chosen from Table 3 for each population and centimorgan values from the percentile extent matrix (PEmat) for the surrounding 10-Mb chromosomal area plotted as a grayscale image. Grayscale levels are adjusted relative to the maximum centimorgan value in the 90th percentile and values above that level set to black; correspondence between gray levels and cM is indicated at the top of each panel. Red line: smoothed extAUC values were down-sampled before plotting. The left-hand y-axis labels refer to percentile levels of the PEmat data, and the right-hand y-axis labels are for the line plot of extAUC values.
Figure 5Diagram of the method for comparing the extent and frequency of homozygous segments with haplotypes underlying ext. From a consensus set of genotypes, homozygous segments were detected, extAUC calculated and smooth spline interpolation performed, and peaks detected. Complementary quantiles (Pr(X >Extent) = (1 - Pr(X ≤ Extent)) = (1 - Percentile/100)) were calculated from each ISLV's percentile values underlying a particular peak. Values of Extent and Pr(X >Extent) that maximized the value of Extent × Pr(X >Extent) were extracted as parameters Extentand p, segments with length greater than Extentselected, and the maximum founder haplotype frequency (Freq) obtained across those segments. The expected haplotype frequency (Freq) was calculated as the square-root of p.
Figure 6Comparison of the extent and frequency of homozygous segments with founder haplotypes underlying ext. Minimum segment length (Extent), expected haplotype frequency (Freq), and maximum haplotype frequency (Freq) were calculated as diagrammed in Figure 5 for peaks dichotomized into non-outlier and outlier peaks. Data points were colored using a two-dimensional density estimate using R's function densCols with nbin = 1,024. Data shown are for CEU with plots for all populations in Additional file 9.
Figure 7Genomic regions of high-frequency extended homozygosity possess low recombination rates. Recombination rate was calculated for the central half-width of each autosomal peak in the merged peak dataset using population-specific genetic maps based on the 1000G pilot data. (a) Recombination rate versus extAUC peak height. (b,c) Boxplot summaries for non-outlier and outlier peaks for each sample population for (b) recombination rate, and (c) recombination rate transformed into cumulative probabilities and adjusting for peak width.
Figure 8Peaks with high cross-population homozygosity and high Fst/θ identify genomic regions with dissimilar haplotype structure. Outlier peaks for each base population were chosen, filtered to include those with high ranking extAUC across all four populations, and sorted based on average Fst/θ between the base population and the other three populations (or with YRI and CEU if the base population was CHB or JPT). Phased haplotypes were plotted for an example peak for YRI with high differentiation (mean Fst/θ = 0.83).
Figure 9Peaks with high cross-population homozygosity and low Fst/θ identify genomic regions with similar haplotype structure. Outlier peaks for each base population were chosen, filtered to include those with high ranking extAUC across all four populations, and sorted based on average Fst/θ between the base population and the other three populations (or with YRI and CEU if the base population was CHB or JPT). The structure of phased haplotypes is plotted for an example peak for JPT with low differentiation (mean Fst/θ = 0.0235).
Figure 10Peak regions harbouring areas at or approaching fixation. (a-c) Boxplots by population: (a) RCL0 SNP counts; (b) RCL0 length (kb); (c) RCL0 extent value (cM). (d) One each of the top autosomal and top chromosome X candidate fixation peak regions from Table S7a-c in Additional file 2 for each population, and the centimorgan values from PEmat for the chromosomal area ± 1 peak region width from the center plotted as a grayscale image. Grayscale levels are adjusted relative to the maximum value in the 90th percentile and values above that level set to black; legend key between gray levels and cM is indicated at the top of each panel. Red line: smoothed extAUC values were down-sampled at a 10% frequency and plotted. The left-hand y-axis labels refer to percentile levels of the PEmat data, and the right-hand y-axis labels are for the line plot of extAUC values. Outlier points in (a-c) are randomly jittered to reduce overlap and datapoints off the figure plotted as diamonds.
Summary of RCL0 in candidate fixed areas
| Minimum | Maximum | Total | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SNP count | Extent (cM) | Extent (bp) | SNP count | Extent (cM) | Extent (bp) | Run count | SNP count | Extent (cM) | Extent (bp) | |
| YRI | 14 | 0.0128 | 9,359 | 14 | 0.0128 | 9,359 | 1 | 14 | 0.0128 | 9,359 |
| CEU | 17 | 0.0188 | 5,675 | 73 | 0.3135 | 258,163 | 21 | 711 | 2.1398 | 1,410,226 |
| CHB | 13 | 0.0252 | 8,033 | 716 | 3.0799 | 1,798,574 | 139 | 5,771 | 16.1761 | 9,815,425 |
| JPT | 15 | 0.0247 | 8,544 | 385 | 2.0276 | 872,011 | 205 | 8,589 | 24.9607 | 13,597,432 |
Pop., population.
Figure 11Agglomerative haplotype analysis: testing for differences between two population samples' homozygous segment length distributions. We compared the homozygous segment length distributions between JPT and CHB at each locus position in the chromosome 20 masked ISLMcm matrix by calculating the Cramer-von Mises two-sample minimum distance statistic ω2. A representative position (Chr 20:43.82 Mb) with a high ω2 value was chosen to illustrate the AHA permutation testing procedure. (a) ω2 values between JPT and CHB for chromosome 20. (b) Segment length ECDFs for JPT and CHB at example peak ω2 position. (c) Determination of significance level using permutation test: ECDFs from 100,000 resamples after mixing JPT and CHB data and randomly reassigning group labels. ω2 was calculated for each permutation resample, and the achieved significance level calculated as the number of permutations with ω2 values greater than or equal to the original two-sample ω2 divided by the total number of permutations.