Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable owing to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads, which allows for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy number differences. We estimate that, on average, 73-87 genes vary in copy number between any two individuals and find that these genic differences overwhelmingly correspond to segmental duplications (odds ratio = 135; P < 2.2 x 10(-16)). Our method can distinguish between different copies of highly identical genes, providing a more accurate assessment of gene content and insight into functional constraint without the limitations of array-based technology.
Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable owing to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads, which allows for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy number differences. We estimate that, on average, 73-87 genes vary in copy number between any two individuals and find that these genic differences overwhelmingly correspond to segmental duplications (odds ratio = 135; P < 2.2 x 10(-16)). Our method can distinguish between different copies of highly identical genes, providing a more accurate assessment of gene content and insight into functional constraint without the limitations of array-based technology.
The human genome is enriched for gene-rich segmental duplications that vary extensively in copy number 1-4. Variation in the content and copy of these duplicated genes has been associated with recurrent genomic rearrangements as well as a variety of diseases, including color blindness, psoriasis, HIV susceptibility, Crohn's disease, and lupus glomerulonephritis 5-10. Despite recent technological advances in copy-number detection, a global assessment of genetic variation of these regions has remained elusive. Commercial SNP microarrays frequently bias against probe selection within these regions 11-13. Array comparative genomic hybridization (arrayCGH) approaches have limited power to discern copy-number differences especially as the underlying number of duplicated genes increases and the differential in copy with respect to a reference genome becomes vanishingly small 3,14,15. Even sequence-based strategies such as paired-end mapping 16,17 frequently fail to unambiguously assign end-sequences in duplicated regions, making it impossible to distinguish allelic and paralogous variation. Consequently duplicated regions have been largely refractory to standard human genetic analyses.One promising approach for assessing copy-number variation has involved measuring the depth-of-coverage of whole-genome shotgun (WGS) sequencing reads aligned to the human reference genome 1. Recent applications of this approach to next-generation sequencing technology 18-22 have provided high-resolution mapping of copy-number alterations. Most of these approaches, however, assay only the “unique” regions of the genome 21,23,24. For example, MAQ reports only unique alignments and arbitrarily selects one position in the case of tied map positions, reporting no sequence variation 23. Although it is possible to run MAQ with an option to return all possible map locations of the sequence reads, it reports only the anchoring position and no sequence variation information is returned. Here, we develop a read-mapping algorithm to rapidly assay copy-number variation and experimentally verify its ability to accurately predict copy number in some of the most complex and duplicated regions of three human genomes.
RESULTS
Algorithm development
We developed mrFAST (micro-read fast alignment search tool) to effectively map large amounts of short sequence reads to the human genome reference assembly, to calculate accurate read-depth and to return all possible single nucleotide differences within both unique and duplicated portions of the genome (Supplementary Figures 1 and 2a). We have shown previously that the ability to place reads to all possible locations in the reference genome is a key requirement to accurately predicting the absolute copy number of duplicated sequences 1.mrFAST is designed for short (>25 bp) sequence reads, employs a seed-and-extend method similar to BLAST 25, and implements a hash table to create indices (n=300 indices of 10 Mbp each) of the reference genome that can efficiently utilize the main memory of the system. The overall scheme of the mrFAST algorithm is illustrated in Supplementary Figure 1. For each read, the first, middle, and last k-mers are interrogated in the hash table to place initial seeds where k is the ungapped seed length (we set k=12 by default). A rapid version of edit distance 26 computation as described by Ukkonen 27 is then performed to extend the seed to discover all possible map locations, allowing 1-2 bp indels. We optionally exclude most of the “non-extendable” seeds, bypassing the high cost of edit distance computation. For this analysis, we selected an edit-distance threshold of two mismatches or indels to account for allelic variants and sequencing error. Moreover, querying three distinct k-mers guarantees discovery of all possible locations of reads within an edit distance of two if the length is >=35 and k=12. As a benchmark, mapping of one human genome (21-fold) against the repeat masked reference genome was achieved in 13.5 hours using a 100-CPU cluster.
Personal duplication maps
We tested the utility of mrFAST to accurately construct duplication maps by obtaining whole-genome shotgun sequence data from three human males from the NCBI short-read archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) and European Read Archive (ftp://ftp.era.ebi.ac.uk/). These included the genome sequence data of an individual of European descent (JDW) generated using 454 FLX sequence data 20 as well as two genomes generated with Illumina WGS data (a Yoruba African (NA18507) and a Han Chinese individual (YH) 18,22 (Table 1)). All loci were first masked for high copy common repeat elements (retroposons and short high copy repeats) using RepeatMasker 28, Tandem Repeats Finder29, and WindowMasker 30. We initially assessed the dynamic range response of shotgun sequence data mapped by mrFAST by determining the read-depth for a set of 32 duplicated and unique loci where copy-number status had been previously confirmed using experimental methods 1. Using these benchmark loci, we determined the average read-depth and variance for 5-kbp (unmasked) regions for autosomal and X chromosomal loci (Table 1). For each of the three libraries we found that read-depth strongly correlated with the known copy number (R2=0.83-0.90, Figure 1a). Due to the known sequencing biases of high throughput sequencing technologies in GC-rich and GC-poor regions 31, we also applied a statistical correction to normalize the read-depth based on the GC content of each window (see Methods and Supplementary Note).
Table 1
Summary statistics of the WGS libraries studied
We inspected three different WGS libraries from three individuals. Approximately 74 million 454-based reads from the JDW genome were rendered into 36-bp reads (see Supplementary Note). In total, we processed 4.9 billion reads (206 gigabases), and approximately 1.4 billion reads (~28.5%) were mapped to repeat masked human genome build35, with different levels of repeat masking (see text and Supplementary Note).
Figure 1
Correlation of predicted and known segmental duplications (NA18507)
a) mrFAST sequence read-depth per 5-kbp window along the human genome correlates well (R2=0.87) with the known copy number of duplicated sequences. b) Predicted duplication interval length versus the assembly-based length intervals of known duplications (Whole Genome Assembly Comparison; WGAC, ≥94% sequence identity) 34 shows that boundaries of duplications can be accurately predicted. A few intervals show discrepancy in boundary prediction, however, this is largely due to deletion polymorphism in the NA18507 genome within duplications (supported by arrayCGH). c) A cumulative plot of the fraction of duplication intervals detected as a function of various read-depth sequence coverage. The segmental duplication (SD) size is given in cumulative intervals (≥5 kbp, ≥10 kbp, etc.) and represents the set of intervals identified both within the public reference assembly (build35) and the Celera whole-genome shotgun sequence reads. As expected, the sensitivity of our method increases with more genome coverage; the most dramatic difference in detection is observed between 3- to 4-fold coverage.
We next assessed the ability of mrFAST read-depth to accurately predict the boundaries of known duplicated sequences. We selected a set of 961 autosomal duplication intervals (745 intervals ≥20 kbp) that were predicted both by the analysis of the human genome assembly 32 and by an independent assessment of Celera capillary WGS sequences 1,33 where the 20-kbp threshold was applied. We reasoned that duplications detected by both methods likely represented a set of true positive duplications whose boundaries would remain largely invariant in additional human genomes. We mapped each of the three WGS sequence libraries (JDW, NA18507 and YH) to the human reference genome (build35) using mrFAST and identified all intervals where at least 6 out of 7 consecutive windows showed an excess depth-of-coverage (number of reads ≥ average + 3 standard deviations). A threshold of 3 standard deviations corresponds to a diploid copy number of approximately 3.5, which means that a fraction of sequences with a hemizygous duplication may be missed by this approach. We compared the predicted sizes of intervals in each genome with the duplications predicted from the assembly34 and determined that the boundaries of known duplications could be accurately predicted (R2=0.92, Figure 1b). Since sequence coverage directly affects the power to detect duplications by read-depth, we computed the fraction of high-confidence duplication intervals that could be detected at various WGS sequence coverages (Figure 1c). Our results show that at 20-fold sequence coverage, >90% of segmental duplications larger than 20 kbp can be accurately predicted. Interestingly, the most significant increase in yield occurs between 3- to 4-fold sequence coverage suggesting that the majority of copy-number variable sequences in excess of 20 kbp in length will be accurately predicted from the 1000 Genomes Project (http://www.1000genomes.org) where at least 4-fold of WGS sequence data are available. We also performed benchmark analyses to compare the segmental duplication detection power of mrFAST with different edit distance parameters, as well as against some of the other available read mapping tools (Supplementary Note).As an independent and more sensitive test within unique regions of the genome, we compared copy-number variant (CNV) genotype calls for NA18507, with calls recently assessed by McCarroll and colleagues using the Affymetrix 6.0 platform35. We found that 250/282 (88.7%) of CNVs >10 kbp and 120/128 (93.8%) of CNVs >20 kbp were consistent between the two platforms (see Supplementary Note). In two of the most extreme cases of discrepancy, we found that the Affymetrix 6.0 genotypes likely misassigned absolute copy numbers, possibly due to an incorrect assignment of the population average genotype based on fluorescent intensities. These results highlight the potential of mrFAST read-depth to provide precise estimates of copy number across all genomic regions.We constructed duplication maps for each of the three genomes and estimated the absolute copy number of each duplication interval larger than 20 kbp in length. We considered a given segment to be duplicated within an individual if the median of estimated copy number for that individual was greater than 2.5 (diploid copy number; see Supplementary Note). We compared the extent of overlap among duplicated sequences (Figure 2, Methods) and reclassified duplicated sequences as shared or individual-specific based on the predicted copy numbers in the analysis of these three genomes (Supplementary Note). We defined a total of 725 non-overlapping duplication intervals across the three individuals that total 84.76 Mbp. Only 25 duplication intervals were not predicted in all three individuals suggesting that the vast majority (97% of the intervals and 98% by base pair) of large segmental duplications are shared (Figure 3 and Supplementary Figure 3).
Figure 2
Computational prediction and arrayCGH validation of segmental duplication copy-number differences for three human genomes
Regions of excess read-depth (average+3std) are shown in red in contrast to regions of intermediate read-depth (gray; average + 2std-3std) or normal read-depth (green, average +/− 2std).The absolute copy number and arrayCGH results for specific individual genome comparisons are shown in the context of RefSeq annotated genes. Oligonucleotide relative log2 ratios are depicted as red/green histograms and correspond to an increase and decrease in signal intensity when test/reference is reverse labeled. a) A known copy-number polymorphism on 17q21.31 that is associated with the H2 haplotype among Europeans (build35 coordinates chr17: 41,000,000–42,300,000). The JDW genome shows an increase of 1-2 copies of a 459-kbp segmental duplication mapping to 17q21.31 when compared to NA18507. b) An expansion of the complement factor H related gene family (chr1:193,350,000–193,700,000) within JDW. c) An increase in NA18507 copy number for the defensin gene cluster in 8p23.1 is confirmed by arrayCGH.
Figure 3
Validation of individual-specific segmental duplications
The number of duplicated base pairs predicted and validated in NA18507, JDW, and YH (autosomes only) are shown. The height of the bars represents the sum of computationally predicted interval lengths, and the blue color bars correspond to the experimentally validated portion. Only duplicated intervals >20 kbp were considered for validation.
Experimental validation
We designed two targeted oligonucleotide microarrays to validate predicted differences in copy number by arrayCGH. Using DNA from each of the sequenced genomes, we performed three pairwise arrayCGH experiments. We validated 68% (17/25) of duplication intervals not shared in all three individuals, which implied that only 1.1 Mbp of duplicated regions would be unique to at least to one of them (Figure 3, Supplementary Note). Interestingly, ~80% of these validated “individual-specific” duplications mapped within 2 kbp of shared human duplications suggesting that sequences adjacent to ancestral duplication blocks have the highest probability of segmental duplication. We also performed a reciprocal analysis of intervals (>20 kbp) predicted to be deleted in one or more of the individuals and confirmed 28 deletions (or 1.4 Mbp of deletion) (Supplementary Note).Irrespective of the next-generation sequence (NGS) platform, the pattern of read-depth was remarkably reproducible for 48% of the shared duplications (44711/94070 Supplementary Figure 4). However among the remaining 52% of duplications, read-depth did not correlate between individuals. This suggests that shared duplications show the greatest extremes of copy-number variation between individuals (Supplementary Figure 5). Using absolute estimates of copy number, we calculated an in silico log2 ratio for each of the three genome-wide comparisons and compared it to the experimental values as determined by arrayCGH (Figure 4, Supplementary Figure 6). Overall, we found a positive correlation with copy-number predictions (R2=~0.52–0.63 depending on the pairwise comparison). We note that the ability of arrayCGH to discriminate absolute differences diminishes as the duplication copy number increases 14.
Figure 4
Correlation between computational and experimental copy number for NA18507 vs. JDW
We computed the copy number for each shared (gray) and individual specific duplication interval (blue or orange) based on the depth-of-coverage of aligned WGS against the human reference assembly (build35). Based on this computational estimates of copy number, we calculated a predicted log2 copy-number ratio for each autosomal duplication interval >20 kbp in length (and with less than 80% of total common repeat content). These values were plotted against the experimental log2 ratios determined by oligonucleotide arrayCGH. The vertical red lines indicate the threshold used for the validated calls (see Supplementary Note).
We selected eleven duplicated loci that showed copy-number differences between the YH and NA18507 genomes and performed fluorescence in situ hybridization (FISH) analysis on interphase nuclei (Figure 5, Supplementary Note) from immortalized cell lines from YH and NA18507. These results show remarkable consistency between the absolute copy number predicted by mrFAST and FISH. For cases where the copy number is higher than 15, FISH was unable to provide a precise estimate of copy-number difference due to the technical limitations of this procedure (Figure 5d, Supplementary Note). With one exception, interphase FISH analysis showed that differences in copy number involved local changes in copy number suggesting that duplicative transpositions to new locations were exceedingly rare.
Figure 5
FISH validation
a) Sequence read-depth predicts 5 copies of this particular 17q21.31 segment in the YH genome and 2 copies (unique) in NA18507. ArrayCGH shows an increase in the YH genome and interphase nuclei FISH confirms the absolute copy-number difference between the two genomes. b) Similarly, interphase FISH confirms copy-number difference of 5 vs. 12 copies for the NPEPPS gene. c) YH is predicted and validated to have two more copies of the defensin gene family cluster of 8p23.1. d) Due to the known mosaic architecture 38 for this high copy locus (>30 copies), both arrayCGH and FISH methods fail to accurately estimate copy-number difference between NA18507 and YH genomes: despite the fact that sequence depth predicts ~2 more copies in NA18507.
Copy-number polymorphic genes
This analysis validated 68 gene families as being completely or partially copy-number variable among these three individual genomes (Supplementary Table 1). This includes a complete duplication of the complement factor H-related complex (consisting of four genes, CFHR1 through CFHR4) within the JDW genome (Figure 2b). We also confirm one additional copy of the 8p23.1 defensin gene family (DEFB103B) within the YH genome when compared to NA18057 and in NA18507 when compared to JDW. We predict about twice as many copies of the amylase (AMY1) gene family in NA18507 (n=9) and YH (n=10) when compared to JDW (n=5). As expected 7, the African genome (NA18507) showed the greatest number of CCL3L1 copies (n=7) when compared to either JDW (n=3) or YH (n=5). We also validate increases in gene segments of functional relevance. For example, we find ten fewer copies of the kringle IV domain of the lipoprotein A gene (LPA) in NA18507 (22 copies vs. 35 in JDW and 26 in YH)—a polymorphism known to be protective against coronary heart disease 36.While many of these differences are consistent with previous studies, the analysis also confirmed differences in rapidly evolving human and great ape gene families that have been previously difficult to ascertain. For example, our results suggest an increase in copy of the TBC1D3 gene family within NA18507 (29 copies) when compared to the other two genomes (JDW=26, YH=17). Similarly, we predicted absolute differences in the morpheus/NPIP copy number between different humans. Unlike FISH or arrayCGH, sequencing data provides exquisite specificity for assessing the presence or absence of individual paralogous genes. We examined three gene families (morpheus, opsin and CFHR) in more detail by identifying single nucleotide variants that distinguish the different paralogs. Despite the high degree of sequence identity among the duplicated genes, we found approximately 300 distinct paralogous sequence variants per duplicated gene (1 variant/91 bp) (Supplementary Table 2). We determined which specific duplicate genes were present in each individual, providing for the first time an accurate census of specific genes (as opposed to copy-number differences in the aggregate) (Supplementary Figure 7). Since we track all single nucleotide differences using mrFAST, we can also assess the relative proportion of disruptive stop codons providing a first-pass approximation of the functional constraint on each polymorphic gene family (Supplementary Table 3). These data suggest that the systematic identification of unique paralogous sequence variants for all duplicated gene families combined with next-generation sequence data will be a powerful approach to genotype these complex regions of the genome. Longer sequence reads, however, will be necessary to accurately assess phase.Our experimental analysis found that 97% (66/68) of the validated genic copy-number differences among the three genomes corresponded to regions annotated as segmental duplications (providing strong evidence that functional copy-number polymorphisms will be similarly biased in their genomic distribution). Since we considered only the largest (>20 kbp) regions in our initial analysis, we repeated the copy-number estimate on a gene-by-gene basis removing the length threshold. We analyzed 17,610 non-redundant RefSeq transcripts 37 (Supplementary Note) and calculated the absolute copy number for each sample based on the median depth-of-coverage for each of the corresponding gene segments in the genome (Supplementary Note). Based on this computational analysis, we predict that 3.8% of genes (662/17601) show a difference of at least one copy (Supplementary Tables 4, 5), with an average of 394 predicted gene copy-number differences between two individuals (see Table 2 for the 30 validated genes with the largest copy-number differences). In order to validate these predicted gene differences, many of which are smaller than 20 kbp, we interrogated the three samples using a customized oligonucleotide microarray targeted toward these gene regions. We conservatively validate 113 genes (Supplementary Table 6) as being variable in copy number among these three individuals (73-87 genes between two human genomes). Although there are almost certainly real copy number differences that were not validated by array-CGH (see Supplementary Note), we note that 84% (95/113) of the validated changes map to segmental duplications. Thus, genes that are duplicated (having a 50% overlap with annotated duplications of at least 90% identity) are significantly more likely to show copy-number difference (OR=135; p< 2.2e-16 Fisher's Exact Test). Moreover, these variably duplicated genes show a greater copy-number range than the non-duplicated CNV genes (median copy-number difference of 2.8 vs. median copy-number difference of 1.2). Notably, 97% (69/71) of the genes with a copy-number difference of two or greater map to previously reported segmental duplications 1,32,34 (Figure 6).
Table 2
Most variable copy number genes among 3 human genomes
Based on the estimated and validated copy numbers of genes in the RefSeq database, we calculated the maximum copy-number difference between each pair of the three genomes analyzed. The top 30 validated genes with the maximum copy-number range are shown.
Gene Name
Transcript ID
Gene Size
Duplicated Bp
JDW Copy Number
NA18507 Copy Number
YH Copy Number
Copy Number Range
DUX4
NM_033178
8205
8205
248
97
196
151
DUB3
NM_201402
1593
1593
139
186
122
64
FAM90A7
NM_001136572
18865
18865
7
44
36
38
PRR20
NM_198441
3022
3022
28
22
11
17
HRNR
NM_001009931
12112
7721
19
8
15
12
TBC1D3
NM_032258
10897
10897
26
29
17
11
TP53TG3
NM_016212
3200
3200
16
7
6
10
WASH1
NM_182905
15229
15229
26
16
20
10
ZNF717
NM_001128223
48227
24791
36
27
32
9
OR4F17
NM_001005240
918
918
18
13
9
9
C2orf78
NM_001080474
32959
26245
9
7
14
7
PCDHB8
NM_019120
2590
2508
12
6
8
6
TCEB3C
NM_145653
1877
1877
23
18
17
6
PCDHB7
NM_018940
3714
2333
10
4
7
6
OR4F16
NM_001005277
937
937
18
17
12
6
FOXD4L5
NM_001126334
3109
3108
40
43
38
6
FOXD4L4
NM_199244
3107
3106
40
43
38
6
MST1
NM_020998
4816
4776
11
6
11
5
MGC50273
NM_214461
6701
6701
33
28
32
5
AMY1A
NM_004038
8871
8871
6
11
10
5
AMY2A
NM_000699
8395
8395
6
10
11
5
POTEB
NM_207355
31415
31415
17
21
22
5
NPEPPS
NM_006310
92199
62993
4
8
3
5
NBPF1
NM_017940
49571
49533
43
48
46
5
OR2A1
NM_001005287
931
931
6
5
9
4
FAM86B2
NM_001137610
10727
10727
20
17
21
4
GOLGA6
NM_001038640
12694
12694
17
13
17
4
LOC283767
NM_001001413
9757
9757
26
22
26
4
FLG
NM_002016
23029
10606
13
9
13
4
BAGE
NM_001187
41142
40371
18
17
14
4
Duplicated bp corresponds to the number of base pairs of the gene that intersects with segmental duplications. This value is equal to the gene size for genes that are completely in segmental duplications, and smaller if the gene partially overlaps with known duplications.
Figure 6
Copy-number differences between unique and duplicated regions
The 113 genes that vary in copy number are partitioned based on the range of copy-number difference and their intersection with annotated segmental duplications. Duplicated genes show a greater extent of copy-number variation when compared to genes mapping to unique regions of the genome.
DISCUSSION
Next-generation sequencing platforms are fundamentally altering the way genetics and genomics research is performed. Compared to other methods, these platforms offer the ability to obtain an unprecedented amount of sequence information in a low-cost, high-throughput fashion. The main draw back of existing technologies is the comparably short sequence read-lengths they produce. As a result, some regions of the human genome—particularly duplication or repeat-rich regions—have already begun to be excluded as part of standard NGS analyses. We specifically designed our new mapping algorithm, mrFAST, to address this limitation. By considering all possible map locations for a read in an efficient manner, we have been able to apply the high potential of NGS to some of the most structurally complex and dynamic regions of the human genome. By including these regions, we provide one of the first comprehensive estimates of absolute copy-number differences among three human genomes.There are three major conclusions from our computational and experimental analyses. First, we show that NGS read-depth can be used to accurately predict absolute copy number, such that even multi-copy differences (5 vs. 12; see Figure 5) can be reliably predicted between different individuals. Second, our results suggest that the duplication status of the largest segmental duplications (>20 kbp in length) is largely invariant with only 3% of the duplications being specific to an individual. Third, our analysis reveals that the most extreme copy-number variation corresponds to genes embedded within segmental duplications and that most of these differences involve tandem changes in copy as opposed to duplications to new locations. We validated 113 complete genes as copy-number variable among these three individuals. Several of the validated loci are of known biomedical relevance related to color blindness (e.g. opsin variation, Supplementary Figure 2d; psoriasis, Supplementary Note; and age-related macular degeneration, Figure 2b). It is also interesting that several of the most variable human copy-number genes (Table 2, Supplementary Figures 2b, 2f) correspond to rapidly evolving gene families that emerged within the common ancestor of human and African great apes (e.g. TBC1D, LRRC37, GOLGA, NBPF). These genes correspond to the core duplicons that have been implicated in the expansion of intrachromosomal segmental duplications during hominid evolution 38. While the function of these genes is largely unknown, the ability to use NGS to accurately predict their copy number provides the ability to make genotype and phenotype correlations in these complex areas of the genome.Copy-number differences, including variable duplications of entire genes, are now recognized as making substantial contributions to variation in human phenotypes. The ability to accurately and systematically determine the absolute copy number for any genomic segment is an important first step toward a true and complete picture of individual genomes and phenotypes. In light of the sensitivity and specificity of read-depth approaches, we anticipate that this strategy will eventually replace arrayCGH based methods. The next challenge will be defining variation in the sequence content and structural organization of these dynamic and important regions of the human genome.
METHODS
Computational Analyses
Details regarding the mrFAST algorithm are described at length in the Supplementary Note. mrFAST can be downloaded from (http://mrfast.sourceforge.net) and is freely available to not-for-profit institutions. Segmental duplication maps were constructed from approximately 6X 454 sequence coverage of the JDW genome, 42X Illumina sequence coverage of NA18507 and 40X Illumina from YH. 454-based JDW WGS sequence reads (average length 266 bp) were broken into 36-bp sequences to make the read-length properties comparable among the three sequence libraries (see Supplementary Note). Sequence reads were mapped using mrFAST against the human genome reference build35 (Supplementary Note), to define duplication intervals and calculate absolute copy numbers. Read-depth was normalized with respect to their GC content via a LOESS-based smoothing technique (Supplementary Note). For cross-sample comparisons, the duplication status of each individual over each interval was reassessed based on the estimated absolute copy number (Supplementary Note).
arrayCGH Validation
We performed array comparative genomic hybridization (arrayCGH) to confirm individual-specific duplications and to confirm copy-number differences for shared duplications. A total of six experiments were performed in replicate with dye-reversals performed between test and reference: NA18507 vs. JDW, NA18507 vs. YH and JDW vs. YH. Log2 relative hybridization intensity was calculated for each probe. In this analysis, we restricted our analysis to those regions that were greater than 20 kbp in length and contained at least 20 probes. We used a heuristic approach to calculate log2 thresholds of significance for each comparison dynamically adjusting the thresholds for each hybridization to result in a false discovery rate of <1% in the control regions 39.
FISH Analysis
Metaphase spreads were obtained from lymphoblast cell lines from NA18507 (Coriell Cell Repository, Camden, NJ) and YH (Han Chinese) 18. FISH experiments were performed using fosmid clones 4 (Table 3) directly labeled by nick-translation with Cy3-dUTP (Perkin-Elmer) as described previously 40 with minor modifications (see Supplementary Note). Digital images were obtained using a Leica DMRXA2 epifluorescence microscope equipped with a cooled CCD camera (Princeton Instruments). DAPI and Cy3 fluorescence signals, detected with specific filters, were recorded separately as grayscale images. Pseudo coloring and merging of images were performed using Adobe Photoshop software. A minimum of 50 interphase cells were scored for each probe.
Authors: Enrique Gonzalez; Hemant Kulkarni; Hector Bolivar; Andrea Mangano; Racquel Sanchez; Gabriel Catano; Robert J Nibbs; Barry I Freedman; Marlon P Quinones; Michael J Bamshad; Krishna K Murthy; Brad H Rovin; William Bradley; Robert A Clark; Stephanie A Anderson; Robert J O'connell; Brian K Agan; Seema S Ahuja; Rosa Bologna; Luisa Sen; Matthew J Dolan; Sunil K Ahuja Journal: Science Date: 2005-01-06 Impact factor: 47.728
Authors: Peter J Campbell; Philip J Stephens; Erin D Pleasance; Sarah O'Meara; Heng Li; Thomas Santarius; Lucy A Stebbings; Catherine Leroy; Sarah Edkins; Claire Hardy; Jon W Teague; Andrew Menzies; Ian Goodhead; Daniel J Turner; Christopher M Clee; Michael A Quail; Antony Cox; Clive Brown; Richard Durbin; Matthew E Hurles; Paul A W Edwards; Graham R Bignell; Michael R Stratton; P Andrew Futreal Journal: Nat Genet Date: 2008-04-27 Impact factor: 38.330
Authors: Devin P Locke; Richard Segraves; Lucia Carbone; Nicoletta Archidiacono; Donna G Albertson; Daniel Pinkel; Evan E Eichler Journal: Genome Res Date: 2003-03 Impact factor: 9.043
Authors: Manuela Fanciulli; Penny J Norsworthy; Enrico Petretto; Rong Dong; Lorraine Harper; Lavanya Kamesh; Joanne M Heward; Stephen C L Gough; Adam de Smith; Alexandra I F Blakemore; Philippe Froguel; Catherine J Owen; Simon H S Pearce; Luis Teixeira; Loic Guillevin; Deborah S Cunninghame Graham; Charles D Pusey; H Terence Cook; Timothy J Vyse; Timothy J Aitman Journal: Nat Genet Date: 2007-05-21 Impact factor: 38.330
Authors: Jeffrey M Kidd; Gregory M Cooper; William F Donahue; Hillary S Hayden; Nick Sampas; Tina Graves; Nancy Hansen; Brian Teague; Can Alkan; Francesca Antonacci; Eric Haugen; Troy Zerr; N Alice Yamada; Peter Tsang; Tera L Newman; Eray Tüzün; Ze Cheng; Heather M Ebling; Nadeem Tusneem; Robert David; Will Gillett; Karen A Phelps; Molly Weaver; David Saranga; Adrianne Brand; Wei Tao; Erik Gustafson; Kevin McKernan; Lin Chen; Maika Malig; Joshua D Smith; Joshua M Korn; Steven A McCarroll; David A Altshuler; Daniel A Peiffer; Michael Dorschner; John Stamatoyannopoulos; David Schwartz; Deborah A Nickerson; James C Mullikin; Richard K Wilson; Laurakay Bruhn; Maynard V Olson; Rajinder Kaul; Douglas R Smith; Evan E Eichler Journal: Nature Date: 2008-05-01 Impact factor: 49.962
Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith Journal: Nature Date: 2008-11-06 Impact factor: 49.962
Authors: Timothy J Aitman; Rong Dong; Timothy J Vyse; Penny J Norsworthy; Michelle D Johnson; Jennifer Smith; Jonathan Mangion; Cheri Roberton-Lowe; Amy J Marshall; Enrico Petretto; Matthew D Hodges; Gurjeet Bhangal; Sheetal G Patel; Kelly Sheehan-Rooney; Mark Duda; Paul R Cook; David J Evans; Jan Domin; Jonathan Flint; Joseph J Boyle; Charles D Pusey; H Terence Cook Journal: Nature Date: 2006-02-16 Impact factor: 49.962
Authors: Fereydoun Hormozdiari; Iman Hajirasouliha; Andrew McPherson; Evan E Eichler; S Cenk Sahinalp Journal: Genome Res Date: 2011-11-02 Impact factor: 9.043
Authors: Andy Itsara; Lisenka E L M Vissers; Karyn Meltz Steinberg; Kevin J Meyer; Michael C Zody; David A Koolen; Joep de Ligt; Edwin Cuppen; Carl Baker; Choli Lee; Tina A Graves; Richard K Wilson; Robert B Jenkins; Joris A Veltman; Evan E Eichler Journal: Am J Hum Genet Date: 2012-04-06 Impact factor: 11.025
Authors: Aaron R Quinlan; Royden A Clark; Svetlana Sokolova; Mitchell L Leibowitz; Yujun Zhang; Matthew E Hurles; Joshua C Mell; Ira M Hall Journal: Genome Res Date: 2010-03-22 Impact factor: 9.043