Literature DB >> 23435088

Using population admixture to help complete maps of the human genome.

Giulio Genovese¹, Robert E Handsaker, Heng Li, Nicolas Altemose, Amelia M Lindgren, Kimberly Chambert, Bogdan Pasaniuc, Alkes L Price, David Reich, Cynthia C Morton, Martin R Pollak, James G Wilson, Steven A McCarroll.

Abstract

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2013 PMID： 23435088 PMCID： PMC3683849 DOI： 10.1038/ng.2565

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Physical maps of the human genome, including the sequence of most of its euchromatic portions[1,2], are basic resources in human genetics and genomics research: they provide the framework for analysis of sequence data; and they enable genome-scale analysis of single nucleotide polymorphisms (SNPs), copy number variants (CNVs), epigenetic phenomena, and gene expression. Yet physical maps of the human genome remain incomplete. Almost 30 million base pairs (Mbp) of euchromatic genome sequence that are apparently human – observed in human whole-genome sequence data[3,4], containing human expressed sequence tags[5,6] (ESTs), and homologous to other mammalian genome sequences – are either absent from, or have no assigned locations in, current assemblies of the human genome[7,8]. These “missing pieces” of the reference human genome are a likely source of mistaken inference in today's analyses of genome sequence data[9]. Sequence reads arising from the missing pieces may be discarded as non-alignable, or incorrectly assumed to arise from paralogous sequences in the known, assembled part of the human genome. Sequences missing from the reference human genome might also help answer questions in human genetics research, such as the source of the genetic signals that have been ascertained (but not yet fine-mapped to causal variation or causal genes) by linkage, association, and CNVs. Here we describe an approach for “admixture mapping” the human genome's missing pieces at megabase pair scales, by utilizing the patterns of sequence variation that have been created by isolation and subsequent re-mixture of human populations. We report the successful mapping of ~5Mbp of unplaced human euchromatic sequences, including many protein-coding genes. We find that most of these sequences are euchromatic islands within the genome's heterochromatic oceans, including centromeres and the short arms of the acrocentric chromosomes, and that they almost always consist of segmental duplications (sometimes recent, sometimes millions of years old) of sequence present elsewhere in the reference genome.

An approach for admixture mapping unplaced sequence

The construction of large-scale genome models (“assemblies”) utilizes physical sequence overlaps between genomic clones[10]. Clones are assembled, into larger scaffolds based on overlapping sequences at their ends. By contrast, mapping based on statistical relationships among variants can provide information that is complementary to physical mapping, as it does not require a continuous path of sequences to be cloned and uniquely assembled. Before physical mapping was feasible, linkage among alleles was used to construct the first genetic maps of the human genome based on restriction fragment length polymorphisms[11,12], and later to build and improve genetic maps based on microsatellite markers[13,14]. A unique kind of long-range information – finer in resolution than linkage in families, yet longer in reach than linkage disequilibrium (LD) in populations – is present in many of the world's admixed populations. Whenever human populations have been reproductively isolated for long periods of time (such as Africans and Europeans) and then re-mixed (such as among African Americans), the genomes of the descendants are mosaics of segments that derive from ancestors from the two ancestral populations (). The divergence of the sequences in the ancestral populations gives rise to sequence variation that is informative about the ancestry of each segment. Long-range “admixture LD” has been used to map genetic factors that segregate at different frequencies in different populations[15,16] and to identify genomic sites of recombination in African Americans[17,18]. We reasoned that population admixture could also be used to map the locations of unmapped human genome sequences. Provided that the sequence in a genomic missing piece is variable, that this variation was subject to genetic drift, and that the extent of this drift is known in the two ancestral populations, we could infer the ancestral origin of a missing piece – whether it has been inherited from each individual's European or African ancestors – with varying levels of statistical certainty, in a large panel of admixed individuals. By comparing such ancestry profiles for the genome's missing pieces to similar determinations across the known/mapped/assembled majority of these individuals’ genomes, each missing piece could in principle be connected to the genomic location at which it resides, even if we lack a continuous path of cloned, assembled sequence with which to make such a connection (). Specifically, we can test ancestry-informative SNPs for correlation between their genotypes and inferred local ancestry across the genome, estimated using available genome-wide genotypes[19]. This is different from, and potentially much more powerful than, detecting LD between genotypes at two SNPs, since the correlation between genotypes and local ancestry is expected to be much stronger (than that between SNPs) at genetic distances up to a few cM, and the distance between unmapped “missing pieces” and the nearest parts of the reference genome may be substantial. Furthermore, we estimated statistical mapping power from allele frequencies in the ancestral populations and found that it was substantial, even for admixed population samples of even a few hundred individuals (). Thus, admixture mapping could in principle connect sequences that are physically farther apart than most genomic clones (20-180kbp) and LD blocks (15-50kbp).

RESULTS

Sources of the missing pieces

We used three sources of unplaced genome sequence: (i) the current reference genome (hg19), which contains 59 unplaced contigs (~5Mbp of euchromatic sequence) for which the correct location is either only known at the chromosomal level or not known at all; (ii) the HuRef genome[20], assembled by random shotgun sequencing of a single individual, containing an even larger number of unplaced scaffolds (~3.5Mbp of euchromatic sequence in 28 scaffolds >100kbp and ~7Mbp of euchromatic sequence in 698 scaffolds >10kbp); and (iii) sequence from bacterial artificial chromosome (BAC) and fosmid clones available from GenBank[21] (Online Methods).

Mapping the human genome's missing pieces

If an ancestry-informative SNP resides on an unmapped contig, we can map the location of the contig by admixture mapping the SNP. We (i) aligned all unmapped sequence reads from the 1000 Genomes Project[22,23] to unplaced scaffolds from HuRef, (ii) identified polymorphic sites across these unplaced sequences, and (iii) computed genotypes at each locus in all European (CEU) and West African (YRI) samples (Online Methods). We selected 314 ancestry-informative SNPs whose genotypes had Pearson correlation r2>15% with local ancestry. We then genotyped these SNPs in a cohort of 380 African American participants from the Jackson Heart Study[24] (JHS), selecting this sample size based on initial analyses of the predicted power to map each SNP as function of the number of available genotypes (Online Methods and ). We successfully admixture-mapped 139 SNPs (), assigning locations for 70 previously unlocalized scaffolds ( and ). We never observed SNPs from the same scaffold mapping to different locations, as could be the case if the scaffold were itself misassembled. Sequences mapped by this approach comprised a total of ~4Mbp of euchromatic sequence that had not been included or mapped in hg19. We describe the properties of these mapped locations below.

Identifying additional, cryptic missing pieces

An additional set of cryptic missing pieces might be missing entirely from human genome reference sequences (i.e. might not even be described as unlocalized sequences, nor present in HuRef) but exist instead as cryptic segmental duplications (or paralogs) of known genomic sequences and have been incorrectly assumed to represent the same genomic sequence as their known paralogs. We reasoned that admixture mapping could also be used to identify cryptic segmental duplications. A SNP that is annotated in the assembled part of the human genome might in fact exist on a cryptic paralogous sequence elsewhere. Therefore, identification of SNPs that admixture-map to a different genomic location than their annotated location might indicate the presence of these SNPs at another genomic location on a cryptic segmental duplication. To identify mismapped SNPs we analyzed genome-wide SNP data from two large African American cohorts. Among the 906,703 SNPs from the Affymetrix 6.0 array genotyped in ~7,800 individuals from the Candidate gene Association Resource (CARe) cohort[25] and the 566,714 SNPs from the Illumina HumanHap550 array genotyped in ~1,800 individuals from the ICDB cohort, we identified, respectively, 121 and 15 SNPs that admixture-mapped to genomic locations far from their HapMap[26] annotations of physical location (). Approximately half these mismapped SNPs belonged to a single region, a ~360kbp segmental duplication from 16q22.2 to 1q21.1 involving the HYDIN gene[27-29], confirmed by fluorescence in situ hybridization (FISH), and previously found to give rise to false genome-wide association signals at 16q22.2, that in fact arise from true association at the Duffy locus at 1q23.2[30] (). Excluding the HYDIN paralog, incorrect mapping for ~30 SNPs can be explained by known segmental duplications[31-37], while for the remaining ~40 mismapped SNPs, the most likely explanation is that they reside on sequence missing from the reference genome. (Among the ~30 SNPs that we simply re-mapped from one known segmental duplication copy to another, ten corresponded to sites previously used as single unique nucleotides[38] (SUNs) to distinguish among known segmental duplications. By definition, none of the re-mapped SNPs with which we identified novel segmental duplications corresponded to SUNs.) To understand the relationships between these cryptic paralogs and unplaced scaffolds from large sequencing efforts, we cross-referenced the locations of these SNPs with alignments of unlocalized sequence from HuRef and GenBank. We identified 18 sequences >40kbp each containing one or more of the mismapped SNPs. Twelve of these 18 regions (spanning ~1.3Mbp of euchromatic sequence) could not be explained by segmental duplications already annotated in the reference genome; these indicate the presence of cryptic segmental duplications. To critically evaluate these findings by an independent method, we utilized the principle that cryptic segmental duplications should give rise, for SNPs called from sequencing data, to excess heterozygosity that does not follow simple models of Hardy Weinberg equilibrium between pairs of alleles. We searched for such a signal – annotated SNPs that behave more like paralogous sequence variants (PSVs) – in data from the 1000 Genomes Project pilot, and were able to confirm all of these regions (Online Methods and ). For 8 of these 12 cryptic segmental duplications we could find no mention in the literature. We further confirmed six of them by inter-chromosomal LD analysis using HapMap genotypes (). We determined for each region whether the alternate allele for any of the mismapped SNPs was present in any of the BAC clones aligning to that region, by aligning sequences from BAC clones retrieved from GenBank to the hg19 reference genome. For SNPs in six of these regions we could identify BAC clones carrying the alternate allele, suggesting that these clones harbor the sequence where these SNPs actually reside (). For one of these regions containing the gene PRIM2, further analysis indicated an intra-chromosomal duplication in the pericentromeric region of chromosome 6 and an additional inter-chromosomal duplication in the pericentromeric region of chromosome 3 (). We confirmed the existence of this triplication by the existence of excess sequence read depth across this region in low-coverage data from the 1000 Genomes Project ( and ) and fluorescence in situ hybridization (FISH) analysis (). We also observed that the copy in the reference genome is a hybrid of the two copies on chromosome 6 due to a misassembly ().

Pericentromeric locations of the missing pieces

Despite the fact that most of the 300 or so gaps[8] in the reference human genome exist in interstitial regions, most of the sequence we were able to localize mapped not to interstitial gaps but to cytogenetically-defined heterochromatic regions of the human genome. Among the mapped scaffolds, 57 of 70 mapped to pericentromeric regions ( and ). Among the re-mapped SNPs identifying cryptic segmental duplications, 40 of 70 mapped to pericentromeric regions. (In all these cases the resolution of the mapping was limited to the pericentromeric region identified.) We sought to confirm these pericentromeric mappings using both published and new cytogenetic data. Among the 70 scaffolds we mapped successfully, 17 were among 29 scaffolds that were previously analyzed by FISH (Supplementary Information of [39] and Supplementary Table S8 of [20]). All 17 of these admixture-based mappings were consistent with one of the often multiple locations suggested by FISH ( and ). While confirmatory, this result also emphasizes the discerning power of admixture mapping over techniques based on hybridization, as the latter can yield ambiguous results when clones contain segmental duplications or other kinds of repeats. We also performed additional FISH experiments to critically evaluate the mappings of five novel cryptic paralogous sequences for which no previous FISH data existed. In all (5/5) cases, FISH confirmed the presence of the additional copy in the predicted pericentromeric region ( and Online Methods). A further prediction of these mappings to pericentromeric regions involves the sequence content of the respective scaffolds. If these genomic missing pieces are indeed euchromatic islands within heterochromatic oceans, then they might frequently contain heterochromatic beaches consisting of the satellite sequences associated with human centromeres. To evaluate this prediction, we measured the amount of sequence classified as heterochromatic satellite on each scaffold. The great majority (50/57) of the scaffolds that admixture-mapped to pericentromeric regions contained more than 5% satellite sequence (Online Methods, ), compared with almost none (1/13) of the scaffolds that admixture-mapped to interstitial regions (p=0.003). Another prediction of these pericentromeric mappings is that, given earlier data indicating that recombination within centromeres is likely to be heavily repressed[40], scaffolds mapping to the same pericentromeric regions might show LD with one another. We identified pairs of SNPs (from distinct scaffolds) with significant linkage disequilibrium not due to admixture, and were able to identify ~500 SNPs from distinct scaffolds mapping to same genomic regions (). In no instance did these LD-based relationships among scaffolds disagree with our mappings from admixture. To understand how the pericentromeric missing pieces relate to the known human genome, we aligned their sequences to hg19; virtually all scaffolds mapping to pericentromeric regions were found to consist of one or more segmental duplications of mapped euchromatic sequence, with 2-5% sequence divergence (). This suggests that a large fraction of these sequences arrived at their current locations by a process of segmental duplication in primate ancestors[41]. Our mapping of these cryptic segmental duplications to centromeric regions is consistent with an earlier finding that most chromosome arms (35 of 43) have an increasing number of known interchromosomal duplications in the proximity of the centromeres[42]; both results appear to reflect a tendency of interchromosomal duplications to deposit sequence at and around centromeres.

Are the missing pieces copy number variable?

Although the cryptic, pericentromeric euchromatic regions described here have not been purposefully interrogated in earlier CNV studies, they may have been indirectly interrogated via assays that targeted paralogous sequences in the known, assembled parts of the human genome. This seems the likely scenario, as almost all of the mismapped SNPs we identified from genotyping arrays (63 of 70, not including the HYDIN locus) fall within CNVs reported in the Database of Genomic Variants (DGV)[43] (), despite the fact that DGV CNVs together cover less than a third of the human genome. Given the sequence divergence over the identified cryptic paralogs (often greater than 2%), these additional copies are likely to have fixed in the ancestors of all humans. Identifying CNVs over these sequences at a greater rate than for the rest of the genome may therefore indicate the instability of sequences in pericentromeric regions rather than a persistent state of polymorphism of these additional copies in the human population after the duplication event. To evaluate CNV of four selected paralogous region-pairs, we analyzed read depth of coverage and paralogous sequence variation using data from the 1000 Genomes Project (Online Methods). We identified common CNVs affecting the segmental duplications from the 2p22.2, 4q35.2, and DUSP22 loci (), and we found evidence for CNVs affecting either of the PRIM2 cryptic paralogs ( and ). In each case we could confirm using PSVs that the cryptic paralogs, rather than the paralogs present in the reference genome, account for the observed copy number variation (), consistent with CNVs having arisen in the pericentromeric paralogs.

Expression of protein-coding genes from pericentromeric regions

Cryptic, pericentromeric paralogs of known protein-coding genes could in principle be either pseudogenes or expressed, intact genes. To test whether cryptic paralogs of coding genes are expressed at an RNA level, we analyzed RNA-seq data from the Human BodyMap 2.0 project. We focused on reads aligning to the genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, all of which appear to have cryptic paralogs in pericentromeric regions ( and ). To distinguish RNA arising from reference gene copies from RNA arising from the cryptic paralogs, we focused on reads covering PSVs identifiable from genomic DNA sequence (many of which were previously mis-annotated as SNPs); this makes it extremely likely that sequence differences observed in RNA have a genomic origin ( and Online Methods). We identified expressed RNA for all of the paralogs except MAP2K3 (). The expression of cryptic, pericentromeric gene copies showed several kinds of relationship to expression of their paralogs. Both DUSP22 and its recently duplicated paralog were expressed and exhibited similar distribution across tissues. In contrast, the cryptic paralogs of PRIM2, which contain only exons 6-14 of the original transcript (as shown in ), give rise to shorter transcripts that are expressed exclusively in brain and testes (). For HYDIN, which is expressed in brain and several other tissues, this analysis indicated that the cryptic paralog at 1q21.1 is expressed in the brain, consistent with its earlier observation in a brain cDNA library[28]. For KCNJ12 we could detect expression of the paralog KCNJ18 in testis tissue (), though we did not detect this KCNJ12 paralog in skeletal muscle where KCNJ18 is known to be expressed at a level sufficient to cause a phenotype[44]. The tissue specificity observed for paralogous copies is also evidence that these observations are not the result of sequencing errors at putative PSV sites. These results suggest that many of these cryptic, pericentromeric gene paralogs are expressed genes, and that their expression patterns can differ from those of their known paralogs.

DISCUSSION

We have described a population-based approach for helping to assemble the rest of the euchromatic human genome, even when missing pieces are separated from known euchromatic sequence by extensive heterochromatic sequence. Because our approach uses data that are widely available or are quickly becoming so, its power will increase quickly in the coming years. We anticipate that this approach will help complete physical maps of the human genome. Analysis of ancestry-informative markers in unlocalized scaffolds can be used to map the genomic locations of these scaffolds with a physical resolution comparable to that of FISH but with unambiguous mapping to individual loci, and in a highly scalable way that will become inherently more powerful as sequence data sets grow. (Many aspects of the genome assembly will continue to require other methods – for example, our approach does not determine the physical orientation of novel sequence with respect to the chromosome.) Using this approach we mapped ~4Mbp of unplaced euchromatic sequence, most of which we found to be embedded in the heterochromatic regions of the genome. These regions are not included in the current human reference genome and, with two exceptions, they do not overlap with any of the current patches included in the latest revision (). One limitation of our approach is that it relies on novel sequence having been correctly assembled and distinguished from paralogous sequence. Most sequences from HuRef unplaced scaffolds have a divergence greater than 2% with their closest paralogs; due to limitations of shotgun sequencing assembly, paralogous segments with <2% sequence divergence are likely to be under-represented in human genome assemblies[45]. Unfortunately, due to their short read lengths, current whole-genome next-generation sequencing approaches do not provide better assemblies for such regions than those obtained with capillary-based sequencing approaches[46]. Nonetheless, we showed that admixture mapping the SNPs ascertained in such regions can still allow the discovery and mapping of these cryptic paralogous sequences. Our results have several potential implications for disease gene mapping in humans, particularly wherever genetic signals map near pericentromeric regions, assembly gaps, and segmental duplications. Copy number variations (CNVs) frequently straddle or are flanked by ambiguous regions of the genome assembly. For example, deletions and duplications at 1q21.1 reported to affect ~1.5Mbp of genomic sequence associate with cardiac developmental defects[47], schizophrenia[48,49], mental retardation, autism, congenital anomalies[50], and abnormal head size[51]. Fully defining the gene content of these CNVs will require interrogating the missing sequence hidden in the assembly gaps at 1q21.1. Some regions implicated in genome-wide association studies may require re-analysis in light of the results here. For example, human height associates with rs17511102 and other markers in a lincRNA-containing segment of 2p22.2[52] for which we found a cryptic segmental duplication (and paralogous lincRNA) in the pericentromeric region of chromosome 22. Following up this association will require that markers throughout the region be re-assigned to the correct paralogous gene copies. Gene SERPINB6 was associated with a clinical phenotype through homozygosity mapping by the identification of an homozygous region terminated by the heterozygous genotype of SNP rs7762811[53], which our results suggest being incorrectly assigned to 6p25.3 while in fact residing at 16p11.2, leading to a slight underestimation of the correct homozygous region. The genes affected by cryptic segmental duplications may be functionally important and critical to include and explicitly model in exome sequencing studies. For example mutations in KCNJ18, a gene missing from the reference genome, have been shown to cause thyrotoxic hypokalemic periodic paralysis[44]. An admixture mapping study found that African Americans with multiple sclerosis have elevated European ancestry around the centromere of chromosome 1[15], a region to which our work has assigned more than a megabase of novel sequence. We showed that CNVs are more common over cryptic paralogs missing from the reference genome, most likely due to the physical instability of pericentromeric regions. We also showed that paralogous genes in these cryptic pericentromeric duplications are transcribed, sometimes with patterns of expression that diverge from those of their paralogs, and therefore potentially serving unique biological functions. The presence of duplicated regions complicates genome assemblies, SNP and CNV discovery (). Notably, HYDIN and PRIM2 are among the most difficult genes to reconstruct using de novo assembly from short sequence reads[54]. PRIM2 and KCNJ12 are among the genes with the largest number of mis-identified non-synonymous SNPs[55], most likely due to identification of PSVs as SNPs. Approximately 6% of the human genome reference is currently considered unreliable for variant discovery by the 1000 Genomes Project[23], due to dearth or excess read coverage or poor alignment of sequence reads. Most of the regions we identified as harboring a cryptic segmental duplication ( and ) fall in this inaccessible part of the human genome. While waiting for a more complete version of the human genome reference, the 1000 Genomes Project now aligns sequence data to an expanded genome reference that includes additional unlocalized sequences (termed “decoy sequences”), to reduce false alignments in regions with cryptic segmental duplications. These additional sequences consist mainly of sequenced clones discarded by the Human Genome Project and sequence from the HuRef assembly (~30% of decoy sequences consist of HuRef unlocalized scaffolds). Of course, the eventual goal of such projects will be the alignment of all human sequence reads to their actual physical locations. In completing maps of the human genome, the important remaining challenges include mapping the human genome's structure at all scales, fully cataloging the genome's sequence content, and appreciating how sequences are ordered and arranged along chromosomes. As the scientific community works toward a complete reference assembly of the human genome[56], analysis of genome-wide data from admixed populations will add unique value and help complete our understanding of the human genome's structure and evolution. URLs. HuRef unplaced scaffolds, ftp://ftp.tigr.org/pub/data/huref/ GenBank database: ftp://ftp.ncbi.nih.gov/genbank/ Database of Genotypes and Phenotypes (dbGaP): http://www.ncbi.nlm.nih.gov/gap Illumina iControlDB: http://www.illumina.com/science/icontroldb.ilmn HapMap inter-chromosomal LD: ftp://ftp.ncbi.nlm.nih.gov/hapmap/inter_chr_ld/ Illumina Human BodyMap 2.0 data: http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611 Decoy sequences: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/

ONLINE METHODS

Alignment of HuRef genome and GenBank BAC and fosmid clones

To align the HuRef genome and sequenced BAC and fosmid clones to the hg19 reference genome we first downloaded all available sequence from The Institute for Genomic Research and GenBank websites (by downloading the scaffold-not-in-chromosome.fasta file from ftp://ftp.tigr.org/pub/data/huref/ and all gbpri* files from ftp://ftp.ncbi.nih.gov/genbank/), and then used BWA[57] (using bwa bwasw) for the alignments against hg19. We identified repeats classified as satellite sequences over HuRef unplaced scaffolds using RepeatMasker[58]. Satellite sequence consists of large arrays of tandemly repeated units of non-coding DNA. The amount of satellite and missing sequence is reported for each unplaced scaffold (). To identify within these resources the presence of cryptic segmental duplications, that is, sequence missing from the current reference genome but present in a diverged duplicated form, we aligned all available contigs from HuRef and GenBank clones against hg19 ().

Alignment and variant calls for 1000 Genomes Project data

For genotyping from sequence reads, we selected all the CEU and YRI samples available in the 1000 Genomes Project[22,23]. Unmapped reads were aligned against the HuRef unplaced scaffolds using BWA[59] (using bwa aln/sampe). Genotype calling in the unplaced contigs was performed using the Genome Analysis Toolkit[60] (GATK), using default settings for the UnifiedGenotyper walker.

Strategy for admixture mapping

To map the location of a SNP, the genotypes were first adjusted by regressing for amount of global West African ancestry for each sample. The adjusted genotypes were then tested for correlation with local ancestry across the genome using a one-tailed Pearson correlation test. If the correlation of the genotypes with global West African ancestry is positive, a right-tailed test is chosen, otherwise a left-tailed test is chosen. The location corresponding to the smallest p-value is then recorded for each SNP, together with the location corresponding to the smallest p-value in a different chromosome. All these steps were performed using custom MATLAB (2011b, The MathWorks, Natick, MA) scripts. It is intuitive to expect that genotyping SNPs over paralogous sequences, only one of which will be expected to be polymorphic, will be often incorrect as it won't be possible to correctly infer the homozygous state for the alternate allele, leading to failure of Hardy-Weinberg equilibrium among other things. This is not always so for genotyping arrays however, as genotyping of SNPs is often based on a two-dimensional Gaussian mixture model over summarized probe intensities for each of the two alleles[61], enabling the correct distinction of the three possible genotypes even without modeling the presence of a cryptic paralog.

SNP selection, sample selection, and Sequenom genotyping

From all detected SNPs in hg19 unplaced contigs and HuRef unplaced scaffolds we filtered out SNPs at loci for which the number of reads with mapping quality 0 was at least 4 and at least 10%. We also filtered out clusters of four SNPs within a window size of 10bp. The rationale is that in loci with ambiguous alignment, it is possible to call SNPs which actually belong to a paralogous region of the genome. Variants called in loci where many SNPs cluster together have a higher chance to be an artifact of misaligned reads originating from paralogous regions that are not present in the reference genome used for the alignment. This methodology maximizes the chances that a SNP belongs to the unplaced scaffold where it is called. Among the filtered list, up to 7 ancestry-informative SNPs were chosen for each contig for which genotype was estimated to have Pearson correlation coefficient with amount of local European ancestry satisfying r2>15%. SNPs were further filtered to fit within 10 Sequenom plexes, prioritizing degree of correlation with ancestry. We selected 380 samples from the Jackson and Heart Study (JHS)[24], which had been genotyped with the Affymetrix 6.0 array and analyzed with HAPMIX[62]. To achieve the maximum possible mapping resolution, we exclusively selected samples with at least 62 detected crossovers between ancestries (maximum was 115). Most likely due to the repetitiveness of the flanking sequences for which primers were designed, 86 assays failed completely; of the remainder, 53 failed the Hardy-Weinberg equilibrium test (p<10-6), and 175 passed. Nevertheless, we could still reliably identify the location of 139 SNPs (Pearson correlation test p<10-6), 106 of which had passed and 33 of which had failed Hardy-Weinberg equilibrium test, showing that SNPs with unreliable genotypes can still be informative for mapping purposes (). By analyzing for each successfully mapped SNP the best correlation between adjusted genotype and local ancestry on chromosomes other than the one where the SNP mapped, we estimated that the selected conservative threshold of 10-6 for the p-value gives a false discovery rate lower than 1%.

Analysis of cryptic paralogs from 1000 Genomes Project pilot data

To identify regions with an excess of PSVs suggesting the presence of large cryptic segmental duplications, we searched for SNPs across the reference genome whose probabilistic genotype from 1000 Genomes Project pilot low-pass sequencing data failed the Hardy-Weinberg equilibrium test[63] (using bcftools view -c). We identified variants that failed the equilibrium test (p<10-6) in CEU and YRI samples, grouped them together if they were <5kbp apart (using custom MATLAB scripts), and listed all resulting regions >40kbp ().

Fluorescence in situ hybridization

Peripheral blood mononuclear cells were stimulated with phytohemagglutinin and harvested. Metaphase spreads were prepared by standard protocols. Fosmid clones spanning the regions of interest were selected for FISH mapping using the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/). Fosmids were labeled with either SpectrumOrange or SpectrumGreen conjugated dUTP using a nick translation kit (Abbott Molecular, Des Plaines, IL). Labeled pairs were hybridized overnight to metaphase chromosome preparations. Following 4x SSC/0.1% Tween, 2xSSC/0.3% Tween, and phosphate-buffered detergent washes, chromosomes were counterstained with DAPI and analyzed by epi-fluorescence with a Zeiss Axioplan2 microscope (Thornwood, NY) and Applied Imaging CytoVision software (Santa Clara, CA).

Analysis of sequence read depth from 1000 Genomes Project data

To assess the copy number variability of the missing reference segments, we used an updated version of Genome STRiP[64] to analyze read depth. Normalized read depth was measured by comparing the number of DNA fragments with sequencing reads aligned to the reference genome in a given region to the expected read depth per haploid copy based on (i) the total sequencing depth for each sample, (ii) the alignability of each position, based on whether it would be uniquely mapped by a perfect 36bp read, and (iii) sequencing bias due to GC content. We performed normalization for GC-bias empirically, similar to [38]. We first identified a 588Mbp subset of the autosomal reference sequence with no known evidence of copy number variation to use as a baseline. We removed all positions within 200bp of annotated CNV regions listed in DGV, segmental duplications listed in the UCSC browser, repeats annotated by RepeatMasker[58] and assembly gaps, yielding a subset that is highly likely to be copy number invariant in the majority of people. This reference subset is divided into 400bp windows, stratified by GC fraction within each window, and the observed read depth at each GC fraction is compared to the total read depth across all windows to yield a GC-normalization curve for each sequencing library. Given a genomic locus, estimation of diploid copy number for each sample was done by fitting a Gaussian mixture model with sample-specific variance to the observed and expected read depth for each sample[64], allowing the model to fit as many copy number classes as needed at each locus. To analyze genome regions with known paralogs in sequences not in the hg19 reference (notably 2p22.2), we used BWA[59] (using bwa aln/sampe) to realign the 1000 Genomes reads from the genomic region to a synthetic reference containing the original reference sequence plus the sequence for the extra paralog. Estimation of copy number was then carried out as described above.

Analysis of RNA sequence expression data

To compare expression of different paralogs of genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, we first identified PSVs over the predicted mRNA for these genes looking at all heterozygous loci called for 1000 Genomes Project pilot high coverage samples NA12878/CEU, NA12891/CEU, NA12892/CEU, NA19238/YRI, NA19239/YRI, and NA19240/YRI and then determined, when possible, which allele belongs to each paralog (). Once we obtained a list of all PSVs, we counted reads from the Illumina Human BodyMap 2.0 project for each of the alleles observed at the locus using the GATK[60] (using default settings for the UnifiedGenotyper walker and custom scripts). To validate the findings and filter out possible artifacts, sequence reads were further manually analyzed using the Integrative genomics viewer[65] (IGV).

Table 1

Segmental duplications localized by admixture mapping.

CHR	FROM	TO	BAND	GENE	SIZE	CHR’	FROM’	TO’	BAND’	SCAFFOLD	DIV	CARE	ICDB	HAPMAP	FISH
chr1	83,598,160	83,955,427	1p31.1	POMZP3	~400kbp	chr7	76,182,346	76,575,579	7q11.23	NA	~1.4%	6	1	yes	no
chr1	206,072,708	206,558,788	1q32.1	FAM72/SRGAP2	~240kbp	chr1	143,880,004	144,095,783	1q21.1	NA	~0.6%	3	0	no	no
chr2	37,958,019	38,003,219	2p22.2	NA	~45kbp	chr22	NA	NA	22q11.1	SCAF_1103279187616	~4.0%	3	0	yes	no
chr2	91,737,476	91,880,745	2p11.1	OTOP1	~140kbp	chr1	NA	NA	1q21.1	RP11-247L13	~1.2%	2	0	yes	no
chr2	133,005,020	133,120,083	2q21.2	NA	~115kbp	chr20	NA	NA	20q11.21	RP11-462H3	>2.0%	1	1	yes	yes
chr3	612,223	663,367	3p26.3	NA	~50kbp	chr22	NA	NA	22q11.1	GL000217	~2.0%	1	0	no	yes
chr3	75,761,051	75,871,577	3p12.3	ZNF717	>110kbp	chr21	NA	NA	21q11.2	RP4-813B7	>5.0%	1	0	no	no
chr4	25,709	68,702	4p16.3	ZNF595	~40kbp	chr22	NA	NA	22q11.1	RP11-85C8	~0.5%	1	0	no	no
chr4	3,536,207	3,636,136	4p16.3	FLJ35424	~100kbp	chr9	NA	NA	9p11.2	SCAF_1103279188214	~3.0%	1	0	yes	yes
chr4	190,470,115	190,684,480	4q35.2	NA	~215kbp	chr21	NA	NA	21q11.2	GL000193	>2.0%	2	0	no	no
chr5	21,506,326	21,573,437	5p14.3	NA	~65kbp	chr6	58,137,660	58,139,549	6p11.2	CH17-92N24	~1.5%	0	0	yes	yes
chr6	256,518	382,461	6p25.3	DUSP22	~125kbp	chr16	NA	NA	16p11.2	NA	~0.1%	0	1	no	no
chr6	57,204,729	57,435,462	6p11.2	PRIM2	~230kbp	chr6	NA	NA	6p11.2	SCAF_1103279188350	~2.0%	0	0	no	yes
chr6	57,204,729	57,608,453	6p11.2	PRIM2	~400kbp	chr6	NA	NA	6q11.1	SCAF_1103279188263	~2.0%	0	0	no	yes
chr6	57,369,236	57,608,453	6p11.2	PRIM2	~240kbp	chr3	NA	NA	3p11.1	SCAF_1103279180085	~2.0%	3	0	yes	yes
chr6	57,401,565	57,570,618	6p11.2	PRIM2	>170kbp	chr3	NA	NA	3p11.1	RP1-216J23	~2.0%	3	0	yes	yes
chr6	57,447,574	57,575,919	6p11.2	PRIM2	~130kbp	chr6	NA	NA	6p11.2	SCAF_1103279188406	~2.0%	0	0	no	yes
chr12	147,380	188,194	12p13.33	FAM138	>40kbp	chr20	62,947,067	62,965,512	20q13.33	SCAF_1103279187960	~1.2%	1	0	no	no
chr13	19,020,001	19,167,977	13q11	ANKRD30BP2	~200kbp	chr21	14,447,204	14,594,419	21q11.2	NA	~0.8%	3	0	yes	no
chr14	19,817,857	20,194,548	14q11.2	POTEH/POTEM	~400kbp	chr22	16,085,071	16,459,525	22q11.1	NA	~0.6%	8	0	yes	no
chr16	70,845,287	71,202,573	16q22.2	HYDIN	~360kbp	chr1	146,341,167	146,400,000	1q21.1	GL000192	~0.6%	58	8	yes	no
chr21	10,971,951	11,032,242	21p11.1	TPTE	>60kbp	chr13	NA	NA	13q11	RP5-1039L24	~0.2%	1	1	no	no
chr21	11,083,847	11,156,072	21p11.1	BAGE	>80kbp	chr13	NA	NA	13q11	NA	NA	2	0	no	no

CHR, FROM, TO, BAND: chromosome, hg19 coordinates, and localization of the ancestral copy of the duplication; GENE: protein coding gene(s) overlapping the duplication; SIZE: estimated size of the duplication; CHR’, FROM’, TO’, BAND’: chromosome, hg19 coordinates, and localization of the derived copy of the duplication; SCAFFOLD: genomic scaffold containing the sequence in the derived copy of the duplication; DIV: estimated sequence divergence between the ancestral and the derived copies of the duplication; CARE: number of Affymetrix 6.0 SNPs re-mapped in the CARe dataset; ICDB: number of Illumina SNPs re-mapped in the ICDB dataset; HAPMAP: whether independent evidence of the cryptic duplication was confirmed by inter-chromosomal LD from HapMap genotypes; FISH: whether a FISH experiment was performed to validate the duplication.

63 in total

1. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21.

Authors: M Guipponi; M L Yaspo; L Riesselman; H Chen; A De Sario; G Roizès; S E Antonarakis
Journal: Hum Genet Date: 2000-08 Impact factor: 4.132

2. Interchromosomal segmental duplications of the pericentromeric region on the human Y chromosome.

Authors: Stefan Kirsch; Birgit Weiss; Tracie L Miner; Robert H Waterston; Royden A Clark; Evan E Eichler; Claudia Münch; Werner Schempp; Gudrun Rappold
Journal: Genome Res Date: 2005-01-14 Impact factor: 9.043

Review 3. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome.

Authors: J Zhang; L Feuk; G E Duggan; R Khaja; S W Scherer
Journal: Cytogenet Genome Res Date: 2006 Impact factor: 1.636

4. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study.

Authors: Herman A Taylor; James G Wilson; Daniel W Jones; Daniel F Sarpong; Asoka Srinivasan; Robert J Garrison; Cheryl Nelson; Sharon B Wyatt
Journal: Ethn Dis Date: 2005 Impact factor: 1.847

5. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21.

Authors: Robert Lyle; Paola Prandini; Kazutoyo Osoegawa; Boudewijn ten Hallers; Sean Humphray; Baoli Zhu; Eduardo Eyras; Robert Castelo; Christine P Bird; Sarantos Gagos; Carol Scott; Antony Cox; Samuel Deutsch; Catherine Ucla; Marc Cruts; Sophie Dahoun; Xinwei She; Frederique Bena; Sheng-Yue Wang; Christine Van Broeckhoven; Evan E Eichler; Roderic Guigo; Jane Rogers; Pieter J de Jong; Alexandre Reymond; Stylianos E Antonarakis
Journal: Genome Res Date: 2007-09-25 Impact factor: 9.043

6. Physical and genetic mapping of the human X chromosome centromere: repression of recombination.

Authors: M M Mahtani; H F Willard
Journal: Genome Res Date: 1998-02 Impact factor: 9.043

7. A 360-kb interchromosomal duplication of the human HYDIN locus.

Authors: Norman A Doggett; Gary Xie; Linda J Meincke; Robert D Sutherland; Mark O Mundt; Nicolas S Berbari; Brian E Davy; Michael L Robinson; M Katharine Rudd; James L Weber; Raymond L Stallings; Cliff Han
Journal: Genomics Date: 2006-08-30 Impact factor: 5.736

8. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility.

Authors: David Reich; Nick Patterson; Philip L De Jager; Gavin J McDonald; Alicja Waliszewska; Arti Tandon; Robin R Lincoln; Cari DeLoa; Scott A Fruhan; Philippe Cabre; Odile Bera; Gilbert Semana; M Ann Kelly; David A Francis; Kristin Ardlie; Omar Khan; Bruce A C Cree; Stephen L Hauser; Jorge R Oksenberg; David A Hafler
Journal: Nat Genet Date: 2005-09-25 Impact factor: 38.330

9. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

10. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

39 in total

1. Genotype-Frequency Estimation from High-Throughput Sequencing Data.

Authors: Takahiro Maruki; Michael Lynch
Journal: Genetics Date: 2015-07-29 Impact factor: 4.562

2. Face shape differs in phylogenetically related populations.

Authors: Saskia M J Hopman; Johannes H M Merks; Michael Suttie; Raoul C M Hennekam; Peter Hammond
Journal: Eur J Hum Genet Date: 2014-01-08 Impact factor: 4.246

3. Mapping the human reference genome's missing sequence by three-way admixture in Latino genomes.

Authors: Giulio Genovese; Robert E Handsaker; Heng Li; Eimear E Kenny; Steven A McCarroll
Journal: Am J Hum Genet Date: 2013-08-08 Impact factor: 11.025

4. Diversity in non-repetitive human sequences not found in the reference genome.

Authors: Birte Kehr; Anna Helgadottir; Pall Melsted; Hakon Jonsson; Hannes Helgason; Adalbjörg Jonasdottir; Aslaug Jonasdottir; Asgeir Sigurdsson; Arnaldur Gylfason; Gisli H Halldorsson; Snaedis Kristmundsdottir; Gudmundur Thorgeirsson; Isleifur Olafsson; Hilma Holm; Unnur Thorsteinsdottir; Patrick Sulem; Agnar Helgason; Daniel F Gudbjartsson; Bjarni V Halldorsson; Kari Stefansson
Journal: Nat Genet Date: 2017-02-27 Impact factor: 38.330