Literature DB >> 20440878

Characterization of missing human genome sequences and copy-number polymorphic insertions.

Jeffrey M Kidd¹, Nick Sampas, Francesca Antonacci, Tina Graves, Robert Fulton, Hillary S Hayden, Can Alkan, Maika Malig, Mario Ventura, Giuliana Giannuzzi, Joelle Kallicki, Paige Anderson, Anya Tsalenko, N Alice Yamada, Peter Tsang, Rajinder Kaul, Richard K Wilson, Laurakay Bruhn, Evan E Eichler.

Abstract

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18-37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

Entities: Chemical

Mesh：

Substances：

Year: 2010 PMID： 20440878 PMCID： PMC2875995 DOI： 10.1038/nmeth.1451

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Introduction

The human genome reference assembly is a mosaic of distinct haplotypes sampled from multiple individuals1. As a result of both gaps in the assembled sequence and the structural differences that exist among different humans, individual genome projects are expected to uncover human sequences present in some (or all) individuals that are not represented in the assembly. Consistent with this prediction, the first sequences of individual genomes2, 3 revealed 23–29 Mb of sequence that do not map against the reference assembly. The short-read, high-throughput approaches currently being employed are also expected to uncover unrepresented insertions4–7. However, these sequences often assemble only as short (median length of 220 to 314 bp 7) contiguous sequences (contigs) that are difficult to anchor and incorporate into existing genome assemblies. Thus, while thousands of novel sequences may be discovered over the next few years, their annotation and complete integration into the human genome will remain a significant bottleneck 8. Since genotyping and expression microarrays are fundamentally dependent upon the reference genome for array probe design, a small fraction of the human genome effectively can not be assayed. We recently reported efforts to systematically map and sequence human genome structural variation using a fosmid end-sequence pair mapping approach9–11. We fragmented genomic DNA from nine human individuals and subcloned 40-kb segments. Using standard capillary sequencing, reads were generated from both ends of each fragment (end-sequence pairs) and clones were mapped to the human reference genome. Structural differences (inversions, deletions, insertions and translocations) between the reference genome assembly and library source were identified based on the mapped location of the end-sequence pairs. Since the individual fosmid clones were retained, the procedure allowed simultaneous discovery and complete sequence characterization of a subset of structural variant loci including novel insertion sequences common to most individuals but not represented in the human reference genome. Here, we present a detailed sequence and copy-number analysis of these segments missing from the human reference genome.

Results

Discovery

We systematically searched 9.7 million end-sequence pairs, corresponding to 92-fold physical coverage of the human genome, for sequences that failed to map to the reference sequence (NCBI build35). The end-sequence dataset was derived from nine individual genomes (4 Yoruba individuals from Ibadan Nigeria (YRI), 2 individuals with European ancestry (CEU), 2 individuals with Han Chinese or Japanese ancestry (CHB+JPT), and 1 individual of unknown ethnicity). We distinguished clones that only mapped onto the assembly with one end (one-end anchored or OEA, clones) and orphan clones where neither end mapped. After eliminating low-quality sequence and obvious viral and bacterial contaminants, we identified 44,415 high-quality fosmid end sequences that do not map onto the genome reference sequence (NCBI build35)11. This set includes individual sequences from 26,001 OEA clones and 9,207 orphan clones. Using phrap (http://www.phrap.org), we initially assembled these individual sequences into 3,963 sequence contigs (total size = 4.47 Mb, N50 = 1,148 bp) (Table 1) but after applying additional experimental and computational filters, this was reduced to 2,363 distinct sequence contigs (Supplementary Note).

Table 1

Assembling novel sequence contigs

The number of novel sequence contigs, their size, and the number of corresponding loci with contributions from each sample is shown. Results are given for the initial set of 3,963 assembled contigs as well as for the 2,363 contigs that pass all filters. The sample origin of 222 sequenced clones (corresponding to 192 distinct loci) is also shown.

		Assembled Sequences			Typed by ArrayCGH

Sample	Population	Contigs	ContigSize(Mb)	Loci	Contigs	ContigSize(Mb)	Loci	SequencedClones
NA15510	--	768	0.904	345	387	0.512	177	9
NA18517	Yoruba	726	0.925	307	529	0.700	229	15
NA18507	Yoruba	1,386	1.752	534	904	1.208	363	22
NA18956	Japan	885	1.140	342	597	0.815	243	65
NA19240	Yoruba	1,034	1.295	400	682	0.910	295	44
NA18555	China	953	1.187	380	615	0.825	269	20
NA12878	CEPH	977	1.232	386	653	0.879	279	26
NA19129	Yoruba	990	1.277	359	678	0.932	266	13
NA12156	CEPH	996	1.278	377	667	0.914	266	8

Total(non-redundant)		3,963	4.465	1,182	2,363	2.834	720	192

40% (1,019/2,363) of the contigs contain sequence contributed by at least one orphan clone, suggesting that these contigs represent segments longer than 40 kb (Supplementary Table 1). Using OEA anchoring information and the mate-pair relationships from the orphan clones, we identified 720 contigs (400 of which have a mapped genomic position) corresponding to ~2.8 Mbp of sequence with a median contig size of 1 kb (Supplementary Note). Interestingly, 80 of the 400 anchored loci (20%) map within 5 Mb of the ends of a chromosome (a significant 2.9-fold subtelomeric enrichment, p=1.0e-18, binomial test) (Supplementary Fig. 1, Supplementary Table 2). In addition to these 720 loci, we identified 19,038 singleton OEA sequences (average length 790 bp) as well as 5,654 orphan clones that did not contribute to any contigs. By convention, we refer to these sequences as “novel insertions” based on the fact that they are not present within the public reference genome assembly.

FISH Analysis

Our analysis distinguished two different types of novel human sequences: 400 loci that were anchored within euchromatin based on OEA assignments and 320 unassigned loci where a clear anchor position could not be identified. We explored the genomic distribution and assessed the accuracy of our assigned locations using individual fosmid clones as FISH probes. Although limited to larger regions, this analysis provided us valuable high-level mapping information with respect to the distribution of insertions in heterochromatin and euchromatin. We selected 33 contigs derived only from orphan clones (assigned to seven distinct unmapped loci) and mapped these loci to metaphase chromosomes by FISH. Three loci mapped separately to telomeric regions on chromosomes 10q, 7p, and Xp; one locus mapped to 6q1; and three loci mapped to the p-arms of the acrocentric chromosomes (Supplementary Table 1). As a complement to these studies we also tested an additional 68 large orphan contigs, which were constructed based on a detailed fingerprint analysis of all orphan clones from a single individual human genome library (NA15510) (Supplementary Note). After excluding 31 contigs assigned to genome assembly gaps12, we found that 15 of the contigs mapped interstitially, with the remainder (22/37) mapping to telomeric, pericentromeric or acrocentric positions (Supplementary Table 3, Supplementary Fig. 2). Finally, we considered sequence contigs that had been anchored by OEA clones, but also had contributions from at least one orphan clone, to positions in human euchromatin by using 37 fosmid clones (20 OEA and 17 orphan clones) as FISH probes. We found that 78% (29/37) of the clones support the predicted position while 11% (4/37) map to a different interstitial location and 11% (4/37) map to the p-arms of the acrocentric chromosomes. We additionally tested a limited number (n = 3) of smaller insertions (<30 kb) that had been completely sequenced and confirmed all three by metaphase oligo-FISH, finding that 2/3 were copy-number polymorphic among the four individuals tested (Supplementary Note). Our FISH results indicate that mega-bases of uncharacterized sequence remain within the heterochromatin and euchromatin-heterochromatin transition regions of the human genome but also confirm the presence of missing euchromatic sequences that are copy-number polymorphic.

Assembly Comparisons

We searched for evidence of the identified 2,363 sequence contigs in other human and non-human primate genome assemblies. 600 contigs (71 loci)have a match against the newest human reference genome assembly, GRCh37 and 1,467 contigs (54 loci) have a match against the HuRef assembly 2 (Supplementary Note). We find partial support for 1,700–2,000 of the contigs in sequence data from the JDW, YH, and NA18507 genomes3–5 (Supplementary Note). One of the genomes in our study, NA18507, was sequenced to high coverage using the Illumina platform4 and subjected to a SOAP de novo sequence assembly 7. Surprisingly, we found that the 94% of our smallest insertions identified from single unmapped reads (~790 bp) were not identified as part of the de novo assembly (Supplementary Note). 32% of our larger contigs had no representation and only 25% had complete sequence coverage (defined as more than 95% bp representation). When we restricted our analysis to insertions from sequenced NA18507 clones, we found that 52% (11/21 sequenced fragments) were either not present (n = 4) or mapped to different scaffolds (n = 7) in the de novo assembly. We find that this fragmentation often corresponds to the presence of large common repeat sequences that disrupt the contiguity and complicate map assignment. Regions largely devoid of common repeats or segmental duplications showed the greatest correspondence in length and coverage. In order to determine the ancestral state of each of these sequences, we also searched the 2,363 contigs against available whole-genome sequence data from chimpanzee and orangutan13. 74% (1,745/2,363) of the contigs had a match against one of these datasets with 68% (1,599/2,363) of the contigs identified within chimpanzee. We were concerned that these sequences may have characteristics leading to their underrepresentation in genome-sequencing datasets, so we performed an arrayCGH experiment using DNA from a single chimpanzee and tested whether the DNA in fact hybridized. This experiment indicated that 84% (1,985/2,363) of the contigs were present in the single chimpanzee analyzed (Supplementary Note). This includes 624 contigs that do not have a match to the chimpanzee genome sequence data. In total, we find experimental or computational support for 94% (2,223/2,363) of the contigs in the chimpanzee and 96% (2,266/2,363) in either chimpanzee or orangutan. The absence of these new insertions in the current reference genome represents either genome assembly errors or deletions that have emerged within the human lineage and are now copy-number polymorphic in our species.

Copy-number Polymorphism

We designed two customized oligonucleotide microarrays in order to provide an assessment of copy-number polymorphism among these novel insertion sequences. In the first, we designed a microarray targeting the 19,038 single OEA sequences that did not assemble into sequence contigs and tested them against the sample genomes used for discovery. After filtering additional contaminants, we found that 38% (7,240/19,038) of these unassembled sequences were represented by at least three probes with signal intensities sufficiently above the background level. Based on a comparison of the intensity values for the eight analyzed samples, we estimate that 31% (2,228/7,240) of the assayable single OEA sequences are copy-number polymorphic (Supplementary Table 4). In the second design, we investigated copy-number polymorphism for the 2,363 sequence contigs that had been assigned to 720 distinct loci and tested a larger collection of 28 unrelated HapMap individuals (9 CEU, 11 YRI, 8 JPT+CHB). These experiments clearly identified sets of sequences that are copy-number polymorphic or apparently fixed among the analyzed individuals (Fig. 1). Polymorphic contigs were identified using two alternative calling schemes: a noise-multiplier approach that compares the median probe log-ratios for each contig with the results of a control self-self hybridization (using reference sample NA15510, Supplementary Table 5) and a clustering approach that assigns contigs to log-ratio clusters14 that are then fitted to distinct, small integer copynumber states (Fig. 2, Supplementary Table 6). The noise-multiplier approach identifies 37% of the contigs as being copy-number polymorphic. 518 contigs could be fitted to a copy-number state, of which 461 contigs are fitted to two or more distinct copy-number states. 443 contigs (18.7%) were identified as polymorphic by both approaches, an indication of the challenges in assigning discrete copy numbers to all copy-number variable loci.

Figure 1

Copy-number polymorphism of novel insertions

ArrayCGH intensity data is displayed for novel sequences ordered along (a) chromosome 5 and (b) chromosome 14 based on anchored map locations (build35 coordinates, UCSC). Copy-number gains (orange) and losses (blue) are shown relative to the reference sample (NA15510). Each column in the heat map represents a probe on the array, and each row represents a sample ordered and separated (yellow lines) by corresponding HapMap population (CEU, CHB, JPT and YRI). The bottom row depicts a reference self-self hybridization as control. The red brackets group multiple contigs into loci that generally show a consistent hybridization pattern by arrayCGH.

Figure 2

Sequencing and genotyping insertions

(a) The complete sequence of a clone (AC205876) carrying a 4.8-kbp novel insertion sequence is compared to the corresponding segment from chromosome 20 using miropeats (black lines connect segments of matching sequence; colored arrows correspond to common repeats; green: LINEs; purple: SINEs; orange: LTR elements; pink: DNA elements). The magenta lines denote the insertion breakpoints. The brown boxes correspond to the mapped position of three assembled novel sequence contigs. (b) ArrayCGH hybridization results represented as a heat map suggest that the deletion is fixed in CEU and CHB populations. The brown-red lines correspond to the three sequence contigs depicted in part (a) and are represented by 16, 15, and 18 arrayCGH probes respectively. The median log2 ratios (c) and single channel intensities (d) are shown for all probes matching AC205876. Note that the reference (blue bars) channel shows similar intensity across hybridizations. For this example the reference sample is inferred to have a copy number of 1. The signals form three distinct clusters that are assigned integer copy-number states of 0, 1, and 2. The dotted red, green, and blue lines correspond to the median intensities of each defined cluster. Using these genotypes an FST of 0.70 is calculated for this insertion. (e–h) A second example as described above depicting a 3.9-kb insertion (AC216083) within the first intron of the LCT (lactase) gene (red boxes represent exons as indicated).

We assessed the extent of population differentiation for these sequences using both the FST and VST statistics15, 16. For 189 loci with a simple autosomal insertion-deletion variant, we found 20 loci having an FST greater than 0.35 (Fig. 3, Supplementary Table 7, Supplementary Table 8, and Supplementary Fig. 3). Among these, we identified a 3.9-kb insertion sequence within the first intron of the lactase gene (LCT)(Fig. 2). Interestingly, this 3.9-kb insertion is prevalent among the YRI samples tested (allele frequency=0.86) but is largely absent among the CEPH Europeans (allele frequency = 0.11) where it is in complete linkage disequilibrium (D’ = 1) with the functional SNP that has been associated with lactase persistence17. We repeated the analysis using the VST statistic, for all 720 loci (Supplementary Fig. 4). We identified 27 loci that have a VST value greater than 0.35 with ten having a value greater than 0.5 (Supplementary Table 9). Fosmid clones corresponding to several of the most stratified loci have been completely sequenced, including a 4.8-kb insertion on chr20 (AC205876, Fig. 2, VST = 0.73, FST = 0.70) and an 11.4-kb insertion on chr1 near the ATP6V1G3 gene (AC212752, VST = 0.48, FST = 0.37). These sites represent structures that show a high level of differentiation among human populations but are absent from the genome reference.

Figure 3

Insertion allele frequency distribution

The frequency of the insertion allele is shown for 189 loci that are fitted to distinct copy numbers and are consistent with a simple autosomal insertion-deletion variant. Values are shown for all 28 individuals (black bars) and separately for each HapMap population as indicated.

Sequencing and Genotyping Novel Insertions

The complete sequence of insertions smaller than 40 kb can be directly obtained by sequencing an appropriate fosmid clone, while an iterative strategy is required to capture the sequence of larger insertions. 222 fosmid clones (53 OEA clones and 169 spanned insertions) were sequenced using a traditional capillary sequencing and assembly approach (Supplementary Table 10, Supplementary Fig. 5). The 222 clones correspond to 192 distinct genomic loci and contain a total of 1.67 Mb of inserted sequence (Supplementary Fig. 6) subsuming 475 of our original 2,363 contigs. Four of the completely sequenced insertions, ranging in size from 41–65 kb, were larger than a single clone insert (Supplementary Note). The sequenced insertions are similar in composition to segments sampled from the reference genome assembly, with a slight enrichment for common repeats, particularly LINEs (Supplementary Table 11). Only five of the 192 loci (Supplementary Table 12) have been updated in GRCh37, thus the majority (97%) of these insertions await integration into the next version of the human genome. We searched the sequenced insertions against the RefSeq gene database18 to identify previously uncharacterized exons. We found that segments from 22 genes matched 21 of the insertions (Supplementary Table 13) including support for structures not represented in the build36 assembly (eg. MINK1, FSCN2, PECAM1, and VPRBP genes (Figure 4). We further searched for expressed elements using mRNA-seq data derived from multiple human tissues that do not map onto the build36 genome assembly19. We mapped these previously unmapped reads onto the sequenced clones and found that 26 insertions contained segments supported by at least three mRNA-seq reads (Supplementary Fig. 5, Supplementary Note). We searched against an alignment of nine mammalian genomes to identify segments matching the sequenced insertions (Ensembl Compara 51) 20, 21. Using these alignments, we identified 477 constrained elements from 104 different loci (Fig. 4), a signature that identifies segments of possible functional importance22. Six of the constrained elements intersect with mapped RefSeq exons with the remainder having an unknown functional importance. Using Genomic Evolutionary Rate Profiling (GERP) scores as a metric, we note that the conserved elements found in the insertions show a similar level of constraint compared to elements identified across the rest of the alignments (Supplementary Fig. 7).

Figure 4

Annotation of conserved and functional elements

(a) The complete sequence of an OEA clone carrying 29 kbp of novel sequence is compared by miropeats to the reference genome. We identify a 95-bp conserved element within this sequence (green rectangles) as defined by a GERP analysis of 8 species (see Online Methods). A multiple sequence alignment of one of these conserved elements (black arrow) is highlighted. (b) A novel exon is predicted within the sequence of a 4.3-kbp insertion based on comparison with the PECAM1 transcript (NM_000442.3), as shown in blue. This alternate exon is supported by RNA-seq data and corresponds to a conserved element identified by alignment comparisons.

High-quality sequence across the variant breakpoints permitted a detailed assessment of exact variant boundaries and associated sequences. We used the breakpoint sequence data obtained from 152 insertions spanned by individual fosmid clones to identify a set of unique, diagnostic k-mers specific to the insertion and deletion alleles of each variant (Fig. 5). We found that 108 of the sequenced loci could be uniquely identified using a k-mer length of 36 and a search stringency of one substitution. 29% of the loci (44/152) could not be uniquely identified using this approach, although we note that this method assumes that the genome reference assembly accurately represents the structure of the deletion allele and all instances of the variant have identical breakpoints. If k-mer lengths increased to 100 bp, there would still be five loci that remained recalcitrant to analysis using this approach (Fig. 5b). We determined genotypes for 106 loci by searching Illumina sequence data from NA18507 against these diagnostic k-mers4. We observed agreement at 94.3% of the genotypes determined for this individual by arrayCGH (Fig. 5c, Supplementary Table 14). We simulated the effect of genome coverage by sampling subsets of the total sequence data from NA18507 (Fig. 5d). We found a rapid increase in the number and accuracy of the sites genotyped with increasing coverage, followed by a plateau of approximately 94% genotype agreement when sequencing coverage reached 10-fold sequence coverage. This indicates that high-quality breakpoint sequence data can be used to genotype structural variants in samples that have been analyzed by next-generation sequencing.

Figure 5

Genotyping sequenced variants through unique k-mer matches

(a) Unique diagnostic k-mer sequences were identified for each variant using sequence-resolved breakpoints. For the deletion breakpoint, k-mers were required to have a single match to the reference genome and no matches to the fosmid sequences. For the insertion breakpoints, k-mers were required to have no matches to the genome and a single match to the fosmid. In order to be uniquely identifiable, a variant must have at least one deletion k-mer and at least one insertion k-mer that meet these criteria. (b) Effect of k-mer length and search stringency on ability to uniquely identify a variant. 71% (108/152) of the sequenced sites are uniquely identifiable with a criteria of k=36 and one substitution, while 97% (147/152) are assayable if k-mer length increased to 100 bp. (c) A comparison of genotypes determined using arrayCGH and breakpoint k-mer matching is depicted for sample NA18507. The search database consists of unique 36-mers (one substitution). Genotypes for 54 variants were successfully determined by both arrayCGH and breakpoint k-mer matching. Partitioning the breakpoint scores into distinct genotypes at 0.5 and 1.5 (red lines) results in 94.3% genotype agreement between the two methods. (d) Effect of sequence coverage on breakpoint k-mer genotyping. The number of variants genotyped (at least one matching read, solid line, left axis) and the percent agreement with arrayCGH results (dashed line, right axis) are shown at various sequence coverage levels (1–42X).

Discussion

Over the past five years the extent of structural variation among individual human genomes has become increasingly clear. Array-based approaches, for example, have systematically discovered and genotyped more than 50% of common copy-number polymorphic deletions23, 24. Sequence-based approaches have begun to more fully explore the size spectrum, cataloging an increasing number of smaller deletions and moving toward personalized duplication maps for individual genomes9, 11, 25 , 26. The characterization of other classes of structural variation, including inversions and insertions, however, has lagged due to technical biases in their discovery and difficulties associated with their validation. New insertions are limited, in particular, by the genetic community’s reliance on a single mosaic reference genome, which at some positions represents rare structural configurations and entirely omits sequences that are found in the majority of individuals. The absence of these sequences from the reference genome hinders their functional characterization leading to a less-than-complete understanding of the sequence content present in the majority of humans. We used a fosmid clone strategy to specifically focus on the characterization of human sequences that are not in the reference assembly and have therefore not been annotated for functional elements or systematically genotyped. In this study we identified 720 distinct loci ranging in size from 1–20 kbp in length as well as several thousand additional smaller segments <1 kbp in length. We have determined that more than half map to the euchromatin with a disproportionate fraction mapping within the last 5 Mbp of human chromosomes (Supplementary Fig. 1). A remarkable feature of these sequences is their degree of copy-number polymorphism. ArrayCGH analysis indicates that 18–37% of the assembled sequence contigs vary in copy number, with 80% of the genotyped variants having a minor allele frequency >10% among the 28 individuals surveyed (Fig. 3). Experimental and computational comparisons with chimpanzee DNA suggest that at least 94% arose as a result of deletions that occurred within the human lineage. Many of the common insertions show striking differences in allele frequency among populations, a pattern suggestive of either selection or genetic drift since the migration of humans out of Africa (Fig. 2, Supplementary Table 8, Supplementary Table 9). We observe that the average insertion allele frequency for the variable loci was significantly greater in African populations when compared to European or Asians (YRI versus CEU p = 0.0003 and YRI versus ASN p = 0.005, 1 sided t-test). The 3.9-kb novel insertion within the first intron of the LCT gene is illustrative. Our initial survey suggests that this insertion sequence is prevalent among the Yoruba (86%) and Asian samples (63%) but is present at a much lower frequency among CEPH Europeans (11%). These findings raise the possibility that the additional sequence within this haplotype may play a role in regulating expression of this gene. The complete sequence of this insertion sequence (AC20193) now allows this hypothesis to be directly tested. An important question going forward is how well de novo assembly methods using next-generation sequence data compare to the clone-based approach we have described here. We had the opportunity to compare an Illumina SOAP de novo assembly 7 against the clone-based discovery on the same individual genome (Supplementary Note). We found that many of the larger novel contigs were only partially represented (50–60%) in a 30X de novo assembly, and in more than a third of studied cases novel contigs were fragmented—mapping to two or more scaffolds instead of being placed in the same region. In many cases, the fragmentation corresponded to common repeats disrupting the contiguity of the novel sequence. In regions largely devoid of retrotransposons, de novo sequence assemblies using NGS datasets perform quite well. These results highlight both the limitation of de novo sequence assembly using NGS and the value of high-quality clone-based data to resolve and integrate these sequences into the reference genome. Nevertheless, there are advantages to de novo assembly. The de novo sequence assembly identifies 2–3 times more novel sequence per genome when compared to our results from 0.3X sequence coverage per genome, suggesting that the methods are complementary. Surprisingly, only 2.9% of our singletons from NA18507 (average size ~790 bp) were identified in the de novo assembly. Since these smaller insertions require more characterization, the significance of this discrepancy is unclear. The major benefit of our approach is the ability to directly obtain high-quality sequence for the insertion loci by complete sequencing of corresponding clone inserts at a quality commensurate with that of the human reference genome. While no complete missing genes were discovered, we did identify 477 elements that have been conserved over evolutionary time, six of which appear to correspond to exons from RefSeq genes as well as 26 loci having support from multiple mRNA-seq reads. Moreover, we demonstrate that these high-quality sequences can be utilized to accurately genotype these regions using next-generation sequence sets produced from the 1000 Genomes and other projects. The complete sequence of these and other loci will facilitate their functional characterization as they can now be incorporated into future genotyping platforms, expression microarrays, and ultimately future genome assemblies to provide a more accurate representation of the organization and genetic variation of the human genome.

AC222570	AC225984	AC225712
AC209232	AC208009	AC210970
AC216120	AC203638	AC210756
AC209007	AC208058	AC220966
AC225707	AC210765	AC203606
AC233719	AC231962	AC225889
AC231989	AC211712	AC207300
AC232310	AC221036	AC221038
AC223408	AC208950	AC208590
AC206609	AC233753	AC195766
AC213121	AC221035	AC208064
AC212752	AC226767	AC208582
AC217414	AC225829	AC216823
AC234425	AC226143	AC222569
AC206484	AC231982	AC214181
AC225099	AC231988	AC214074
AC207442	AC208324	AC234851
AC231953	AC216971	AC217009
AC226171	AC212910	AC208066
AC231273	AC217140	AC226007
AC226804	AC209310	AC233754
AC234230	AC232302	AC231198
AC234305	AC232224	AC231536
AC233758	AC236073	AC216138
AC225822	AC210437	AC226697
AC205940	AC209307	AC231958
AC208502	AC216281	AC209234
AC231414	AC207999	AC226139
AC233314	AC236778	AC209420
AC226621	AC208323	AC231117
AC211399	AC210886	AC233764
AC209551	AC213472	AC231964
AC208056	AC231980	AC234039
AC231540	AC236964	AC231118
AC226696	AC222568	AC212491
AC203610	AC213471	AC233714
AC207607	AC213468	AC208190
AC208786	AC216083	AC210438
AC213240	AC226495	AC217018
AC212901	AC208170	AC233712
AC213223	AC196541	AC232309
AC226593	AC212794	AC203636
AC208169	AC225768	AC231646
AC213029	AC225989	AC237148
AC225617	AC236926	AC237106
AC217954	AC208871	AC210544
AC203617	AC204980	AC226699
AC234852	AC205876	AC217012
AC226116	AC217326	AC196515
AC208716	AC233721	AC233722
AC226108	AC226724	AC207611
AC235759	AC232307	AC215339
AC203605	AC232301	AC206437
AC203644	AC231780	AC203630
AC208069	AC229891	AC231287
AC195745	AC206743	AC215799
AC206479	AC225034	AC214824
AC212759	AC231189	AC216089
AC204972	AC206474	AC215710
AC225710	AC203640	AC213440
AC207713	AC207981	AC209283
AC206930	AC233768	AC233720
AC215288	AC217064
AC215700	AC158320
AC235087	AC196513
AC225603	AC209546
AC234232	AC209618
AC217515	AC207588
AC233755	AC225728
AC223433	AC233756
AC158324	AC232304
AC209298	AC231288
AC231276	AC204974
AC223423	AC207173
AC226762	AC208103
AC217628	AC193150
AC226140	AC204971
AC234142	AC207777
AC203665	AC207366
AC231649	AC204963

26 in total

1. The fine-scale and complex architecture of human copy-number variation.

Authors: George H Perry; Amir Ben-Dor; Anya Tsalenko; Nick Sampas; Laia Rodriguez-Revenga; Charles W Tran; Alicia Scheffer; Israel Steinfeld; Peter Tsang; N Alice Yamada; Han Soo Park; Jong-Il Kim; Jeong-Sun Seo; Zohar Yakhini; Stephen Laderman; Laurakay Bruhn; Charles Lee
Journal: Am J Hum Genet Date: 2008-01-24 Impact factor: 11.025

2. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.

Authors: Kevin Judd McKernan; Heather E Peckham; Gina L Costa; Stephen F McLaughlin; Yutao Fu; Eric F Tsung; Christopher R Clouser; Cisyla Duncan; Jeffrey K Ichikawa; Clarence C Lee; Zheng Zhang; Swati S Ranade; Eileen T Dimalanta; Fiona C Hyland; Tanya D Sokolsky; Lei Zhang; Andrew Sheridan; Haoning Fu; Cynthia L Hendrickson; Bin Li; Lev Kotler; Jeremy R Stuart; Joel A Malek; Jonathan M Manning; Alena A Antipova; Damon S Perez; Michael P Moore; Kathleen C Hayashibara; Michael R Lyons; Robert E Beaudoin; Brittany E Coleman; Michael W Laptewicz; Adam E Sannicandro; Michael D Rhodes; Rajesh K Gottimukkala; Shan Yang; Vineet Bafna; Ali Bashir; Andrew MacBride; Can Alkan; Jeffrey M Kidd; Evan E Eichler; Martin G Reese; Francisco M De La Vega; Alan P Blanchard
Journal: Genome Res Date: 2009-06-22 Impact factor: 9.043

3. Integrated detection and population-genetic analysis of SNPs and copy number variation.

Authors: Steven A McCarroll; Finny G Kuruvilla; Joshua M Korn; Simon Cawley; James Nemesh; Alec Wysoker; Michael H Shapero; Paul I W de Bakker; Julian B Maller; Andrew Kirby; Amanda L Elliott; Melissa Parkin; Earl Hubbell; Teresa Webster; Rui Mei; James Veitch; Patrick J Collins; Robert Handsaker; Steve Lincoln; Marcia Nizzari; John Blume; Keith W Jones; Rich Rava; Mark J Daly; Stacey B Gabriel; David Altshuler
Journal: Nat Genet Date: 2008-09-07 Impact factor: 38.330

4. Initial sequence of the chimpanzee genome and comparison with the human genome.

Authors:
Journal: Nature Date: 2005-09-01 Impact factor: 49.962

5. Global variation in copy number in the human genome.

Authors: Richard Redon; Shumpei Ishikawa; Karen R Fitch; Lars Feuk; George H Perry; T Daniel Andrews; Heike Fiegler; Michael H Shapero; Andrew R Carson; Wenwei Chen; Eun Kyung Cho; Stephanie Dallaire; Jennifer L Freeman; Juan R González; Mònica Gratacòs; Jing Huang; Dimitrios Kalaitzopoulos; Daisuke Komura; Jeffrey R MacDonald; Christian R Marshall; Rui Mei; Lyndal Montgomery; Kunihiro Nishimura; Kohji Okamura; Fan Shen; Martin J Somerville; Joelle Tchinda; Armand Valsesia; Cara Woodwark; Fengtang Yang; Junjun Zhang; Tatiana Zerjal; Jane Zhang; Lluis Armengol; Donald F Conrad; Xavier Estivill; Chris Tyler-Smith; Nigel P Carter; Hiroyuki Aburatani; Charles Lee; Keith W Jones; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2006-11-23 Impact factor: 49.962

6. The diploid genome sequence of an Asian individual.

Authors: Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman; Wei Fan; Junqing Zhang; Jun Li; Juanbin Zhang; Yiran Guo; Binxiao Feng; Heng Li; Yao Lu; Xiaodong Fang; Huiqing Liang; Zhenglin Du; Dong Li; Yiqing Zhao; Yujie Hu; Zhenzhen Yang; Hancheng Zheng; Ines Hellmann; Michael Inouye; John Pool; Xin Yi; Jing Zhao; Jinjie Duan; Yan Zhou; Junjie Qin; Lijia Ma; Guoqing Li; Zhentao Yang; Guojie Zhang; Bin Yang; Chang Yu; Fang Liang; Wenjie Li; Shaochuan Li; Dawei Li; Peixiang Ni; Jue Ruan; Qibin Li; Hongmei Zhu; Dongyuan Liu; Zhike Lu; Ning Li; Guangwu Guo; Jianguo Zhang; Jia Ye; Lin Fang; Qin Hao; Quan Chen; Yu Liang; Yeyang Su; A San; Cuo Ping; Shuang Yang; Fang Chen; Li Li; Ke Zhou; Hongkun Zheng; Yuanyuan Ren; Ling Yang; Yang Gao; Guohua Yang; Zhuo Li; Xiaoli Feng; Karsten Kristiansen; Gane Ka-Shu Wong; Rasmus Nielsen; Richard Durbin; Lars Bolund; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

7. Closing gaps in the human genome with fosmid resources generated from multiple individuals.

Authors: Donald Bovee; Yang Zhou; Eric Haugen; Zaining Wu; Hillary S Hayden; Will Gillett; Eray Tuzun; Gregory M Cooper; Nick Sampas; Karen Phelps; Ruth Levy; V Anne Morrison; James Sprague; Donald Jewett; Danielle Buckley; Sandhya Subramaniam; Jean Chang; Douglas R Smith; Maynard V Olson; Evan E Eichler; Rajinder Kaul
Journal: Nat Genet Date: 2007-12-23 Impact factor: 38.330

8. Paired-end mapping reveals extensive structural variation in the human genome.

Authors: Jan O Korbel; Alexander Eckehart Urban; Jason P Affourtit; Brian Godwin; Fabian Grubert; Jan Fredrik Simons; Philip M Kim; Dean Palejev; Nicholas J Carriero; Lei Du; Bruce E Taillon; Zhoutao Chen; Andrea Tanzer; A C Eugenia Saunders; Jianxiang Chi; Fengtang Yang; Nigel P Carter; Matthew E Hurles; Sherman M Weissman; Timothy T Harkins; Mark B Gerstein; Michael Egholm; Michael Snyder
Journal: Science Date: 2007-09-27 Impact factor: 47.728

9. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

10. Alternative isoform regulation in human tissue transcriptomes.

Authors: Eric T Wang; Rickard Sandberg; Shujun Luo; Irina Khrebtukova; Lu Zhang; Christine Mayr; Stephen F Kingsmore; Gary P Schroth; Christopher B Burge
Journal: Nature Date: 2008-11-27 Impact factor: 49.962

79 in total

1. Next-generation sequencing for cancer diagnostics: a practical perspective.

Authors: Cliff Meldrum; Maria A Doyle; Richard W Tothill
Journal: Clin Biochem Rev Date: 2011-11

2. Resolving the breakpoints of the 17q21.31 microdeletion syndrome with next-generation sequencing.

Authors: Andy Itsara; Lisenka E L M Vissers; Karyn Meltz Steinberg; Kevin J Meyer; Michael C Zody; David A Koolen; Joep de Ligt; Edwin Cuppen; Carl Baker; Choli Lee; Tina A Graves; Richard K Wilson; Robert B Jenkins; Joris A Veltman; Evan E Eichler
Journal: Am J Hum Genet Date: 2012-04-06 Impact factor: 11.025

3. Human genomics: Filling gaps and finding variants.

Authors: Mary Muers
Journal: Nat Rev Genet Date: 2010-05-05 Impact factor: 53.242

4. The author file: Evan Eichler.

Authors: Monya Baker
Journal: Nat Methods Date: 2010-05 Impact factor: 28.547

Review 5. Annotating non-coding regions of the genome.

Authors: Roger P Alexander; Gang Fang; Joel Rozowsky; Michael Snyder; Mark B Gerstein
Journal: Nat Rev Genet Date: 2010-07-13 Impact factor: 53.242

6. The Archon Genomics X PRIZE for whole human genome sequencing.

Authors: Larry Kedes; Edison T Liu
Journal: Nat Genet Date: 2010-11 Impact factor: 38.330

7. Public data archives for genomic structural variation.

Authors: Deanna M Church; Ilkka Lappalainen; Tam P Sneddon; Jonathan Hinton; Michael Maguire; John Lopez; John Garner; Justin Paschall; Michael DiCuccio; Eugene Yaschenko; Stephen W Scherer; Lars Feuk; Paul Flicek
Journal: Nat Genet Date: 2010-10 Impact factor: 38.330

8. A comprehensively molecular haplotype-resolved genome of a European individual.

Authors: Eun-Kyung Suk; Gayle K McEwen; Jorge Duitama; Katja Nowick; Sabrina Schulz; Stefanie Palczewski; Stefan Schreiber; Dustin T Holloway; Stephen McLaughlin; Heather Peckham; Clarence Lee; Thomas Huebsch; Margret R Hoehe
Journal: Genome Res Date: 2011-08-03 Impact factor: 9.043

9. Population-genetic properties of differentiated human copy-number polymorphisms.

Authors: Catarina D Campbell; Nick Sampas; Anya Tsalenko; Peter H Sudmant; Jeffrey M Kidd; Maika Malig; Tiffany H Vu; Laura Vives; Peter Tsang; Laurakay Bruhn; Evan E Eichler
Journal: Am J Hum Genet Date: 2011-03-11 Impact factor: 11.025

10. Relating CNVs to transcriptome data at fine resolution: assessment of the effect of variant size, type, and overlap with functional regions.

Authors: Andreas Schlattl; Simon Anders; Sebastian M Waszak; Wolfgang Huber; Jan O Korbel
Journal: Genome Res Date: 2011-08-23 Impact factor: 9.043