Literature DB >> 20164091

Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries.

Michal Mokry¹, Harma Feitsma, Isaac J Nijman, Ewart de Bruijn, Pieter J van der Zaag, Victor Guryev, Edwin Cuppen.

Abstract

Microarray-based enrichment of selected genomic loci is a powerful method for genome complexity reduction for next-generation sequencing. Since the vast majority of exons in vertebrate genomes are smaller than 150 nt, we explored the use of short fragment libraries (85-110 bp) to achieve higher enrichment specificity by reducing carryover and adverse effects of flanking intronic sequences. High enrichment specificity (60-75%) was obtained with a relative even base coverage. Up to 98% of the target-sequence was covered more than 20x at an average coverage depth of about 200x. To verify the accuracy of SNP/mutation detection, we evaluated 384 known non-reference SNPs in the targeted regions. At approximately 200x average sequence coverage, we were able to survey 96.4% of 1.69 Mb of genomic sequence with only 4.2% false negative calls, mostly due to low coverage. Using the same settings, a total of 1197 novel candidate variants were detected. Verification experiments revealed only eight false positive calls, indicating an overall false positive rate of less than 1 per approximately 200,000 bp. Taken together, short fragment libraries provide highly efficient and flexible enrichment of exonic targets and yield relatively even base coverage, which facilitates accurate SNP and mutation detection. Raw sequencing data, alignment files and called SNPs have been submitted into GEO database http://www.ncbi.nlm.nih.gov/geo/ with accession number GSE18542.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20164091 PMCID： PMC2879533 DOI： 10.1093/nar/gkq072

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The need for detection of SNPs and mutations in large genomic segments is increasing rapidly, partially as a result of genome-wide association studies (GWAS) that have pinpointed many genomic loci of interest for specific diseases and disease susceptibilities (1–5). Furthermore, the vastly increased throughput of massively parallel next-generation sequencing technologies enables the interrogation of unprecedented numbers of genes in a single analysis, permitting, for instance, the investigation of the complete protein-coding transcriptome (6). Enrichment of genomic loci by microarray hybridization followed by massively parallel sequencing has become an important method for targeted re-sequencing. The approach is based on hybridization of fragmented and adapter-ligated DNA to capturing probes printed on microarray slides (7–11), present in solution (12,13) or PCR products immobilized on filters (14) that are specifically designed for the regions of interest. After hybridization, non-targeted fragments are washed away and only the captured fragments are eluted for deep sequencing on any of the next-generation sequencing platforms (7–10). Generally, DNA fragment libraries with 500-bp fragment size are recommended and used for optimal efficiency because shorter fragments were reported to increase the number of off-target reads (8). However, many re-sequencing projects are focused on exons of protein coding genes. Because the median size of a human exon is only 120 bp (with 70% of all exons shorter than 200 bp) (Figure 1), long-fragment libraries can have severe limitations. Most importantly, many of the specifically captured DNA from long-fragment libraries will consist of sequences derived from introns flanking the exons of interest, which decreases the effective sequencing yield. To address this issue, we have explored the efficiency of enrichment of DNA fragment libraries with much shorter fragments (85–110 bp). We used human genomic DNA and an exon-centric capturing design on commercially available custom microarray slides. We developed an effective probe design strategy with tiled oligonucleotides for capturing both genomic strands, resulting in targeting of 1.69 Mb of exonic sequences on a 244 K array. To validate the performance of the microarray-based enrichment strategy, we have tested the specificity of enrichment and evaluated the effectiveness of retrieval of known SNPs that were present in the targeted regions. When applying highly stringent filtering criteria, we were able to evaluate 93% of the target regions without any false-negatives. Moreover, independent verification experiments on newly discovered polymorphisms revealed an overall false-positive rate of less than 1 per 200 kb. These results indicate that the use of short fragment libraries results firstly in highly efficient enrichment with relatively even base coverage over the targeted region and that secondly the sensitivity and specificity of SNP/mutation calling is very high.

Figure 1.

Size distribution of human exons. The median size of human exons is only 120 bp (with 70% of all exons shorter than 200 bp). Therefore many of the specifically captured DNA from long-fragment libraries will consist of sequences derived from introns flanking the exons of interest, which decreases the effective sequencing yield.

MATERIALS AND METHODS

Enrichment array design

Exonic sequences of 1621 genes were selected from hg18 build of the human genome spanning a total length of targeted sequence of 1.69 Mb. Genes were selected based on the potential presence of exonic SNPs with presumed functional effects and thus potential clinical relevance (15). Sixty nucleotide long probes were designed with an average tilling density of 10 bp for both negative and positive strands. The probe selection strategy was set using the following rules: all possible 60-mer probes starting in a 10 bases long window were collected and a single probe with the lowest penalty score (see below) was selected. This procedure was repeated for every 10 nt bin in the region of interest, which presented all coding exons of selected genes. Penalty scores were calculated as follows: 4 points if Tm is <77°C or >81°C with Tm defined as: where nG is total number of guanidines, nC is total number of cytosines and N represents oligonucleotides length, 2 points per homo-polymer longer than 5 bp, 1 point per each base over or below the limit (C or G fraction <15% and >25%, A or T fraction <25% and >35%). To exclude potentially repetitive elements from the design, all probes were compared to the reference genome using BLAST and those returning more than one hit (as defined by E-value cutoff <0.01) were discarded from the design. Probes were synthesized on custom 244 k Agilent arrays with randomized positions.

Library preparation

DNA was fragmented for 6 min using a Covaris S2 sonicator (6 × 16 mm AFA fiber Tube, duty cycle: 20%, intensity: 5, cycles/burst: 200, frequency sweeping). After fragmentation, fragments were blunt-ended and phosphorylated at the 5′-end using End-it Kit (Epicentre) according to the manufacturer’s instructions, followed by ligation of double-stranded short adapters (adapter 1: pre-annealed duplex of 5′-CTA TGG GCA GTC GGT GAT-3′ and 5′-ATC ACC GAC TGC CCA TAG TTT-3′ and adapter 2: pre-annealed duplex of 5′-CGC CTT GGC CGT ACA GCA G-3′ and 5′-GCT GTA CGG CCA AGG CG-3′; all oligo’s were acquired through Integrated DNA Technologies (Coralville, IA, USA) and pre-annealing was done by mixing complementary oligonucleotides at 500 µM concentration and running on thermocycler with the following program: 95°C for 3 min, 80°C for 3 min, 70°C for 3 min, 60°C for 3 min, 50°C for 3 min, 40°C for 3 min and 4°C hold). Ligation was performed using Quick ligation kit (New England Biolabs) with 1 µg of fragmented DNA, 750 nM adaptor 1 and adaptor 2, 150 µl of 2× Quick ligation buffer and 5 µl Quick Ligase in a total volume of 300 µl. Samples were purified on Ampure beads (Agencourt) and run on a native 6% polyacrylamide gel. Fragments ranging from 125 to 150 bp were excised; the piece of gel containing fragments was shredded and dispersed into 400 µl of Platinum PCR Supermix with 750 nM of both amplification PCR primers (provide sequence of amplification primers), 2.5 U of Pfu DNA polymerase (Stratagene) and 5 U Taq DNA polymerase (Bioline). Before ligation-mediated amplification, the PCR sample was incubated at 72°C for 20 min in PCR mix to let the DNA diffuse from the gel and to perform nick translation on non-ligated 3′-ends. After eight cycles of amplification, the library DNA was purified on Ampure beads and the quality was checked on a gel for the proper size range and the absence of adapter dimers and heterodimers. This library served as a stock for all subsequent hybridization experiments.

Enrichment hybridization and elution

Prior to hybridization, 50 ng of stock library was amplified using 10 cycles in 1000 µl of Platinum PCR Supermix with 750 nM of both amplification PCR primers to produce a sufficient amount of library DNA necessary for enrichment (Table 1). Amplified library DNA was subsequently purified using a MinElute Reaction Cleanup Kit (Qiagen). Amplified DNA was mixed with 5 × weight excess of human Cot-1 DNA (Invitrogen) and concentrated using a speedvac to a final volume of 12.3 μl. DNA was mixed with 31.7 μl Nimblegen aCGH hybridization solution and denatured at 95°C for 5 min. After denaturing, the sample was hybridized for 65 h at 42°C on a 4-bay MAUI hybridization station using an active mixing MAUI AO chamber (MAUI). After hybridization, the array was washed using the Nimblegen Wash Buffer Kit according to the user’s guide for aCGH hybridization. The temperature of Wash buffer I for Library 2 was 42°C instead of room temperature. Elution was performed using 800 μl of elution buffer (10 mM Tris pH 8.0) in an Agilent Microarray Hybridization Chamber at 95°C for 30 min. After 30 min, the chamber was quickly disassembled and elution buffer collected into a separate 1.5 ml tube. Microarray slides were dipped into re-distilled water and stored for re-use. Eluted library DNA was concentrated in a speedvac to a final volume of 50 μl and amplified with a limited number of PCR cycles (12–14 cycles) with full-length primers (amp-P1: 5′-CCA CTA CGC CTC CGC TTT CCT CTC TAT GGG CAG TCG GTG AT and amp-P2: 5′-CTG CCC CGG GTT CCT CAT TCT CTN NNN NNN NNN CTG CTG TAC GGC CAA GGC G, where N represent unique barcode sequence for each library) to introduce barcode sequences as well as adapter sequences required for SOLiD sequencing.

Table 1.

Enrichment statistics

	Library1	Library2	Library3	^cLibrary1
Length of sequenced tags	35	35	35	50
Microarray slide	new	new	reused	new
Amount of hybridized DNA (µg)	3	3	6.5	3
Washing temperature	RT	42°C	RT	RT
Mappable tags (millions)	6.64	18.88	17.05	14.10
Uniquely mappable tags (millions)	4.97	12.70	13.98	10.78
Mappable sequence on target (%)	56	40	61	53
Uniquely mappable sequence on target (%)	67	60	75	69
Bases covered 1× (%)	99.59	99.38	99.88	99.97
Bases covered 10× (%)	92.49	93.18	98.08	99.37
Bases covered 20× (%)	83.17	86.77	95.46	98.13
Bases covered with 10% of average coverage^a (%)	95	90	95	98
Bases covered with 25% of average coverage^a (%)	86	77	86	92
Bases covered with 50% of average coverage^a (%)	69	60	69	76
Evenness score E (%)^b	70.2	62.4	70.1	74.8

aThe percentage of bases covered with a given percentage of the average coverage is a better measurement for comparison of coverage evenness than the percentage of bases covered with a certain depth, because it is independent on overall depth of sequencing.

bEvenness score represents the fraction of sequenced bases that do not have to be redistributed from above-average coverage to below-average coverage positions to obtain completely even coverage for all targeted positions. This is a measurement that is relatively independent on sequencing depth (see text).

cThe last column gives the results of an experiment in which the library resulting from a first enrichment experiment was sequenced using the Solid V3 update, which provides 50-bp read lengths.

Enrichment statistics aThe percentage of bases covered with a given percentage of the average coverage is a better measurement for comparison of coverage evenness than the percentage of bases covered with a certain depth, because it is independent on overall depth of sequencing. bEvenness score represents the fraction of sequenced bases that do not have to be redistributed from above-average coverage to below-average coverage positions to obtain completely even coverage for all targeted positions. This is a measurement that is relatively independent on sequencing depth (see text). cThe last column gives the results of an experiment in which the library resulting from a first enrichment experiment was sequenced using the Solid V3 update, which provides 50-bp read lengths.

SOLiD sequencing

To achieve clonal amplification of library fragments on the surface of sequencing beads, emulsion PCR (emPCR) was performed according to the manufacturer’s instructions (Applied Biosystems). A total of 600 pg of double stranded library DNA was added to 5.6 ml of PCR mix containing 1× PCR Gold Buffer (Applied Biosystems), 3000 U AmpliTaq Gold, 20 nM emPCR primer 1, 3 µM of emPCR primer 2, 3.5 mM of each deoxynucleotide, 25 mM MgCl2 and 1.6 billion SOLiD sequencing beads (Applied Biosystems). PCR mix was added to SOLiD emPCR Tube containing 9 ml of oil phase and emulsified using ULTRA-TURRAX Tube Drive (IKA). The PCR emulsion was dispensed into 96-well plate and cycled for 60 cycles. After amplification, the emulsion was broken with butanol, beads were enriched for template-positive beads, 3′-end extended and covalently attached onto sequencing slides. Four physically separated samples were deposited on one sequencing slide and sequenced using SOLiD system version 2 to produce 35-base long reads. Library 1 has been additionally sequenced using the SOLiD version 3 system to produce 50-base long reads in a barcoded experimental setup.

Mapping of sequencing data and SNP calling

Sequencing reads were mapped against the reference genome (hg18 assembly, NCBI build 36) using the Maq package (16), which allows mapping in SOLiD color space corresponding to dinucleotide encoding of the sequenced DNA with following settings: number of maximum mismatches that can always be found −n 3, threshold on the sum of mismatching base qualities −e 150. Raw variant positions were called by the Maq package and filtered using custom scripts (available upon request). For stringent SNP calling, we used the following filtering settings: (i) positions with <20× and >5000× coverage were excluded, (ii) each of non-reference alleles had to be supported by at least three independent reads (as determined by different read start positions) separately on positive and negative strand with quality >10, (iii) the non-reference allele should account for at least 20% of the reads covering the polymorphic position, and (iv) the ratio between + and – strand reads should be between 1/9 and 9 (Table 2). Positions that passed these filtering settings were considered as SNPs. A SNP was qualified as homozygous when the fraction of non-reference alleles was above 95% and heterozygous when the fraction of non-reference alleles was between 20% and 95%.

Table 2.

SNP calling statistics

	Library 1	Library 2	Library 3	Library 1
Length of sequenced tags	35	35	35	50
Average coverage	67×	148×	204×	213×
SNP positions validated	384	384	384	384
SNP positions filtered due to low coverage (<20×) (%)	18.8	25.8	3.4	1.8
SNPs with enough coverage filtered out for other reasons (%)^a	26.5	21.1	27.9	6.0
SNPs identified after filtering (with at least 20× coverage) (%)	54.7	53.1	68.7	92.2
False-negative discovery rate (%)	45.3	46.9	31.3	7.8

aDue to low base quality, or large strand bias.

SNP calling statistics aDue to low base quality, or large strand bias.

Calculation of evenness score

A crucial parameter for assessing the effectiveness of any enrichment method is the evenness of coverage (12). Here, we introduce a dedicated parameter to represent the evenness of coverage score, E. This score intends to describe the uniformity of base coverage over targeted regions. Together with the percentage of sequenced bases on target, which determines the enrichment level, this score can be used as an objective measure to compare different enrichment experiments as the parameter E is quite insensitive to sequencing depth (Figure 2C). The evenness score, E, represents the fraction of whole-sequencing throughput that is correctly distributed. Consequently, 1-E represents the fraction of the (whole) sequencing output that still has to be redistributed from positions with coverage above average to positions with coverage below average (by better enrichment) to get the ideal even coverage over all targeted positions. The more even the coverage, the higher the evenness score: E will be 100% for completely uniform coverage of every base in the targeted regions and approaches 0% in case of extreme non-uniform distributions. From Figure 2B, one can appreciate that E is equivalent to the area under the curve. Hence, a formula for E can be readily arrived at by summing for all percentage positions up to the normalized coverage of 1, as in the ideal case all positions will have a coverage at least equal to the average coverage. Thus the evenness score, E is defined as: Where P is defined as number of targeted positions with at least coverage C, Cave is defined as the average coverage through all targeted positions and NTP is defined as a total number of targeted positions. This formula can be rewritten in a form which is numerically more attractive, as Cave may not be an integer number: Where F(i) is the fraction of positions with normalize coverage of at least C(i)/Cave. This fraction equals P/NTP, where P is percentage of position with a coverage of at least C(i) and NTP is the total number of targeted positions. Cave is the average coverage over all targeted positions. This integral corresponds to the area under the curve of the graph between a normalized coverage of 0 and 1 (Figure 2B).

Figure 2.

Comparison of sequence coverage evenness after enrichment. The fraction of target positions with at least that coverage was as compared to the average coverage. (A) Comparison of various enrichment (washing temperature and input DNA) and sequencing (35- versus 50-mer) conditions. Library 1 sequenced by 50-mer reads results in the most even coverage compared to other libraries. The brown curve depicts the best possible evenness for an ideal evenly enriched sample with 100 × average coverage, where the unevenness is purely caused by statistical randomness in the coverage assuming a Poisson distribution of the sequencing reads. (B) The evenness score, E, represents the fraction of whole sequencing throughput that is correctly distributed (marked area below the curve). Consequently, 1-E represents the fraction of the (whole) sequencing output that has to be redistributed from positions with coverage above average to positions with coverage below average (by better enrichment) to get the ideal even coverage over all targeted positions. The more even the coverage, the higher the evenness score. (C) Correlation of evenness score E for randomized sets to the sequencing depth. In this simulation, the unevenness of these datasets is purely caused by the random distribution of reads and fits a Poisson distribution of sequence coverage. When the discrete character of the data is reduced by sufficient depth of coverage, E changes only slightly with increasing average coverage and thus can be characterized as relatively independent of sequencing depth. (D) Comparison between + and – strand coverage for 35- and 50-mer reads. In the case of 50-mer reads, the coverage is more even with fewer positions covered by extremely low (or high) numbers of sequencing tags. This difference is more prominent when the coverage is determined separately for the positive or negative strand. Independent strand coverage is better for 50-mer than for 35-mer sequencing.

llumina SNP genotyping

The DNA sample that was used in this study was genotyped using an Illumina HumanHap550+ Genotyping BeadChip through 23andMe services (http://www.23andme.com). A total of 384 genotyped SNPs, which are either heterozygous or homozygous non-reference in the sample, are located in the 1.69 Mb region of our interest. These positions were used as a reference set for identifying false negatives in our sequencing dataset.

RESULTS AND DISCUSSION

Specificity of enrichment

Short fragment (85–110 nt) libraries were made using focused acoustic fragmentation (Covaris) and used for enrichment on custom-designed 244 K Agilent microarrays. The specificity of the enrichment was determined by sequencing on an ABI SOLiD sequencer version 2 with a quadrant slide capacity per sample (loaded at various densities). This resulted in 219–676 millions of mappable bases out of which 40–61% mapped directly to the targeted regions (Table 1). Increasing the stringency by raising the washing temperature from room temperature to 42°C (Library 2) did not increase the enrichment efficiency (Table 1). In contrast, the evenness of coverage E did decrease appreciably from 70% to 62 % (Table 1 and Figure 2A), suggesting selective loss of specific target regions. Increasing the amount of DNA for hybridization (Library 3) had no effect at all. The percentage of on-target bases increases to 60–75% when only taking into account bases from reads that could be placed completely uniquely on the genome. Our results contrast with previously reported results (8), where the use of 100–200 bp fragments resulted in only 29% percent of reads mapping to targeted exons. The observed difference could possibly be explained by differences in probe design strategy and/or array platform (Nimblegen versus Agilent). A different approach for enrichment using 170-nt long biotinylated RNA probes (12), resulted in 42–50% of sequenced bases mapping directly to targeted exons. An alternative solution-based approach used molecular inversion probes (MIPs) (13) for selective capturing of 55 000 human exons and resulted in >99% of all reads mapping to the targeted regions. However, since the first 20 bp of each sequenced read are always coming from the MIPs only the remainder of the sequence is informative for genotyping and a substantial proportion of the overall sequencing output thus has to be discarded as non-informative. Moreover, exons longer than the maximum read length of the sequencing platform will need multiple probe designs for capturing and sequencing. As no commercial solutions are available for cost-effective synthesis of high-quality, long oligonucleotides (>100 nt), which are required for this approach, widespread implementation of this method is questionable.

Evenness of coverage

During enrichment not all DNA hybridizes to capture probes with the same efficiency. As a result, targeted sequences are covered unevenly with sequencing reads. For a limited number of regions there even is no coverage at all. Generally, the more even the coverage distribution is, the less overall sequencing depth is required for variant detection. In our experiments 60–76% of targeted regions are covered with >50% of average coverage, 77–92% of targeted regions are covered with >25% of average coverage and 90–98% of reads have >10% of average coverage (Table 1 and Figure 2A). In addition, we covered 99.38–99.97% of targeted positions with at least one read suggesting significantly better evenness of coverage compared to other studies, where 82% (10); 95% (12) or 99.21% (17) of targeted positions were covered with at least one read. For better comparison of coverage evenness, we introduce a new parameter, the evenness score E (see section ‘Materials and Methods’ section and Figure 2B). E is relatively independent of sequence coverage and thus enables the comparison of different libraries with varying sequencing depth. The relative independence of the evenness score E to sequencing depth is shown in Figure 2C. In this figure, the correlation of E for a randomized read sets assuming a Poisson read distribution, is calculated as a function of the sequencing depth. Figure 2C shows that once the sequencing coverage is sufficiently high (>50×), the evenness score E changes only slightly with increasing average coverage and thus can be characterized as relatively independent of the sequencing depth. The evenness score represents the area under the curve in a fraction of the positions with at least that coverage vs normalized coverage at X = 1 (see in Figure 2B). In other words, E represents the fraction of sequenced bases that do not have to be redistributed from above-average coverage to below-average coverage positions to obtain completely even coverage for all targeted positions. The evenness score, E = 100% for perfectly uniform coverage and approaches 0% in cases of extreme non-uniform distributions. By using short-fragment libraries, we achieved even distributions with evenness scores (E) ranging from 62.4% (Library 2) to 74.8% (Library 1 sequenced with 50-nt read length). Sequencing of the same enriched library on SOLiD V3 with 50-nt long reads instead of 35-nt long reads with SOLiD V2 resulted in a better evenness score with E, rising from 70.2% to 74.8% (Table 1 and Figure 2A). The most obvious explanation for this observation would be that the longer reads resulted in improved bridging of regions with lower capture efficiency due to fragments captured by well-performing flanking probes. In addition, due to low complexity of genomic regions, 35-mer tags cannot be mapped uniquely to a substantial part of the genome and this fraction is reduced with 50-mers. To illustrate that the described approach universally results in high evenness, we analyzed additional experiments that were performed with the same experimental protocol, but with different array designs and/or species (Supplementary Table S1 and Supplementary Figure S1). We do find a consistent high evenness score, even between organisms (human, rat and Arabidopsis). Moreover, we reanalyzed publically available datasets from recently published genomic enrichment experiments (6,11,18) showing that our experiments result consistently in more even coverage, especially when considering strand-specific coverage (Supplementary Table S1 and Supplementary Figure S1). The latter is especially important, as we observed in our dataset that proper coverage on both strands is instrumental for reducing false positive heterozygote SNP calling. This is most likely due to systematic errors that are introduced by the sequencing process due to platform-specific biases in sequencing chemistry, which is in all cases context dependent and thus different for the + and – strand.

Sequencing of positive and negative DNA strand

We found that having sequencing data mapping to both positive and negative strand is an important factor in reducing false positive variant calls. Therefore, we evaluated the evenness of coverage with respect to DNA strand. A substantial part of the targeted sequence was covered by sequencing tags coming from only one strand in case of libraries sequenced as 35-mers with SOLiD V2 chemistry (Figures 2D and 3). Increasing the read length to 50-mers improved double-stranded coverage markedly, in line with the observed overall base coverage (Figures 2D and 3).

Figure 3.

Exemplary representation of target coverage after enrichment. Sequencing results of Library 1 are shown for 35-mer (green) and 50-mer (purple) sequencing. Total, positive and negative strand reads are shown independently. Coverage is more equal and better represented by both strands for the longer sequencing reads.

Improvements of sequencing coverage

Since the contribution of the random character of sequencing to unevenness of coverage was minor (Figure 2C), further improvements in the evenness of the sequencing coverage could be obtained by improvement of the enrichment procedure. The strategy used in our array design resulted in overlapping probes and did not provide much opportunity for redesigning probes for poorly enriched regions. Therefore, we included various probes at variable quantities in our test design. We found a very strong correlation between the number of probes and eventual base coverage (Figure 4). Spotting more copies of the same probe for underperforming regions therefore seems an effective strategy for improving E. Furthermore, these results also indicate that a limiting factor for DNA yield after elution could be the number of probe molecules and their saturation after 65 h of hybridization and not the depletion of targeted library molecules. This is supported by the observation that increasing the amount of library DNA during hybridization had no effect on enrichment efficiency and coverage. However, we cannot exclude that the efficiency of mixing during hybridization is limited and that local depletion of target sequences occurred, without saturating the capturing probes. Presence of capture probes at physically separated locations (which is the case in the random design used in these experiments) would in such case also improve capturing efficiency.

Figure 4.

Correlation of probe density and sequencing coverage. Each genomic region was represented on the array with a variable number of capture probes throughout the region. The sequencing coverage per base (blue line) linearly correlates with probe density (red line). Yet another possibility for improvement of sequence coverage can be expected from further increasing sequencing read length, in line with our results obtained for 35- versus 50-mer reads.

Identification of polymorphic positions

To determine the accuracy of SNP detection, we compared SNPs called from sequencing data obtained from enriched short-fragment libraries and those genotyped by an Illumina HumanHap550+ Genotyping BeadChip. A total of 384 SNP genotyped positions were different from the reference allele (either heterozygous or homozygous) and were located in the targeted regions. From those 384 SNP positions, 53.1– 92.2% passed the stringent criteria of our SNP filtering pipeline, which required sufficient high-quality base coverage on both DNA strands. In concordance with evenness of coverage, Library 1 sequenced on SOLiD V3 with 50-nt long reads gave the best results with only a 7.8% false negative discovery rate over the complete targeted region. From this total of 7.8% false-negative SNP positions, 3.6% had not been covered with (i) at least 20 reads, and (ii) at least 3 reads from each of the strands or (iii) did not have a coverage ratio from both strands within the limits set at 1/9 and 9. Consequently, this part of the targeted regions could already be marked as not surveyed, even prior to SNP calling, since this part would not pass our minimal requirements of SNP calling. Such a prediction will be important for clinical diagnostic purposes because this enables one to predict which regions have not been sequenced sufficiently deep for reliable SNP calling. Taken together, at ∼200× average sequence coverage, we were able to survey 96.4% of 1.69 Mb of genomic sequence with only 4.2% of false negative calls, while 3.6% of targeted regions had to be marked as unsurveyed. The better performance of Library 1 could be explained by better evenness of coverage (more positions have sufficient base coverage for reliable SNP calling) as well as by better strand balance where more positions have good coverage coming from tags mapping to both negative and positive strands (Figures 2D and 3). Another explanation for the better performance of longer reads could be a reduction of mappability bias. While mapping sequencing reads, non-reference alleles are more likely to be discarded due to low mapping quality, since they already have one mismatch (two mismatches in SOLiD color space) compared to reads coming from reference alleles. Additionally, capture of non-reference DNA molecules to reference capture probes may be slightly less effective. Indeed, the overall frequency of non-reference allele reads for heterozygous positions was shifted downward form the 50% position, although not dramatically (Figure 5). The mappability issue is even more serious for SNPs that have additional linked SNPs in close vicinity. This bias is less prone for longer reads since one mismatch contributes less to overall mapping accuracy in 50-mers compared to 35-mers.

Figure 5.

Distribution of non-reference allele reads. The percentage of non-reference allele reads was calculated for every heterozygous and homozygous non-reference allele position in the targeted region (n = 1197) and is represented in bins of 5%. For heterozygous calls, the distribution is skewed towards reference allele reads.

Detection of novel variants

We analyzed the results of Library 1 (50-mer reads) for the presence of polymorphisms. A total of 1197 SNPs were identified within the targeted regions plus 30-bp flanking intronic regions using our SNP detection pipeline. This set included the 384 previously genotyped SNPs as well as 759 other polymorphisms that were already present in dbSNP129 or the Ensembl database. We considered these 1143 (95.5%) SNPs as validated and set out to validate the remaining 54 SNPs by PCR-based dideoxy resequencing. We failed to develop working assays for 10 candidate SNPs, most likely due to the repetitive nature of the genomic environment. We found that only eight of the remaining candidates, all heterozygote scores, were identified as false positives. Altogether, our results indicate a false positive rate of less than one per ∼200 000 bp (0.0005%). All eight false positive SNPs tended to have a lower than average percentage of non-reference allele reads and/or low overall base coverage compared to true positives (Figure 6). Although we only used planar microarrays in our experiments, we believe that the characteristics of short-fragment libraries described here will be equally applicable to any hybridization-based approach, including in-solution methods.

Figure 6.

Sequencing coverage and percentage of non-reference allele distribution for validated and non-validated SNPs. All polymorphic and non-reference positions that were identified by the SNP detection pipeline are plotted as a function of total base coverage versus non-reference read frequency. Validated SNPs (either by their presence in dbSNP or by resequencing) are indicated in blue, non-validated SNPs are shown in red and positions for which no working validation assay could be designed in green. False-positive SNPs tend to have a lower percentage of non-reference allele reads and/or low overall coverage.

Input DNA requirements

Relatively high amounts of DNA are normally used for enrichment procedures, which can be a limiting factor for many clinical applications. In our experiments, we used only 1 μg of genomic DNA for all experiments shown. The fragment library was made and amplified with only eight PCR cycles to produce a stock library, which was sufficient for at least 20 independent enrichment procedures as described here. Before enrichment a small proportion (50 ng) of the initial library was amplified with 10 PCR cycles to produce sufficient material for hybridization. After enrichment the eluted library was amplified by an additional 13 cycles to produce amounts sufficient for accurate quantification and sequencing. By using these logistics, the stock library could be used multiple times for different enrichment experiments, taking away the need for re-isolating DNA or re-preparing sequencing libraries. In addition, most of the amplification cycles (18 out of 31 in total) were done before hybridization. Potential biases in coverage caused by PCR could, in theory, be normalized again during the hybridization step. However, this requires that capturing probes, rather than target molecules, are the limiting factor in this step. Detailed analysis of our deep-sequencing results revealed no unexpected clonality bias due to the amplification steps.

CONCLUSIONS

Short-fragment libraries provide highly efficient enrichment characteristics for exonic targets. The relative even base and strand coverage facilitates accurate SNP and mutation retrieval and discovery. To measure the evenness of coverage, which is relatively independent of sequencing depth, we have introduced the parameter E [see Equations (2 and 3) and Figure 2]. This score can be applied to any genomic enrichment experiment and in combination with the percentage of reads on target it provides the possibility to compare the efficiency of different approaches.

ACCESSION NUMBER

Raw sequencing data, alignment files and called SNPs have been submitted into GEO database http://www.ncbi.nlm.nih.gov/geo/ with accession number GSE18542.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: Hubrecht Institutes will pay for page charges from its primary funding from the Royal Dutch Academy of Sciences (KNAW). Conflict of interest statement. None declared.

17 in total

1. Genome-wide in situ exon capture for selective resequencing.

Authors: Emily Hodges; Zhenyu Xuan; Vivekanand Balija; Melissa Kramer; Michael N Molla; Steven W Smith; Christina M Middle; Matthew J Rodesch; Thomas J Albert; Gregory J Hannon; W Richard McCombie
Journal: Nat Genet Date: 2007-11-04 Impact factor: 38.330

2. Direct selection of human genomic loci by microarray hybridization.

Authors: Thomas J Albert; Michael N Molla; Donna M Muzny; Lynne Nazareth; David Wheeler; Xingzhi Song; Todd A Richmond; Chris M Middle; Matthew J Rodesch; Charles J Packard; George M Weinstock; Richard A Gibbs
Journal: Nat Methods Date: 2007-10-14 Impact factor: 28.547

3. Massively parallel exon capture and library-free resequencing across 16 genomes.

Authors: Emily H Turner; Choli Lee; Sarah B Ng; Deborah A Nickerson; Jay Shendure
Journal: Nat Methods Date: 2009-04-06 Impact factor: 28.547

4. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

5. Targeted next-generation sequencing by specific capture of multiple genomic loci using low-volume microfluidic DNA arrays.

Authors: Stephan Bau; Nadine Schracke; Marcel Kränzle; Haiguo Wu; Peer F Stähler; Jörg D Hoheisel; Markus Beier; Daniel Summerer
Journal: Anal Bioanal Chem Date: 2008-10-29 Impact factor: 4.142

6. Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1.

Authors: Wei Zheng; Jirong Long; Yu-Tang Gao; Chun Li; Ying Zheng; Yong-Bin Xiang; Wanqing Wen; Shawn Levy; Sandra L Deming; Jonathan L Haines; Kai Gu; Alecia Malin Fair; Qiuyin Cai; Wei Lu; Xiao-Ou Shu
Journal: Nat Genet Date: 2009-02-15 Impact factor: 38.330

7. Targeted capture and massively parallel sequencing of 12 human exomes.

Authors: Sarah B Ng; Emily H Turner; Peggy D Robertson; Steven D Flygare; Abigail W Bigham; Choli Lee; Tristan Shaffer; Michelle Wong; Arindam Bhattacharjee; Evan E Eichler; Michael Bamshad; Deborah A Nickerson; Jay Shendure
Journal: Nature Date: 2009-08-16 Impact factor: 49.962

8. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2.

Authors: Shahana Ahmed; Gilles Thomas; Maya Ghoussaini; Catherine S Healey; Manjeet K Humphreys; Radka Platte; Jonathan Morrison; Melanie Maranian; Karen A Pooley; Robert Luben; Diana Eccles; D Gareth Evans; Olivia Fletcher; Nichola Johnson; Isabel dos Santos Silva; Julian Peto; Michael R Stratton; Nazneen Rahman; Kevin Jacobs; Ross Prentice; Garnet L Anderson; Aleksandar Rajkovic; J David Curb; Regina G Ziegler; Christine D Berg; Saundra S Buys; Catherine A McCarty; Heather Spencer Feigelson; Eugenia E Calle; Michael J Thun; W Ryan Diver; Stig Bojesen; Børge G Nordestgaard; Henrik Flyger; Thilo Dörk; Peter Schürmann; Peter Hillemanns; Johann H Karstens; Natalia V Bogdanova; Natalia N Antonenkova; Iosif V Zalutsky; Marina Bermisheva; Sardana Fedorova; Elza Khusnutdinova; Daehee Kang; Keun-Young Yoo; Dong Young Noh; Sei-Hyun Ahn; Peter Devilee; Christi J van Asperen; R A E M Tollenaar; Caroline Seynaeve; Montserrat Garcia-Closas; Jolanta Lissowska; Louise Brinton; Beata Peplonska; Heli Nevanlinna; Tuomas Heikkinen; Kristiina Aittomäki; Carl Blomqvist; John L Hopper; Melissa C Southey; Letitia Smith; Amanda B Spurdle; Marjanka K Schmidt; Annegien Broeks; Richard R van Hien; Sten Cornelissen; Roger L Milne; Gloria Ribas; Anna González-Neira; Javier Benitez; Rita K Schmutzler; Barbara Burwinkel; Claus R Bartram; Alfons Meindl; Hiltrud Brauch; Christina Justenhoven; Ute Hamann; Jenny Chang-Claude; Rebecca Hein; Shan Wang-Gohrke; Annika Lindblom; Sara Margolin; Arto Mannermaa; Veli-Matti Kosma; Vesa Kataja; Janet E Olson; Xianshu Wang; Zachary Fredericksen; Graham G Giles; Gianluca Severi; Laura Baglietto; Dallas R English; Susan E Hankinson; David G Cox; Peter Kraft; Lars J Vatten; Kristian Hveem; Merethe Kumle; Alice Sigurdson; Michele Doody; Parveen Bhatti; Bruce H Alexander; Maartje J Hooning; Ans M W van den Ouweland; Rogier A Oldenburg; Mieke Schutte; Per Hall; Kamila Czene; Jianjun Liu; Yuqing Li; Angela Cox; Graeme Elliott; Ian Brock; Malcolm W R Reed; Chen-Yang Shen; Jyh-Cherng Yu; Giu-Cheng Hsu; Shou-Tung Chen; Hoda Anton-Culver; Argyrios Ziogas; Irene L Andrulis; Julia A Knight; Jonathan Beesley; Ellen L Goode; Fergus Couch; Georgia Chenevix-Trench; Robert N Hoover; Bruce A J Ponder; David J Hunter; Paul D P Pharoah; Alison M Dunning; Stephen J Chanock; Douglas F Easton
Journal: Nat Genet Date: 2009-03-29 Impact factor: 38.330

9. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants.

Authors: Sekar Kathiresan; Benjamin F Voight; Shaun Purcell; Kiran Musunuru; Diego Ardissino; Pier M Mannucci; Sonia Anand; James C Engert; Nilesh J Samani; Heribert Schunkert; Jeanette Erdmann; Muredach P Reilly; Daniel J Rader; Thomas Morgan; John A Spertus; Monika Stoll; Domenico Girelli; Pascal P McKeown; Chris C Patterson; David S Siscovick; Christopher J O'Donnell; Roberto Elosua; Leena Peltonen; Veikko Salomaa; Stephen M Schwartz; Olle Melander; David Altshuler; Diego Ardissino; Pier Angelica Merlini; Carlo Berzuini; Luisa Bernardinelli; Flora Peyvandi; Marco Tubaro; Patrizia Celli; Maurizio Ferrario; Raffaela Fetiveau; Nicola Marziliano; Giorgio Casari; Michele Galli; Flavio Ribichini; Marco Rossi; Francesco Bernardi; Pietro Zonzin; Alberto Piazza; Pier M Mannucci; Stephen M Schwartz; David S Siscovick; Jean Yee; Yechiel Friedlander; Roberto Elosua; Jaume Marrugat; Gavin Lucas; Isaac Subirana; Joan Sala; Rafael Ramos; Sekar Kathiresan; James B Meigs; Gordon Williams; David M Nathan; Calum A MacRae; Christopher J O'Donnell; Veikko Salomaa; Aki S Havulinna; Leena Peltonen; Olle Melander; Goran Berglund; Benjamin F Voight; Sekar Kathiresan; Joel N Hirschhorn; Rosanna Asselta; Stefano Duga; Marta Spreafico; Kiran Musunuru; Mark J Daly; Shaun Purcell; Benjamin F Voight; Shaun Purcell; James Nemesh; Joshua M Korn; Steven A McCarroll; Stephen M Schwartz; Jean Yee; Sekar Kathiresan; Gavin Lucas; Isaac Subirana; Roberto Elosua; Aarti Surti; Candace Guiducci; Lauren Gianniny; Daniel Mirel; Melissa Parkin; Noel Burtt; Stacey B Gabriel; Nilesh J Samani; John R Thompson; Peter S Braund; Benjamin J Wright; Anthony J Balmforth; Stephen G Ball; Alistair S Hall; Heribert Schunkert; Jeanette Erdmann; Patrick Linsel-Nitschke; Wolfgang Lieb; Andreas Ziegler; Inke König; Christian Hengstenberg; Marcus Fischer; Klaus Stark; Anika Grosshennig; Michael Preuss; H-Erich Wichmann; Stefan Schreiber; Heribert Schunkert; Nilesh J Samani; Jeanette Erdmann; Willem Ouwehand; Christian Hengstenberg; Panos Deloukas; Michael Scholz; Francois Cambien; Muredach P Reilly; Mingyao Li; Zhen Chen; Robert Wilensky; William Matthai; Atif Qasim; Hakon H Hakonarson; Joe Devaney; Mary-Susan Burnett; Augusto D Pichard; Kenneth M Kent; Lowell Satler; Joseph M Lindsay; Ron Waksman; Christopher W Knouff; Dawn M Waterworth; Max C Walker; Vincent Mooser; Stephen E Epstein; Daniel J Rader; Thomas Scheffold; Klaus Berger; Monika Stoll; Andreas Huge; Domenico Girelli; Nicola Martinelli; Oliviero Olivieri; Roberto Corrocher; Thomas Morgan; John A Spertus; Pascal McKeown; Chris C Patterson; Heribert Schunkert; Erdmann Erdmann; Patrick Linsel-Nitschke; Wolfgang Lieb; Andreas Ziegler; Inke R König; Christian Hengstenberg; Marcus Fischer; Klaus Stark; Anika Grosshennig; Michael Preuss; H-Erich Wichmann; Stefan Schreiber; Hilma Hólm; Gudmar Thorleifsson; Unnur Thorsteinsdottir; Kari Stefansson; James C Engert; Ron Do; Changchun Xie; Sonia Anand; Sekar Kathiresan; Diego Ardissino; Pier M Mannucci; David Siscovick; Christopher J O'Donnell; Nilesh J Samani; Olle Melander; Roberto Elosua; Leena Peltonen; Veikko Salomaa; Stephen M Schwartz; David Altshuler
Journal: Nat Genet Date: 2009-02-08 Impact factor: 38.330

10. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing.

Authors: Andreas Gnirke; Alexandre Melnikov; Jared Maguire; Peter Rogov; Emily M LeProust; William Brockman; Timothy Fennell; Georgia Giannoukos; Sheila Fisher; Carsten Russ; Stacey Gabriel; David B Jaffe; Eric S Lander; Chad Nusbaum
Journal: Nat Biotechnol Date: 2009-02-01 Impact factor: 54.908

44 in total

1. Multiplexed array-based and in-solution genomic enrichment for flexible and cost-effective targeted next-generation sequencing.

Authors: Magdalena Harakalova; Michal Mokry; Barbara Hrdlickova; Ivo Renkens; Karen Duran; Henk van Roekel; Nico Lansu; Mark van Roosmalen; Ewart de Bruijn; Isaac J Nijman; Wigard P Kloosterman; Edwin Cuppen
Journal: Nat Protoc Date: 2011-11-03 Impact factor: 13.491

2. Mutation discovery by targeted genomic enrichment of multiplexed barcoded samples.

Authors: Isaäc J Nijman; Michal Mokry; Ruben van Boxtel; Pim Toonen; Ewart de Bruijn; Edwin Cuppen
Journal: Nat Methods Date: 2010-10-17 Impact factor: 28.547

3. Dominant missense mutations in ABCC9 cause Cantú syndrome.

Authors: Magdalena Harakalova; Jeske J T van Harssel; Paulien A Terhal; Stef van Lieshout; Karen Duran; Ivo Renkens; David J Amor; Louise C Wilson; Edwin P Kirk; Claire L S Turner; Debbie Shears; Sixto Garcia-Minaur; Melissa M Lees; Alison Ross; Hanka Venselaar; Gert Vriend; Hiroki Takanari; Martin B Rook; Marcel A G van der Heyden; Folkert W Asselbergs; Hans M Breur; Marielle E Swinkels; Ingrid J Scurr; Sarah F Smithson; Nine V Knoers; Jasper J van der Smagt; Isaac J Nijman; Wigard P Kloosterman; Mieke M van Haelst; Gijs van Haaften; Edwin Cuppen
Journal: Nat Genet Date: 2012-05-18 Impact factor: 38.330

Review 4. Next-generation sequencing in the clinic: promises and challenges.

Authors: Jiekun Xuan; Ying Yu; Tao Qing; Lei Guo; Leming Shi
Journal: Cancer Lett Date: 2012-11-19 Impact factor: 8.679

5. Evaluation of the evenness score in next-generation sequencing.

Authors: Konrad Oexle
Journal: J Hum Genet Date: 2016-04-14 Impact factor: 3.172

Review 6. Biological and nanotechnological applications using interactions between ionic liquids and nucleic acids.

Authors: Hisae Tateishi-Karimata; Naoki Sugimoto
Journal: Biophys Rev Date: 2018-04-23

7. Inferring the evolution of the major histocompatibility complex of wild pigs and peccaries using hybridisation DNA capture-based sequencing.

Authors: Carol Lee; Marco Moroldo; Alvaro Perdomo-Sabogal; Núria Mach; Sylvain Marthey; Jérôme Lecardonnel; Per Wahlberg; Amanda Y Chong; Jordi Estellé; Simon Y W Ho; Claire Rogel-Gaillard; Jaime Gongora
Journal: Immunogenetics Date: 2017-12-18 Impact factor: 2.846

8. Next generation sequencing-based molecular diagnosis of retinitis pigmentosa: identification of a novel genotype-phenotype correlation and clinical refinements.

Authors: Feng Wang; Hui Wang; Han-Fang Tuan; Duy H Nguyen; Vincent Sun; Vafa Keser; Sara J Bowne; Lori S Sullivan; Hongrong Luo; Ling Zhao; Xia Wang; Jacques E Zaneveld; Jason S Salvo; Sorath Siddiqui; Louise Mao; Dianna K Wheaton; David G Birch; Kari E Branham; John R Heckenlively; Cindy Wen; Ken Flagg; Henry Ferreyra; Jacqueline Pei; Ayesha Khan; Huanan Ren; Keqing Wang; Irma Lopez; Raheel Qamar; Juan C Zenteno; Raul Ayala-Ramirez; Beatriz Buentello-Volante; Qing Fu; David A Simpson; Yumei Li; Ruifang Sui; Giuliana Silvestri; Stephen P Daiger; Robert K Koenekoop; Kang Zhang; Rui Chen
Journal: Hum Genet Date: 2013-10-24 Impact factor: 4.132

Review 9. Novel sequencing-based strategies for high-throughput discovery of genetic mutations underlying inherited antibody deficiency disorders.

Authors: Hong-Ying Wang; Ashish Jain
Journal: Curr Allergy Asthma Rep Date: 2011-10 Impact factor: 4.806

10. Next-generation sequencing-based molecular diagnosis of a Chinese patient cohort with autosomal recessive retinitis pigmentosa.

Authors: Qing Fu; Feng Wang; Hui Wang; Fei Xu; Jacques E Zaneveld; Huanan Ren; Vafa Keser; Irma Lopez; Han-Fang Tuan; Jason S Salvo; Xia Wang; Li Zhao; Keqing Wang; Yumei Li; Robert K Koenekoop; Rui Chen; Ruifang Sui
Journal: Invest Ophthalmol Vis Sci Date: 2013-06-14 Impact factor: 4.799