| Literature DB >> 19182786 |
Andreas Gnirke1, Alexandre Melnikov, Jared Maguire, Peter Rogov, Emily M LeProust, William Brockman, Timothy Fennell, Georgia Giannoukos, Sheila Fisher, Carsten Russ, Stacey Gabriel, David B Jaffe, Eric S Lander, Chad Nusbaum.
Abstract
Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19182786 PMCID: PMC2663421 DOI: 10.1038/nbt.1523
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Overview of hybrid selection method. Illustrated are steps involved in the preparation of a complex pool of biotinylated RNA capture probes (“bait”; top left), whole-genome fragment input library (“pond”; top right) and hybrid-selected enriched output library (“catch”; bottom). Two sequencing targets and their respective baits are shown in red and blue. Thin and thick lines represent single and double strands, respectively. Universal adapter sequences are grey. The excess of single-stranded non-self-complementary RNA (wavy lines) drives the hybridization. See main text and Methods for details.
Figure 2Coverage profiles of exon targets by end sequencing and shotgun sequencing. Shown are cumulative coverage profiles that sum the per-base sequencing coverage along 7,052 single-bait target exons. Only free-standing baits that were not within 500 bases of another one were included in this analysis. End sequencing of exon capture 1 with 36-base reads (a) produced a bimodal profile with high sequence coverage near and slightly beyond the ends of the 170-base baits (indicated by the horizontal bar). Shotgun sequencing of capture 2 from a different pond library (containing fragments with generic rather than Illumina-specific adapters) with 36-base reads after concatenating and re-shearing (b) gave more coverage on bait (shaded area) than near bait. Re-sequencing of capture 1 with 76-base end reads (c) had a similar effect, although the peak was slightly wider and the on-bait fraction of the peak area slightly less. Note that the scale on the Y-axis and hence the absolute peak height is different in each case. The different scales reflect the different numbers of sequenced bases which is much lower for GA-I lanes (a, b) than for a GA-II lane (c).
Detailed breakdown of Illumina sequences generated from exon catches
| Length and kind of Illumina sequencing reads | 36-base GA-I end sequences | 36-base GA-I shotgun sequences | 76-base GA-II end sequences |
|---|---|---|---|
| Aggregate length of target | 2.5 Mb | 2.5 Mb | 2.5 Mb |
| Aggregate length of baits | 3.7 Mb | 3.7 Mb | 3.7 Mb |
| Total raw unfiltered sequence | 152 Mb | 219 Mb | 851 Mb |
| Raw sequence not aligned uniquely to genome | 67 Mb | 116 Mb | 358 Mb |
| Uniquely aligned human sequence | 85 Mb | 102 Mb | 492 Mb |
| Uniquely aligned sequence on target | 36 Mb | 51 Mb | 235 Mb |
| Uniquely aligned sequence near target | 40 Mb | 38 Mb | 210 Mb |
| Uniquely aligned sequence on or near target | 76 Mb | 90 Mb | 445 Mb |
| Fraction of uniquely aligned sequence on or near target | 89% | 88% | 90% |
| Fraction of raw bases uniquely aligned on or near target | 50% | 41% | 52% |
| Fraction of uniquely aligned bases on target | 42% | 50% | 48% |
Protein-coding exon sequence only
Each unit of concatenated catch contains 44–46 bases (~18%) of generic adapter sequence. Therefore, ~18% (39 Mb) of the 219 Mb is not of human origin.
All raw sequence that fails to align uniquely to the human reference genome including low-quality sequence
Outside but within 500 bp of a target exon
Upper bound for estimating the specificity of hybrid selection
Lower bound for estimating the specificity of hybrid selection
The denominator (219 Mb) includes ~39 Mb of sequence from the generic adapters. Excluding these 39 Mb, the lower bound for the estimated specificity of this catch is 90/180 = 50%.
Upper bound for the overall specificity of targeted exon sequencing
Figure 3Sequence coverage along a contiguous target. Shown is base-by-base sequence coverage along a typical 11-kb segment (chr4:118635000-118646000) out of 1.7 Mb. Sequence corresponding to bait is marked in blue. Segments that had more than 40 repeat-masked bases per 170-base window were not targeted by baits and received little or no coverage with sequencing reads aligning uniquely to the genome except directly adjacent to a bait.
Figure 4Normalized coverage-distribution plots. Shown is the fraction of bait-covered bases in the genome achieving coverage with uniquely aligned sequence equal or greater than the normalized coverage indicated on the X-axis. The absolute per base coverage was divided by the mean coverage of all bait positions (18 in a; 221 in b). The curve for the shotgun-sequenced exon capture (a) is steeper than the curve for the regional capture (b) indicating a less uniform representation of sequencing targets in the exon catch. Dashed lines point to the fraction of bases achieving at least half or one fifth the mean coverage.
Figure 5Reproducibility of hybrid selection. For each exon (n = 15,565), the ratio of the mean coverage in two independent hybrid selection experiments performed on the same source DNA (NA15510) was plotted over its mean coverage in one experiment (a). Coverage was normalized to adjust for the different number of sequencing reads. The average ratio (black line) is close to 1. Standard deviations are indicated by purple lines. The graph on the right (b) shows base-by-base sequence coverage along one target in three independent hybrid selections, two of them performed on NA15510 (purple and teal lines) and one on NA11994 source DNA (black). Note the similiarities at this fine resolution of the three profiles which were normalized to the same height. The position of target exon (ENSE00000968562) and bait is indicated by red and blue bars, respectively.