Literature DB >> 19668202

BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.

Ken Chen¹, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath, Michael C Wendl, Qunyuan Zhang, Devin P Locke, Xiaoqi Shi, Robert S Fulton, Timothy J Ley, Richard K Wilson, Li Ding, Elaine R Mardis.

Abstract

Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods with which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations. We examined BreakDancer's performance in simulation, in comparison with other methods and in analyses of a sample from an individual with acute myeloid leukemia and of samples from the 1,000 Genomes trio individuals. BreakDancer sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA

Year: 2009 PMID： 19668202 PMCID： PMC3661775 DOI： 10.1038/nmeth.1363

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Introduction

Genomic structural variation is commonly considered to be any DNA sequence alteration other than a single nucleotide substitution[1]. Instances of structural variants in germ and somatic cells contribute respectively to heritable genetic diseases[2,3] and cancers[4-6]. Numerous types of structural variation exist, including indels, copy number variants (CNVs), inversions, and translocations. Many inherited CNVs (> 30 kb) have been discovered using array comparative genomic hybridization (CGH)[7] and high density SNP arrays[8]. Alignment of DNA sequences from different sources has been used to identify small or balanced rearrangements not detectable by arrays[9,10]. Recent sequencing and assembly of individual genomes have revealed larger numbers of structural variants than originally expected, especially in the smaller size range (< 1 kb)[11,12]. However, precise characterization and genotyping of structural variants are still difficult and expensive due to limitations in sequencing technology and detection methods. Much of the recent advance in structural variation detection can be attributed to next-generation sequencing (NGS) instruments[13], which have dramatically economized paired-end, whole-genome sequencing. One widely used instrument, the Illumina Genome Analyzer (GA) II, employs DNA fragments between 100 and 500 bp and requires little input DNA (∼1 μg) for sufficient genome-wide coverage. Recent whole genome resequencing projects[14,15] have obtained paired end sequence coverage of 20-40 × and have predicted thousands of structural variants using end sequencing profiling (ESP) methods that discerns variants via perceived anomalies in the separation lengths or orientation of aligned read pairs[16,17]. Many substantive issues regarding the analysis of paired-end data, however, remain unresolved. Open questions include whether the procedures and heuristics established for fosmids and BACs can be extrapolated to short inserts, how the expected false positive and negative rates vary with coverage, insert size, and read length, and how prediction confidence should be established. As NGS data begin to dominate whole genome resequencing projects, there is a pressing need both to obtain precise answers and to provide practical solutions for data analysis. Here, we address these questions using a combination of computational and experimental approaches. Our software package, collectively called BreakDancer consists of two complementary algorithms. The first, BreakDancerMax, provides genome-wide detection of five types of structural variants: deletions, insertions, inversions, intra-chromosomal and inter-chromosomal translocations from one or a pool of DNA samples sequenced by GA II (Fig .1). The second, BreakDancerMini, focuses on detecting small indels (typically between 10-100 bp) that are not routinely detected by BreakDancerMax. Together, they provide sensitive and accurate detection for a wide variety of structural variants, as demonstrated in both simulation and real data analysis[14,18,19].

Figure 1

Overview of BreakDancer algorithm. (a) The workflow. (b) Five types of anomalous read pairs recognized by BreakDancerMax. A pair of arrows represents the location and the orientation of a read pair. A dotted line represents a chromosome in the subject genome. A solid line represents a chromosome in the reference genome.

Results

Simulation

To quantify BreakDancer's performance with respect to different parameter settings, we produced synthetic data based on 844 structural variants identified on chromosome 17 of J. Craig Venter's genome[11], which include 425 deletions, 415 insertions, and 4 inversions ranging from 20 bp to 7953 bp. We excluded indels shorter than 20 bp since they are relatively easy to detect via Smith-Waterman algorithm (Supplementary Fig. 1). Many variants in this set occur in repetitive regions that are difficult to map or assemble (Supplementary Notes). We considered a deletion or an inversion as detected if it overlapped 50% reciprocally with a predicted variant. We considered an insertion as detected if its single breakpoint overlaps a predicted variant. We simulated 50 bp paired-end reads from the chromosome 17 nucleotide sequence of Venter's genome using MAQ-0.7.1[20] with normally distributed insert size of a 200 bp mean and a 20 bp standard deviation (s.d.). We analyzed the set of reads that were confidently mapped (MAQ mapping quality > 10) using BreakDancerMax at a separation threshold of 3 s.d. Among the 365 (43.2%) variants whose flanking regions contain 2 or more anomalously mapped reads at 100 ×, 324 (89%) were detected with a 1.48% false positive rate (FPR) including 147 that are shorter than 60 bp (Fig. 2 and Supplementary Table 1).

Figure 2

Performance of BreakDancer in simulation. TPR and FPR of BreakDancerMax (BDMax) at the confidence threshold of Q ≥ 30 are shown. TPR analytic refers to the percent of variants that can hypothetically be detected by BDMax under an analytic model (Online Methods). TPR detectable is the percent of variants hose flanking regions (300 bp both to the left and to the right) contain 2 or more confidently mapped ARPs in the MAQ alignment. The performance of BreakDancerMini (BDMini) is characterized by its TPR and FPR. The combined performance (BD all) is obtained by merging the results of these to programs.

The 324 detected SVs included 214 deletions, 109 insertions, and 3 inversions with varying true positive rate (TPR) in different size ranges and coverages (Online Methods and Supplementary Fig. 2). Of the 214 deletions, 203 (95%) were correctly predicted as deletions with accurate sizes (Pearson's r = 0.92) (Supplementary Fig. 3a). In comparison, only 72/109 (66%) known insertions were correctly predicted as insertions with less accurate sizes (r = 0.65) and breakpoints (Supplementary Fig. 3a,b). Longer deletions were more accurately predicted in terms of both size and breakpoint. The confidence score we derived to prioritize BreakDancerMax predictions (Online Methods) demonstrated improved statistical properties when compared to simply using the number of anomalously mapped read pairs (ARPs), which remains the de facto standard metric[21-23]. It provides finer distinction among variants that are supported by identical number of ARPs (Supplementary Fig. 4). It also reduces the result's dependency on the separation threshold and leads to relatively consistent TPRs and FPRs. (Supplementary Fig. 5). We ran BreakDancerMini on the same data and required the anomalous regions having two-sample Kolmogorov–Smirnov test statistics D ≥ 2.3 (Online Methods and Supplementary Fig. 6). We observed dramatic improvement in detecting small indels (Fig. 2). At 100 × physical coverage, BreakDancerMini detected 543 (64.3%) variants with a 7.3% FPR, including 407 (75.0%) that are shorter than 60 bp. We merged the indels (< 100 bp) detected by BreakDancerMini with those detected by BreakDancerMax and obtained a non-redundant set of 683 variants, including 365 deletions, 290 insertions, and 21 inversions. Altogether, 621 (74%) of the known variants were detected with a 9.1% FPR. We repeated this simulation under identical conditions but included indels between 10 and 20 bp. On this set, BreakDancerMax alone only detected 24% of the 1897 known variants with a 7% FPR. However, in combination with BreakDancerMini, we detected 68.0% with a 10.3% FPR, 62.6% of which are between 10 and 20 bp. The size of indels appeared to be reasonably accurately predicted throughout the range (Supplementary Fig. 7a,b).

Comparison with other methods

We compared BreakDancer with to recently published structural variant detection tools VariationHunter[24] and MoDIL[25]. Noticeably, these tools both use a different mapping algorithm, MrFast (http://mrfast.sourceforge.net/) than BreakDancer. MoDIL and BreakDancerMini both utilize the Kolmogorov–Smirnov test[26], but differ in many algorithmic details. We ran BreakDancerMax and BreakDancerMini on the obtained MAQ map files of the Yoruban genome[14] (Online Methods) with a conservative threshold of 4 s.d. for BreakDancerMax, D ≥ 2.3. for BreakDancerMini and MAQ mapping quality > 10. BreakDancerMax returned a set of 9,202 deletions, 4,901 insertions, and 665 inversions while BreakDancerMini returned a set of 21,433 deletions, 17,029 insertions that are shorter than 100 bp. After merging them by position, we obtained a non-redundant set of 27,092 deletions, 19,305 insertions, and 665 inversions. We examined the overlap between the predicted variants with those obtained through alternative approaches (Table 1). Altogether, BreakDancer detected a total of 59/92 (64.1%) large fosmid deletions[21], which is comparable to VariationHunter under identical conditions[24]. Among the deletions predicted by BreakDancerMini, 21.1% overlapped at least 1 bp with 4528 known deletion polymorphisms[21], 34.4% with dbSNP v129, and 43.6% with the intra-contig deletions produced by Beijing Genome Institute (BGI) through whole genome de novo assembly (unpub. data). Among the insertions predicted by BreakDancerMini, 16.9% overlapped with 2876 known insertion polymorphisms[21], 29.8% with dbSNP v129, and 22.8% with BGI insertions. Indels < 10bp in the dbSNP and in the BGI sets were excluded in the comparison. All these percentages are substantially higher than those obtained by VariationHunter or MoDIL. The variant sizes estimated by BreakDancerMini were highly correlated with the deletion or insertion polymorphisms[21] (r > 0.8).

Table 1

Comparison of BreakDancer with other tools. Structural variants predicted by BreakDancer on the Yoruban (NA18507) sample were compared to sets of variants discovered by alternative approaches[14,21]. ESP (large structural variants that were found by analyzing discordant fosmid clone-end alignment), DIP (small deletion/insertion polymorphisms found as gaps in the paired alignment between the fosmid end sequences and the reference). The MPSV weighted, MPSV unweighted, Probabilistic, and MoDIL refer to sets of SVs predicted by VariationHunter[24] and by MoDIL[25] respectively. Call sets for these tools were downloaded from http://compbio.cs.sfu.ca/strvar.htm and http://compbio.cs.toronto.edu/modil/. The dbSNP v129 set refers to indels that are 10 bp or longer in dbSNP version 129. The BGI set refers to 10 bp or longer intra-contig indels produced by Beijing Genome Institute through whole genome de novo assembly on the same sample. The Strict* criteria require the length of the intersection between the validated and the predicted variants to overlap at least 50% of the length of the union of the intervals, or the predicted variants to be entirely encompassed by the fosmid interval. Before the slash sign (/) are the numbers of overlapping variants, after are the number of predictions in the corresponding category.

Type	Deletion	Deletion	Deletion	Deletion	Deletion	Insertion	Insertion	Insertion	Inversion
Method	ESP	DIP		Assembly	ESP	DIP		Assembly	ESP
From	ref. 21	ref. 21	dbSNP v129	BGI	ref. 14	ref. 21	dbSNP v129	BGI	ref. 21
Size filtering			>=10bp	>=10bp			>=10bp	>=10bp
Reported	92	116,395	82,956	107,760	5,704	107,458	82,956	41,134	13
Criteria	strict*	1bp	1bp	1bp	1bp	1bp	1bp	1bp	1bp
BreakDancerMax	55/9,202	955/9,202	2,039/9,202	3,123/9,202	5,015/9,202	339/4,901	903/4,901	827/4,901	2/665
BreakDancerMini	21/21,433	4528/21,433	7379/21,433	9,344/21,433	1,598/21,433	2,876/17,029	5,083/17,029	3,878/17,029	N/A
BreakDancer merged	59/27,092	4970/27,092	7998/27,092	10,792/27,092	5,064/27,092	2,983/19,305	5,336/19,305	4,104/19,305	2/655
MPSV weighted	57/8,959	711/8,959	1332/8,959	2,246/8,959	4,819/8,959	121/5,575	192/5,575	192/5,575	2/504
MPSV unweighted	55/7,599	588/7,599	1022/7,599	1,835/7,599	4,537/7,599	70/3,772	88/3,772	93/3,772	4/433
Probabilistic	58/8,537	703/8,537	1217/8,537	2,061/8,537	4,703/8,537	100/7,142	124/7,142	131/7,142	1/181
MoDIL	20/13,147	622/13,147	967/13,147	1,162/13,147	540/13,147	282/3,981	687/3,981	571/3,981	N/A

In addition, 54.3% of the deletions predicted by BreakDancerMax overlapped with 87.7% of the deletions originally reported[14]. Both percentages are higher in comparison to those obtained by VariationHunter[24], possibly because BreakDancerMax uses algorithms similar to the in the original article[14].

Detecting variants in an AML sample

We performed variant detection using data obtained from the tumor and the normal samples of an individual with cytogenetically normal AML[19]. we obtained 21 × paired-end haploid coverage for both the tumor and the normal genomes, corresponding to 63.5 × and 39.9 × physical coverage, respectively. We jointly analyzed data from six libraries using BreakDancerMax with library specific separation thresholds and MAQ mapping quality > 35. At a confidence score threshold of Q ≥ 60, 7087 variants were predicted, including 3170 deletions, 1570 insertions, 1382 inversions, and 965 intra-chromosomal translocations (Fig. 2 and Supplementary Table 2). 46.4% of these deletions overlapped (50% interval) with known inherited CNVs in the database of genomic variants v5 (DGV). The percent of overlap became 5-8% higher hen culling variants based on the confidence scores instead of the number of ARPs alone (Supplementary Fig. 8). A recent study using Affymetrix 6.0 array identified 116 inherited CNVs on the same individual[27], 37 (31.90%) of which overlap with our predictions. These overlapping CNVs range from 131 bp to 1.5 Mbp with no noticeable bias in size. We extracted variants that were detected only in the tumor and derived a set of 223 putative somatic variants including 100 deletions, 67 insertions (< 100 bp), 22 inversions, and 34 intra-chromosomal translocations. We attempted a local assembly for each of the 167 indels, using the reads mapped to the predicted variant interval (Methods). We were able to call variants from the assemblies in 153 of the 167 instances, with 100 confirming the variants (79 both in the tumor and in the normal, 17 only in the tumor, and 4 only in the normal). We submitted the set of 167 indels for experimental validation (Methods). 110 (69 deletions and 41 insertions) were validated both in the tumor and in the normal, 31 were not validated either in the tumor or in the normal, and 26 were not called due to lo data quality (Supplementary Table 3). This suggested a 78% validation rate, excluding the no-calls. Noticeably, 16 of the 20 deletions that were not validated received a confidence score below 80 (Supplementary Fig. 9). Therefore, the validation rate became 89% at Q ≥ 80. The size of the deletions determined by BreakDancerMax shoed good correlation with those determined independently from the validation experiment (r = 0.867). Local assembly clearly improved overall accuracy in that 79 variants were correctly identified in both the tumor and the normal. Although the false negative rate of the assembly calls as relatively high: 26 (49%) of the 53 non-variant calls were validated in the experiment, the FPR as fairly lo: only six (6%) variant calls could not be validated. This observation suggested using assembly in a confirmatory role, rather than as a mechanism to limit false negatives. The assembly also improved the size estimation of small indels (Fig. 3).

Figure 3

Size distribution of deletions detected in an AML genome. 3170 deletions were detected from the sequence data by BreakDancerMax ranging from 58 bp to 959,498 bp. To signature peaks at 300 bp and at 6,000 bp correspond respectively to the AluY and the L1Hs retro-transposon. In comparison, only 116 inherited CNVs were detected using Affymetrix 6.0 array on this sample.

Among the identified insertions, three appeared to be ancient alleles that are closer to chimp than to the human reference. In at least 4 inherited deletions we identified, there are stretches of 10-20 bp AT-rich microhomologous sequences inserted between the deletion breakpoints, likely formed by transposons hen they inserted into the genome. We were only able to obtain high quality validation data for 13 inversions and 6 intra-chromosomal translocations. Of these, four inversions and to intra-chromosomal translocations were validated both in the tumor and in the normal (Supplementary Fig. 10a-f).

Detecting variants in a 1,000 Genomes dataset

We applied BreakDancerMax to the 1,000 Genomes Project[18] data and compared our deletion calls with those that were previously known via fosmid ESP[21] and array CGH[28] on chromosome 5 of the CEU and the YRI trio individuals. Each CEU individual had reads from to paired-end libraries with ∼15 × physical coverage (Supplementary Table 4). At the threshold of 4 s.d., mapping quality > 35 and Q ≥ 40, 125 deletions were detected in NA12878, 79 (63%) of which overlap DGV. Around 25-35% of known deletions were present in our calls (Supplementary Table 5). This percentage increased substantially to 35%-45% after lowering mapping quality threshold to 10, while the DGV concordance dropped to 54%. Reducing the separation distance cutoff from 4 s.d. to 3 s.d. increased the total number of Q ≥ 40 predictions by about 20%, but did not increase the numbers of known variants that were detected. Interestingly, 40-57% of known variants were detected when we jointly analyzed reads from all three individuals with library specific separation thresholds. There as a substantial overlap among the predicted deletions of the trio individuals: 88/120 (73%) deletions in the father (NA12891) and 98/133 (74%) in the mother (NA12892) were independently detected in the child (NA12878). We repeated the same set of analyses using data from the YRI trio individuals. Each individual had reads from to paired-end libraries with about 50 × to 70 × physical coverage (Supplementary Table 4). At the threshold of 4 s.d., mapping quality > 35 and Q ≥ 40, 246 deletions were detected in NA19240, 123 (50%) of which overlapped DGV. Around 50%-100% known deletions were present in our calls (Supplementary Table 6). No additional known variants were detected after lowering mapping quality threshold to 10 or by performing pooled analysis. There as a substantial overlap among the deletions of the trio individuals: 168/235 (72%) deletions in the father (NA19239) and 126/164 (77%) in the mother (NA19238) were also independently detected in the child (NA19240). In contrast to these substantial familial overlaps, the degree of overlap between individuals in different families as noticeably lower (31-37%).

Discussion

Our study indicates that BreakDancer has achieved accurate and sensitive structural variant discovery based on short-insert paired-end read mapping. The pooled analysis framework implemented in BreakDancer produces unified segmentation across pooled samples and libraries. In a family-/population-based study, it enhances the detection of common variants, as demonstrated in our analysis of the CEU trio. In a tumor-normal paired study, it improves the specificity of somatic variant prediction through effective elimination of inherited variants. This is particularly important hen discovery power is not matched in the paired genomes due to different insert size. It is possible to further improve BreakDancer's performance by systematically integrating more information in confidence scoring. For example, it may be beneficial to incorporate the mapping quality rather than applying a fixed threshold. Moreover, there is evidence suggesting that integrating read depth may help improve segmentation and genotyping[29], although an effective integration method is yet to be discovered. Our goal is to derive phred-style quality scores that accurately predict the error probability. Some types of structural variants, such as inversions and translocations, appeared to be more difficult to detect and validate. Many putative predictions overlapped with regions of tandem or inverted repeat and required further sequence analysis and filtering, or the use of additional longer reads and longer inserts. Nonetheless, BreakDancer as able to identify bona fide instances of inversions and intra-chromosomal translocations in this study, and somatic inter-chromosomal translocations in our study of glioblastoma multiforme, ovarian, and other AML samples (data not shown). The algorithms we implemented in BreakDancer are generic and can potentially be expanded to analyze data of different insert sizes or produced by different sequencing technologies. It can also be expanded to analyze paired-end data obtained from mRNA sequencing to identify instances of gene fusion and alternative splicing.

Methods

BreakDancerMax

Our first algorithm BreakDancerMax starts with the map files produced by MAQ[20] (Fig. 1a). Read pairs mapped to a reference genome with sufficient mapping quality are independently classified into six types (Fig. 1b): normal, deletion, insertion, inversion, intra-chromosomal translocation, and inter-chromosomal translocation. This classification process is based on 1) the separation distance and alignment orientation between the paired reads, 2) the user-specified threshold, and 3) the empirical insert size distribution estimated from the alignment of each library contributing genome coverage. The algorithm then searches for genomic regions that anchor significantly more anomalous read pairs (ARPs) than expected on average. A putative structural variant is derived from the identification of one or more regions that are interconnected by at least to ARPs. A confidence score is estimated for each variant based on a Poisson model that takes into consideration the number of supporting ARPs, the size of the anchoring regions and the coverage of the genome. The dominant type of associated ARPs in a particular region determines the type of structural variant. The start and the end coordinates are defined as the inner boundaries of the constituent regions that are closest to the suspected breakpoints, while the size is estimated by subtracting the mean insert size from the average spanning distance in each library and then averaging across libraries.

Confidence score estimation

It is important to derive confidence scores that quantify the underlying error probabilities of the predicted structural variants. The accuracy of the score depends on many factors, for example, whether the set of reads represents an unbiased sampling of the genome and all alleles, whether the reads are mapped to correct locations, and whether the amount of observed evidence is significant. One of the primary signals for the presence of a structural variant is the clustering of ARPs. Therefore, it is important to measure the degree of clustering from the perspective of both depth and breadth. We assume that under the null hypothesis of no variant, the genomic location of one particular type of insert is uniformly distributed[14]. For studies that define more than one insert type, the number of inserts at a particular location forms a mixture Poisson distribution with each mixture component representing one of the insert types. The statistic that summarizes the degree of clustering of a particular insert type is the probability of having more than the observed number of inserts in a given region: Where n denotes a Poisson random variable with mean equal to λ, i the type of the insert, and k the number of observed type i inserts. The estimation of λ is straightforward based on uniform assumption: Where s represents the cumulative size of the regions that the ARPs anchor to, N the total number of type i inserts in the entire data set, and G the length of the reference genome. N is counted directly from the data without assuming any form of insert size distribution. To detect indels, one could define three types of inserts: long, medium, and short defined by pre-determined thresholds. The task of indel detection is to find deletions from regions that contain significantly more long inserts and insertions from regions that contain significantly more short-inserts. This probabilistic scoring system can conveniently integrate information from multiple libraries from the same or different individuals using Fisher's method[30] assuming that the m libraries are produced independently: Where χ denotes a chi-square distribution of 2m degree of freedom and P the P value obtained from the j library. This makes it straightforward to compute a combined P value from a set of variable insert-size libraries or from one or multiple individuals to fully harness the statistical power of the pooled data. For convenience of representation, we convert the combined P value to phred scale using: However it should be noted that this Q score is not necessarily a phred quality score although they should have good correlation.

An analytic model of true positive rate (TPR) in simulation

Assuming that all the reads can be confidently mapped and that the ARPs cannot intersect with the variant breakpoint, we can analytically estimate the number of ARPs that a known structural variant may possess Where G(·) represents the insert size distribution function with mean μ and standard deviation σ, size of the deletions θ, size of the insertions θ, threshold that defines the long inserts c, threshold that defines the short inserts c, read length l, physical coverage R, and allele frequency a. We can compute the TPR in our simulation using this analytic model that summarizes information about the insert size, read length, coverage, and the variant size (Fig. 1). with a 200 bp insert library (s.d. 20 bp and read length 50 bp), 493 (58.69%) of 844 known variants (≥ 20 bp) on the chromosome 17 of J. Craig Venter's genome would possess 2 or more ARPs (≥ 3 s.d.) at 100 × physical coverage. This analytic TPR approaches an asymptote at 180 × here all deletions are detected and at 220 × here 307 (74%) of 415 insertions are detected (Supplementary Table 7). For a 400 bp insert library (s.d. 40 bp and read length 50 bp), the analytic TPR approaches an asymptote at 430 × here all deletions are detected and at 470 × here 87.5% of insertions are detected (Supplementary Table 8). We can explicitly characterize the analytic TPR as a function of variant size and coverage based on the Poisson coverage model: Where P(·) represents the Poisson distribution function. With these formulas, it can be shown that insertions and deletions shorter than 40 bp are difficult to detect using the above 200 bp insert library due to the 20 bp standard deviation. Deletions longer than 60 bp took about 30 × coverage to reach an asymptote and those longer than 100 bp took only 20 × (Supplementary Fig. 11a). Insertions ranging from 60 bp to 80 bp were relatively easier to detect (Supplementary Fig. 11b), but those longer than 100 bp cannot be detected at all, as their detection is limited by the insert size and read length of DNA fragments.

BreakDancerMini

Our second algorithm BreakDancerMini analyzes the normally mapped read pairs that were ignored by BreakDancerMax. A genomic region of size equivalent to the mean insert size is classified as either normal or anomalous based on a sliding indo test that examines the difference of the separation distances between read pairs that are mapped within the indo versus those in the entire genome. Similar to BreakDancerMax, a putative structural variant can be derived from the anomalous genomic regions that are interconnected by at least to common read pairs. A confidence score is assigned based on the significance value of the sliding indo test. The start and the end coordinates are decided as the outer boundaries of the constituent regions, while the size is estimated using the same approach as BreakDancerMax.

The sliding Window test

We applied a sliding indo test to identify anomalous regions that contain read pairs significantly different from the entire genome. By default, BreakDancerMini using a fixed indo size of w = μ + 3σ - 2l bp and a step size of 1 bp, here μ and σ are the mean and the standard deviation estimated from the separation distance of normally and confidently (mapping quality > 40) mapped read pairs, and l is the average read length. A to-sample Kolmogorov–Smirnov (KS) test statistic[26] is computed for each indo, here F(x) and F(x) are the empirical cumulative distribution function (ECDF) estimated from the normal reads in the indo and in the entire genome respectively, and n and n′ are the number of reads in each set; x is the separation distance from 1 bp to a maximum size (∼300 bp); sup denotes the supremum of the set. Obviously, D objectively measures the difference between the to ECDFs in terms of both location and shape. To model alignment orientation, we computed to statistics D and D per indo using reads that are mapped to the plus and the minus strands respectively. A genomic region is classified as anomalous in either the plus or the minus orientation if the corresponding KS statistic exceeds a user-selected threshold. Overlapping anomalous regions in the same orientation are filtered and only the highest scoring one is kept. For small indels, the anomalous regions that support the same variant are required to be in the opposite orientations. In principle, this approach works with any insert size distribution and does not require any predetermined cutoff on the separation distance.

Variant calling based on local assembly

A local assembly of the breakpoints within a suspected variant region can confirm the existence of the structural variant, precisely define the breakpoint locations, and determine any inserted sequences that may be present. In our AML study, we assembled reads mapped by MAQ to within 500 bp of the predicted variant boundaries, including unaligned reads hose mates mapped within the region using both Velvet[31] and phrap. we found that using more than one assembly algorithm increased the chance of assembling a structural variant. If the derived contig sequences cumulatively covered over 75% of the region from which the reads were extracted, we aligned the contigs to a region of the human reference sequence containing the structural variant and 1 kbp of flanking sequence on either side using cross-match. The resulting pair-wise alignments were examined for the existence of breakpoints or gaps. A variant as called if there is a gap or if the tumor and the normal contigs contain consistent breakpoint.

Experimental validation

Experimental validation as performed on putative structural variants in the AML tumor and normal genomes. Primer3 as used in conjunction with internal software to design and select tailed PCR primers for structural variant validation. Efforts were made to avoid designing primers in repetitive regions and to select primers with average GC-content close to 50% and a predicted Tm of 60° C. Primers were selected by hand hen automated methods indicated a lo likelihood of success. For small insertions, small inversions, and deletions of most sizes, PCR primers were designed approximately 100-200 bp outside of the boundaries of the breakpoints defined by BreakDancer. For large inversions and intrachromosomal translocations, primers were designed with the same orientation as, but 10-200 bp upstream of any variant supporting read pairs. If a structural variant as supported by both forward and reverse read pairs across both breakpoints, a total of four primers were designed and to separate attempts were made to validate the variant with PCR amplification and Sanger sequencing. Structural variants were considered validated if any single resulting read sequence spanned the predicted breakpoints. No primers were designed for complex events, e.g., if conserved repeats spanned or flanked both ends of the predicted breakpoints. Genomic DNA from the tumor and a matched normal blood sample were amplified using standard PCR protocols. Putative small insertions, small inversions, and deletions of all sizes were amplified using Amplitaq Gold polymerase. Putative large inversions and intrachromosomal translocations were amplified using Accutaq Hotstart polymerase. All PCR products were evaluated on a 2% agarose gel. Regardless of yield, all products were sequenced in both directions using Big Dye Terminator reactions and subsequently loaded on an AB 3730xl capillary sequencer. The resulting traces were assembled to a reference sequence extracted from the region surrounding the predicted variant site on NCBI build 36 with an additional 1 kbp of flanking 3′ and 5′ sequence. All resulting diploid trace data were manually reviewed and those traces showing unambiguous evidence of homozygous or heterozygous SV were classified as either somatic or germline events, or alternatively, labeled as variants if the somatic status could not be determined due to lack of sequence data from the matched normal sample.

The NA18507 data

We downloaded approximately 3.5 billion end sequences (1.7 billion pairs) of length 36 to 41 bp and insert size 200bp from the NCBI Short Read Archive. This constituted about 42 × sequence and 120 × physical coverage of the human genome. we mapped all reads from the 200 bp library to the NCBI build 36.1 reference using MAQ-0.7.1 and obtained 37.2 × haploid coverage after removing the duplicated reads that have identical outer coordinates. Consistent with the previous reports[24], the obtained insert size distribution is approximately normal with a mean of 209 bp and a s.d. of 13 bp.

The AML data

We constructed four Illumina paired-end libraries from the genomic DNA of the primary tumor cells and to libraries from the normal skin cells. The mean insert sizes range from 95 bp to 268 bp based on the empirical insert size distributions estimated from the alignment (Supplementary Table 1). All libraries had unimodal insert size distributions although the normal DNA libraries had a relatively larger s.d. than the tumor libraries (Supplementary Fig. 12). Some libraries have distributions clearly diverged from Gaussian and these can be problematic for variant detection methods that assume normality. For both the tumor and the skin genomes, we obtained 21 × haploid sequence coverage, corresponding to 63.5× and 39.9× physical coverage, respectively. Of the paired-end reads obtained, 67% were 50 bp and the rest between 35 bp and 36 bp. All reads were mapped to the NCBI build 36 human reference sequence using MAQ-0.7.1.

System Requirements and Software Availability

BreakDancer is currently written in Perl and is available at http://genome.ustl.edu/tools/cancer-genomics/. It usually takes three to five hours and between 200 MB to 500 MB memory to analyze one human chromosome at around 50-fold sequence redundancy. Supplementary Figure 1: The size of indels detectable by MAQ paired end Smith-Waterman alignment algorithm Supplementary Figure 2: The true positive rate (TPR) .r.t. coverage in different size bins Supplementary Figure 3: Size and Breakpoint Accuracy of BreakDancerMax in simulation Supplementary Figure 4: Receiver operator characteristics (ROC) .r.t. the number of anomalous read pairs (n) and the confidence score (Q) Supplementary Figure 5: The true positive rate (TPR) and the false positive rate (FPR) given various separation thresholds Supplementary Figure 6: The Receiver Characteristic (ROC) of BreakDancerMini at different threshold of D Supplementary Figure 7: Accuracy of variant size estimated by BreakDancerMini Supplementary Figure 8: Percent overlap between the AML structural variants and the DGV (v5) as functions of the confidence score or number of anomalous read pairs Supplementary Figure 9: AML validation result .r.t variant size and confidence score Supplementary Figure 10: Plots of identified inversions and intra-chromosomal translocations in the AML genome Supplementary Figure 11: The analytic true positive rate .r.t. variant size and coverage Supplementary Figure 12: The insert size distribution of the libraries in the AML project Supplementary Table 1: List of structural variants detected in simulation Supplementary Table 2: Number and type of structural variants detected by BreakDancerMax in the tumor-normal paired AML genome. Supplementary Table 3: A list of AML2 structural variants detected by BreakDancer, refined by local assembly, and validated via PCR resequencing Supplementary Table 4: The paired end libraries used for the structural variation detection of the 1000 genomes trio individuals Supplementary Table 5: Deletions detected by BreakDancerMax on chromosome 5 of NA12878 Supplementary Table 6: Deletions detected by BreakDancerMax on chromosome 5 of NA19240 Supplementary Table 7: Analytic true positive rate in simulated structural variant detection using a 200 insert library Supplementary Table 8: Analytic true positive rate in simulated structural variant detection using a 400 insert library Supplementary Note: Additional results in simulation

28 in total

1. End-sequence profiling: sequence-based analysis of aberrant genomes.

Authors: Stanislav Volik; Shaying Zhao; Koei Chin; John H Brebner; David R Herndon; Quanzhou Tao; David Kowbel; Guiqing Huang; Anna Lapuk; Wen-Lin Kuo; Gregg Magrane; Pieter De Jong; Joe W Gray; Colin Collins
Journal: Proc Natl Acad Sci U S A Date: 2003-06-04 Impact factor: 11.205

2. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions.

Authors: Seunghak Lee; Fereydoun Hormozdiari; Can Alkan; Michael Brudno
Journal: Nat Methods Date: 2009-05-31 Impact factor: 28.547

3. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

4. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes.

Authors: Fereydoun Hormozdiari; Can Alkan; Evan E Eichler; S Cenk Sahinalp
Journal: Genome Res Date: 2009-05-15 Impact factor: 9.043

5. The diploid genome sequence of an Asian individual.

Authors: Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman; Wei Fan; Junqing Zhang; Jun Li; Juanbin Zhang; Yiran Guo; Binxiao Feng; Heng Li; Yao Lu; Xiaodong Fang; Huiqing Liang; Zhenglin Du; Dong Li; Yiqing Zhao; Yujie Hu; Zhenzhen Yang; Hancheng Zheng; Ines Hellmann; Michael Inouye; John Pool; Xin Yi; Jing Zhao; Jinjie Duan; Yan Zhou; Junjie Qin; Lijia Ma; Guoqing Li; Zhentao Yang; Guojie Zhang; Bin Yang; Chang Yu; Fang Liang; Wenjie Li; Shaochuan Li; Dawei Li; Peixiang Ni; Jue Ruan; Qibin Li; Hongmei Zhu; Dongyuan Liu; Zhike Lu; Ning Li; Guangwu Guo; Jianguo Zhang; Jia Ye; Lin Fang; Qin Hao; Quan Chen; Yu Liang; Yeyang Su; A San; Cuo Ping; Shuang Yang; Fang Chen; Li Li; Ke Zhou; Hongkun Zheng; Yuanyuan Ren; Ling Yang; Yang Gao; Guohua Yang; Zhuo Li; Xiaoli Feng; Karsten Kristiansen; Gane Ka-Shu Wong; Rasmus Nielsen; Richard Durbin; Lars Bolund; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

6. High-resolution mapping of copy-number alterations with massively parallel sequencing.

Authors: Derek Y Chiang; Gad Getz; David B Jaffe; Michael J T O'Kelly; Xiaojun Zhao; Scott L Carter; Carsten Russ; Chad Nusbaum; Matthew Meyerson; Eric S Lander
Journal: Nat Methods Date: 2008-11-30 Impact factor: 28.547

7. Whole-genome shotgun assembly and comparison of human genome assemblies.

Authors: Sorin Istrail; Granger G Sutton; Liliana Florea; Aaron L Halpern; Clark M Mobarry; Ross Lippert; Brian Walenz; Hagit Shatkay; Ian Dew; Jason R Miller; Michael J Flanigan; Nathan J Edwards; Randall Bolanos; Daniel Fasulo; Bjarni V Halldorsson; Sridhar Hannenhalli; Russell Turner; Shibu Yooseph; Fu Lu; Deborah R Nusskern; Bixiong Chris Shue; Xiangqun Holly Zheng; Fei Zhong; Arthur L Delcher; Daniel H Huson; Saul A Kravitz; Laurent Mouchard; Knut Reinert; Karin A Remington; Andrew G Clark; Michael S Waterman; Evan E Eichler; Mark D Adams; Michael W Hunkapiller; Eugene W Myers; J Craig Venter
Journal: Proc Natl Acad Sci U S A Date: 2004-02-09 Impact factor: 11.205

8. Recurring mutations found by sequencing an acute myeloid leukemia genome.

Authors: Elaine R Mardis; Li Ding; David J Dooling; David E Larson; Michael D McLellan; Ken Chen; Daniel C Koboldt; Robert S Fulton; Kim D Delehaunty; Sean D McGrath; Lucinda A Fulton; Devin P Locke; Vincent J Magrini; Rachel M Abbott; Tammi L Vickery; Jerry S Reed; Jody S Robinson; Todd Wylie; Scott M Smith; Lynn Carmichael; James M Eldred; Christopher C Harris; Jason Walker; Joshua B Peck; Feiyu Du; Adam F Dukes; Gabriel E Sanderson; Anthony M Brummett; Eric Clark; Joshua F McMichael; Rick J Meyer; Jonathan K Schindler; Craig S Pohl; John W Wallis; Xiaoqi Shi; Ling Lin; Heather Schmidt; Yuzhu Tang; Carrie Haipek; Madeline E Wiechert; Jolynda V Ivy; Joelle Kalicki; Glendoria Elliott; Rhonda E Ries; Jacqueline E Payton; Peter Westervelt; Michael H Tomasson; Mark A Watson; Jack Baty; Sharon Heath; William D Shannon; Rakesh Nagarajan; Daniel C Link; Matthew J Walter; Timothy A Graubert; John F DiPersio; Richard K Wilson; Timothy J Ley
Journal: N Engl J Med Date: 2009-08-05 Impact factor: 91.245

9. Reconstructing tumor genome architectures.

Authors: Benjamin J Raphael; Stanislav Volik; Colin Collins; Pavel A Pevzner
Journal: Bioinformatics Date: 2003-10 Impact factor: 6.937

10. Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

Authors:
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

689 in total

1. Pancreatic intraductal tubulopapillary neoplasm is genetically distinct from intraductal papillary mucinous neoplasm and ductal adenocarcinoma.

Authors: Olca Basturk; Michael F Berger; Hiroshi Yamaguchi; Volkan Adsay; Gokce Askan; Umesh K Bhanot; Ahmet Zehir; Fatima Carneiro; Seung-Mo Hong; Giuseppe Zamboni; Esra Dikoglu; Vaidehi Jobanputra; Kazimierz O Wrzeszczynski; Serdar Balci; Peter Allen; Naoki Ikari; Shoko Takeuchi; Hiroyuki Akagawa; Atsushi Kanno; Tooru Shimosegawa; Takanori Morikawa; Fuyuhiko Motoi; Michiaki Unno; Ryota Higuchi; Masakazu Yamamoto; Kyoko Shimizu; Toru Furukawa; David S Klimstra
Journal: Mod Pathol Date: 2017-08-04 Impact factor: 7.842

2. How Can Next-Generation Sequencing (Genomics) Help Us in Treating Colorectal Cancer?

Authors: Kristen K Ciombor; Sigurdis Haraldsdottir; Richard M Goldberg
Journal: Curr Colorectal Cancer Rep Date: 2014-12-01

3. Simultaneous structural variation discovery among multiple paired-end sequenced genomes.

Authors: Fereydoun Hormozdiari; Iman Hajirasouliha; Andrew McPherson; Evan E Eichler; S Cenk Sahinalp
Journal: Genome Res Date: 2011-11-02 Impact factor: 9.043

4. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion.

Authors: Ruibin Xi; Angela G Hadjipanayis; Lovelace J Luquette; Tae-Min Kim; Eunjung Lee; Jianhua Zhang; Mark D Johnson; Donna M Muzny; David A Wheeler; Richard A Gibbs; Raju Kucherlapati; Peter J Park
Journal: Proc Natl Acad Sci U S A Date: 2011-11-07 Impact factor: 11.205

5. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

Review 6. A survey of sequence alignment algorithms for next-generation sequencing.

Authors: Heng Li; Nils Homer
Journal: Brief Bioinform Date: 2010-05-11 Impact factor: 11.622

7. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome.

Authors: Aaron R Quinlan; Royden A Clark; Svetlana Sokolova; Mitchell L Leibowitz; Yujun Zhang; Matthew E Hurles; Joshua C Mell; Ira M Hall
Journal: Genome Res Date: 2010-03-22 Impact factor: 9.043

8. Primer-initiated sequence synthesis to detect and assemble structural variants.

Authors: Andreas Massouras; Korneel Hens; Carine Gubelmann; Swapna Uplekar; Frederik Decouttere; Jacques Rougemont; Stewart T Cole; Bart Deplancke
Journal: Nat Methods Date: 2010-06-13 Impact factor: 28.547

9. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing.

Authors: Akihiro Fujimoto; Hidewaki Nakagawa; Naoya Hosono; Kaoru Nakano; Tetsuo Abe; Keith A Boroevich; Masao Nagasaki; Rui Yamaguchi; Tetsuo Shibuya; Michiaki Kubo; Satoru Miyano; Yusuke Nakamura; Tatsuhiko Tsunoda
Journal: Nat Genet Date: 2010-10-24 Impact factor: 38.330

10. Systems-level analysis of nitrogen starvation-induced modifications of carbon metabolism in a Chlamydomonas reinhardtii starchless mutant.

Authors: Ian K Blaby; Anne G Glaesener; Tabea Mettler; Sorel T Fitz-Gibbon; Sean D Gallaher; Bensheng Liu; Nanette R Boyle; Janette Kropat; Mark Stitt; Shannon Johnson; Christoph Benning; Matteo Pellegrini; David Casero; Sabeeha S Merchant
Journal: Plant Cell Date: 2013-11-26 Impact factor: 11.277