| Literature DB >> 24358278 |
Steven N Hart1, Vivekananda Sarangi1, Raymond Moore1, Saurabh Baheti1, Jaysheel D Bhavsar1, Fergus J Couch2, Jean-Pierre A Kocher1.
Abstract
BACKGROUND: Structural variation (SV) represents a significant, yet poorly understood contribution to an individual's genetic makeup. Advanced next-generation sequencing technologies are widely used to discover such variations, but there is no single detection tool that is considered a community standard. In an attempt to fulfil this need, we developed an algorithm, SoftSearch, for discovering structural variant breakpoints in Illumina paired-end next-generation sequencing data. SoftSearch combines multiple strategies for detecting SV including split-read, discordant read-pair, and unmated pairs. Co-localized split-reads and discordant read pairs are used to refine the breakpoints.Entities:
Mesh:
Year: 2013 PMID: 24358278 PMCID: PMC3865185 DOI: 10.1371/journal.pone.0083356
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The general strategy for SoftSearch.
A) Left clipped reads are defined as where the clipped portion of the read is at a smaller genome coordinate than the opposite end (opposite for right clipping). For a left clipped read located on the “+” strand, SoftSearch looks upstream for a discordant read pair where the read is oriented in the “-” direction. The orientation and location of the mate is where SoftSearch links the first region to. To increase the likelihood of exactly detecting the breakpoint, it then looks upstream for a right clipped read cluster. If none is found, then the default breakpoint location is the discordant read mate location; otherwise it is the position of soft clipping at the right clipped read. B) SoftSearch determines discordant read pairs by their insert size and orientation and places them in a temporary BAM file. It also reads the input BAM file for soft clipped reads and converts them to a BED file. Overlapping soft clip locations are counted to identify putative breakpoints, and then queried against the discordant read pair bam file for properly oriented supporting reads, which are then output in VCF format.
Summary Statistics from the Synthetic Whole Genome (40X).
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|
|
| 2,907 | 2,884 |
| 3,483 | 3,527 | 2,686 | 2,972 | 3,154 |
|
| 1,022 | 1,045 | 371 | 446 | 402 | 1,243 | 957 | 775 |
|
| 385 | 3 | 666 | 58 | 601 | 21 | 381 | 176 |
|
| 0.88 |
| 0.84 | 0.98 | 0.85 | 0.99 | 0.89 | 0.95 |
|
| 0.74 | 0.73 |
| 0.89 | 0.90 | 0.68 | 0.76 | 0.80 |
|
| 0.81 | 0.85 | 0.87 |
| 0.88 | 0.81 | 0.82 | 0.87 |
|
| 7.56 | -- | -- | 13.06 | -- | 162.64 | 72.16 |
|
|
|
| -- | -- | 0.58 | -- | 5.40 | 6.21 | 3.18 |
TP=True positive; FN=False negative; FP=False positive; Bold indicates best result.
Figure 2Overlap of true positive calls for the NA12878 and NA18507 datasets.
Figure 3Example IGV screenshot of a 71bp tandem duplication in the BRCA2 gene identified by SoftSearch.
Discordant reads are blue (plus strand) or red (minus strand). Soft clipped bases appear as multicolour “rainbows”.