| Literature DB >> 24389656 |
Jan Schröder1, Arthur Hsu1, Samantha E Boyle1, Geoff Macintyre1, Marek Cmero2, Richard W Tothill1, Ricky W Johnstone1, Mark Shackleton3, Anthony T Papenfuss3.
Abstract
MOTIVATION: Methods for detecting somatic genome rearrangements in tumours using next-generation sequencing are vital in cancer genomics. Available algorithms use one or more sources of evidence, such as read depth, paired-end reads or split reads to predict structural variants. However, the problem remains challenging due to the significant computational burden and high false-positive or false-negative rates.Entities:
Mesh:
Year: 2014 PMID: 24389656 PMCID: PMC3982158 DOI: 10.1093/bioinformatics/btt767
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Clusters of split reads spanning a break point. Three cases are shown: (A) blunt end joining, (B) micro-homology at the break points and (C) untemplated sequence inserted between the break points
Fig. 2.Comparison of different methods on simulated SV data. (A) Precision and recall of BreakDancer, CLEVER, CREST, DELLY, Pindel, PRISM and Socrates on simulated structural variations in E.coli and human chromosome 12. The mean precision and recall from the simulated series are plotted at 7.5, 15 and 30× coverage. (B) Detailed analysis of feature type (deletion, translocation, inversion, tandem duplication) and size [small (S), medium (M), large (L) and extra large (XL)] showing specific biases for each method on the 30× chromosome 12 data. Note that all methods are tested on a consistent set of variants, but some methods (Clever and Pindel) do not make predictions in all categories, which penalizes their performance overall in (A) and for specific classes in (B). See the text for results on novel insertions
Fig. 3.Structure of the Eμ–myc transgene. Sizes of regions are not to scale, but are indicative. A–H indicate fusions; B–F are known fusions; A and H are novel. Fusion G is the end of the pUC12 reference sequence and is due to the circular topology of the cloning vector. The break point at A is promiscuous, linking chr15 to chr9 and H
Detection of fusions associated with the Eμ–myc transgene by different methods
| Algorithm | A | B | C | F | G | H |
|---|---|---|---|---|---|---|
| 19–15 | 15–12 | 12–15 | 15–pUC12 | pUC12–pUC12 | pUC12–15 | |
| Socrates | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| CREST | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| BreakDancer | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ |
| DELLY | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
Note: Fusions B–F are known (Fig. 3); others are novel or inferred.
Comparison of Socrates and defuse results
| Sample | All | defuse | Socrates | Overlap (%) |
|---|---|---|---|---|
| High confidence | ||||
| 1 | 340 | 84 | 6 | 5 (83%) |
| 2 | 183 | 38 | 4 | 1 (25%) |
| 3 | 431 | 134 | 3 | 3 (100%) |
| 4 | 276 | 72 | 2 | 0 (0%) |
| 5 | 220 | 68 | 1 | 1 (100%) |
| 6 | 440 | 121 | 2 | 1 (50%) |
| Average | 315 | 86 | 3 | 2 (60%) |
Note: Socrates adds increased sensitivity and single-nucleotide resolution to the RNA-seq data analysis.
Summary of Socrates predictions in different classes of repetitive regions
| Repeat class | Sensitive | Filtered | ||
|---|---|---|---|---|
| Normal | Somatic | Normal | Somatic | |
| Non-repetitive | 5509 | 10 957 (13%) | 3176 | 226 (33%) |
| LINE | 2338 | 8630 (10%) | 1114 | 123 (18%) |
| Low complexity | 195 | 1483 (2%) | 63 | 17 (3%) |
| LTR | 1052 | 2726 (3%) | 555 | 42 (6%) |
| Satellite | 8759 | 36 315 (43%) | 214 | 69 (10%) |
| Simple repeat | 2734 | 11 069 (13%) | 571 | 122 (18%) |
| SINE | 1532 | 12 461 (15%) | 621 | 79 (12%) |
Note: Sensitive means at least one long soft clip and one short soft clip supporting the fusion. Filtered means at least two long soft clips on each side.
Resource consumption comparison between the competing algorithms on simulated data and real sequencing reads from a cancer data set
| Algorithm | Chr12 30× simulation | Eμ–myc | ||
|---|---|---|---|---|
| Run time (min) | Max memory (Mb) | Run time (h) | Max memory (Gb) | |
| CREST | 87 | 483 | >50 | NA |
| DELLY | 53 | 337 | 33 | 8.6 |
| Socrates | 3 | 550 | 4 | 10.1 |
aCREST failed during the soft clip extraction stage after running over 50 h, so the timing represents a lower bound. Runtimes are wall clock measures.