| Literature DB >> 22689759 |
Deniz Yorukoglu1, Faraz Hach, Lucas Swanson, Colin C Collins, Inanc Birol, S Cenk Sahinalp.
Abstract
MOTIVATION: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention. In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22689759 PMCID: PMC3371846 DOI: 10.1093/bioinformatics/bts214
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Structural alteration events considered in this article. T represent the transcript, G and S represent two genomic regions. G′ is the complementary strand for G. Boundaries between red and green blocks indicate event breakpoints; arrows represent corresponding genomic transitions in the alignment. Apart from the event types shown in the figure, duplication events can appear as non-tandem and fusions can be between two different strands
Fig. 2.Fragment chaining in the presence of a rearrangement and an inversion. The fragments involved include two segments from T associated with segments from G and another segment from T associated with a segment from G′. The figure depicts how the fragments reveal themselves in the alignment tables and how they can be chained to get the overall alignment
Alignment results of Dissect for the simulated wild-type transcriptome dataset with novel insertions
| Insertion length | Total | WT | All events | A. D/I | N.A. |
|---|---|---|---|---|---|
| 6–20 bases | 8365 | 8335 | 12 | 16 | 2 |
| 21–35 bases | 8365 | 8284 | 52 | 23 | 6 |
| 36–50 bases | 8365 | 8223 | 106 | 24 | 13 |
| 51–65 bases | 8365 | 8117 | 204 | 20 | 24 |
Rows represent the length interval of the novel insertion distributions (e.g. insertions reported in the first row are uniformly distributed between 6 and 20 nucleotides). Columns indicate the output labels of Dissect: All events column represents the total number of transcripts Dissect has identified as a structural alteration A. D/I column represents the alignments that contain a short ambiguous interval that cannot be verified with certainty as an insertion or a duplication, and N.A. column indicates the number of transcript sequences for which Dissect did not return a valid high-similarity alignment.
The number of structural alterations detected by Dissect for the simulation datasets
| Tot. | Tot-E. | Fusion | Inv. | F. Dup. | F. Rea. | |
|---|---|---|---|---|---|---|
| Exp. 1 | 5234 | 5099 | 1 | 5 | 5092 | 1 |
| Exp. 2 | 5234 | 5172 | 0 | 1 | 5171 | 0 |
| Exp. 3 | 5234 | 5093 | 0 | 0 | 5093 | 0 |
| Exp. 4 | 4788 | 4762 | 0 | 4762 | 0 | 0 |
| Exp. 5 | 4788 | 4331 | 0 | 4331 | 0 | 0 |
| Exp. 6 | 3188 | 3125 | 0 | 3125 | 0 | 0 |
| Exp. 7 | 4654 | 4501 | 0 | 4501 | 0 | 0 |
| Exp. 8 | 4788 | 4512 | 2 | 8 | 3 | 4499 |
| Exp. 9 | 4788 | 4623 | 0 | 8 | 2 | 4613 |
| Exp. 10 | 4316 | 4255 | 0 | 14 | 4 | 4237 |
| Exp. 11 | 1312 | 1237 | 1232 | 5 | 0 | 0 |
| Exp. 12 | 1558 | 1433 | 1433 | 0 | 0 | 0 |
| Exp. 13 | 2363 | 562 | 562 | 0 | 0 | 0 |
Tot., total number of transcript sequences; Tot-E., total number of discovered structural event containing transcripts; Fusion, total number of fusions; Inv., inversion events including inverted duplications, inverted rearrangements, in-place inversions, and suffix-inversions; F. Dup., forward duplications; F. Rea., forward rearrangement events.