| Literature DB >> 21233167 |
Alexej Abyzov1, Mark Gerstein.
Abstract
MOTIVATION: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e.g. make a large gap) and perform precise local alignments at the flanking ends.Entities:
Mesh:
Year: 2011 PMID: 21233167 PMCID: PMC3042181 DOI: 10.1093/bioinformatics/btq713
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematics of the expected optimal alignment around a structural variation (left) and alignments produced by global Needleman–Wunsch (NW) and local Smith–Waterman (SW) algorithms (right). The structural variation, i.e. deletion, is in red. In (B), the deletion is accompanied by a small insertion (blue). Throughout the figure, alignable flanking regions are shown in green and orange. Both SW and NW algorithms generally cannot arrive at a biologically correct alignment.
Fig. 2.Sequence similarity (see A) around SV breakpoints (shades of green) can mislead local and/or global alignment(s) to produce incorrect alignment (see B). Conceptually, to produce correct alignment one has to find an optimal jump between overlapping local alignments. However, local alignment calculation and jump finding have to be done simultaneously rather than successively to guarantee finding the optimal alignment (Supplementary Fig. S4).
Fig. 3.Schematics of the algorithm. (A) Two alignment score matrices for the alignment of the 5′ ends (S matrix) and the 3′ ends (S matrix) are constructed a la the Smith–Waterman local alignment. In this example, scoring is as follows: match = 1, mismatch = −1, gap open penalty = 4 and gap extend penalty = 2. Orange arrows represent trace-back information. The best alignment maximizes the sum of the maximum in the leading submatrix (highlighted buff) in S and the maximum in the paired trailing submatrix (highlighted cyan) in S. (B) Matrices are converted so that each cell contains the maximum score of the leading submatrix in S and the trailing submatrix in S. The location of score maxima is traced (blue arrows), just like the score is traced in scoring matrices. Maximum score sum (red) can now be found in one pass through the matrices. (C) Alignment is constructed by, first, tracing back the maximum location (red arrows) in each matrix and then, tracing back alignments for the 5′ and 3′ ends (green arrow). The resulting alignment is the sum of the alignments at the 5′ and 3′ ends, with the unaligned region in-between. The maximum score can be redundant (bold rectangles). However, the resulting alignment will be the same.
Fig. 4.Schematics for aligning sequences with tandem duplication and inversion. Color gradient reflects the directionality of sequences from the 5′-end to the 3′-end. (A) Aligning a contig/read spanning a duplication breakpoint can be no different than aligning sequences with a deletion/insertion. (B) For contigs spanning other duplication breakpoints, the order of the aligned fragments is different in the two se-quences. (C) Optimal split-alignment for sequences with inversions is calculated by aligning the 5′-end of the first sequence to the 5′-end of the second one and aligning the 3′-end of the first sequence to the 3′-end of the reverse complement of the second one.
Fig. 5.Comparison of assembled contig alignments in the region of predicted deletions. The first line in each alignment is the sequence for the genomic region, while the second is for the contig sequence. Nucleotide numbering is sequential, starting from one in both compared sequences. Each alignment is accompanied by a schematic representation underneath. (A) The predicted deletion is chr20:2,969,769-2,970,056. The contig that is 614 bp in length has been aligned by the AGE, GAP3, CrossMatch and Blat programs to the predicted region of deletion, which is extended by 1 kb in each direction, i.e. from 2,968,769-2,971,056. The first sequence (genomic region) has two pairs of homologous sequences: orange to yellow and dark green to light green. AGE alignment clearly identifies a large unaligned region, confirms a predicted deletion, and derives deletion breakpoints as chr20:2,969,756-2,970,052 (coordinates are for the first and the last deleted bases). Note that the resulting breakpoints are in excellent agreement (within 13 bp) with the prediction. No other program was able to produce the correct alignment. (B) Predicted deletion is chr8:118,292,728-118,292,987. The contig of 530 bp in length has been aligned by the AGE and GAP3 programs to the predicted region of deletion, which is extended by 1 kb in each direction, i.e. from 118 291 728 to 118 293 987. AGE alignment clearly identifies a large unaligned region, confirms a predicted deletion, and derives deletion breakpoints as chr8:118,292,711-118,292,990 (coordinates are for first and last deleted bases). GAP3 is not able to align the left flanking sequence, as the penalty for a long gap outweighs the matches at the left flanking sequence. All coordinates are for human hg18 reference.