| Literature DB >> 34257466 |
Bonnie Berger1, Michael S Waterman2, Yun William Yu3.
Abstract
Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.Entities:
Keywords: Levenshtein distance; dynamic programming; metric entropy; sequence comparison; similarity search
Year: 2020 PMID: 34257466 PMCID: PMC8274556 DOI: 10.1109/tit.2020.2996543
Source DB: PubMed Journal: IEEE Trans Inf Theory ISSN: 0018-9448 Impact factor: 2.501
Fig. 1.Smith-Waterman algorithm.
Start with a scoring matrix. As an example, consider the alignment of two strings GCA and AGCT, where correctly aligned letters give a score of +1, substitutions a score of −1, and insertions or deletions a score of −2. The matrix is filled in recursively, with base case of 0’s in the leftmost column and top row. Moving to the right in the matrix corresponds to an insertion of a character from the left string, moving down corresponds to an insertion of a character from the top string, and moving diagonally down and to the right corresponds to either a correctly aligned letter or a substitution. A cell need only consider the three cells above and to the left of it. The score in a cell is the maximum score that can be achieved by coming from one of those three cells, with a floor of 0. After filling in the matrix, we need only scan the matrix for high scores, and then we can reconstruct the optimal path by tracing back from the maximum scoring cells. In this example, the optimal local alignment is GC to GC.
Fig. 2.Cartoon illustration of coverings and low fractal dimension.
In this example, points and balls are in a 2D space. The metric ball covers are the gray circles, the green triangle represents a search query q, the inner red circle is B(q, s), and the outer red circle corresponds to B(q, s + r). The red circles illustrate the desired search radius for similarity search and the needed wider search radius for finding any ball that might contain a point in the desired search radius. The number of metric ball coverings represent the covering number of the data, which is proportional to the Shannon entropy. The low fractal dimension is intuitively understood as there not being too many neighboring balls surrounding the one containing the query, and thus the covering looks tree-like. The theory generalizes to points in a high-dimensional space for which the balls would be hyperspheres. Figure taken from [20].