F Ferrè1, P Clote. 1. Department of Biology, Boston College, Chestnut Hill, MA, USA.
Abstract
UNLABELLED: Dynamic time warping (DTW) is a well-known quadratic time algorithm to determine the smallest distance and optimal alignment between two numerical sequences, possibly of different length. Originally developed for speech recognition, this method has been used in data mining, medicine and bioinformatics. For gene expression time series data, time warping distance is arguably a more flexible tool to determine genes having similar temporal expression, hence possibly related biological function, than either Euclidean distance or correlation coefficient--especially since time warping accommodates sequences of different length. The BTW web server allows a user to upload two tab-separated text files A,B of gene expression data, each possibly having a different number of time intervals of different durations. BTW then computes time warping distance between each gene of A with each gene of B, using a recently developed symmetric algorithm which additionally computes the Boltzmann partition function and outputs Boltzmann pair probabilities. The Boltzmann pair probabilities, not available with any other existent software, suggest possible biological significance of certain positions in an optimal time warping alignment. AVAILABILITY: http://bioinformatics.bc.edu/clotelab/BTW/.
UNLABELLED: Dynamic time warping (DTW) is a well-known quadratic time algorithm to determine the smallest distance and optimal alignment between two numerical sequences, possibly of different length. Originally developed for speech recognition, this method has been used in data mining, medicine and bioinformatics. For gene expression time series data, time warping distance is arguably a more flexible tool to determine genes having similar temporal expression, hence possibly related biological function, than either Euclidean distance or correlation coefficient--especially since time warping accommodates sequences of different length. The BTW web server allows a user to upload two tab-separated text files A,B of gene expression data, each possibly having a different number of time intervals of different durations. BTW then computes time warping distance between each gene of A with each gene of B, using a recently developed symmetric algorithm which additionally computes the Boltzmann partition function and outputs Boltzmann pair probabilities. The Boltzmann pair probabilities, not available with any other existent software, suggest possible biological significance of certain positions in an optimal time warping alignment. AVAILABILITY: http://bioinformatics.bc.edu/clotelab/BTW/.
Dynamic time warping (DTW), described in the text of Kruskal and Liberman (1), is an algorithm to compute the optimal alignment of numerical sequences. Using dynamic programming (2), DTW was first introduced by Vintsyuk (3) and applied in speech recognition by Sakoe and Chiba (4). While most applications of DTW are in the area of speech recognition (5), DTW has recently been applied to the fields of data mining (6–8), medicine (electrocardiograms) (9) and bioinformatics (10,11). In bioinformatics, Aach and Church (10) implement some variants of classical DTW along with interpolative DTW, investigated robustness and statistical significance of time warping alignments, and analyzed published cell-cycle gene expression data of Saccharomyces cerevisiae. Very recently, Criel and Tsiporkova (11) describe their publicly available Java program, GenT χ Warper, which implements classical time warping and provides a user-friendly graphical user interface suitable for biological investigation of gene expression data.DTW is a variant of sequence alignment for numerical sequences, as opposed to sequences of amino acids or nucleotides, where the analogue of a linear gap penalty in sequence alignment is played by time expansion or contraction. Despite the similarity with sequence alignment, DTW is subtly different and thus warrants a short presentation of details.Let a = a1, … , a be numerical values measured at times 0, τ, 2τ, … , (n − 1)τ, and let b = b1, … , b be numerical values measured at times 0, μ, 2μ, … , (m − 1)μ. For each 1 ≤ i ≤ n, 1 ≤ j ≤ m, define
With this notation, (classical) time warping distance between a and b is defined to be D; the optimal alignment is obtained using tracebacks. Clearly the computation of time warping distance can be performed in quadratic time using quadratic memory resources. Using the method of Hirschberg (12), it is possible to reduce memory resources to a linear factor; however, given the current number of time points in gene expression data, this additional complication is unwarranted.As in sequence alignment, time warping can alternatively be viewed as a method to determine the minimum cost path, which proceeds in a left-to-right and bottom-to-top manner from the point (1,1) to the point (n,m), such that each path edge is one step in the horizontal, vertical or diagonal direction. See Figure 1 for an example of the optimal path graph between two small numerical sequences, and notice that unlike the case for sequence alignment, DTW is not symmetric—i.e. time warping distance between sequences a,b is not necessarily equal to that of the reversal of a with the reversal of b. A Sakoe–Chiba band (4) is used to define a warping window that constrains the path close to the diagonal, restricting the portion of the matrix that the path is allowed to visit to
given a parameter p between 0 and 1.
Figure 1
Path graph for classic DTW for (toy) sequences a = (a1,a2,a3,a4,a5) = (2,1,3,6,8) and (b1,b2,b3,b4) = (2,2,7,8) of unequal length. Here |a − b| is Euclidean distance, and time expansion/compression intervals are τ = 10 = μ. (Left) Optimal path graph with distance 25.0 for aligning a with b. (Right) Optimal path graph with distance 20.0 for aligning the reversal of a with the reversal of b. Note that classic DTW is not symmetric, unlike the situation for sequence alignment algorithm of Needleman and Wunsch (20).
In (13), Clote and Straubhaar describe a new variant of time warping, which is proved to be symmetric in the sense just described. The paper (13) includes a quadratic time computation of the Boltzmann forward partition function FZ = ∑ exp(−D/RT), where the sum is over all time warpings of a1, … , a with b1, … , b, D is the (new, symmetric and modified) time warping distance between a1, … , a and b1, … , b, R is the universal gas constant and T is absolute temperature. In the context of time warping, T has no physical significance and can be chosen arbitrarily. Symmetry allows the unambiguous computation of the backward partition function BZ, where the sum is over all time warpings of a, … , a with b, … , b. Together we obtain the Boltzmann pair probability Pr[a,b] that a is aligned with b, defined by
This expression is a slight simplification—see (13) for more details. Boltzmann pair probabilities indicate possible biological significance for certain alignment positions in time warping of gene expression data. For instance, it could be that the alignment of expression values in cell-cycle phases G1 and S is more significant than those in phases G2 and M. Indeed, in the context of sequence alignment, (14) has shown that Boltzmann pair probabilities do indicate biological significance of certain positions in sequence alignments; see also (15). Although currently available gene expression time series data include only a small number of time points, future datasets are certain to include more time points, as costs decrease and accuracy is improved. Hence, our BTW web server should prove increasingly useful in genomics research.
WEB SERVER
Figure 2 displays a screen shot of the BTW web server which computes the optimal symmetric time warping distance between two numerical sequences, and allows the user to obtain the Boltzmann pair probabilities Pr[a,b] that a and b are aligned. Additionally graphical output is available, as depicted in Figure 3.
Figure 2
Output from BTW: top ranking time warping distance for genes from file A with those from file B. Small time warping distance could indicate a related biological function, as determined in the Gene Ontology. For each gene pair, the user can retrieve a graphical description of the optimal alignment, as well as text and graphical output for the Boltzmann pair probabilities Pr[a,b]. Optimal alignment and pair probabilities graphical output is shown in Figure 3.
Figure 3
Optimal alignment (upper panel) and Boltzmann pair probabilities (lower panel) for alignment positions in optimal time warping between yeast gene YGL116W (17 time points) and human gene U05340 (13 time points). Boltzmann probabilities are computed with k = 1 and for the values 0.5, 1.0 and 2.0 of T. With increasing values of T, the Boltzmann probability values decrease; this enhances the depiction of presumed biological significance of aligned positions in this optimal time warping.
The BTW web server currently allows a user to upload two tab-separated text files, A and B, each containing gene expression time series data. The first line is required to contain the time points, measured in minutes—for instance, file A might contain values t1, … , t equal to 0, 10, 20, … , 160 as in Cho's expression data for S.cerevisiae (16), and file B might contain values t′1, … , t′ equal to 0, 120, 240, … , 1440 as in Cho's expression data for Homo sapiens (17). Time intervals need not be constant; for instance, a time series might consist of values measured at times 0, 20, 40, 80, 160. The user may enter constants c resp. d, set by default to 1, in order to allow scaling between τ and μ; i.e. τ′ = c · τ resp. μ′ = d · μ. For instance, since approximately two cell-cycles for S.cerevisiae occur in 160 min for the data of (16) with intervals of 10 min, and since approximately two cell-cycles for H.sapiens occur in 1440 min (24 h) or the data of (17) with intervals of 120 min, there are 16 non-zero time points for yeast compared with 12 time points for human. Hence one could take c = 1/10, d = 4/3·1/120 or alternatively c = 1, d = 4/3·1/12. The leftmost column of both A and B contains distinct gene names.Every subsequent line of file A resp. B is required to contain normalized log expression values, with no missing data—the user is assumed to complete missing data by using interpolation, splines (18) or another method. The BTW web server computes the (symmetric) time warping distance between each gene of A and each gene of B, where due to time and space constraints, file A resp. B has an upper bound of 100 resp. 10 000 genes. Time warping scores for gene pairs are sorted by increasing distance, or available in lexicographic order of gene pairs. For up to 100 user-specified gene pairs, Boltzmann pair probabilities are available in both text and graphical format, the latter depicted in Figure 3. Since Clote and Straubhaar (13) proved the symmetry of one of the four variants of DTW in (10), the BTW web server allows the user to choose either the algorithm of Clote or that of Aach. [After seeing a preprint of (13) Aach conjectured that his DTW program genewarp with flag -a 2 is symmetric. Aach's conjecture is proved in (13)].Current hardware supporting the BTW web server consists of a Beowulf-style cluster comprising 6 Dell 1650, 2 × 1300 MHz Pentium III, 2 GB RAM with 4 Apple XServe, 2 × 1333 MHz G4, 2 GB RAM and finally 12 Dell 1850, 2 × 2800 MHz Xeon EM64T, 2 GB RAM. Interconnect is 1 Gbit Ethernet. Pentium III nodes are running 32-bit CentOS 4.2, Xeon EM64T nodes are running 64-bit CentOS 4.2 and G4 nodes are running MacOS 10.2.8.
DISCUSSION
A histogram containing 1000 classes, produced over 40 million [6601 yeast × 7077 human (16,17)] time warping distances between each yeast and each human gene is depicted in Figure 4A. This figure suggests that time warping distances could follow an extreme value distribution, known by (19) to be the distribution of BLAST hits. From the method of moments, we fit the histogram of Figure 4A to an extreme value distribution with cumulative distribution function . This would allow us to compute a P-value for significance of human genes, whose time warping distance with a given yeast gene is less than a fixed threshold.
Figure 4
(A) Histogram of (symmetric) time warping distances between each yeast and human gene, using gene expression data from Cho et al. (16,17); mean μ = 100.91, SD σ = 15.647, max = 221.74, min = 19.99. (B) Histogram of (symmetric) time warping distances between each yeast and human gene from a sample of 188 homologous pairs of yeast/human genes; mean μ = 101.73, SD σ = 17.75, max = 167.41, min = 48.03. Homologous pairs determined using NCBI HomoloGene (J. Straubhaar personal communication).
Authors: R J Cho; M Huang; M J Campbell; H Dong; L Steinmetz; L Sapinoso; G Hampton; S J Elledge; R W Davis; D J Lockhart Journal: Nat Genet Date: 2001-01 Impact factor: 38.330
Authors: E G Caiani; A Porta; G Baselli; M Turiel; S Muzzupappa; M Pagani; A Malliani; S Cerutti Journal: Med Biol Eng Comput Date: 2002-03 Impact factor: 2.602
Authors: R J Cho; M J Campbell; E A Winzeler; L Steinmetz; A Conway; L Wodicka; T G Wolfsberg; A E Gabrielian; D Landsman; D J Lockhart; R W Davis Journal: Mol Cell Date: 1998-07 Impact factor: 17.970