Topological data analysis, including persistent homology, has undergone significant development in recent years. However, one outstanding challenge is to build a coherent statistical inference procedure on persistent diagrams. The paired dependent data structure, which are the births and deaths in persistent diagrams, adds complexity to statistical inference. In this paper, we present a new lattice path representation for persistent diagrams. A new exact statistical inference procedure is developed for lattice paths via combinatorial enumerations. The proposed lattice path method is applied to study the topological characterization of the protein structures of the COVID-19 virus. We demonstrate that there are topological changes during the conformational change of spike proteins, a necessary step in infecting host cells.
Topological data analysis, including persistent homology, has undergone significant development in recent years. However, one outstanding challenge is to build a coherent statistical inference procedure on persistent diagrams. The paired dependent data structure, which are the births and deaths in persistent diagrams, adds complexity to statistical inference. In this paper, we present a new lattice path representation for persistent diagrams. A new exact statistical inference procedure is developed for lattice paths via combinatorial enumerations. The proposed lattice path method is applied to study the topological characterization of the protein structures of the COVID-19 virus. We demonstrate that there are topological changes during the conformational change of spike proteins, a necessary step in infecting host cells.
Despite its rigorous mathematical foundation developed since its introduction 20 years ago [11], persistent homology still suffers from numerous statistical and computational problems. It has not yet become a standard tool in medical imaging. This approach has been applied to a wide variety of data including brain networks [9], protein structures [12], and RNA viruses [5]. However, most of these methods only serve as exploratory tools that provide descriptive summary statistics rather than formal inference. The main difficulty is due to the heterogeneous nature of topological features, which do not have a one-to-one correspondence across persistent diagrams. Motivated by these challenges, we propose a more principled topological inference procedure through lattice paths.Lattice paths are widely studied algebraic objects in combinatorics and may have potential applications in persistent homology [2,16,18]. Previously, they have been used to differentiate 0th and 1st Betti numbers in functional brain networks [9]. However, the method is limited to a specific type of filtration that produces only up to the 1st Betti numbers. Here, we propose a procedure that is generalizable to arbitrary persistent diagrams. The main advantage of the lattice path approach is that it provides probabilistic statements about the similarity of two persistent diagrams. This is often needed to produce some baseline quantitive measure, such as the p-value, commonly used in biomedical research [7,9]. Existing methods for computing p-values usually rely on approximate resampling techniques: jackknife [8], bootstrap [1] and the permutation test [9]. Our approach here is different because it is analytic and thus compute the exact probability.The main contributions of this study are the following: (1) a new data representation via Dyck and lattice paths; (2) the first analytic approach for computing probabilities without resampling; (3) the first topological study on the shape of COVID-19 virus spikes proteins. The proposed lattice path method was use din differentiating the conformational changes of the COVID-19 virusspike proteins (Figure 1). This demonstration is particularly relevant due to the potential for advancing vaccine development [19,4].
Fig. 1:
Left: Schematic representation of COVID-19 virus with spike proteins (red). Right: Spike proteins of the three different corona viruses. The spike proteins consist of three similarly shaped interwinding substructures identified as A (blue), B (red) and C (green) domains.
Preliminary: simplicial homology for protein structures
High dimensional objects, such as proteins and molecules, can be modeled as a point cloud data V consisting of p number of points (atoms) indexed as V = {1, 2, ⋯, p}. Suppose that the distance ρ between points i and j satisfies the metric properties. For proteins, we can simply use the Euclidean distance between atoms in a molecule. Then is a metric space where we can build a filtration necessary for persistent homology. If we connect points following some criterion on the distance, they will form a simplicial complex which will follow the topological structure of the molecule [10,15,21]. Note that the k-simplex is the convex hull of k + 1 points in V. A simplicial complex is a finite collection of simplices such as points (0-simplex), lines (1-simplex), triangles (2-simplex) and higher dimensional counterparts. In particular, the Rips complex
is a simplicial complex, whose k-simplices are formed by (k + 1) points which are pairwise within distance ϵ [13]. While a graph has at most 1-simplices, the Rips complex has at most (p − 1)-simplices. The Rips complex induces a hierarchical nesting structure called the Rips filtration for filtration values 0 = ϵ0 < ϵ1 < ϵ2 < ⋯. The filtration is quantified through k-cycles where 0-cycles are the connected components, 1-cycles are loops while 2-cycles are a 3-simplices (tetrahedron) without interior. The Betti numbers β, count the number of independent k-cycles. During the Rips filtration, the i-th k-cycles are born at filtration value b and die at d. The collection of all the paired filtration values {(b1, d1), ⋯, (b, d)} displayed as 1D intervals is called the barcode and displayed as scatter points in 2D plane is called the persistent diagram. Since b < d, the scatter points in the persistent diagram are traditionally displayed above the line y = x line by taking births in the x-axis and deaths in the y-axis.
Methods
Dyck paths.
The first step in the proposed lattice path method is to sort the set of all the birth and death values in the filtration as an order statistic c : c(1) < c(2) < ⋯ < c(2, where c( is one of the birth or death values. The subscript ( denotes the i-th smallest value. We will simply call such sequence as the birth-death process. Every possible valid sequence of birth and death values can be viewed as forming a probability space, where each valid sequence is likely to happen with equal probability. During the filtration, the sequence of birth and death occurs somewhat randomly but still maintains a specific pairing structure.There exists a one-to-one relation between the ordering information and Dyck paths if we identify births with ↗ and deaths with ↘ [2,16]. If we trace the arrows, we obtain the Dyck path (Figure 2) [18]. A valid Dyck path always starts at y = 0 and ends at y = 0. At any moment during the filtration, a Dyck path cannot go below y = 0. The total number of Dyck paths is called the Catalan number . The first few Catalan numbers are κ1 = 1, κ2 = 2, κ3 = 5 and κ4 = 14. More rapid changes in the direction of Dyck paths imply more fleeting fluctuations indicative of smaller topological signals. Less fluctuations indicate larger persistence and thus larger topological structures. The first path in Figure 2 has larger persistence while the last path has smaller persistence.
Fig. 2:
Top: 4 different Dyck paths out of 14 possible paths for = 4. Bottom: corresponding lattice paths. Right: Weighted lattice path equivalent to a persistent diagram.
Lattice paths.
If we rotate the Dyck paths clockwise at 45° and flip vertically, we obtain equivalent monotone lattice paths consisting of a sequence of → (uparrow) and ↑ (downarrow). Figure 2 displays corresponding monotone lattice paths between (0, 0) and (q, q) in where the path does not pass above the diagonal line y = x [18]. During the filtration, there cannot be more deaths than births and thus the path must lie below the diagonal line. The total area below Dyck paths can be used to quantify the Dyck paths [6]. Since
we will simply use lattice paths for quantification. If we tabulate how the area of boxes change over the x-coordinate in a lattice path, it is monotone. In the first path in Figure 2, the number of boxes below the first and the last lattice paths are (0, 0, 0, 0) and (0, 1, 2, 3). The area below the path is related to persistence. A barcode with smaller persistences (last path in Figure 2) will have more boxes (dark gray boxes) while longer persistences will have fewer boxes (first path in Figure 2). Given the sequence of heights of piled-up boxes, we can recover the corresponding lattice path by tracing the outline of boxes. We can further recover the original pairing information about births and deaths.Remarks. In the situation where the birth values are identical at −∞, persistent diagrams line up vertically as (0, d() and hence we simply augment them as ((i − 1)δ, d(). For instance, the Rips filtration for 0-cycles line up vertically.
Weighted lattice paths.
The lattice and Dyck path representations only encode the ordering information about how births and deaths are paired, and do not encode the actual filtration values. This is remedied by adaptively weighting the length of arrows in lattice paths. We sort the set of birth values b and death values d as the order statistics:
We start at origin (0, 0). When we encounter a birth b(, we take the horizontal step to b(. When we encounter a death d(, take the vertical step to d( (Figure 2). The weighted lattice paths contains the same topological information as the original persistent diagram. Using the weighted lattice paths, we map a birth-death process to a monotone function ϕ:Theorem 1.
There exists a one-to-one map from a birth-death process to a monotone function ϕ with ϕ(0) = 0 and ϕ(1) = q.Proof. We explicitly construct such a function ϕ. Consider the sequence of heights (or death values) as we traverse the weighed lattice path: h : h1 ≤ h2 ≤ ⋯ ≤ h. The heights may not strictly increase (Figure 2-right). If births occurs r times sequentially in the birth-death process, the heights h will have r repeated identical death values d( as a subsequence. To make the subsequence strictly increasing, we simply add δ(0, 1, 2, ⋯, r − 1)(d( − d() to the repetition for (Figure 3 for example). Denote the transformed sequence as . ϕ(t) is given as a step function
From ϕ(t), the original sequence h and the original birth-death process can be recovered exactly. Such map from a birth-death process to ϕ is one-to-one. □
Fig. 3:
Left: The heights of boxes below lattice paths h (dotted line) is made into strictly increasing to h′ (solid line). Right: The problem of lattice path enumeration between (0, 0) and (q1, q2) with the constraint |x/q1 − y/q2| < d.
Remarks. Note ‖h − h′‖2 → 0 as δ → 0. So by making δ as small as possible, we can construct a strictly monotone h′ to be arbitrarily close to h. The normalized step function ϕ(t)/q can be viewed as an empirical cumulative distribution and many statistical tools for analyzing distributions can be readily applied. Figure 4-bottom displays the lattices paths and the normalized step functions of 1-cycles corresponding to the spike proteins used in the study. The proposed lattice path method uses ϕ(t)/q as a new topological functional descriptor – which is one of the main contributions in this paper.
Fig. 4:
Top: persistent diagrams of three different spike proteins. The red dots are 0-cycles and the black dots are 1-cycles. The units are in Å (angstrom). Bottom: the corresponding lattice paths and normalized step functions ϕ(t)/q.
Exact topological inference.
The task now is to develop a method for testing the topological equivalence of two birth-death processes:
where each element is either a birth or death. Let ϕ1 and ϕ2 be the step functions corresponding to C1 and C2. The topological distance
will be used as the test statistic for testing the equivalence of C1 and C2. The normalizing denominators q1 and q2 ensures that the value of step functions are in [0, 1]. The statistic D(ϕ1, ϕ2) is the upper bound of area difference under ϕ1(t)/q1 and ϕ2(t)/q2:Theorem 2.
Under the null hypothesis of the equivalence of
C1
and
C2,
where A = A + A
with the boundary condition
within the band |u/q1 − v/q2| < d.Proof. The statement can be proved similarly as the combinatorial construction of the Kolmogorov-Smirnov test that also uses the lattice path enufmeration [3,14,9]. First, we combine monotonically increasing sequences C1 and C2 and sort them into a bigger monotone sequence of size q1 + q2. Then, we represent the combined sequence as the sequence of → and ↑ respectively depending on if they are coming from C1 or C2. Under the null, there is no preference and they equally likely come from C1 or C2. If we follow the sequence of arrows, it forms a monotone lattice path from (0, 0) to (q1, q2). In total, there are possible equally likely lattice paths that forms the sample space. This is equivalent to the total number of permutations in a two-sample test.From Theorem 1, the values of ϕ1(t) and ϕ2(t) are integers from 0 to q. Thus, (ϕ1(t), ϕ2(t)) are the (x, y)-coordinates in lattice paths. Note
if and only if |ϕ1(t)/q1 − ϕ2(t)/q2| < d for all t. Such points are all the lattice points satisfying within the band bounded by two lines y/q2 = x/q1 ± d (dotted red lines in Figure 3). Thus, the probability of the event D < d is equal the number of valid paths over the every possible paths:
where A is the total number of valid paths from (0, 0) to (u, v). Since there are only two paths leading to (u, v), we write
If there is no constraint, . However, due to the constraint, A has to be iteratively computed with the boundary condition A = A0, = 1 for all u and v. □For numerical implementation, we start from the bottom left corner (0, 0) and move toward the top right corner (q1, q2) in solving the recursion numerically:
Computing iteratively requires at most q1 · q2 operations while the permutation test will cause a computational bottleneck for large q1 and q2. Thus, the proposed lattice path method computes the exact p-value substantially faster than the permutation test. Since most protein molecules consist of thousands of atoms, q1 and q2 should be sufficiently large to apply the asymptotic:Theorem 3.
.The result follows by treating ϕ1/q1 and ϕ2/q2 as the empirical cumulative distribution functions and using the asymptotic expansion of the KS-test [14,17]. Subsequently, the p-value under null is given by
where d is the observed value of . For any large observed value d ≥ 2, the second term is in the order of 10−14 and insignificant. Even for a small value of d d, the expansion converges quickly and 5 terms should be more than sufficient to use Theorem 3.
Experiment: spike proteins of COVID-19 virus
The proposed lattice path method is used to study the topological structure of the severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2), which is often called COVID-19. Since the start of the global pandemic (approximately December 2019), COVID-19 has already caused 3.85 million deaths in the world as of June 2021. The COVID-19 virus is specific member of a much broader coronavirus family, which all have spike proteins surrounding the spherically shaped virus similar to the sun’s corona. The glycoprotein spikes that bind with receptors on cells and consequently cause severe infection. The atomic structure of spike proteins can be determined through the cryogenic electron microscopy (cryo-EM) [19,4]. Figure 1-left illustrates spike proteins (colored red) that surrounds the spherically shaped virus. Each spike consists of three similarly shaped protein molecules with rotational symmetry often identified as A, B and C domains. The spike proteins have two distinct conformations identified as open and closed states, where the domain’s opening is necessary for interfacing with the host cell’s surface for membrane fusion and viral entry (Figure 1-right). Indeed, most current vaccine efforts focus on preventing the open state from interfacing with the host cell. Hence, this line of research is of prime importance in vaccine development and therapeutics [4].In this study, we analyzed the spikes of three different coronaviruses identified as 6VXX, 6VYB [19] and 6JX7 [20]. The 6VXX and 6VYB are the closed and open states of SARS-Cov-2 from human while 6JX7 is feline coronavirus (Figure 1). All the domains of 6VXX have exactly 7604 atoms and are expected to be topologically identical. Applying the lattice path method to 1-cycles, we tested the topological equivalence of the B-domain and the A- and C-domains within 6VXX. The normalized step functions are almost identical and the observed topological distances are 0.0090 and 0.0936, which give the p-value of 1.00 each. As expected, the method concludes that they are topologically equivalent. Figure 4- bottom displays the lattice paths and the normalized step functions for domain B of 6VXX. The plots for other domains are visually almost inseparable and hence not shown. The closed domain B of 6VXX is also compared against the open domain B of 6VYB. The open state has a significantly reduced number of atoms at 6865 due to the conformational change that may change the topology as well. The observed topological distance is 0.2 and the test shows the extremely small p-values of 8.1123 × 10−38 confirming the topological change. The persistent diagrams of both closed and open states are almost identical in smaller birth and death values below 6 Å (angstrom) (Figure 4-top). The major difference is in the scatter points with larger brith and death values. The lattice path method confirms that the local topological structures are almost identical while the global topological structures are different.The domain B of 6VXX is also compared against the domain B of feline coronavirus 6JX7 consisting of 9768 atoms. Since 6JX7 is not from human, it is expected that they are different. The topological distance is 0.9194 and p-values is 0.00 × 10−38 confirming that the topological nature of spikes are different. This shows the biggest difference among all the comparisons done in this study.
Conclusions
In this paper, we proposed a new representation of persistent diagrams using the Dyck and lattice paths. The novel representation enables us to perform the statistical inference combinatorially through the proposed lattice path method by enumerating every possible valid lattice paths analytically. The lattice path method is subsequently used to analyze the coronavirus spike proteins. The normalized step function ϕ(t)/q for all the spike proteins show fairly stable consistent global monotone pattern but with localized differences. We demonstrated the lattice path method has the ability to statistically discriminate between the conformational changes of the spike protein that is needed in the transmission of the virus. We hope that the our new representation enables scientists in their effort to automatically identify the different types and states of coronaviruses in a more principled manner.
Authors: Moo K Chung; Jamie L Hanson; Jieping Ye; Richard J Davidson; Seth D Pollak Journal: IEEE Trans Med Imaging Date: 2015-03-24 Impact factor: 10.048
Authors: Alexandra C Walls; Young-Jun Park; M Alejandra Tortorici; Abigail Wall; Andrew T McGuire; David Veesler Journal: Cell Date: 2020-03-09 Impact factor: 41.582