| Literature DB >> 32025466 |
Nina Otter1,2, Mason A Porter1,3,4, Ulrike Tillmann1,2, Peter Grindrod1, Heather A Harrington1.
Abstract
Persistent homology (PH) is a method used in topological data analysis (TDA) to study qualitative features of data that persist across multiple scales. It is robust to perturbations of input data, independent of dimensions and coordinates, and provides a compact representation of the qualitative features of the input. The computation of PH is an open area with numerous important and fascinating challenges. The field of PH computation is evolving rapidly, and new algorithms and software implementations are being updated and released at a rapid pace. The purposes of our article are to (1) introduce theory and computational methods for PH to a broad range of computational scientists and (2) provide benchmarks of state-of-the-art implementations for the computation of PH. We give a friendly introduction to PH, navigate the pipeline for the computation of PH with an eye towards applications, and use a range of synthetic and real-world data sets to evaluate currently available open-source implementations for the computation of PH. Based on our benchmarking, we indicate which algorithms and implementations are best suited to different types of data sets. In an accompanying tutorial, we provide guidelines for the computation of PH. We make publicly available all scripts that we wrote for the tutorial, and we make available the processed version of the data sets used in the benchmarking. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1140/epjds/s13688-017-0109-5) contains supplementary material.Entities:
Keywords: networks; persistent homology; point-cloud data; topological data analysis
Year: 2017 PMID: 32025466 PMCID: PMC6979512 DOI: 10.1140/epjds/s13688-017-0109-5
Source DB: PubMed Journal: EPJ Data Sci ISSN: 2193-1127 Impact factor: 3.184
Figure 1Example of persistent homology for a point cloud. (a) A finite set of points in (for ) and a nested sequence of spaces obtained from it (from to ). (b) Barcode for the nested sequence of spaces illustrated in (a). Solid lines represent the lifetime of components, and dashed lines represent the lifetime of holes.
Figure 2Example of persistent homology for a gray-scale digital image. (a) A gray-scale image, (b) the matrix of gray values, (c) the filtered cubical complex associated to the digital image, and (d) the barcode for the nested sequence of spaces in panel (c). A solid line represents the lifetime of a component, and a dashed line represents the lifetime of a hole.
Figure 3A simple example. (a) A simplicial complex, (b) a map of simplicial complexes, and (c) a geometric realization of the simplicial complex in (a).
Figure 4Examples to illustrate simplicial homology. (a) Computation of simplicial homology for the simplicial complex in Figure 3(a) and (b) induced map in 0th homology for the map of simplicial complexes in Figure 3(b).
Overview of existing software for the computation of PH that have an accompanying peer-reviewed publication (and also [ 68 ], because of its performance)
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| (a) Language | Java | C++ | Java | C++ | C++ | C++ | C++ | C++ | C++ |
| (b) Algorithms for PH | standard, dual, zigzag | Morse reductions, standard | standard (uses | standard, dual, zigzag | standard, dual, twist, chunk, spectral sequence | twist, dual, distributed | dual, multifield | simplicial map | twist, dual |
| (c) Coefficient field |
|
|
|
|
|
|
|
|
|
| (d) Homology | simplicial, cellular | simplicial, cubical | simplicial | simplicial | simplicial, cubical | simplicial, cubical | simplicial, cubical | simplicial | simplicial |
| (e) Filtrations computed | VR, W, | VR, lower star of cubical complex | WRCF | VR, | - | VR, lower star of cubical complex | VR, | - | VR |
| (f) Filtrations as input | simplicial complex, zigzag, CW | simplicial complex, cubical complex | - | simplicial complex, zigzag | boundary matrix of simplicial complex | boundary matrix of simplicial complex | - | map of simplicial complexes | - |
| (g) Additional features | Computes some homological algebra constructions, homology generators | weighted points for VR | - | vineyards, circle-valued functions, homology generators | - | - | - | - | - |
| (h) Visualization | barcodes | persistence diagram | - | - | - | persistence diagram | - | - | - |
The symbol ‘-’ signifies that the associated feature is not implemented. For each software package, we indicate the following items. (a) The language in which it is implemented. (b) The implemented algorithms for the computation of barcodes from the boundary matrix. (c) The coefficient fields for which PH is computed, where the letter p denotes any prime number in the coefficient field . (d) The type of homology computed. (e) The filtered complexes that are computed, where VR stands for Vietoris–Rips complex, W stands for the weak witness complex, stands for parametrized witness complexes, WRCF stands for the weight rank clique filtration, α stands for the alpha complex, and Č for the Čech complex. Perseus, DIPHA, and Gudhi implement the computation of the lower-star filtration [160] of a weighted cubical complex; one inputs data in the form of a d-dimensional array; the data is then interpreted as a d-dimensional cubical complex, and its lower-star filtration is computed. (See the Tutorial in Additional file 2 of the SI, for more details.) Note that DIPHA and Gudhi use the efficient representation of cubical complexes presented in [55], so the size of the cubical complex that is computed by these libraries is smaller than the size of the resulting complex with Perseus. (f) The filtered complexes that one can give as input. javaPlex supports the input of a filtered CW complex for the computation of cellular homology [78]; in contrast with simplicial complexes, there do not currently exist algorithms to assign a cell complex to point-cloud data. (g) Additional features implemented by the library. javaPlex supports the computation of some constructions from homological algebra (see [66] for details), and Perseus implements the computation of PH with the VR for points with different ‘birth times’ (see Section 5.1.3). The library Dionysus implements the computation of vineyards [155] and circle-valued functions [127]. Both javaPlex and Dionysus support the output of representatives of homology classes for the intervals in a barcode. (h) Whether visualization of the output is provided
Figure 5Example of persistent homology for a finite filtered simplicial complex. (a) We start with a finite filtered simplicial complex. (b) At each filtration step i, we draw as many vertices as the dimension of (left column) and (right column) . We label the vertices by basis elements, the existence of which is guaranteed by the Fundamental Theorem of Persistent Homology, and we draw an edge between two vertices to represent the maps , as explained in the main text. We thus obtain a well-defined collection of disjoint half-open intervals called a ‘barcode.’ We interpret each interval in degree p as representing the lifetime of a p-homology class across the filtration. (c) We rewrite the diagram in (b) in the conventional way. We represent classes that are born but do not die at the final filtration step using arrows that start at the birth of that feature and point to the right. (d) An alternative graphical way to represent barcodes (which gives exactly the same information) is to use persistence diagrams, in which an interval is represented by the point in the extended plane , where . Therefore, a persistence diagram is a finite multiset of points in . We use squares to signify the classes that do not die at the final step of a filtration, and the size of dots or squares is directly proportional to the number of points being represented. For technical reasons, which we discuss briefly in Section 5.4, one also adds points on the diagonal to the persistence diagrams. (Each of the points on the diagonal has infinite multiplicity.)
Figure 6PH pipeline.
We summarize several types of complexes that are used for PH
|
|
|
|
|---|---|---|
| Čech |
| Nerve theorem |
| Vietoris–Rips (VR) |
| Approximates Čech complex |
| Alpha |
| Nerve theorem |
| Witness |
| For curves and surfaces in Euclidean space |
| Graph-induced complex |
| Approximates VR complex |
| Sparsified Čech |
| Approximates Čech complex |
| Sparsified VR |
| Approximates VR complex |
We indicate the theoretical guarantees and the worst-case sizes of the complexes as functions of the cardinality N of the vertex set. For the witness complexes (see Section 5.2.4), L denotes the set of landmark points, while Q denotes the subsample set for the graph-induced complex (see Section 5.2.5).
Figure 7Example of PH computation with the standard algorithm (see Algorithm 1).
Figure 8Two well-known examples. (a) Plot of the image of the figure-8 immersion of the Klein bottle and (b) the reconstruction of the Stanford Dragon (retrieved from [164]).
Performance of the software packages measured in wall-time (i.e., elapsed time), and CPU seconds (for the computations running on the cluster)
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| Size of complex | 4.4 × 106 | 1.1 × 107 | 2.1 × 108 | 1.3 × 109 | 3.1 × 109 | 4.5 × 108 |
| Max. dim. | 2 | 2 | 2 | 2 | 8 | 2 |
|
| 84 | 747 | - | - | - | - |
|
| 474 | 1,830 | - | - | - | - |
|
| 6 | 90 | 1,631 | 142,559 | - | 9,110 |
|
| 543 | 1,978 | - | - | - | - |
|
| 513 | 145 | - | - | - | - |
|
| 4 | 6 | 81 | 2,358 | 5,096 | 232 |
|
| 36 | 89 | 1,798 | 14,368 | - | 4,753 |
|
| 1 | 1 | 2 | 6 | 349 | 3 |
For each data set, we indicate the size of the simplicial complex and the maximum dimension up to which we construct the VR complex. For all data sets, we construct the filtered VR complex up to the maximum distance between any two points. We indicate the implementation of the standard algorithm using the abbreviation ‘st’ following the name of the package, and we indicate the implementation of the dual algorithm using the abbreviation ‘d.’ The symbol ‘-’ signifies that we were unable to finish computations for this data set, because the machine ran out of memory. Perseus implements only the standard algorithm, and Gudhi and Ripser implement only the dual algorithm. (a), (b) We run DIPHA on one node and 16 cores for the data sets eleg, Klein, and genome; on 2 nodes of 16 cores for the HIV data set; on 2 and 3 nodes of 16 cores for the dual and standard implementations, respectively, for drag 2; and on 8 nodes of 16 cores for random. (The maximum number of processes that we could use at any one time was 128.) (c) We run DIPHA on a single core.
Memory usage in GB for the computations summarized in Table
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| Size of complex | 4.4 × 106 | 1.1 × 107 | 2.1 × 108 | 1.3 × 109 | 3.1 × 109 | 4.5 × 108 |
| Max. dim. | 2 | 2 | 2 | 2 | 8 | 2 |
|
| <5 | <15 | >64 | >64 | >64 | >64 |
|
| 1.3 | 11.6 | - | - | - | - |
|
| 0.1 | 0.2 | 2.7 | 4.9 | - | 4.8 |
|
| 5.1 | 12.7 | - | - | - | - |
|
| 0.5 | 1.1 | - | - | - | - |
|
| 0.1 | 0.2 | 1.8 | 13.8 | 9.6 | 6.3 |
|
| 0.2 | 0.5 | 8.5 | 62.8 | - | 21.5 |
|
| 0.007 | 0.02 | 0.06 | 0.2 | 24.7 | 0.07 |
For javaPlex, we indicate the value of the maximum heap size that was sufficient to perform the computation. The value that we give is an upper bound to the memory usage. For DIPHA, we indicate the maximum memory used by a single core (considering all cores). See Table 3 for details on the number of cores used.
For each software package in (a), we indicate in (b) the maximal size of the simplicial complex supported by it thus far in our tests
|
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| (b) | 4.5⋅108 | 1.1⋅107 | 2.8 × 109 | 1.3⋅109 | 3.4⋅109 | 1⋅107 | 3.4⋅109 | 3.4⋅109 |
Guidelines for which implementation is best-suited for which data set, based on our benchmarking
|
|
|
|
|---|---|---|
| networks | WRCF |
|
| image data | cubical |
|
| distance matrix | VR |
|
| distance matrix | W |
|
| points in Euclidean space | VR |
|
| points in Euclidean space | Č |
|
| points in Euclidean space |
|
|
Recall that we indicate the implementation of the dual algorithm using the abbreviation ‘d’ following the name of a package, and similarly we indicate the implementation of the standard algorithm by ‘st’. Note that for smaller data sets one can also use javaPlex to compute PH with VR complexes from points in Euclidean space, and Perseus to compute PH with cubical complexes for image data, and with VR complexes for distance matrices. The library jHoles can only handle networks with density much less than 1.