Topological Data Analysis

Mathematics is not just a language for equations, it is also a language for shape. In many modern data sets, whether coming from biology, climate science, or social networks, important information is hidden in the way the data is arranged, connected, or clustered. Topological Data Analysis (TDA) is a growing field that applies ideas from algebraic topology to study the “shape of data.” We do so by building combinatorial objects (such as simplicial complexes) from point clouds and examining how their features (such as connected components, loops, voids) persist across scales. TDA produces summaries of the lifespans of these features, capturing global structure in the data. These summaries, called persistence diagrams or barcodes, can then be used for classification, visualizations, and other statistical and machine learning tasks.

One tool for doing this is to create a filtration of simplicial complexes, whose vertices lie on a given data set. In one-parameter persistence, we typically create this filtration by increasing \(\epsilon\)-balls around each data point, and connect two vertices when their \(\epsilon\) balls intersect. This induces a filtered simplicial complex (Vietoris-Rips filtration). We then choose a dimension of features we’d like to track over this filtration. Dimension 0 features are connected components, dimension 1 features are loops or holes, and dimension 2 features are voids. Higher dimensional voids may be tracked as well, although they are harder to visualize. We apply homology in the chosen degree to the filtered simplicial complex, yielding a persistence module, a linearly ordered set of vector spaces, with maps between vector spaces; equivalently, a functor from the category \((\mathbb{R}, \leq )\) to \(Vec\). The structure theorem ensures that these persistence modules decompose nicely into direct sums of interval modules (persistence modules which are 1-dimensional on a fixed interval, and \(0\) elsewhere). Each such interval represents the birth and death of a feature in the filtration. For example, in homology dimension \(=1\), an interval module of the form \([0.2, 0.7)\) is interpreted as, “There is a loop which exists when \(\epsilon =0.2\) but not before. This loop exists whenever \(0.2 \leq \epsilon < 0.7\), but the loop no longer exists when \(\epsilon \geq 0.7\)”. The collection of intervals that make up the decomposition of a persistence module are referred to as the persistence barcode.

This mapping from data sets to barcodes is stable with respect to perturbation of the data. That is, small changes in the data set yield small changes in the barcode. However, this mapping is highly sensitive to outliers. For example, an arrangement of data in the shape of a large circle has one major feature in homology degree 1. The death of this feature is drastically reduced if we append our data with a single point at the center of the circle. To address this, many researchers study multiparameter persistence. In this setting, the simplicial complex is filtered not only by distance between points, but also by some other parameter(s), such as density of points. In general, a d-filtered simplicial complex will yield a d-parameter persistence module: a functor from \((\mathbb{R}^d, \leq )\) into the category \(Vec\).

Unlike in the one-parameter setting, there is no structure theorem of multiparameter persistence modules guaranteeing decomposition into interval modules. An active area of research is the creation of suitable variations of barcodes for the multiparameter setting. One such variation is that of signed rank barcodes, as defined by (Botnan, Opperman, Oudot, 2024). These can take the form of signed rectangle modules in \(\mathbb{R}^d\).