C++ implementation of Sonathalia & Gilbert "Tree! I am no Tree! I am a Low Dimensional Hyperbolic Embedding," 1 with a thin python 3 wrapper.
Treerep takes an N-point metric (matrix of pairwise distances) and computes a weighted tree that approximates it. You can use this tree directly, or embed it in hyperbolic space2. If the original metric is hyperbolic (roughly "tree-like") then Treerep produces an embedding with low distortion. Recent research3 4 5 shows hyperbolic embeddings outperform Euclidean embeddings for many types of hierarchical data.
The C++11 source can be used with no external dependencies. The python API requires pybind11, which can be downloaded by cloning the submodule
$ git clone --recursive https://github.com/meiji163/treerep.git
then compile with cmake
$ cd treerep && mkdir build && cd build
$ cmake ..
$ make
$ make install #if you want to install to your python lib
To save space, the N-point metric is represented as a length N*(N-1)/2 array (see scipy pdist). Input can be a python list or numpy array.
Here is Sarich's "immunological distance" example from Sonthalia & Gilbert's paper.
from trep import *
from trep_utils import load_mat
metric, N = load_mat("../data/immune.mtx")
print(metric)
# array([ 32., 48., 51., 50., 48., 98., 148., 26., 34., 29., 33.,
# 84., 136., 42., 44., 44., 92., 152., 44., 38., 86., 142.,
# 42., 89., 142., 90., 142., 148.])
tree, weights = treerep(metric,N)
print(weights)
# {(0, 10): 24.0, (1, 10): 8.0, (2, 11): 19.25, (3, 9): 18.25, (4, 12): 21.25,
# (5, 8): 17.5, (6, 9): 67.75, (7, 13): 121.9375, (8, 9): 2.25, (8, 13): 1.0625,
# (10, 11): 1.75, (11, 12): 2.0, (12, 13): 1.6875}
the tree looks like this (edges not to scale)
Treerep is a randomized algorithm, so you can produce multiple trees and choose the best.
- multithread sorting
- efficient all pairs shortest path for trees
- distortion/ MAP