This repository contains the source code to reproduce the results and figures presented in the paper "ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale".
This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.
-
Create an Amazon EC2 c5.metal instance with Ubuntu Server 24.04 LTS and 20GiB of storage.
-
Clone the repository including submodules:
git clone https://github.com/dynatrace-research/exaloglog-paper.git && cd exaloglog-paper && git submodule init && git submodule update
-
Install all required packages:
sudo apt update && sudo apt --yes install openjdk-21-jdk python-is-python3 python3-pip texlive texlive-latex-extra texlive-fonts-extra texlive-science && pip install -r python/requirements.txt --break-system-packages
-
To reproduce the estimation error results
results/error/*.csv
run thesimulateEstimationErrors
task (takes ~35min):./gradlew simulateEstimationErrors
-
To reproduce the empirically determined memory-variance product (MVP) values based on the actual allocated memory and the serialization size of different data structure implementations for approximate distinct counting run the
runEmpiricalMVPComputation
task (takes ~1h10min):./gradlew runEmpiricalMVPComputation
The results can be found in the
results\comparison-empirical-mvp
folder. -
To reproduce the performance benchmark results in the folder
results/benchmarks
disable Turbo Boost (set P-state to 1), run therunBenchmarks
task (takes ~8h15min), and enable Turbo Boost again:sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"; ./gradlew runBenchmarks; sudo sh -c "echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo"
-
To calculate theoretical MVP constants as well as constants used in the Java implementation run the
calculateConstants
task (takes ~9min, not needed for the figures):./gradlew calculateConstants
The output can then be found in the
results/constants
folder. -
To (re-)generate all figures in the
paper
directory execute thepdfFigures
task (takes ~1m30s):./gradlew pdfFigures
-
The unit tests are executed by the
test
task (takes ~13min, not needed for the figures):./gradlew test