ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

This repository contains the source code to reproduce the results and figures presented in the paper "ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale" which was accepted at EDBT 2025.

Abstract

This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.

Steps to reproduce the results and figures presented in the paper

Create an Amazon EC2 c5.metal instance with Ubuntu Server 24.04 LTS and 20GiB of storage.

Clone the repository including submodules:

git clone https://github.com/dynatrace-research/exaloglog-paper.git && cd exaloglog-paper && git submodule init && git submodule update

Install all required packages:

sudo apt update && sudo apt --yes install openjdk-21-jdk python-is-python3 python3-pip texlive texlive-latex-extra texlive-fonts-extra texlive-science black && pip install -r python/requirements.txt --break-system-packages

To reproduce the estimation error results in the folder results/error run the simulateEstimationErrors task (takes ~35min):
```
./gradlew simulateEstimationErrors
```
To reproduce the empirically determined memory-variance product (MVP) values based on the actual allocated memory and the serialization size of different data structure implementations for approximate distinct counting run the runEmpiricalMVPComputation task (takes ~1h):
```
./gradlew runEmpiricalMVPComputation
```
The results can be found in the results/comparison-empirical-mvp folder.
To reproduce the performance benchmark results in the folder results/benchmarks disable Turbo Boost (set P-state to 1), run the runBenchmarks task (takes ~8h), and enable Turbo Boost again:
```
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"; ./gradlew runBenchmarks; sudo sh -c "echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
To calculate theoretical MVP constants as well as constants used in the Java implementation run the calculateConstants task (takes ~9min, not needed for the figures):
```
./gradlew calculateConstants
```
The output can then be found in the results/constants folder.
To (re-)generate all figures in the paper directory execute the pdfFigures task (takes ~90s):
```
./gradlew pdfFigures
```
The unit tests are executed by the test task (takes ~13min, not needed for the figures):
```
./gradlew test
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
c++		c++
gradle/wrapper		gradle/wrapper
java		java
licenses		licenses
paper		paper
python		python
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Abstract

Steps to reproduce the results and figures presented in the paper

About

Languages

dynatrace-research/exaloglog-paper

Folders and files

Latest commit

History

Repository files navigation

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Abstract

Steps to reproduce the results and figures presented in the paper

About

Topics

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Languages