This repository contains our implementation of state-of-the-art methods and GXBits for estimating set difference cardinalities. In comparison with existing methods such as Odd Sketch, Tug-of-War Sketch, and HyperLogLog Sketch, our proposed GXBits is more accurate and memory-efficient, which builds a bit array for each set to embed its corresponding elements and estimates the set difference cardinality between two sets based on the joint of their sketches. Each bit in the array is updated following the geometric distribution and the update probability varies across different bits. Afterwards, given a random initial value, our GXBits utilizes the Newton-Raphson methods to further optimize the estimated set difference cardinalities.
The detailed process of generating synthetic datasets is included in loader.py, which can be used to generate two kinds of synthetic datasets:
- Balanced Synthetic Dataset: the cardinalities of any two sets are equal to each other.
- Skewed Synthetic Dataset: the cardinalities of any two sets are difference from each other.
In addition to synthetic datasets, our GXBits can also be used for real-world datasets. In our experiments, four online social network datasets are used to evaluate the performance:
Dataset | #Sets | Max-Cardinality | Total-Cardinality |
---|---|---|---|
Youtube | 570,774 | 28,644 | 4,945,382 |
Flickr | 1,441,431 | 26,185 | 22,613,980 |
Orkut | 2,997,376 | 31,949 | 223,534,301 |
LiveJournal | 4,590,650 | 9,186 | 76,937,805 |
Method | Data Structure | Shortcomings | Reference |
---|---|---|---|
Odd Sketch | bit array | limited estimation range | odd.py |
Tug-of-War Sketch | counter array | high computational costs and memory usage | tow.py |
HyperLogLog Sketch | counter array | high computational costs and memory usage | hll.py |
GXBits | bit array with geometric distribution | / | gxbits.py |
- Python >= 3.9
- mmh3 >= 3.0.0 (use the command "pip install mmh3")