Cristian Boldrin and Fabio Vandin, "Fast and Accurate Triangle Counting in Graph Streams Using Predictions", appeared at ICDM 2024.
Link for Arxiv Extended Version of the paper
Here are the instructions for running Tonic algorithm: Tiangles cOuntiNg with predICtions, for both insertion-only and fully-dynamic graph streams. Code is deployed in C++ 17 under gcc 9.4.0 compiler. Additionally, CMake 3.16+ is required.
We provide some examples of datasets and oracles inside the respective folders, so that you can skip directly to step (4) after having compiled the sources.
-
Compile the code
bash compile.sh
The binaries will be generated inside thebuild
folder. -
Preprocess the raw dataset
./build/DataPreprocessing <dataset_path> <delimiter> <skip> <output_path>
where dataset_path is the filepath to the dataset to be preprocessed, delimiter is the character used to separate the rows in the dataset, skip is the number of lines to skip before starting to read the dataset, and output_path is the path where the preprocessed dataset will be saved. -
Build the Oracle
./build/BuildOracle <preprocessed_dataset_path> <type of the oracle = {Exact, noWR, Node}> <percentage_retain> <output_path> [wr_size]
where preprocessed_dataset_path is the path to the preprocessed dataset at point (2), type of the oracle is the type of oracle to be built (Exact, noWR, Node), percentage_retain is the fraction of top heavies edges/nodes to be retained in the oracle, output_path is the path where the oracle will be saved, and wr_size is the size of the waiting room for excluding the counts (only for the noWR oracle). -
Run Tonic Algorithm:
./build/Tonic Tonic <flag: 0: insertion-only stream, 1: fully-dynamic stream> <random_seed> <memory_budget> <alpha> <beta> <dataset_path> <oracle_path> <oracle_type = [nodes, edges]> <output_path>
where flag is the type of the input stream (0 for insertion-only, 1 for fully-dynamic), random_seed is the seed for the random number generator, memory_budget is the memory budget for the algorithm, alpha and beta are the parameters for fraction of size of WR and H, preprocessed_dataset_path is the path to the preprocessed dataset at point (2), oracle_path is the path to the oracle at point (3), oracle_type is the type of oracle used (nodes or edges), and output_path is the path where the output will be saved.
Here are the links to the datasets we used to perform the experiments.
Dataset | Nodes | Edges | Triangles |
---|---|---|---|
Edit EN Wikibooks | 133k | 386k | 178k |
SOC YouTube Growth | 3.2M | 9.3M | 12.3M |
Cit US Patents | 3.7M | 16.5M | 7.5M |
Actors Collaborations | 382k | 15M | 346M |
Stackoverflow | 2.5M | 28M | 114M |
SOC Live Journals | 4.8M | 42.8M | 285.7M |
Twitter-merged | 41M | 1.5B | 34.6B |
This kind of datasets consists in a collection of consecutive graphs registered each in a given time of interval, meaning that the topology betweeen each graph is only slightly varying in terms of nodes and edges (and, consequently triangles).
Dataset | Number of Graphs | Description | Max Nodes | Max Edges | Max Triangles |
---|---|---|---|---|---|
Autonomous systems - Oregon-1 | 9 | 9 graphs of Autonomous Systems (AS) peering information inferred from Oregon route-views between March 31 2001 and May 26 2001 | 11k | 23k | 19k |
CAIDA AS Relationships Datasets | 122 | CAIDA AS graphs, derived from a set of RouteViews BGP table snapshots, from January 2004 to November 2007 | 26k | 106k | 36k |
Autonomous systems AS-733 | 733 | 733 Daily instances which span an interval of 785 days from November 8 1997 to January 2 2000, from the BGP logs | 6k | 13k | 6k |
Twitter snapshots | 4 | 4 graphs of the Twitter following/followers network | 29.9M | 373M | 4.4B |
To reproduce experiments and results of the paper, please refer to the scripts
folder.