Caleb Belth, Xinyi Zheng, and Danai Koutra. Mining Persistent Activity in Continually Evolving Networks. Knowledge Discovery and Data Mining (KDD), August 2020. [Link to the paper]
If used, please cite:
@article{belth2020mining,
title={Mining Persistent Activity in Continually Evolving Networks},
author={Belth, Caleb and Zheng, Xinyi and Koutra, Danai},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages={934–944},
year={2020}
}
$ git clone git@github.com:GemsLab/PENminer.git
$ cd PENminer/test
$ python tester.py
$ cd ../src
Python 3
numpy
scipy
rrcf
eu_email.txt
EU Email networkcolumbus_bike.txt
Columbus Bike networkreddit.txt
Reddit networkdarpa_ip.txt
DARPA IP network (zipped indarpa_ip.zip
)darpa_ip_without_labels
DARPA IP network with attack edges not marked for fair anomaly detection (zipped indarpa_ip.zip
)
The other datasets used in the paper were too big to share via Github. Alternatives are being considered.
Each row (edge update) has the format: {1/-1},{u},{v},{w},{u_label},{v_label},{edge_label},...,{timestamp}
.
Here 1
or -1
specifies insert or delete, u
is the id of the first node and v
of the second. w
is a weight (1 if unweighted), u_label
and v_label
are the nodes' labels (ignored if view != label
), and edge_label
is the edge's label (if unlabeled, it doesn't matter what it is, as long as it is the same for all edges). The ... means that other information can be stored (e.g., a string version of the timestamp or some helpful description) that will be ignored by the code. timestamp
is an integer timestamp in seconds. Edge updates (rows) are assumed to be sorted by timestamp.
For reddit
dataset, with k_max = 1
, delta_max = 1
, alpha = 1
, beta = 0.2
, gamma = 5.0
, view = 'id'
sPENminer:
python main.py -s reddit -v id -ms 1 -ws 1 --alpha 1.0 --beta 0.2 --gamma 5.0 -v id
oPENminer:
python main.py -s reddit -v id -ms 1 -ws 1 --alpha 1.0 --beta 0.2 --gamma 5.0 -v id -o True
--stream / -s (Required)
Expects {stream}.txt
to be in data/
directory in format as described above.
--verbose / -v True/False (Optional; Default = True)
Whether or not to print logs while running.
--window_size / -ws [1, infinity) (Optional; Default = 1)
The window size in seconds (integer) (equivalently the maximum snippet duration delta_max).
--max_size / -ms {1, 2, 3} (Optional; Default = 1)
The maximum snippet size (k_max). Only implemented for k_max in {1, 2, 3}.
--view / -v {id, label, order}
the view of the snippet to use.
--alpha / -alpha (0, infinity) (Optional; Default = 1)
the exponent for W(.)
.
--beta / -beta (0, infinity) (Optional; Default = 1)
the exponent for F(.)
.
--gamma / -gamma (0, infinity) (Optional; Default = 1)
the exponent for S(.)
.
--offline / -o (Optional; Default = False)
Whether to use sPENminer (if True) or oPENminer (if False).
You can still use PENminer, but setting the update type to 1
(insert) for all updates, setting the weight and node labels arbitrarily (just don't use the view = label
view, and PENminer will ignore these). Make sure the edge type is consistent across all updates, but it doesn't matter what it is.
Contact Caleb Belth with comments or questions: cbelth@umich.edu