Graph datasets for vertext classification task, including modified version with synthetic noise edges.
Several datasets collected from different sources are included in this project. Each dataset is assigned with a prefix
. Then the folder structure and file names are determined by following rules:
.
├── prefix
│ ├── prefix.content
│ ├── prefix.feature
│ ├── prefix.cites
│ ├── prefix.label
│ ├── prefix.meta
│
│ ├── prefix.cites.add10
│ ├── prefix.cites.add5
│ ├── prefix.cites.reduce10
│ ├── prefix.cites.reduce5
│
│ ├── README
│
└── ...
-
prefix.content: original data content in tab-separated format with each row describing one document, containing the one-hot encoded word vector indicating the absence/presence of corresponding word in a certain document, while the first and last element indicates the ID and class label of this document
-
prefix.cites: links in tab-separated format with each row denoting a single edge; the two elements separated by the tab are the two vertices connected by this link
-
prefix.meta: dictionary of all possible class labels in this dataset
-
prefix.label (optional): parsed labels from content file, each row contains the tab separated entry ID and class label
-
prefix.feature (optional): serialized feature matrix in
torch.Tensor
- prefix.cites.add5: links data with additional 5% random links
- prefix.cites.add10: links data with additional 10% random links
- prefix.cites.reduce5: links data by randomly removing 5% links from the original
- prefix.cites.reduce10: links data by randomly removing 10% links from the original
Cora is a citation dataset consisting of 2708 vertices (scientific publications), and 5429 edges (citations). All publications are classified into 7 classes (research topics):
* Rule Learning
* Genetic Algorithms
* Reinforcement Learning
* Neural Networks
* Probabilistic Methods
* Case Based
* Theory
Citeseer is a citation dataset consisting of 3312 vertices (scientific publications) and 4723 edges (citations). All publications are classified into 6 classes:
* Agents
* AI
* DB
* IR
* ML
* HCI
WebKB is a website dataset collected from four computer science departments in different universities which contains 877 vertices (web pages) and 1608 edges (hyper-links). All websites are classified 5 classes:
* faculty
* students
* project
* course
* other
It would be appreciated to cite the related publications if you decide to use these datasets in your work.
@inproceedings{xu2017attentive,
title={Attentive graph-based recursive neural network for collective vertex classification},
author={Xu, Qiongkai and Wang, Qing and Xu, Chenchen and Qu, Lizhen},
booktitle={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
pages={2403--2406},
year={2017}
}
@article{mccallum2000automating,
title={Automating the construction of internet portals with machine learning},
author={McCallum, Andrew Kachites and Nigam, Kamal and Rennie, Jason and Seymore, Kristie},
journal={Information Retrieval},
volume={3},
number={2},
pages={127--163},
year={2000},
publisher={Springer}
}
@inproceedings{giles1998citeseer,
title={CiteSeer: An automatic citation indexing system},
author={Giles, C Lee and Bollacker, Kurt D and Lawrence, Steve},
booktitle={Proceedings of the third ACM conference on Digital libraries},
pages={89--98},
year={1998},
organization={ACM}
}
@inproceedings{Craven:1998:LES:295240.295725,
author = {Craven, Mark and DiPasquo, Dan and Freitag, Dayne and McCallum, Andrew and Mitchell, Tom and Nigam, Kamal and Slattery, Se'{a}n},
title = {Learning to Extract Symbolic Knowledge from the World Wide Web},
booktitle = {Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence},
series = {AAAI '98/IAAI '98},
year = {1998},
location = {Madison, Wisconsin, USA},
pages = {509--516},
numpages = {8},
publisher = {American Association for Artificial Intelligence},
address = {Menlo Park, CA, USA},
}