Noise Graph Dataset

Graph datasets for vertext classification task, including modified version with synthetic noise edges.

Dataset format

Several datasets collected from different sources are included in this project. Each dataset is assigned with a prefix. Then the folder structure and file names are determined by following rules:

.
├── prefix
│   ├── prefix.content
│   ├── prefix.feature
│   ├── prefix.cites
│   ├── prefix.label
│   ├── prefix.meta
│   
│   ├── prefix.cites.add10
│   ├── prefix.cites.add5
│   ├── prefix.cites.reduce10
│   ├── prefix.cites.reduce5
│   
│   ├── README
│   
└── ...

Components

prefix.content: original data content in tab-separated format with each row describing one document, containing the one-hot encoded word vector indicating the absence/presence of corresponding word in a certain document, while the first and last element indicates the ID and class label of this document
prefix.cites: links in tab-separated format with each row denoting a single edge; the two elements separated by the tab are the two vertices connected by this link
prefix.meta: dictionary of all possible class labels in this dataset
prefix.label (optional): parsed labels from content file, each row contains the tab separated entry ID and class label
prefix.feature (optional): serialized feature matrix in torch.Tensor

Noisy variants

prefix.cites.add5: links data with additional 5% random links
prefix.cites.add10: links data with additional 10% random links
prefix.cites.reduce5: links data by randomly removing 5% links from the original
prefix.cites.reduce10: links data by randomly removing 10% links from the original

Datasets

Cora

Cora is a citation dataset consisting of 2708 vertices (scientific publications), and 5429 edges (citations). All publications are classified into 7 classes (research topics):

* Rule Learning
* Genetic Algorithms
* Reinforcement Learning
* Neural Networks
* Probabilistic Methods
* Case Based
* Theory

Citeseer

Citeseer is a citation dataset consisting of 3312 vertices (scientific publications) and 4723 edges (citations). All publications are classified into 6 classes:

* Agents
* AI
* DB
* IR
* ML
* HCI

WebKB

WebKB is a website dataset collected from four computer science departments in different universities which contains 877 vertices (web pages) and 1608 edges (hyper-links). All websites are classified 5 classes:

* faculty
* students
* project
* course
* other

Acknowledgement

It would be appreciated to cite the related publications if you decide to use these datasets in your work.

Noisy and synthetic settings

@inproceedings{xu2017attentive,
    title={Attentive graph-based recursive neural network for collective vertex classification},
    author={Xu, Qiongkai and Wang, Qing and Xu, Chenchen and Qu, Lizhen},
    booktitle={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
    pages={2403--2406},
    year={2017}
}

Cora

@article{mccallum2000automating,
    title={Automating the construction of internet portals with machine learning},
    author={McCallum, Andrew Kachites and Nigam, Kamal and Rennie, Jason and Seymore, Kristie},
    journal={Information Retrieval},
    volume={3},
    number={2},
    pages={127--163},
    year={2000},
    publisher={Springer}
}

Citeseer

@inproceedings{giles1998citeseer,
    title={CiteSeer: An automatic citation indexing system},
    author={Giles, C Lee and Bollacker, Kurt D and Lawrence, Steve},
    booktitle={Proceedings of the third ACM conference on Digital libraries},
    pages={89--98},
    year={1998},
    organization={ACM}
}

WebKB

@inproceedings{Craven:1998:LES:295240.295725,
    author = {Craven, Mark and DiPasquo, Dan and Freitag, Dayne and McCallum, Andrew and Mitchell, Tom and Nigam, Kamal and Slattery, Se'{a}n},
    title = {Learning to Extract Symbolic Knowledge from the World Wide Web},
    booktitle = {Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence},
    series = {AAAI '98/IAAI '98},
    year = {1998},
    location = {Madison, Wisconsin, USA},
    pages = {509--516},
    numpages = {8},
    publisher = {American Association for Artificial Intelligence},
    address = {Menlo Park, CA, USA},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
WebKB-syn		WebKB-syn
WebKB		WebKB
citeseer		citeseer
cora		cora
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noise Graph Dataset

Dataset format

Components

Noisy variants

Datasets

Cora

Citeseer

WebKB

Acknowledgement

Noisy and synthetic settings

Cora

Citeseer

WebKB

About

Releases

Packages

Contributors 2

xuqiongkai/NoiseGraphDataset

Folders and files

Latest commit

History

Repository files navigation

Noise Graph Dataset

Dataset format

Components

Noisy variants

Datasets

Cora

Citeseer

WebKB

Acknowledgement

Noisy and synthetic settings

Cora

Citeseer

WebKB

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages