- [2024.11.01] We have uploaded the source datasets.
- [2024.11.03] We have uploaded the target datasets and the pretrained checkpoint.
- How to apply GraphCLIP on customized datasets.
conda create -n graphclip python=3.10
conda activate graphclip
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric
- This repository includes the smallest source dataset, i.e., pubmed. For larger-scale source datasets, please download generated graph summaries:
Datasets | Links |
---|---|
OGBN-ArXiv | Google Drive |
ArXiv_2023 | Google Drive |
Google Drive | |
OGBN-Products | Google Drive |
-
Once downloaded, unzip the files and place them in the
summary
directory. -
For convenience, we also provide the processed data, which includes the graph structure and node features. Please download them following:
Datasets | Links |
---|---|
OGBN-ArXiv | Google Drive |
ArXiv_2023 | Google Drive |
Google Drive | |
OGBN-Products | Google Drive |
- Once downloaded, unzip the files and place them in the
processed_data
directory.
- For target datasets, we only need to download processed data, unzip them and put them into
processed_data
directory.
Datasets | Links |
---|---|
WikiCS | Google Drive |
Google Drive | |
Ele-Photo | Google Drive |
Ele-Computers | Google Drive |
Books-History | Google Drive |
- As for subgraphs, please run
bash gen_target_subg.sh
to generate subgraphs for each target dataset.
To get started, download our released checkpoint and unzip the content. Place the extracted files into the checkpoints
directory. You can then use this checkpoint directly on your target datasets, as outlined in the next section.
# We provide the smallest source data (pubmed) for running our codes
# single gpu
CUDA_VISIBLE_DEVICES=0 python train.py --source_data pubmed --batch_size 1024 --epochs 30
# multiple gpus
CUDA_VISIBLE_DEVICES=0,1 python train.py --source_data pubmed --batch_size 1024 --epochs 30
# reproduce our results
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --source_data ogbn-arxiv+arxiv_2023+pubmed+ogbn-products+reddit --batch_size 7200 --epochs 30
We use 8 A100(40G) GPUs for pretraining with ~7 hours
This code supports Data Parallel, you can assign multiple gpus here.
We provide a sample target dataset (citeseer) for running our code. By default, this will load your pretrained checkpoint.
CUDA_VISIBLE_DEVICES=0 python eval.py --target_data citeseer
more target datasets can be evaluated: --target_data cora+citeseer+wikics+instagram+photo+computer+history
To reproduce our experiments, use the --ckpt
flag to specify the pretrained checkpoint, and provide the name of the downloaded checkpoint.
CUDA_VISIBLE_DEVICES=0 python eval.py --target_data cora+citeseer+wikics+instagram+photo+computer+history --ckpt pretrained_graphclip