This is the official implementation of paper 'Model Generalization on Text Attribute Graphs: Principles with Large Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.
The repository structure is as follows:
LLM_BP (Root Directory)
│── dataset/ # Contains dataset files
│── model/ # Stores model implementation of LLM-BP and LLM-BP (appr.)
│── results/ # Contains generated results from GPT-4o (the predictions on testset) and GPT-4o-mini (predictions on homophily ratio)
│── zero_shot.py # zero shot inference
│── few_shot.py # few shot inference
│── run_gpt.py # run openai GPT to predict the results by taking raw node texts
│── pred_h.py # predict the homophily ratio r by sampling edges
│── generate_llm.py # generate the embeddings of vanilla LLM2Vec or task-adaptive encoder
│── generate_lm.py # generate the embeddings of sbert or Roberta
│── generate_llm_gpt.py # generate the embeddings of text-embedding-3-large
│── README.md # Documentation file
To set up the environment, follow these steps:
conda create -n llmbp python==3.8.18
conda activate llmbp
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install pyg_lib==0.3.1+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install torch_scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install torch_sparse==0.6.18+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install torch_cluster==1.6.3+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install torch_spline_conv==1.2.2+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install transformers==4.46.3
pip install sentence_transformers==2.2.2
pip install dgl==2.4.0+cu121 -f https://data.dgl.ai/wheels/torch-2.1/cu121/repo.html
pip install openai
pip install torch_geometric==2.5.0
pip install protobuf
pip install accelerate
The dataset structure should be organized as follows:
/dataset/
│── [dataset_name]/
│ │── processed_data.pt # Contains labels and graph information
│ │── [encoder]_x.pt # Features extracted by different encoders
│ │── categories.csv # label name raw texts
│ │── raw_texts.pt # raw text of each node
processed_data.pt
: A PyTorch file storing the processed dataset, including graph structure and node labels. Note that in heterophilic datasets, thie is named as [Dataset].pt, where Dataset could be Cornell, etc, and should be opened with DGL.[encoder]_x.pt
: Feature matrices extracted using different encoders, where[encoder]
represents the encoder name.categories.csv
: raw label names.raw_texts.pt
: raw node texts. Note that in heterophilic datasets, this is named as [Dataset].csv, where Dataset can be Cornell, etc.
[dataset_name]
should be one of the following:
cora
citeseer
pubmed
bookhis
bookchild
sportsfit
wikics
cornell
texas
wisconsin
washington
[encoder]
can be one of the following:
sbert
(the sentence-bert encoder)roberta
(the Roberta encoder)llmicl_primary
(the vanilla LLM2Vec)llmicl_class_aware
(the task-adaptive encoder)llmgpt_text-embedding-3-large
(the embedding api text-embedding-3-large by openai)
Ensure the datasets are placed correctly for smooth execution.
They could be found at: huggingface repository, one could directly download, place under /dataset/ folder.
python generate_llm.py --dataset [DATASET] --version [VERSION]
CUDA_VISIBLE_DEVICES=0,1 python generate_llm.py --dataset cora --version class_aware
[DATASET]
: The name of the dataset.[VERSION]
:primary
→ Vanilla LLM2Vecclass_aware
→ Task-adaptive encoding
Ensure that the appropriate CUDA devices are set before running the script.
We have enclosed the pre-calculated embeddings for the encoders in: huggingface repository, one may directly download and put them under the /dataset folder
python run_gpt.py --mode [MODE] --model [MODEL] --dataset [DATASET]
[MODEL]
: The model selection (e.g., 4o for GPT-4o).[DATASET]
: The name of the dataset.[MODE]
: when set as inference, it do inference and save results, when set as evaluate, it evaluate the results of the model
We have enclosed the pre-calculated predictions from GPT-4o in: huggingface repository, one may directly download and put them under the /results folder
python pred_r.py --mode [MODE] --dataset [DATASET] --model [MODEL]
[DATASET]
: The name of the dataset.[MODEL]
: The model selection (e.g., 4o_mini).[MODE]
: when set as inference, it do inference and save results, when set as evaluate, it makes prediction with the model
Fill the predicted value in H_dict in zero_shot.py or few_shot.py
We have enclosed the pre-calculated predictions from GPT-4o-mini in: huggingface repository, one may directly download and put them under the /results folder
python zero_shot.py --dataset [DATASET] --encoder [ENCODER] --model 4o
[DATASET]
: The name of the dataset.[ENCODER]
: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).4o
: Specifies the use of GPT-4o as averaged class embeddings.
python few_shot.py --dataset [DATASET] --encoder [ENCODER]
[DATASET]
: The name of the dataset.[ENCODER]
: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).
The dataset pre-processing, formats and code implementations are inspired by or built upon GLBench, Text-space graph foundation model, and LLaGA.
If you find our work helpful, please consider citing:
@article{wang2025model,
title={Model Generalization on Text Attribute Graphs: Principles with Large Language Models},
author={Wang, Haoyu and Liu, Shikun and Wei, Rongzhe and Li, Pan},
journal={arXiv preprint arXiv:2502.11836},
year={2025}
}