LLM_BP

Overview

This is the official implementation of paper 'Model Generalization on Text Attribute Graphs: Principles with Large Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.

Repository Structure

The repository structure is as follows:

LLM_BP (Root Directory)
│── dataset/        # Contains dataset files
│── model/          # Stores model implementation of LLM-BP and LLM-BP (appr.)
│── results/        # Contains generated results from GPT-4o (the predictions on testset) and GPT-4o-mini (predictions on homophily ratio)
│── zero_shot.py  # zero shot inference
│── few_shot.py        # few shot inference
│── run_gpt.py     # run openai GPT to predict the results by taking raw node texts
│── pred_h.py     # predict the homophily ratio r by sampling edges
│── generate_llm.py     # generate the embeddings of vanilla LLM2Vec or task-adaptive encoder
│── generate_lm.py     # generate the embeddings of sbert or Roberta
│── generate_llm_gpt.py     # generate the embeddings of text-embedding-3-large
│── README.md       # Documentation file

STEP 0.1 Environment Setup

To set up the environment, follow these steps:

conda create -n llmbp python==3.8.18 
conda activate llmbp
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install pyg_lib==0.3.1+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_sparse==0.6.18+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_cluster==1.6.3+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_spline_conv==1.2.2+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install transformers==4.46.3 
pip install sentence_transformers==2.2.2
pip install dgl==2.4.0+cu121 -f https://data.dgl.ai/wheels/torch-2.1/cu121/repo.html 
pip install openai 
pip install torch_geometric==2.5.0 
pip install protobuf 
pip install accelerate

STEP 0.2 Dataset Preparation

The dataset structure should be organized as follows:

/dataset/
│── [dataset_name]/
│   │── processed_data.pt    # Contains labels and graph information
│   │── [encoder]_x.pt       # Features extracted by different encoders
│   │── categories.csv       # label name raw texts
│   │── raw_texts.pt       # raw text of each node

File Descriptions

processed_data.pt: A PyTorch file storing the processed dataset, including graph structure and node labels. Note that in heterophilic datasets, thie is named as [Dataset].pt, where Dataset could be Cornell, etc, and should be opened with DGL.
[encoder]_x.pt: Feature matrices extracted using different encoders, where [encoder] represents the encoder name.
categories.csv: raw label names.
raw_texts.pt: raw node texts. Note that in heterophilic datasets, this is named as [Dataset].csv, where Dataset can be Cornell, etc.

Dataset Naming Convention

[dataset_name] should be one of the following:

cora
citeseer
pubmed
bookhis
bookchild
sportsfit
wikics
cornell
texas
wisconsin
washington

Encoder Naming Convention

[encoder] can be one of the following:

sbert (the sentence-bert encoder)
roberta (the Roberta encoder)
llmicl_primary (the vanilla LLM2Vec)
llmicl_class_aware (the task-adaptive encoder)
llmgpt_text-embedding-3-large (the embedding api text-embedding-3-large by openai)

Ensure the datasets are placed correctly for smooth execution.

Download Pre-Calculated Embeddings and datasets

They could be found at: huggingface repository, one could directly download, place under /dataset/ folder.

STEP 1: Generating Dataset Embeddings

python generate_llm.py --dataset [DATASET] --version [VERSION]

Example:

CUDA_VISIBLE_DEVICES=0,1 python generate_llm.py --dataset cora --version class_aware

Parameters:

[DATASET]: The name of the dataset.
[VERSION]:
- primary → Vanilla LLM2Vec
- class_aware → Task-adaptive encoding

Ensure that the appropriate CUDA devices are set before running the script.

Download pre-calculated embeddings

We have enclosed the pre-calculated embeddings for the encoders in: huggingface repository, one may directly download and put them under the /dataset folder

STEP 2: Generate the predictions from GPT-4o

python run_gpt.py --mode [MODE] --model [MODEL] --dataset [DATASET]

Parameters:

[MODEL]: The model selection (e.g., 4o for GPT-4o).
[DATASET]: The name of the dataset.
[MODE]: when set as inference, it do inference and save results, when set as evaluate, it evaluate the results of the model

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o in: huggingface repository, one may directly download and put them under the /results folder

STEP 3: Predict the homophily ratio of the dataset

python pred_r.py --mode [MODE] --dataset [DATASET] --model [MODEL]

Parameters:

[DATASET]: The name of the dataset.
[MODEL]: The model selection (e.g., 4o_mini).
[MODE]: when set as inference, it do inference and save results, when set as evaluate, it makes prediction with the model

Fill the value

Fill the predicted value in H_dict in zero_shot.py or few_shot.py

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o-mini in: huggingface repository, one may directly download and put them under the /results folder

STEP 4: Zero-shot Inference

python zero_shot.py --dataset [DATASET] --encoder [ENCODER] --model 4o

Parameters:

[DATASET]: The name of the dataset.
[ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).
4o: Specifies the use of GPT-4o as averaged class embeddings.

STEP 5: Few-shot Inference

python few_shot.py --dataset [DATASET] --encoder [ENCODER]

Parameters:

[DATASET]: The name of the dataset.
[ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).

Acknowledgements

The dataset pre-processing, formats and code implementations are inspired by or built upon GLBench, Text-space graph foundation model, and LLaGA.

Citation

If you find our work helpful, please consider citing:

@article{wang2025model,
title={Model Generalization on Text Attribute Graphs: Principles with Large Language Models},
author={Wang, Haoyu and Liu, Shikun and Wei, Rongzhe and Li, Pan},
journal={arXiv preprint arXiv:2502.11836},
year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM_BP

Overview

Repository Structure

STEP 0.1 Environment Setup

STEP 0.2 Dataset Preparation

File Descriptions

Dataset Naming Convention

Encoder Naming Convention

Download Pre-Calculated Embeddings and datasets

STEP 1: Generating Dataset Embeddings

Example:

Parameters:

Download pre-calculated embeddings

STEP 2: Generate the predictions from GPT-4o

Parameters:

Download pre-calculated predictions

STEP 3: Predict the homophily ratio of the dataset

Parameters:

Fill the value

Download pre-calculated predictions

STEP 4: Zero-shot Inference

Parameters:

STEP 5: Few-shot Inference

Parameters:

Acknowledgements

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
model		model
results		results
.gitignore		.gitignore
README.md		README.md
few_shot.py		few_shot.py
generate_llm.py		generate_llm.py
generate_llm_gpt.py		generate_llm_gpt.py
generate_lm.py		generate_lm.py
pred_r.py		pred_r.py
prompts.py		prompts.py
run_gpt.py		run_gpt.py
upload.py		upload.py
zero_shot.py		zero_shot.py

Graph-COM/LLM_BP

Folders and files

Latest commit

History

Repository files navigation

LLM_BP

Overview

Repository Structure

STEP 0.1 Environment Setup

STEP 0.2 Dataset Preparation

File Descriptions

Dataset Naming Convention

Encoder Naming Convention

Download Pre-Calculated Embeddings and datasets

STEP 1: Generating Dataset Embeddings

Example:

Parameters:

Download pre-calculated embeddings

STEP 2: Generate the predictions from GPT-4o

Parameters:

Download pre-calculated predictions

STEP 3: Predict the homophily ratio of the dataset

Parameters:

Fill the value

Download pre-calculated predictions

STEP 4: Zero-shot Inference

Parameters:

STEP 5: Few-shot Inference

Parameters:

Acknowledgements

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages