Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 1.11 KB

README.md

File metadata and controls

34 lines (23 loc) · 1.11 KB

Pretraining Language Models with Text-Attributed Heterogeneous Graphs

Data Preparation

Please download the datasets from DatasetsForTHLM , and put it into ./Data

Model Pretraining

Example of training THLM on Patents dataset:

python main.py --dataset_name Patents

Get node embeddings

Obtain node embeddings for Patents, GoodReads and OAG_Venue in ./Downstream/preprocess_data

Example of obtaining node embeddings for Patents:

python Patent_features.py

Model Evaluation

  • Link Prediction for OAG_Venue: ./Downstream/Link-Train-OAG

  • Link Prediction for Patents/GoodReads: ./Downstream/Link-Train-Patent

  • Node Classification for OAG_Venue: ./Downstream/train-OAG

  • Node Classification for Patents/GoodReads: ./Downstream/train-Patent

Pre-trained Language Models

We also provide the pre-trained language models on these three datasets at HuggingFace.

Environments:

  • PyTorch 2.0.0
  • transformers 4.23.1
  • dgl 0.9.1
  • tqdm
  • numpy