GraDaSE is agraph-based method for Dataset Search with Examples task.
The repository is organised as follows:
data/: the original data of two test collections: DataFinder-E and DSEBench.
code/: the code files of the implementation of GraDaSE.
-
code/run.py: run reranking of GraDaSE.
-
code/model.py: implementation of GraDaSE.
-
code/utils/: contains tool functions.
"data/" directory contains the information of dataset graph and data for two test collections: DataFinder-E and DSEBench.
DataFinder-E and DSEBench are test collections for Dataset Search with Examples, which is the task of reranking a list of candidate datasets based on a keyword query and target datasets. The DataFinder-E is adapted from Datafinder.
The "./data/{test collection}/graph/" directory provides the ID and metadata of each dataset, the relationship between datasets and the relationships between datasets and tags in CSV format.
The "./data/{test collection}/queries.tsv" provides the keywords queries. The first column is the id of query, and the second column is the text of query.
The "./data/{test collection}/pairs.json" provides the pair ID, keyword query and target datasets in JSON format.
{
pairID: {"query": query_id, "targets": [dataset_id, ...]}, ...
}
Take the "./data/{test collection}/train.json" file for example. The train.json file contains pair id and candidate datasets list in JSON format.
{pair_id: {dataset_id: rel_score, ...}
The retrieval results of BM25 are in the "./data/{test collection}/bm25_test.json" file.
"code/" directory contains the implementation of GraDaSE. Before running the code, download and extract the data according to "data/README.md".
Run the following commands in the project root "GraDaSE/" directory.
cd data/
wget https://zenodo.org/records/14876878/files/DataFinder-E.zip
wget https://zenodo.org/records/14876878/files/DSEBench.zip
unzip DataFinder-E.zip
unzip DSEBench.zip
Before running the code, install the required libraries in the following order.
python==3.10.15
torch==2.4.0+cu121
dgl==2.4.0+cu121
scikit-learn==1.5.2
transformers==4.44.2
rank_bm25==0.2.2
pytrec_eval==0.5
FlagEmbedding==1.3.3
You can refer to the following commands to install the aforementioned packages:
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install dgl -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html
pip install scikit-learn==1.5.2 transformers==4.44.2 rank_bm25==0.2.2 pytrec_eval==0.5 FlagEmbedding==1.3.3
We train our model using NVIDIA GeForce RTX 4090 with CUDA 12.2.
For reranking on DataFinder-E:
cd code/
bash DataFinder.sh
For reranking on DSEBench:
cd code/
bash DSEBench.sh