This repository contains the code for the paper "Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning". In this paper, we propose SC-Prompt, a novel divide-and-conquer strategy for effectively supporting Text-to-SQL translation in the few-shot scenario.
git clone git@github.com:ruc-datalab/SC-prompt.git
cd SC-prompt
mkdir -p -m 777 experimental_outputs
mkdir -p -m 777 transformers_cache
cd experimental_outputs
mkdir -p -m 777 spider
mkdir -p -m 777 cosql
mkdir -p -m 777 geoquery
cd ..
- Spider: Put it under
src/datasets/spider
. - Cosql: Put it under
src/datasets/cosql
. - Geoquery: Put it under
src/datasets/geoquery
.
|-- experimental_outputs # save the fine-tuned models and evaluation results
|-- scripts # the train/inference script
|-- src
|-- datasets # the class to preprocess the dataset
|-- metrics # the class to evaluate the prediction results
|-- utils # main code
|-- run.py # the class to train/inference the few-shot text-to-sql model
Our constrained decoding method is based on the parser provided by Picard. Please use the Docker image provided by the official repository to build the container.
docker run -itd --gpus '"device=<your_available_gpu_ids>"' --rm --user 13011:13011 --mount type=bind,source=<your_base_dir>/transformers_cache,target=/transformers_cache --mount type=bind,source=<your_base_dir>/scripts,target=/app/scripts --mount type=bind,source=<your_base_dir>/experimental_outputs,target=/app/experimental_outputs --mount type=bind,source=<your_base_dir>/src,target=/app/src tscholak/text-to-sql-eval:6a252386bed6d4233f0f13f4562d8ae8608e7445
You should set <your_available_gpu_ids>
and <your_base_dir>
.
Download the fine-tuned model and put it under the corresponding folder.
Dataset | #Train | Model | Folder |
---|---|---|---|
Spider | 0.05 (350) | link | experimental_outputs/spider/ |
Spider | 0.1 (700) | link | experimental_outputs/spider/ |
CoSQL | 0.05 (475) | link | experimental_outputs/cosql/ |
CoSQL | 0.1 (950) | link | experimental_outputs/cosql/ |
Geoquery | 1. (536) | link | experimental_outputs/geoquery/ |
Use the scripts to inference.
# Inference on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_spider_scprompt.sh 0.1
# Inference on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_cosql_scprompt.sh 0.1
# Inference on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_geoquery_scprompt.sh 1.
- The second argument refers to the proportion of using the official training set.
# Train on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/train_spider_scprompt.sh 0.1
# Train on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/train_cosql_scprompt.sh 0.1
# Train on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/train_geoquery_scprompt.sh 1.
- The second argument refers to the proportion of using the official training set.
The best model will be automatically saved at experimental_outputs/
. Please note that training does not use the fine-grained constrained decoding strategy, which is only necessary for evaluation. Please refer to Quick Inference
to evaluate the fine-tuned model.