BIRD-SQL Mini-Dev

Overview

Here, we provide a Lite version of developtment dataset: Mini-Dev. This mini-dev dataset is designed to facilitate efficient and cost-effective development cycles, especially for testing and refining SQL query generation models. This dataset results from community feedback, leading to the compilation of 500 high-quality text2sql pairs derived from 11 distinct databases in a development environment. To further enhance the practicality of the BIRD system in industry settings and support the development of text-to-SQL models, we make the Mini-Dev dataset available in both MySQL and PostgreSQL.

Additionally, we introduce two new evaluation metrics for the Mini-Dev dataset: the Reward-based Valid Efficiency Score (R-VES) and the Soft F1-Score. These metrics aim to evaluate the efficiency and accuracy of text-to-SQL models, respectively. It is important to note that the both metrics, currently in their beta version, applies exclusively to the Mini-Dev dataset using baseline models.

We welcome contributions and suggestions for enhancing these metrics, particularly regarding their integration into existing leaderboards. Please do not hesitate to contact us if you are interested in these developments or have any proposals for improvements.

Below are some key statistics of the mini-dev dataset:

Difficulty Distribution

Simple: 30%
Moderate: 50%
Challenging: 20%

Database Distribution

Debit Card Specializing: 30 instances
Student Club: 48 instances
Thrombosis Prediction: 50 instances
European Football 2: 51 instances
Formula 1: 66 instances
Superhero: 52 instances
Codebase Community: 49 instances
Card Games: 52 instances
Toxicology: 40 instances
California Schools: 30 instances
Financial: 32 instances

Keywords Statistic

Main Body Keywords •SELECT •FROM •WHERE •AND •OR •NOT •IN •EXISTS •IS •NULL •IIF •CASE •CASE WHEN.
Join Keywords • INNER JOIN • LEFT JOIN • ON • AS.
Clause Keywords • BETWEEN • LIKE • LIMIT • ORDER BY • ASC • DESC • GROUP BY •HAVING •UNION •ALL •EXCEPT •PARTITION BY •OVER.
Aggregation Keywords • AVG • COUNT • MAX • MIN • ROUND • SUM.
Scalar Keywords • ABS • LENGTH • STRFTIME • JULIADAY • NOW • CAST • SUBSTR • INSTR.
Comparison Keywords •= •> •< •>= •<= •!=.
Computing Keywords •- •+ •* •/.

Dataset Introduction

The dataset contains the main following resources:

database: The database should be stored under the ./mini_dev_data/dev_databases/. In each database folder, it has two components:
- database_description: the csv files are manufactured to describe database schema and its values for models to explore or references.
- sqlite: The database contents in BIRD.

Note

You have to download the latest dev databases in order to construct database in the MySQL and PostgreSQL. If you use the SQLite version only, you can use the original dev databases.

data: Each text-to-SQL pairs with the oracle knowledge evidence is stored as a json file, i.e., mini_dev_sqlite.json is stored on ./mini_dev_data/mini_dev_sqlite.json. In each json file, it has three main parts:
- db_id: the names of databases
- question: the questions curated by human crowdsourcing according to database descriptions, database contents.
- evidence: the external knowledge evidence annotated by experts for assistance of models or SQL annotators.
- SQL: SQLs annotated by crowdsource referring to database descriptions, database contents, to answer the questions accurately.
ground-truth SQL file: The SQL file should be stored at ./llm/mini_dev_data/mini_dev_sqlite_gold.sql.
llm: It contains source codes to convert texts to SQLs by calling APIs from LLMs, such as GPT35-turbo-instruct, gpt-35-turbo, gpt-4, gpt-4-32k, and gpt-4-turbo.

Mini-Dev Dataset in MySQL and PostgreSQL

You can locate the SQL queries within the mini_dev_mysql.json and mini_dev_postgresql.json files. These queries have been transpiled from the original SQLite versions using the sqlglot package, then refined manually and with GPT-4 Turbo. After downloading the Mini-Dev dataset, each database folder will contain .sql and command.script files. Follow the instructions below to set up the database in MySQL and PostgreSQL:

MySQL

Download and install the MySQL from the official website: https://dev.mysql.com/downloads/mysql/
Set the environment variables:

export PATH=$PATH:/usr/local/mysql/bin

Start the MySQL server:

sudo /usr/local/mysql/support-files/mysql.server start

Login to the MySQL server and create the database (password will be the one you set during the installation)

mysql -u root -p
CREATE DATABASE BIRD;

Construct the database by run the following command (You can find MySQL version database: BIRD_dev.sql in the MINIDEV_mysql folder):

mysql -u root -p BIRD < BIRD_dev.sql

Examples that how to run mysql query in the Python (with pymysql) can be find in the examples/mysql_example.ipynb file.
If you encounter the error: "this is incompatible with sql_mode=only_full_group_by", you can run the following command to disable the sql_mode:

select @@global.sql_mode;
SET GLOBAL sql_mode='{EVERYTHING SHOW IN THE ABOVE COMMAND EXCEPT ONLY_FULL_GROUP_BY}';

PostgreSQL

Download and install the postgresql from the official website: https://www.postgresql.org/download/
Download the pgAdmin4 from the official website: https://www.pgadmin.org/download/ (Recommended to monitor the database)
In pgADmin4/terminal create a new database called BIRD
Construct the database by run the following command (You can find PostgreSQL version database:BIRD_dev.sql in the MINIDEV_postgresql folder):

psql -U USERNAME -d BIRD -f BIRD_dev.sql

Examples that how to run mysql query in the Python (with Psycopg) can be find in the examples/postgresql_example.ipynb file.

In-Context Learning (ICL):

Environment Setup:

First, you need install openai in your python environment by:

conda create -n BIRD python=3.11.5
pip install requirements.txt

Collect results

Use this script to run the OpenAI model on the Azure cloud. (you may need to adjust parameters and paths with your preference):

cd ./llm/
sh ./run/run_gpt.sh

Evaluation:

Execution (EX) Evaluation:

Please post-process your collected results as the format: SQL and its db_id, which is splitted by '\t----- bird -----\t'. The examples are shown in the ./llm/exp_result/turbo_output/predict_mini_dev_gpt-4-turbo_cot_SQLite.json. Put the ground-truth sql file in the ./data/. And you may need to design a ChatGPT tag by your own. The main file for ex evaluation is located at ./llm/src/evaluation_ex.py.
Then you could evaluate the results by the following command line :

cd ./llm/
sh ./run/run_evaluation.sh

Reward-based Valid Efficiency Score (R-VES):

The main file for R-VES evaluation is located at ./llm/src/evaluation_ves.py. R-VES and EX can be evaluated in the same shell, so you can eval your efficiency via:

cd ./llm/
sh ./run/run_evaluation.sh

(For stable R-VES, you may need to enlarge timeout or repeat and average results. In our test evaluation, we will enlarge timeout to 3 s/ex; then we repeat 5 times for VES computation, only the highest results will be reported.)

In the latest version, we adjust the VES evaluation to be more stable and reliable. Instead of simply measuring the time ratio between predict and ground-truth SQLs, we now assign reward point based on the time ratio. The R-VES are calculated as follows:

Soft F1-Score Evaluation:

The main file for Soft F1-Score evaluation is located at ./llm/src/evaluation_f1.py. Soft-F1, VES and EX can be evaluated in the same shell, so you can eval your efficiency via:

cd ./llm/
sh ./run/run_evaluation.sh

Soft F1-Score:

Alongside the update to the Mini-Dev set, we introduced a new evaluation metric—the soft F1-score. This metric is specifically designed to assess the performance of text-to-SQL models by measuring the similarity between the tables produced by predicted SQL queries and those from the ground truth. In a nutshell, the soft F1-score is a more lenient metric that reduces the impact of column order and missing values in the tables produced by predicted SQL queries.

The following demonstrate how we calculate the soft F1-score.

Ground truth SQL resulted table:

Row
1	'Apple'	325
2	'Orange'
3	'Banana'	119

Predicted SQL resulted table:

Row
1	325	'Apple'
2	191	'Orange'
3		'Banana'

The soft F1-score is calculated as follows:

	Matched	Pred_only	Gold_only
Row 1	2	0	0
Row 2	1	1	0
Row 3	1	0	1

tp = SUM(Matched) = 4
fp = SUM(Pred_only) = 1
fn = SUM(Gold_only) = 1
Precision = tp / (tp + fp) = 4 / 5 = 0.8
Recall = tp / (tp + fn) = 4 / 5 = 0.8
F1 = 2 * Precision * Recall / (Precision + Recall) = 0.8

Baseline performance on Mini-Dev Dataset

EX Evaluation

	SQLite	MySQL	PostgreSQL
mixtral-8x7b	21.60	13.60	12.40
llama3-8b-instruct	24.40	24.60	18.40
phi-3-medium-128k-instruct	30.60	25.00	21.60
gpt-35-turbo-instruct	33.60	31.20	26.60
gpt-35-turbo	38.00	36.00	27.40
llama3-70b-instruct	40.80	37.00	29.40
TA + gpt-35-turbo	41.60	-	-
TA + llama3-70b-instruct	42.80	-	-
gpt-4-turbo	45.80	41.00	36.00
gpt-4-32k	47.00	43.20	35.00
gpt-4	47.80	40.80	35.80
TA + gpt-4-turbo	58.00	-	-
TA + gpt-4o	63.00	-	-

R-VES Evaluation

	SQLite	MySQL	PostgreSQL
mixtral-8x7b	20.41	12.99	14.16
llama3-8b-instruct	23.27	23.66	17.90
phi-3-medium-128k-instruct	29.54	24.12	21.07
gpt-35-turbo-instruct	32.28	30.39	26.14
gpt-35-turbo	37.33	34.94	26.80
llama3-70b-instruct	39.02	35.82	28.80
TA + gpt-35-turbo	40.59	-	-
TA + llama3-70b-instruct	41.37	-	-
gpt-4-turbo	44.79	39.37	35.23
gpt-4-32k	45.29	42.79	34.59
gpt-4	45.91	39.92	35.24
TA + gpt-4-turbo	56.44	-	-
TA + gpt-4o	60.86	-	-

Soft F1-Score Evaluation

	SQLite	MySQL	PostgreSQL
mixtral-8x7b	22.95	13.79	14.70
llama3-8b-instruct	27.87	27.49	19.35
phi-3-medium-128k-instruct	35.33	28.73	24.11
gpt-35-turbo-instruct	36.34	33.85	28.30
gpt-35-turbo	41.84	40.75	30.22
TA + gpt-35-turbo	44.25	-	-
llama3-70b-instruct	44.38	40.95	31.43
TA + llama3-70b-instruct	46.66	-	-
gpt-4-turbo	50.08	45.96	38.36
gpt-4-32k	51.92	47.38	39.55
gpt-4	52.69	45.78	38.96
TA + gpt-4-turbo	62.40	-	-
TA + gpt-4o	66.97	-	-

Predict SQLs

We drop the predicted SQLs of baseline models under ./llm/exp_result/sql_output_kg/ for reference.

Time Ratio Distribution

Acknowledgement

Main contributors to the Mini-Dev project: Xiaolong Li, Jinyang Li, Ge Qu, Binyuan Hui, Reynold Cheng, Chenhao Ma.

We extend our sincere gratitude to the invaluable feedbacks from the open community, including github reviewers (@freiz @nnarodytska @josem7 @wbbeyourself @ronch99 @tshu-w ) and those who reached out to us via email with their valuable suggestions.

For any questions, please contact us by bird.bench23@gmail.com.

My To-Do List

Release BIRD MINI DEV
Implement TA-SQL as ICL Reasoning Baseline.
Implement more open-source LLMs.
Release cleaner dev.json.
Release cleaner train.json.
BIRD-SQL 2024-2025, a broad new research problem.

Citation

Please cite the repo if you think our work is helpful to you.

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BIRD-SQL Mini-Dev

Overview

Difficulty Distribution

Database Distribution

Keywords Statistic

Dataset Introduction

Mini-Dev Dataset in MySQL and PostgreSQL

MySQL

PostgreSQL

In-Context Learning (ICL):

Environment Setup:

Collect results

Evaluation:

Execution (EX) Evaluation:

Reward-based Valid Efficiency Score (R-VES):

Soft F1-Score Evaluation:

Soft F1-Score:

Baseline performance on Mini-Dev Dataset

EX Evaluation

R-VES Evaluation

Soft F1-Score Evaluation

Predict SQLs

Time Ratio Distribution

Acknowledgement

My To-Do List

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

BIRD-SQL Mini-Dev

Overview

Difficulty Distribution

Database Distribution

Keywords Statistic

Dataset Introduction

Mini-Dev Dataset in MySQL and PostgreSQL

MySQL

PostgreSQL

In-Context Learning (ICL):

Environment Setup:

Collect results

Evaluation:

Execution (EX) Evaluation:

Reward-based Valid Efficiency Score (R-VES):

Soft F1-Score Evaluation:

Soft F1-Score:

Baseline performance on Mini-Dev Dataset

EX Evaluation

R-VES Evaluation

Soft F1-Score Evaluation

Predict SQLs

Time Ratio Distribution

Acknowledgement

My To-Do List

Citation