Skip to content

Commit

Permalink
Slight edits to README files (#12)
Browse files Browse the repository at this point in the history
  • Loading branch information
wendy-aw committed Aug 17, 2023
1 parent 73b6623 commit eb6a474
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 18 deletions.
34 changes: 19 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,25 @@

[![tests](https://github.com/defog-ai/sql-generation-evaluation/actions/workflows/main.yml/badge.svg)](https://github.com/defog-ai/sql-generation-evaluation/actions/workflows/main.yml)

This repository contains the code that Defog uses for sql generation evaluation. It is based off the [spider](https://github.com/taoyds/spider) datasets' schema, but with a new set of hand-selected questions and queries grouped by query category.
This repository contains the code that Defog uses for the evaluation of LLM-generated SQL. It's based off the schema from the [Spider](https://github.com/taoyds/spider), but with a new set of hand-selected questions and queries grouped by query category.

## Introduction

Our testing procedure comprises the following steps. For each question/query pair:
1. We generate a query (could be from a LLM).
2. We run both the "gold" query and the generated query on their respective postgres database and obtain 2 dataframes with the results.
1. We generate a SQL query from an LLM.
2. We run both the "gold" query and the generated query on their respective Postgres database to obtain 2 dataframes with the results.
3. We compare the 2 dataframes using an "exact" and a "subset" match. TODO add link to blogpost.
4. We log these alongside other metrics of interest (eg tokens used, latency) and aggregate the results for reporting.
4. We log these alongside other metrics of interest (e.g. tokens used, latency) and aggregate the results for reporting.

## Getting Started

This is a comprehensive set of instructions that assumes basic familiarity with the command line, docker, running SQL queries on a database, and common python data manipulation libraries involved (`pandas`).
This is a comprehensive set of instructions that assumes basic familiarity with the command line, Docker, running SQL queries on a database, and common Python data manipulation libraries (e.g. pandas).

### Start Postgres Instance

Firstly, you would need to setup the databases used to run the queries on. We use postgres here, since it is the most common OSS database with the widest distribution and usage in production. In addition, we would recommend using docker to do this, as it is the easiest way to get started. You can install docker [here](https://docs.docker.com/get-docker/). Once you have docker installed, you can create the docker container, and then start the postgres database using the following commands. We recommend mounting a volume on `data/postgres` to persist the data, as well as `data/export` to make it easier to import the data. To create the container, run:
Firstly, you would need to set up the databases that the queries are executed on. We use Postgres here, since it is the most common OSS database with the widest distribution and usage in production. In addition, we would recommend using Docker to do this, as it is the easiest way to get started. You can install Docker [here](https://docs.docker.com/get-docker/).

Once you have Docker installed, you can create the Docker container and start the Postgres database using the following commands. We recommend mounting a volume on `data/postgres` to persist the data, as well as `data/export` to make it easier to import the data. To create the container, run:

```bash
mkdir data/postgres data/export
Expand All @@ -30,35 +32,37 @@ To start the container, run:
docker start postgres-sql-eval
```

If you want to reset the postgres server instance's state (eg memory leaks from transient connections), you can turn it off (and start it back up after):
If you want to reset the Postgres server instance's state (e.g. memory leaks from transient connections), you can turn it off (and start it back up after):
```bash
docker stop postgres-sql-eval
# see that the container is still there:
docker container list -a
```

Some notes:
- You would need to stop other postgres instances listening on port 5432 before running the above command.
- You would need to stop other Postgres instances listening on port 5432 before running the above command.
- You only need to run the `docker create ...` once to create the image, and then subsequently only `docker start/stop postgres-sql-eval`.
- The data is persisted in `data/postgres`, so turning it off isn't critical. On the other hand, if you delete the `data/postgres` folder, then all is lost T.T
- While we will use docker for deploying postgres and the initialization, you are free to modify the scripts/instructions to work with your local installation.
- While we will use Docker for deploying Postgres and the initialization, you are free to modify the scripts/instructions to work with your local installation.


### Import data into Postgres
### Import Data into Postgres

The data for importing is already in the exported sql dumps in the `data/export` folder. Each sql file corresponds to its own database (eg `data/export/academic.sql` contains all the data required to reload the academic database). We will create a new database for each database, in `postgres-sql-eval`.
The data for importing is already in the exported SQL dumps in the `data/export` folder. Each SQL file corresponds to a single database (e.g. `data/export/academic.sql` contains all the data required to reload the 'academic' database). We will create a new database in `postgres-sql-eval` for each of the 7 SQL files with the following command.

```bash
./data/init_db.sh
```

### Query Generator

To test your own query generator with our framework, you would need to extend `QueryGenerator` and implement the `generate_query` method returning the query of interest. We create a new class for each question/query pair to isolate each pair's runtime state against the others when running concurrently. You can see a sample `OpenAIQueryGenerator` in `query_generators/openai.py` implementing it and using a simple prompt to send a message over to openai's api. Feel free to extend it for your own use. If there are functions that are generally useful for all query generators, it can be put in the `utils` folder. If you need to incorporate specific verbose templates (e.g. for prompt testing), you can put them in the `prompts` folder, and import them. Being able to version control the prompts in a central place has been a productivity win for our team.
To test your own query generator with our framework, you would need to extend `QueryGenerator` and implement the `generate_query` method to return the query of interest. We create a new class for each question/query pair to isolate each pair's runtime state against the others when running concurrently. You can see the sample `OpenAIQueryGenerator` in `query_generators/openai.py` which implements the class and uses a simple prompt to send a message over to OpenAI's API. Feel free to extend it for your own use.

If there are functions that are generally useful for all query generators, they can be placed in the `utils` folder. If you need to incorporate specific verbose templates (e.g. for prompt testing), you can store them in the `prompts` folder, and later import them. Being able to version control the prompts in a central place has been a productivity win for our team.

### Runner

Having implemented the query generator, the next piece of abstraction would be the runner. The runner calls the query generator, and is responsible for handling the configuration of work (e.g. parallelization/batching/model selected etc) to the query generator for each question/query pair. We have provided 2 most common runners: `eval/openai_runner.py` for calling openai's api (with parallelization support) and `eval/hf_runner.py` for calling a local huggingface model. When testing your own query generator with an existing runner, you can replace the `qg_class` in the runner's code with your own query generator class.
Having implemented the query generator, the next piece of abstraction would be the runner. The runner calls the query generator, and is responsible for handling the configuration of work (e.g. parallelization / batching / model selected etc.) to the query generator for each question/query pair. We have provided 2 most common runners: `eval/openai_runner.py` for calling OpenAI's API (with parallelization support) and `eval/hf_runner.py` for calling a local Hugging Face model. When testing your own query generator with an existing runner, you can replace the `qg_class` in the runner's code with your own query generator class.

### Running the test

Expand All @@ -78,7 +82,7 @@ python main.py \
-v
```

#### HuggingFace
#### Hugging Face
To test it out with just 10 questions (instead of all 175):

```bash
Expand All @@ -94,7 +98,7 @@ python -W ignore main.py \
-n 10
```

You can explore the results generated and aggregated the various metrics that you care about to understand your query generator's performance. Happy iterating!
To better understand your query generator's performance, you can explore the results generated and aggregated for the various metrics that you care about. Happy iterating!

## Misc

Expand Down
6 changes: 3 additions & 3 deletions prompts/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Defining your prompt
You can define your prompt in the following structure way
You can define your prompt using the following structure.

```
### Instructions:
Expand All @@ -16,7 +16,7 @@ THE RESPONSE TEXT FOR THE MODEL
```

# Adding variables
You can add variables using curly braces - like so `{user_question}`. Then, these can be updated at runtime using Python's `.format()` function for strings. Like [here](../eval/hf_runner.py#L18)
You can add variables using curly braces - like so `{user_question}`. These can then be updated at runtime using Python's `.format()` method for strings. Like [here](../eval/hf_runner.py#L18).

# Translating to OpenAI's messages prompt
If evaluating OpenAI's chat models, please ensure that your prompt always has the keywords `### Instructions:`, `### Input:`, and `### Response:` in them. This will help ensure that the model is automatically converted to OpenAI's `system`, `user`, and `assistant` prompts. The section under Instructions is mapped to the `system` prompt, the section under Input is mapped to the `user` prompt, and the section under Response is mapped to the `assistant` prompt
If you're performing evaluation with OpenAI's chat models, please ensure that your prompt contains the keywords `### Instructions:`, `### Input:`, and `### Response:`. This will help ensure that the prompt sections are automatically mapped to OpenAI's different prompt roles. The text under Instructions, Input and Response will be converted to the `system`, `user` and `assistant` prompts respectively.

0 comments on commit eb6a474

Please sign in to comment.