Skip to content

We propose a framework for LLMs to seek user support, design evaluation metrics to measure the trade-off between performance boost and user burden, and empirically assess this ability on Text-to-SQL generation.

License

Notifications You must be signed in to change notification settings

appier-research/i-need-help

Repository files navigation

I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation

TL;DR: We propose a framework for LLMs to seek user support, design evaluation metrics to measure the trade-off between performance boost and user burden, and empirically assess this ability on Text-to-SQL generation.

Paper link: https://arxiv.org/abs/2407.14767

Figure 1
Figure 1: Overview of our experiments on text-to-SQL. LLMs struggle to determine when they need help based solely on the instruction (x) or their output (y). They require external feedback, such as the execution results (r) from the database, to outperform random baselines.

Main Results

Methods/LLMs Wizard Llama3 DPSeek GPT-3.5 Mixtral GPT-4t GPT-4o
Random Baseline 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
Direct Ask 0.4915 0.4834 0.4976 0.4390 0.5301 0.5758 0.5479
Write then Ask 0.4759 0.4497 0.4857 0.4735 0.5677 0.5807 0.5740
Execute then Ask 0.5096 0.4987 0.5848 0.6313 0.6242 0.6641 0.5989

Table 1: Area Under Delta-Burden Curve (AUDBC) across different methods and LLMs. Text in bold denotes the method with the best performance, while underlined text means better than random (uniform sampling of â ∈ [0, 1]). For the details of AUDBC, please refer to our paper.

How to Run Experiments

Install Dependencies

pip install -r requirements.txt

Download the Text-to-SQL Databases

python download_text2sql_data.py

The script will download, unzip, and extract Text-to-SQL databases of BIRD to the ./data directory automatically.

Run the Main Script

Before running the script, make sure to set your OpenAI API key:

export OPENAI_API_KEY=<your-api-key>

Suppose you want to test the performance of gpt-4o-mini-2024-07-18:

python src/run.py \
    --series "openai" \
    --model_name "gpt-4o-mini-2024-07-18" \
    --method "EA"  # ["DA", "WA", "EA"]

Abbreviations:

  • DA: Direct Ask
  • WA: Write then Ask
  • EA: Execute then Ask

As the script runs, you can find the results in the ./results/<series>_<model_name>.jsonl directory.

Visualize the Results

To visualize the performance curves (Delta-Burden Curve, PR Curve, and Flip Rate Curve) and inspect the performance of each method, run:

python src/visualize.py \
    --jsonl "./results/openai_gpt-4o-mini-2024-07-18.jsonl" \  # path to the jsonl file
    --methods "Random EA"  # specify the methods to plot

If you want to plot all methods, you can specify --methods "Random DA WA EA".

Playground

See playground.ipynb for step-by-step walkthrough of how to obtain "need-user-support probability" with toy examples.

About

We propose a framework for LLMs to seek user support, design evaluation metrics to measure the trade-off between performance boost and user burden, and empirically assess this ability on Text-to-SQL generation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published