Compress database schemata to reduce cost for LLM processing
Tested with c5.4xlarge EC 2 instance with Ubuntu 22.04 installed. All commands are executed from Ubuntu user home directory.
- Download benchmark schemata here, here, and here.
- Install Gurobi for Python:
sudo pip install gurobipy
- Download Gurobi solver (tested with version 10.0.3):
wget https://packages.gurobi.com/10.0/gurobi10.0.3_linux64.tar.gz
- Unpack Gurobi solver:
tar xvfz gurobi10.0.3_linux64.tar.gz
- Add the following to your
.bashrc
file:
export GUROBI_HOME="/home/ubuntu/gurobi1003/linux64"
export PATH="${PATH}:${GUROBI_HOME}/bin"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${GUROBI_HOME}/lib"
- Re-read the changed file:
source .bashrc
. - Check that Gurobi is installed:
gurobi_cl --version
. - Install a license to enable solving large problems. E.g., the experiments used a Gurobi academic license WLS. For this license, download the
gurobi.lic
file and copy it into the home directory of the server executing optimization.
To extract schemata from the SPIDER benchmark, use the file src/sc/spider.py
with the following parameters:
Parameter | Explanation |
---|---|
inpath | Path to schema.json in the SPIDER directory |
top_k | How many schemata to extract |
outdir | Write extracted schemata into this directory |
Use src/sc/benchmark/performance.py
to compare different schema compression methods in terms of their run time and compression ratio. The script takes the following command line parameters:
Parameter | Explanation |
---|---|
inputdir | Path to directory containing .sql files with schema definitions |
timeout_s | Timeout in seconds per test case and per compression baseline |
outpath | Path to .json file with benchmark results to be created |
Optionally, users can specify the following flags for ablation studies:
Flag | Explanation |
---|---|
--nostart | Do not use greedy solutions as ILP start |
--nohints | Do not specify hints for ILP variables |
--nomerge | Do not merge column annotations together |
--noilp | Do not execute ILP approach |
E.g., assuming that python3.10
is the Python interpreter, generate results via the following command on Ubuntu:
PYTHONPATH=src python3.10 src/sc/benchmark/performance.py /home/ubuntu/publicbi 1200 publicbi.json &> publicbiLog &