Skip to content

yizenov/l1-error

Repository files navigation

Sub-optimal Join Order Identification with L1-error

Table of Contents

  1. Experimental Setup
  2. L1-error Evaluation
  3. IMDB Dataset
  4. JOB-light and Join Order Benchmark (JOB) workloads
  5. JCC-H and Join DSB workloads
  6. Install PostgreSQL and COMPASS
  7. Questions
  8. Acknowledgments
  9. References
  10. Citation

1. Experimental Setup

L1-error was developed in the following environment:

  • Machine: 2x Intel(R) Xeon(R) CPU E5-2660 v4 (56 CPU cores and 256GB memory)
  • NVIDIA Tesla K80 GPU, CUDA v11.4
  • OS: Ubuntu 22.04 LTS, Clang/LLVM v14.0.0, gcc v11.3.0
  • Python 3.10.6 (sklearn v1.1.2, numpy v1.21.5, pandas v1.4.3, matplotlib v3.5.3)
  • Docker v20.10.12, NVIDIA Docker v2.12.0

2. L1-error Evaluation

We provide Python scripts, which one can use to replicate L1-error results. The results can be used to reproduce the figures and tables in the paper. All the figures are stored in ods files and provided here. The input data for the Python scripts are already provided here with instructions. In case one decides to reproduce the input data, corresponding Python scripts are also provided.

Step-by-step instructions are provided here.

3. IMDB Dataset

The dataset that was used is Internet Movie Data Base (IMDB). The original data is publicly available (ftp://ftp.fu-berlin.de/pub/misc/movies/database/) in txt files, and the open-source imdbpy package was used to transform txt files to CSV files in [1]. See more details here. This 3.6GB snapshot is from May 2013, and it can be downloaded from here or here. The dataset includes 21 CSV files i.e., 21 relations in total. The package also includes queries to create the necessary relations written in schema.sql or schematext.sql files. Lastly, in addition to primary keys, there are queries to create foreign keys in case one decides to use them.

However, there were issues in bulk loading the original dataset, SQL syntax errors, and missing primary key values in referenced tables. To make sure PostgreSQL and COMPASS have the same dataset, we fixed those errors in data, schema, and indexes.

An alternative solution to bulk load the original dataset, follow the commands below:

  • sed -e 's/\\\\\"/"/g' -e 's/\\"//g' company_name.csv > company_name_temp.csv. This eliminates \\" and \" characters that cause the PostgreSQL to fail ignoring commas within two double quotes during the bulk loading. There are 10 out of 21 tables have this issue (company_name, aka_title, aka_name, title, movie_companies, person_info, char_name, name, movie_info, and cast_info). The remaining 11 tables have no issues.
  • \COPY company_name FROM 'company_name_temp.csv' DELIMITER ',' NULL '' CSV;. It is necessary to include CSV in the command to ignore commas between two double quotes.

4. JOB-light and Join Order Benchmark (JOB)

The workloads used to evaluate L1-error are JOB-light and Join Order Benchmark (JOB).

  • JOB-light consists of 70 star-structure queries with equijoins. Join sizes 2-5, join predicates 1-4, and tables 2-5. The first three queries consist of only two tables.
  • JOB consists of 113 queries in total, including 33 query families with equijoins, and each family's queries differ only in selection predicates. Join sizes 2-17, join predicates 4-28, and tables 2-17.

There were differences in SQL syntax and their execution. Thus the results of selection predicates in PostgreSQL and COMPASS were different. Thus we adjusted the queries so that both systems have the same selectivities and cardinality.

5. JCC-H and DSB

Additionally, we evaluated L1-error on two other benchmarks, JCC-H and DSB. All necessary data including queries, query plans, and cardinality estimates are provided here and here, respectively.

6. Install PostgreSQL and COMPASS

We collected cardinality estimations from one well-known and one recently proposed query optimizer:

  • PostgreSQL v15.1
  • COMPASS is built as an extension on top of a clone of MapD System, version 3.6.1, rebranded to OmniSciDB and then to HeavyDB

To replicate the PostgreSQL docker image for runtime comparison, one can follow these instructions.

7. Questions

If you have questions, please contact:

8. Acknowledgments

This work is supported by NSF award (number 2008815).

9. References

[1] Query optimization through the looking glass, and what we found running the Join Order Benchmark
[2] JCC-H: adding join crossing correlations with skew to TPC-H
[3] DSB: a decision support benchmark for workload-driven and traditional database systems

10. Citation

@misc{l1-github,
  author = {Yesdaulet Izenov},
  title = "{Sub-optimal Join Order Identification with L1-error}",
  howpublished = "\url{https://github.com/yizenov/l1-error}"
}