Skip to content

Latest commit

 

History

History
404 lines (309 loc) · 15 KB

2_intro-sql-2.md

File metadata and controls

404 lines (309 loc) · 15 KB
jupytext kernelspec
formats text_representation
md:myst
extension format_name format_version jupytext_version
.md
myst
0.13
1.16.2
display_name language name
Python 3
python
python3
Open In Colab

2. Learn to quack SQL with DuckDB: Group by, Joins and Subqueries

To start off, install the latest version of duckdb and magic-duckdb to run this notebook.

!pip install --upgrade duckdb magic-duckdb --quiet
%load_ext magic_duckdb

Example Tables

Let's start with two datasets:

  • {Download}birds.csv<./data/birds.csv>: a list of measurements of individuals of different bird species
  • {Download}ducks.csv<./data/ducks.csv>: a list of scientific names of bird species that are ducks

To download the datasets directly from GitHub, run:

!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/birds.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/ducks.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_2.zip -q 
!unzip -o answers_2.zip -d answers 

To create the tables in your database, run:

%%dql
CREATE TABLE birds AS SELECT * FROM read_csv('birds.csv');
CREATE TABLE ducks AS SELECT * FROM read_csv('ducks.csv');

To begin understanding the data contained in these tables, you can run:

%%dql
SUMMARIZE birds;
%%dql
SUMMARIZE ducks;
Create a new table `birds_measurements` from the file `birds.csv` (this file contains the names and measurements of individuals from over 10k bird species).
# Uncomment and run to show solution
# !cat ./answers/answer_2.01.sql
Create a new table `ducks_species` from the file `ducks.csv` (this file contains species names and common names of ducks).
# Uncomment and run to show solution
# !cat ./answers/answer_2.02.sql

1. Aggregate Functions

The functions we saw previously when building calculated columns operated on each row of the table individually. In contrast, aggregate functions summarize many rows of the table. By default, they will summarize all rows (stay tuned though!). For example, let's find the minimum and maximum Beak_Width of any bird in the dataset.

%%dql
SELECT 
    MIN(Beak_Width) AS Min_Beak_Width,
    MAX(Beak_Width) AS Max_Beak_Width
FROM birds;

However, aggregating an entire table all the way up to just a single row isn't always what we are looking for. Next, we will use the GROUP BY clause to apply aggregate functions to groups of rows instead of all rows.

2. Group Rows (GROUP BY Clause)

To group the rows based on a specific column (or columns) and perform aggregate functions, you can use the GROUP BY clause. For example, if you want to group the birds by their species name and calculate the average Beak_Width, Beak_Depth and Beak_Length_Culmen for each group, you can run this query:

%%dql
SELECT
    Species_Common_Name,
    AVG(Beak_Width) AS Avg_Beak_Width,
    AVG(Beak_Depth) AS Avg_Beak_Depth,
    AVG(Beak_Length_Culmen) AS Avg_Beak_Length_Culmen
FROM birds
GROUP BY Species_Common_Name;

This command groups the rows by the Species_Common_Name column and calculates the average Beak_Width, Beak_Depth and Beak_Length_Culmen for the individuals in each bird species group.

Multiple columns can be included within a GROUP BY clause, separated by commas. In this example, we measure the maximum wing_length by Country_WRI and Source. This example shows that these columns do not have to be hierarchically related - the GROUP BY will show all combinations of data in those columns.

%%dql 
SELECT 
    Country_WRI,
    Source,
    MAX(wing_length)
FROM birds 
GROUP BY
    Country_WRI,
    Source

Run a query that gets the average `Beak_Length_Culmen`, `Wing_Length` and `Tail_Length` for all birds.
# Uncomment and run to show solution
# !cat ./answers/answer_2.03.sql
Run a query that finds the average `Tail_Length` by `Species_Common_Name` and by `Country_WRI`. 
# Uncomment and run to show solution
# !cat ./answers/answer_2.04.sql

Getting the 95th percentile of a column value

We've used GROUP BY to group by a certain column, and used an aggregate function to combine other columns in our query, for instance, by taking the average. But, what if we want to get the 95th percentile of a column value? DuckDB has a built-in aggregate function for that too! For instance, to get the 95th percentile value of the Beak_Length_Culmen of all birds, run:

%%dql
SELECT 
    QUANTILE_CONT(Beak_Length_Culmen, 0.95)
FROM birds;
Run a query that gets the 95<sup>th</sup> percentile and 99<sup>th</sup> percentile of `Beak_Length_Culmen` for all birds.
# Uncomment and run to show solution
# !cat ./answers/answer_2.05.sql
Run a query that gets the 99<sup>th</sup> percentile of `Wing_Length` by `Species_Common_Name`.
# Uncomment and run to show solution
# !cat ./answers/answer_2.06.sql

3. Understanding SQL Joins

INNER JOIN (JOIN)

In SQL, a Join operation allows you to combine rows from two or more tables based on a related column between them. This is incredibly useful when you need to pull together related information that is stored in different tables.

Let's combine the birds and ducks tables to find the Beak_Length_Culmen of all birds that are ducks. To do this, we'll use a SQL Join operation. Specifically, we'll use an INNER JOIN, which combines rows from both tables only when there is a match in the Species_Common_Name column.

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen,
    ducks.author
FROM birds
INNER JOIN ducks ON birds.Species_Common_Name = ducks.name;

Step-by-Step Explanation

Let's break down the SQL query step by step:

SELECT birds.column00 as id, birds.Species_Common_Name, birds.Beak_Length_Culmen, ducks.author: We're selecting the species id, name and beak length from the birds table, and the duck species author from the ducks table.

Up until now, we haven't needed to specify which table a column is coming from since we have been working with just one table! Now we prefix column names with the name of the table they come from. As a note, this is not required if the column names in the two tables are completely different from one another, but it is a good best practice.

FROM birds: We're starting with the birds table.

INNER JOIN ducks ON birds.Species_Common_Name = ducks.name: We're joining the birds table to the ducks table where the species' common name matches in both tables. We are using table prefixes again for clarity.

INNER JOIN Gotchas

NOTE: When using an INNER JOIN, we only show output rows where there are matching values in both tables. This has dramatically reduced the number of output rows since now we are only looking at ducks!

NOTE: If a join between 2 tables results in multiple matches, all matches will be returned. This can mean that your result can actually return more rows after a join, in some cases! (Imagine that we had messy data in ducks.csv, and one species mistakenly had multiple authors. We would have 1 row in our result for each author.)

NOTE: INNER JOIN is the default kind of join in SQL. So if you see a query that just says ... table1 JOIN table2 ..., then it is using an INNER JOIN! It is common practice to omit INNER.

NOTE: It is possible to join on multiple columns. For example, imagine wanting to connect two tables by matching both a first name column and last name column. Inequality conditions are also possible (as we will see later!).

Run a query that gets the name, `Beak_Length_Culmen`, `Wing_Length` and `Tail_Length` of birds that are ducks.
# Uncomment and run to show solution
# !cat ./answers/answer_2.07.sql
Let's run a similar query, but group the ducks by species. Run a query that gets the `Species_Common_Name`, _average_ `Beak_Length_Culmen`, `Wing_Length` and `Tail_Length` of birds that are ducks, and sort the results by `Species_Common_Name`.
# Uncomment and run to show solution
# !cat ./answers/answer_2.08.sql

LEFT OUTER JOIN (LEFT JOIN)

A LEFT OUTER JOIN will keep all rows from the LEFT table in the join (the table before the LEFT JOIN keywords), even if there is not a match in the table on the right. Any rows that do not have a match in the right table will have the value NULL for all columns from the right table. NULL is the missing value indicator in SQL.

This can be useful when adding optional details. For example, in our situtation, ducks will have an author, but all other birds will not.

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen,
    ducks.author
FROM birds
LEFT JOIN ducks ON birds.Species_Common_Name = ducks.name;

Notice how the LEFT JOIN query has 90371 rows in the result (the same number of rows as the birds table), and the INNER JOIN query only had 662 rows.

The author column contains the Python missing value indicator of None, which is equivalent to SQL's NULL.

LEFT JOIN Gotchas

NOTE: A LEFT JOIN usually, but not always, will result in the same number of rows as the left table. Cases where this is not true include:

  • Duplicates in the columns that are being joined in the right table
  • A WHERE clause that filters the result
Modify the `LEFT JOIN` query above to filter to only rows that are **NOT** ducks. 

Hint: In Python (like in SQL), nothing equals None! 
Just like in Python, we use the `IS` keyword to check if a value is missing.
# Uncomment and run to show solution
# !cat ./answers/answer_2.09.sql

3. Subqueries

What is a Subquery?

A subquery, also known as a nested query, is a query within another SQL query. It's like a query inside a query! Subqueries are used to perform operations that require multiple steps, such as filtering data based on a complex condition or aggregating data before using it in the main query. In other words, instead of creating multiple new tables as intermediate steps, you can define these steps within the scope of a larger query.

Types of Subquery

A Subquery can return a single value (one row and one column), an entire column of values, or a table of values. These each can be used in the location within a query where a static value, column, or table would otherwise be.

Using Subqueries in DuckDB

Let's start by looking at our previously example query to understand how subqueries work in DuckDB.

Finding the top Beak_Length_Culmen

Suppose we want to find the individual ducks with the largest Beak_Length_Culmen. We can use a subquery to calculate the 99th percentile of Beak_Length_Culmen first, and then use that result in our main query:

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen
FROM birds
INNER JOIN ducks ON birds.Species_Common_Name = ducks.name
WHERE birds.Beak_Length_Culmen > (
    SELECT QUANTILE_CONT(birds.Beak_Length_Culmen, 0.99)
    FROM birds 
    INNER JOIN ducks ON birds.Species_Common_Name = ducks.name
)
ORDER BY birds.Beak_Length_Culmen DESC;

In this example, the subquery (SELECT QUANTILE_CONT(birds.Beak_Length_Culmen, 0.99) FROM birds INNER JOIN ducks ON birds.Species_Common_Name = ducks.name) calculates the 99th percentile of beak length for all birds that are ducks. The main query then selects the names and beak measurements of individual ducks who have a beak length above this value.


Find the duck species that have a `Wing_Length` larger than the 99<sup>th</sup> percentile of all ducks.
# Uncomment and run to show solution
# !cat ./answers/answer_2.10.sql
Can you find any duck species that have both a `Wing_Length` _and_ `Beak_Length_Culmen` larger than the 95<sup>th</sup> percentile of all duck species?
# Uncomment and run to show solution
# !cat ./answers/answer_2.11.sql
NOTE: This is extra credit!

Instead of individual ducks, find the duck species that _on average_ have a measured beak size that is larger than the 95<sup>th</sup> percentile of all ducks.
# Uncomment and run to show solution
# !cat ./answers/answer_2.12.sql

Using the WITH Clause

The WITH clause is an alternative to a subquery that has 2 key advantages: it can increase readability, and it allows for reusability. The technical term for a WITH clause is a Common Table Expression (abbreviated CTE), which describes how it can be reusable.

Now, let's see how we can use the WITH clause to make our queries more readable. Suppose we want to find the individual ducks that have a beak length above the 99th percentile of duck beaks. Here's how we can do it using the WITH clause:

%%dql
WITH
    duck_beaks AS (
        SELECT
            column00 as id,
            Species_Common_Name,
            Beak_Length_Culmen
        FROM birds
        INNER JOIN ducks ON name = Species_Common_Name
    ),

    pc99_beak_len AS (
        SELECT QUANTILE_CONT(Beak_Length_Culmen, 0.99) AS Top_Beak_Length 
        FROM duck_beaks
    )

SELECT
    duck_beaks.id,
    duck_beaks.Species_Common_Name,
    duck_beaks.Beak_Length_Culmen
FROM duck_beaks
INNER JOIN pc99_beak_len ON duck_beaks.Beak_Length_Culmen > pc99_beak_len.Top_Beak_Length
ORDER BY duck_beaks.Beak_Length_Culmen DESC;

In this example, the WITH clause creates two temporary result sets called duck_beaks and pc99_beak_len. The main query then selects the names and beak measurements of ducks with Beak_Length_Culmen above the top 99th percentile beak length.

Find the duck species that have an average `Wing_Length` larger than the 95<sup>th</sup> percentile of all duck species.
# Uncomment and run to show solution
# !cat ./answers/answer_2.13.sql
What about the duck species that have both a `Wing_Length` _or_ `Beak_Length_Culmen` larger than the 95sup>th</sup> percentile of all duck species?
# Uncomment and run to show solution
# !cat ./answers/answer_2.14.sql