feat: builtin descriptions & examples #2178

universalmind303 · 2023-11-29T00:10:54Z

closes #2177.

Still need to finish adding examples and descriptions to all builtin functions, but this gives us a simple example & brief description for each builtin function.

example:

> select function_name, example, description from glare_catalog.functions where example is not null ORDER BY function_name asc;
┌────────────────────────────────────┬───────────────────────────────────────┬─────────────────────────────────────────────────────────────────────┐
│ function_name                      │ example                               │ description                                                         │
│ ──                                 │ ──                                    │ ──                                                                  │
│ Utf8                               │ Utf8                                  │ Utf8                                                                │
╞════════════════════════════════════╪═══════════════════════════════════════╪═════════════════════════════════════════════════════════════════════╡
│ approx_distinct                    │ approx_distinct(a)                    │ Gives the approximate count of distinct elements using HyperLogLog. │
│ approx_median                      │ approx_median(a)                      │ Gives the approximate median of a column                            │
│ approx_percentile_cont             │ approx_percentile_cont(a)             │ Gives the approximate percentile of a column                        │
│ approx_percentile_cont_with_weight │ approx_percentile_cont_with_weight(a) │ Gives the approximate percentile of a column with a weight column   │
│ array_agg                          │ array_agg(a)                          │ Returns a list containing all the values of a column                │
│ avg                                │ avg(a)                                │ Returns the average of a column                                     │
│ bit_and                            │ bit_and(a)                            │ Returns the bitwise AND of a column                                 │
│ bit_or                             │ bit_or(a)                             │ Returns the bitwise OR of a column                                  │
│ bit_xor                            │ bit_xor(a)                            │ Returns the bitwise XOR of a column                                 │
│ bool_and                           │ bool_and(a)                           │ Returns the boolean AND of a column                                 │
│ …                                  │ …                                     │ …                                                                   │
│ regr_r2                            │ regr_r2(x, y)                         │ Returns the coefficient of determination (R-squared) for non-null … │
│ regr_slope                         │ regr_slope(y, x)                      │ Returns the slope of the linear regression line for non-null pairs… │
│ regr_sxx                           │ regr_sxx(y, x)                        │ Returns the sum of squares of the independent variable for non-nul… │
│ regr_sxy                           │ regr_sxy(y, x)                        │ Returns the sum of products of independent times dependent variabl… │
│ regr_syy                           │ regr_syy(y, x)                        │ Returns the sum of squares of the dependent variable for non-null … │
│ stddev                             │ stddev(a)                             │ Returns the sample standard deviation of a column                   │
│ stddev_pop                         │ stddev_pop(a)                         │ Returns the population standard deviation of a column               │
│ sum                                │ sum(a)                                │ Returns the sum of a column                                         │
│ variance                           │ variance(a)                           │ Returns the sample variance of a column                             │
│ variance_pop                       │ variance_pop(a)                       │ Returns the population variance of a column                         │
└────────────────────────────────────┴───────────────────────────────────────┴─────────────────────────────────────────────────────────────────────┘

crates/protogen/proto/metastore/catalog.proto

crates/sqlbuiltins/src/functions/aggregates.rs

crates/sqlbuiltins/src/functions/scalars.rs

universalmind303 · 2023-11-30T15:44:59Z

@tychoish

The approach I've taken is consistent with similar tools in our field. In fact, many of these are copied from those tools (polars, spark, duckdb, ...).

I'm open to making necessary changes. However, given the volume of comments, it would be helpful to prioritize the most critical ones. If there are specific changes you feel strongly about, could you make PR suggestions for those specific ones? The suggestions are much easier to apply for small changes like this.

tychoish · 2023-11-30T17:54:32Z

The approach I've taken is consistent with similar tools in our field. In fact, many of these are copied from those tools.

I mean I don't think this makes them good, or helpful, and I think we (maybe?) have the opportunity to help and make using our interface easier? If having help text for our methods is important, then having it be clear and useful, should also be important.

I'm open to making necessary changes. However, given the volume of comments, it would be helpful to prioritize the most critical ones. If there are specific changes you feel strongly about, could you make PR suggestions for those specific ones? The suggestions are much easier to apply for small changes like this.

I feel moderately strongly about it: while I did have some that were stylistic, many of my comments point out unclear or confusing semantics or situations where the string is actively confusing.

Improving things is going to take one of us a few hours of editing. Maybe the best path forward is:

read through things and spend some amount of (time boxed, 90 minutes? 2 hours) making changes.
copy anything you don't want to address to TODOs or similar?
I'll make a PR after you merge that (and this) to further clean things up?

universalmind303 · 2023-11-30T18:58:20Z

I feel moderately strongly about it: while I did have some that were stylistic, many of my comments point out unclear or confusing semantics or situations where the string is actively confusing.

What i was suggesting is using the pr suggestions instead of comments for the comments that are suggesting a direct change such as "use Return instead of Compute". This makes it much easier to apply the changes (just click accept suggestion) instead of going through every comment & having to reason about what the change should be.

I also don't think we want overly verbose examples & descriptions in the catalogs. I do think the "extended" examples are helpful, but i don't think the functions table is the best spot for that level of information.

IMO, this is very readable from the command line, and gives you at least a starting point on what the function is, and what it does.

> select function_name, example, description from glare_catalog.functions where function_type = 'scalar' order by function_name asc;
┌────────────────┬──────────────────────────────────┬───────────────────────────────────────────────────────────────────────────┐
│ function_name  │ example                          │ description                                                               │
│ ──             │ ──                               │ ──                                                                        │
│ Utf8           │ Utf8                             │ Utf8                                                                      │
╞════════════════╪══════════════════════════════════╪═══════════════════════════════════════════════════════════════════════════╡
│ abs            │ abs(-1)                          │ Compute the absolute value of a number.                                   │
│ acos           │ acos(0.5)                        │ Compute the inverse cosine (arc cosine) of a number.                      │
│ acosh          │ acosh(1)                         │ Compute the inverse hyperbolic cosine of a number.                        │
│ array_append   │ array_append([1, 2], 3)          │ Append an element to the end of an array.                                 │
│ array_concat   │ array_concat([1, 2], [3, 4])     │ Concatenate two arrays.                                                   │
│ array_contains │ array_contains([1, 2], 1)        │ Returns true if the array contains the specified element                  │
│ array_dims     │ array_dims([1, 2])               │ Returns the dimensions of an array.                                       │
│ array_element  │ array_element([1, 2], 1)         │ Returns the element of an array at the specified index.                   │
│ array_has_all  │ array_has_all([1, 2], [1, 2, 3]) │ Returns true if the first array contains all elements of the second array │
│ array_has_any  │ array_has_any([1, 2], [1, 2, 3]) │ Returns true if the first array contains any elements of the second array │
│ …              │ …                                │ …                                                                         │
│ to_hex         │ TODO                             │ TODO                                                                      │
│ to_timestamp   │ to_timestamp('2020-09-08T12:00:… │ Converts a string to a timestamp (Timestamp<ns, UTC>). Alias for `TIMEST… │
│ to_timestamp_… │ to_timestamp_micros('2020-09-08… │ Converts a string to a timestamp with microsecond precision (Timestamp<…  │
│ to_timestamp_… │ to_timestamp_millis('2020-09-08… │ Converts a string to a timestamp with millisecond precision (Timestamp<m… │
│ to_timestamp_… │ to_timestamp_seconds('2020-09-0… │ Converts a string to a timestamp with second precision (Timestamp<s, UTC… │
│ translate      │ translate('hello', 'el', '12')   │ Replace all occurrences of a set of characters in a string with a new se… │
│ trim           │ trim(' hello ')                  │ Remove all spaces from the start and end of a string.                     │
│ trunc          │ trunc(1.1111, 2)                 │ Truncate a number to the nearest integer. If a second argument is provid… │
│ upper          │ upper('hello')                   │ Convert a string to uppercase.                                            │
│ uuid           │ uuid()                           │ Generate a random UUID.                                                   │
└────────────────┴──────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘

anything more than that, such as what I have for the table functions (but want to find a way to slim down) isn't as easy to grok.

> select function_name, example, description from glare_catalog.functions where function_type = 'table' order by function_name asc;
┌────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────┐
│ function_name      │ example                                                                                 │ description                                                │
│ ──                 │ ──                                                                                      │ ──                                                         │
│ Utf8               │ Utf8                                                                                    │ Utf8                                                       │
╞════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╪════════════════════════════════════════════════════════════╡
│ csv_scan           │ ↵                                                                                       │ ↵                                                          │
│                    │ -- Read a relative path.↵                                                               │ Syntax:↵                                                   │
│                    │ SELECT * FROM csv_scan('./my_data.csv');↵                                               │ -- Single url or path.↵                                    │
│                    │ -- Read all csv files in a directory.↵                                                  │ csv_scan(<url>)↵                                           │
│                    │ SELECT * FROM csv_scan('./directory_of_data/*.csv');↵                                   │ -- Multiple urls or paths.↵                                │
│                    │ -- Read csv files from multiple directories.↵                                           │ csv_scan([<url>])↵                                         │
│                    │ SELECT * FROM csv_scan('./**/*.csv');↵                                                  │ -- Using a cloud credentials object.↵                      │
│                    │ -- Read multiple explicitly provided files.↵                                            │ csv_scan(<url>, <credential_object>)↵                      │
│                    │ SELECT * FROM csv_scan(['./directory_of_data/1.csv', './directory_of_data/2.csv']);↵    │ -- Required named argument for S3 buckets.↵                │
│                    │                                                                                         │ csv_scan(<url>, <credentials_object>, region =>↵           │
│                    │                                                                                         │ '<aws_region>')↵                                           │
│                    │                                                                                         │ -- Pass S3 credentials using named arguments.↵             │
│                    │                                                                                         │ csv_scan(<url>, access_key_id => '<aws_access_key_id>',↵   │
│                    │                                                                                         │ secret_access_key => '<aws_secret_access_key>', region =>↵ │
│                    │                                                                                         │ '<aws_region>')↵                                           │
│                    │                                                                                         │ -- Pass GCS credentials using named arguments.↵            │
│                    │                                                                                         │ csv_scan(<url>, service_account_key =>↵                    │
│                    │                                                                                         │ '<gcp_service_account_key>'                                │
│ delta_scan         │ NULL                                                                                    │ NULL                                                       │
│ read_bigquery      │ NULL                                                                                    │ NULL                                                       │
│ read_excel         │ NULL                                                                                    │ NULL                                                       │
│ read_mongodb       │ NULL                                                                                    │ NULL                                                       │
│ read_mysql         │ NULL                                                                                    │ NULL                                                       │
│ read_postgres      │ NULL                                                                                    │ NULL                                                       │
│ read_snowflake     │ NULL                                                                                    │ NULL                                                       │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
>

I like the idea of the suggested help function. Maybe we could use that to encapsulate more detailed information

maybe something like

\help digest

## Syntax
`digest(<expr>, <method>)`

## Arguments
<expr>     A string or binary expression, the message to be digested. 
<method>  the algorithm to use for digest. Valid options are 'md5' | 'sha224' | 'sha256' | 'sha384' | 'sha512' | 'blake2s' | 'blake2b' | 'blake3'

## Returns
`Binary`

## Examples
> select digest('hello', 'md5');
┌───────────────────────────────────┐
│ digest(Utf8("hello"),Utf8("md5")) │
│ ──                                │
│ Binary                            │
╞═══════════════════════════════════╡
│ 5d41402abc4b2a76b9719d911017c592  │
└───────────────────────────────────┘

universalmind303 · 2023-11-30T19:34:05Z

@tychoish i don't mean to be argumentative or combative. I do agree that we should really strive to have good documentation, i think we are just a bit misaligned on the manifestation of those. I'm of the mind that the table fields should be concise & serve as an "entry point". (The ones addressed in this PR). Anything beyond that (such as ones you'd find in the function reference) should not be part of the glare_catalog.functions table. IMO It's just information overload.

I do really like how snowflake has their functions documented. They provide examples, outputs, syntax & more. Doing something like that via \help or select help(<function>) seems like a better solution to me. That way users only get the "detailed" information if it's explicitly requested, not by default.

tychoish

lgtm moduo a few minor comments.

crates/sqlbuiltins/src/functions/mod.rs

crates/sqlbuiltins/src/functions/scalars.rs

crates/sqlbuiltins/src/builtins.rs

crates/sqlbuiltins/src/functions/table/object_store.rs

closes #2213 Some context.. After finishing up the docs in #2178, I realized that we were missing **all** of our udfs. _kdl_*, and all of our postgres specific functions._ It wasn't really obvious why either. They were in a separate part of the codebase hidden away. This pr moves all of the udfs out of sqlexec and puts them into sqlbuiltins alongside our udtfs (user defined table functions). --------- Co-authored-by: Sean Smith <scsmithr@gmail.com>

universalmind303 force-pushed the universalmind303/builtin-descriptions branch from 3c78427 to 8e160c6 Compare November 29, 2023 18:08

universalmind303 changed the title ~~wip: builtin descriptions & examples~~ feat: builtin descriptions & examples Nov 29, 2023

universalmind303 marked this pull request as ready for review November 29, 2023 18:11

universalmind303 requested review from scsmithr and vrongmeal November 29, 2023 18:11

scsmithr approved these changes Nov 29, 2023

View reviewed changes

crates/protogen/proto/metastore/catalog.proto Outdated Show resolved Hide resolved

tychoish reviewed Nov 30, 2023

View reviewed changes

universalmind303 requested a review from tychoish December 1, 2023 16:53

universalmind303 force-pushed the universalmind303/builtin-descriptions branch from 251b254 to f1d4d87 Compare December 1, 2023 17:05

universalmind303 requested a review from scsmithr December 1, 2023 17:05

universalmind303 added 11 commits December 1, 2023 14:29

wip

9f91f6a

wip

45e7189

add more docs & examples

8047bb5

add simple slts

c55140f

clippy

11e5b35

update bad comment string

a192656

fix merge

5095073

address some comments

05c2f74

address some comments

90dbfc7

address some more comments and some refactoring

db2a2a2

clippy

b8a65d7

universalmind303 force-pushed the universalmind303/builtin-descriptions branch from cffae1a to b8a65d7 Compare December 1, 2023 20:30

merge conflicts

8796467

tychoish approved these changes Dec 1, 2023

View reviewed changes

universalmind303 added 2 commits December 4, 2023 10:40

pr feedback

a128027

add docs for to_hex

de78bdb

universalmind303 enabled auto-merge (squash) December 4, 2023 17:05

universalmind303 merged commit b00ab67 into main Dec 4, 2023
10 checks passed

universalmind303 deleted the universalmind303/builtin-descriptions branch December 4, 2023 17:09

universalmind303 mentioned this pull request Dec 6, 2023

chore: refactor udfs into sqlbuiltins #2214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: builtin descriptions & examples #2178

feat: builtin descriptions & examples #2178

universalmind303 commented Nov 29, 2023

universalmind303 commented Nov 30, 2023 •

edited

Loading

tychoish commented Nov 30, 2023

universalmind303 commented Nov 30, 2023 •

edited

Loading

universalmind303 commented Nov 30, 2023

tychoish left a comment

feat: builtin descriptions & examples #2178

feat: builtin descriptions & examples #2178

Conversation

universalmind303 commented Nov 29, 2023

universalmind303 commented Nov 30, 2023 • edited Loading

tychoish commented Nov 30, 2023

universalmind303 commented Nov 30, 2023 • edited Loading

universalmind303 commented Nov 30, 2023

tychoish left a comment

Choose a reason for hiding this comment

universalmind303 commented Nov 30, 2023 •

edited

Loading

universalmind303 commented Nov 30, 2023 •

edited

Loading