Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: builtin descriptions & examples #2178

Merged
merged 14 commits into from
Dec 4, 2023

Conversation

universalmind303
Copy link
Contributor

closes #2177.

Still need to finish adding examples and descriptions to all builtin functions, but this gives us a simple example & brief description for each builtin function.

example:

> select function_name, example, description from glare_catalog.functions where example is not null ORDER BY function_name asc;
┌────────────────────────────────────┬───────────────────────────────────────┬─────────────────────────────────────────────────────────────────────┐
│ function_name                      │ example                               │ description                                                         │
│ ──                                 │ ──                                    │ ──                                                                  │
│ Utf8                               │ Utf8                                  │ Utf8                                                                │
╞════════════════════════════════════╪═══════════════════════════════════════╪═════════════════════════════════════════════════════════════════════╡
│ approx_distinct                    │ approx_distinct(a)                    │ Gives the approximate count of distinct elements using HyperLogLog. │
│ approx_median                      │ approx_median(a)                      │ Gives the approximate median of a column                            │
│ approx_percentile_cont             │ approx_percentile_cont(a)             │ Gives the approximate percentile of a column                        │
│ approx_percentile_cont_with_weight │ approx_percentile_cont_with_weight(a) │ Gives the approximate percentile of a column with a weight column   │
│ array_agg                          │ array_agg(a)                          │ Returns a list containing all the values of a column                │
│ avg                                │ avg(a)                                │ Returns the average of a column                                     │
│ bit_and                            │ bit_and(a)                            │ Returns the bitwise AND of a column                                 │
│ bit_or                             │ bit_or(a)                             │ Returns the bitwise OR of a column                                  │
│ bit_xor                            │ bit_xor(a)                            │ Returns the bitwise XOR of a column                                 │
│ bool_and                           │ bool_and(a)                           │ Returns the boolean AND of a column                                 │
│ …                                  │ …                                     │ …                                                                   │
│ regr_r2                            │ regr_r2(x, y)                         │ Returns the coefficient of determination (R-squared) for non-null … │
│ regr_slope                         │ regr_slope(y, x)                      │ Returns the slope of the linear regression line for non-null pairs… │
│ regr_sxx                           │ regr_sxx(y, x)                        │ Returns the sum of squares of the independent variable for non-nul… │
│ regr_sxy                           │ regr_sxy(y, x)                        │ Returns the sum of products of independent times dependent variabl… │
│ regr_syy                           │ regr_syy(y, x)                        │ Returns the sum of squares of the dependent variable for non-null … │
│ stddev                             │ stddev(a)                             │ Returns the sample standard deviation of a column                   │
│ stddev_pop                         │ stddev_pop(a)                         │ Returns the population standard deviation of a column               │
│ sum                                │ sum(a)                                │ Returns the sum of a column                                         │
│ variance                           │ variance(a)                           │ Returns the sample variance of a column                             │
│ variance_pop                       │ variance_pop(a)                       │ Returns the population variance of a column                         │
└────────────────────────────────────┴───────────────────────────────────────┴─────────────────────────────────────────────────────────────────────┘

@universalmind303 universalmind303 force-pushed the universalmind303/builtin-descriptions branch from 3c78427 to 8e160c6 Compare November 29, 2023 18:08
@universalmind303 universalmind303 changed the title wip: builtin descriptions & examples feat: builtin descriptions & examples Nov 29, 2023
@universalmind303 universalmind303 marked this pull request as ready for review November 29, 2023 18:11
crates/protogen/proto/metastore/catalog.proto Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/aggregates.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/aggregates.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/aggregates.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/aggregates.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
@universalmind303
Copy link
Contributor Author

universalmind303 commented Nov 30, 2023

@tychoish

The approach I've taken is consistent with similar tools in our field. In fact, many of these are copied from those tools (polars, spark, duckdb, ...).

I'm open to making necessary changes. However, given the volume of comments, it would be helpful to prioritize the most critical ones. If there are specific changes you feel strongly about, could you make PR suggestions for those specific ones? The suggestions are much easier to apply for small changes like this.

@tychoish
Copy link
Collaborator

The approach I've taken is consistent with similar tools in our field. In fact, many of these are copied from those tools.

I mean I don't think this makes them good, or helpful, and I think we (maybe?) have the opportunity to help and make using our interface easier? If having help text for our methods is important, then having it be clear and useful, should also be important.

I'm open to making necessary changes. However, given the volume of comments, it would be helpful to prioritize the most critical ones. If there are specific changes you feel strongly about, could you make PR suggestions for those specific ones? The suggestions are much easier to apply for small changes like this.

I feel moderately strongly about it: while I did have some that were stylistic, many of my comments point out unclear or confusing semantics or situations where the string is actively confusing.

Improving things is going to take one of us a few hours of editing. Maybe the best path forward is:

  • read through things and spend some amount of (time boxed, 90 minutes? 2 hours) making changes.
  • copy anything you don't want to address to TODOs or similar?
  • I'll make a PR after you merge that (and this) to further clean things up?

@universalmind303
Copy link
Contributor Author

universalmind303 commented Nov 30, 2023

I feel moderately strongly about it: while I did have some that were stylistic, many of my comments point out unclear or confusing semantics or situations where the string is actively confusing.

What i was suggesting is using the pr suggestions instead of comments for the comments that are suggesting a direct change such as "use Return instead of Compute". This makes it much easier to apply the changes (just click accept suggestion) instead of going through every comment & having to reason about what the change should be.

I also don't think we want overly verbose examples & descriptions in the catalogs. I do think the "extended" examples are helpful, but i don't think the functions table is the best spot for that level of information.

IMO, this is very readable from the command line, and gives you at least a starting point on what the function is, and what it does.

> select function_name, example, description from glare_catalog.functions where function_type = 'scalar' order by function_name asc;
┌────────────────┬──────────────────────────────────┬───────────────────────────────────────────────────────────────────────────┐
│ function_name  │ example                          │ description                                                               │
│ ──             │ ──                               │ ──                                                                        │
│ Utf8           │ Utf8                             │ Utf8                                                                      │
╞════════════════╪══════════════════════════════════╪═══════════════════════════════════════════════════════════════════════════╡
│ abs            │ abs(-1)                          │ Compute the absolute value of a number.                                   │
│ acos           │ acos(0.5)                        │ Compute the inverse cosine (arc cosine) of a number.                      │
│ acosh          │ acosh(1)                         │ Compute the inverse hyperbolic cosine of a number.                        │
│ array_append   │ array_append([1, 2], 3)          │ Append an element to the end of an array.                                 │
│ array_concat   │ array_concat([1, 2], [3, 4])     │ Concatenate two arrays.                                                   │
│ array_contains │ array_contains([1, 2], 1)        │ Returns true if the array contains the specified element                  │
│ array_dims     │ array_dims([1, 2])               │ Returns the dimensions of an array.                                       │
│ array_element  │ array_element([1, 2], 1)         │ Returns the element of an array at the specified index.                   │
│ array_has_all  │ array_has_all([1, 2], [1, 2, 3]) │ Returns true if the first array contains all elements of the second array │
│ array_has_any  │ array_has_any([1, 2], [1, 2, 3]) │ Returns true if the first array contains any elements of the second array │
│ …              │ …                                │ …                                                                         │
│ to_hex         │ TODO                             │ TODO                                                                      │
│ to_timestamp   │ to_timestamp('2020-09-08T12:00:… │ Converts a string to a timestamp (Timestamp<ns, UTC>). Alias for `TIMEST… │
│ to_timestamp_… │ to_timestamp_micros('2020-09-08… │ Converts a string to a timestamp with microsecond precision (Timestamp<…  │
│ to_timestamp_… │ to_timestamp_millis('2020-09-08… │ Converts a string to a timestamp with millisecond precision (Timestamp<m… │
│ to_timestamp_… │ to_timestamp_seconds('2020-09-0… │ Converts a string to a timestamp with second precision (Timestamp<s, UTC… │
│ translate      │ translate('hello', 'el', '12')   │ Replace all occurrences of a set of characters in a string with a new se… │
│ trim           │ trim(' hello ')                  │ Remove all spaces from the start and end of a string.                     │
│ trunc          │ trunc(1.1111, 2)                 │ Truncate a number to the nearest integer. If a second argument is provid… │
│ upper          │ upper('hello')                   │ Convert a string to uppercase.                                            │
│ uuid           │ uuid()                           │ Generate a random UUID.                                                   │
└────────────────┴──────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘

anything more than that, such as what I have for the table functions (but want to find a way to slim down) isn't as easy to grok.

> select function_name, example, description from glare_catalog.functions where function_type = 'table' order by function_name asc;
┌────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────┐
│ function_name      │ example                                                                                 │ description                                                │
│ ──                 │ ──                                                                                      │ ──                                                         │
│ Utf8               │ Utf8                                                                                    │ Utf8                                                       │
╞════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╪════════════════════════════════════════════════════════════╡
│ csv_scan           │ ↵                                                                                       │ ↵                                                          │
│                    │ -- Read a relative path.↵                                                               │ Syntax:↵                                                   │
│                    │ SELECT * FROM csv_scan('./my_data.csv');↵                                               │ -- Single url or path.↵                                    │
│                    │ -- Read all csv files in a directory.↵                                                  │ csv_scan(<url>)↵                                           │
│                    │ SELECT * FROM csv_scan('./directory_of_data/*.csv');↵                                   │ -- Multiple urls or paths.↵                                │
│                    │ -- Read csv files from multiple directories.↵                                           │ csv_scan([<url>])↵                                         │
│                    │ SELECT * FROM csv_scan('./**/*.csv');↵                                                  │ -- Using a cloud credentials object.↵                      │
│                    │ -- Read multiple explicitly provided files.↵                                            │ csv_scan(<url>, <credential_object>)↵                      │
│                    │ SELECT * FROM csv_scan(['./directory_of_data/1.csv', './directory_of_data/2.csv']);↵    │ -- Required named argument for S3 buckets.↵                │
│                    │                                                                                         │ csv_scan(<url>, <credentials_object>, region =>↵           │
│                    │                                                                                         │ '<aws_region>')↵                                           │
│                    │                                                                                         │ -- Pass S3 credentials using named arguments.↵             │
│                    │                                                                                         │ csv_scan(<url>, access_key_id => '<aws_access_key_id>',↵   │
│                    │                                                                                         │ secret_access_key => '<aws_secret_access_key>', region =>↵ │
│                    │                                                                                         │ '<aws_region>')↵                                           │
│                    │                                                                                         │ -- Pass GCS credentials using named arguments.↵            │
│                    │                                                                                         │ csv_scan(<url>, service_account_key =>↵                    │
│                    │                                                                                         │ '<gcp_service_account_key>'                                │
│ delta_scan         │ NULLNULL                                                       │
│ read_bigquery      │ NULLNULL                                                       │
│ read_excel         │ NULLNULL                                                       │
│ read_mongodb       │ NULLNULL                                                       │
│ read_mysql         │ NULLNULL                                                       │
│ read_postgres      │ NULLNULL                                                       │
│ read_snowflake     │ NULLNULL                                                       │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
> 

I like the idea of the suggested help function. Maybe we could use that to encapsulate more detailed information

maybe something like

\help digest

## Syntax
`digest(<expr>, <method>)`

## Arguments
<expr>     A string or binary expression, the message to be digested. 
<method>  the algorithm to use for digest. Valid options are 'md5' | 'sha224' | 'sha256' | 'sha384' | 'sha512' | 'blake2s' | 'blake2b' | 'blake3'

## Returns
`Binary`

## Examples
> select digest('hello', 'md5');
┌───────────────────────────────────┐
│ digest(Utf8("hello"),Utf8("md5")) │
│ ──                                │
│ Binary                            │
╞═══════════════════════════════════╡
│ 5d41402abc4b2a76b9719d911017c592  │
└───────────────────────────────────┘

@universalmind303
Copy link
Contributor Author

@tychoish i don't mean to be argumentative or combative. I do agree that we should really strive to have good documentation, i think we are just a bit misaligned on the manifestation of those. I'm of the mind that the table fields should be concise & serve as an "entry point". (The ones addressed in this PR). Anything beyond that (such as ones you'd find in the function reference) should not be part of the glare_catalog.functions table. IMO It's just information overload.

I do really like how snowflake has their functions documented. They provide examples, outputs, syntax & more. Doing something like that via \help or select help(<function>) seems like a better solution to me. That way users only get the "detailed" information if it's explicitly requested, not by default.

@universalmind303 universalmind303 force-pushed the universalmind303/builtin-descriptions branch from cffae1a to b8a65d7 Compare December 1, 2023 20:30
Copy link
Collaborator

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm moduo a few minor comments.

crates/sqlbuiltins/src/functions/mod.rs Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/builtins.rs Show resolved Hide resolved
crates/sqlbuiltins/src/functions/table/object_store.rs Outdated Show resolved Hide resolved
@universalmind303 universalmind303 enabled auto-merge (squash) December 4, 2023 17:05
@universalmind303 universalmind303 merged commit b00ab67 into main Dec 4, 2023
10 checks passed
@universalmind303 universalmind303 deleted the universalmind303/builtin-descriptions branch December 4, 2023 17:09
universalmind303 added a commit that referenced this pull request Dec 6, 2023
closes #2213

Some context.. 

After finishing up the docs in
#2178, I realized that we were
missing **all** of our udfs. _kdl_*, and all of our postgres specific
functions._

It wasn't really obvious why either. They were in a separate part of the
codebase hidden away. This pr moves all of the udfs out of sqlexec and
puts them into sqlbuiltins alongside our udtfs (user defined table
functions).

---------

Co-authored-by: Sean Smith <scsmithr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add example and description columns to glare_catalog.functions
3 participants