-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(format): add additional features to 1.1.0 spec #765
feat(format): add additional features to 1.1.0 spec #765
Conversation
|
df64b7c
to
04ecc45
Compare
c0486ed
to
f91c0db
Compare
@zeroshade this should be the rest of the requested APIs for 1.1.0, if this looks reasonable I'll port to Java and Go |
adbc.h
Outdated
/// \brief Get a hierarchical view of system and user-defined functions. | ||
/// | ||
/// The result is an Arrow dataset with the following schema: | ||
/// | ||
/// | Field Name | Field Type | | ||
/// |--------------------------|-------------------------| | ||
/// | catalog_name | utf8 | | ||
/// | catalog_db_schemas | list<DB_SCHEMA_SCHEMA> | | ||
/// | ||
/// DB_SCHEMA_SCHEMA is a Struct with fields: | ||
/// | ||
/// | Field Name | Field Type | | ||
/// |--------------------------|-------------------------| | ||
/// | db_schema_name | utf8 | | ||
/// | db_schema_functions | list<FUNCTION_SCHEMA> | | ||
/// | ||
/// FUNCTION_SCHEMA is a Struct with fields: | ||
/// | ||
/// | Field Name | Field Type | Comments | | ||
/// |--------------------------|-------------------------|----------| | ||
/// | function_name | utf8 not null | | | ||
/// | remarks | utf8 | (1) | | ||
/// | function_type | int16 | (2) | | ||
/// | specific_name | utf8 | (3) | | ||
/// | function_columns | list<COLUMN_SCHEMA> | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the difficulty and complexity we've seen with GetObjects
, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opted to keep it for consistency. I think this being difficult is a reflection of the Arrow APIs still being immature, since all things considered this is a fairly straightforward nested schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said you could say GetTableStatistics doesn't follow it; I can adjust that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the difficulty and complexity we've seen with
GetObjects
, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?
Can you elaborate on the difficulty? Is it about easily accessing nested data from C++? If so, perhaps some separate helpers could be added for ease of navigating struct arrays and scalars...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building and traversing these structures is a bit verbose and easy to mess up. Also this is all via Nanoarrow which of course is a more minimal API. I've been prototyping a code generator (generating iterators/builders for fixed schemas, sort of like Protobuf) as a possible solution but it isn't anywhere near ready yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping some of the helpers defined in utils.h via https://github.com/apache/arrow-adbc/pull/769/files#diff-3a15518d6cfd7263838e981957bf0280c06324e662700a6e8409969e3c5db6ed could help here. These are still probably too strict for a lot of use cases today but wondering if we can't iterate on those to help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my idea with the code generator, to be able to just generate the code for helpers along those lines to build/access these structures and check that the schema is the expected schema
f992660
to
6b188e5
Compare
Updated to include some additional common statistics after reviewing Hive. It may also be good to include the 'histogram' statistic (Oracle, Hive, PostgreSQL) but the encoding for this gets very messy with Arrow: either you need a union[list[int], list[double], ...], or you have to devise some way to pack it into a binary column. (Or possibly we can just include list[binary] and declare that things need to be packed there.) Regardless, the statistic schema is now fairly complicated with nesting, unions, and dictionary-encoding. |
adbc.h
Outdated
/// 2. A dictionary-encoded statistic name. Values in [0, 1024) are | ||
/// reserved for ADBC use. Other values are free for | ||
/// implementation-specific statistics. For the names and | ||
/// definitions of predefined statistic types, see \ref | ||
/// adbc-table-statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird. Does it mean the dictionary must have more than 1024 entries if impl-specific stats are desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah...right, dictionaries aren't sparse. I will probably abandon this idea then and just have it contain strings, and take the potential space hit (likely won't be a big deal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think you want to make lookups easy (and reasonably efficient?) as well. As is "I want the min and max of column a.b.c".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just make it an int column with a separate call to get the "dictionary" (without making it a formal Arrow dictionary). Non-standardized statistic names require special handling by callers anyways.
We could also make the schema wider; this is often the actual implementation underneath. But that makes it less extensible (unless we have a generic list[struct[string, union[...]]]
column at the end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to just int16
with a separate call to get driver-specific statistic values.
adbc.h
Outdated
@@ -907,6 +1008,34 @@ AdbcStatusCode AdbcDatabaseNew(struct AdbcDatabase* database, struct AdbcError* | |||
AdbcStatusCode AdbcDatabaseGetOption(struct AdbcDatabase* database, const char* key, | |||
const char** value, struct AdbcError* error); | |||
|
|||
/// \brief Get a bytestring option of the database. | |||
/// | |||
/// This must always be thread-safe (other operations are not). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... but if AdbcDatabaseGetOptionBytes
is called from multiple threads, then it is undefined if the return value
is still valid when the calling thread consumes it, right?
(same problem as with POSIX getenv
, actually: https://pubs.opengroup.org/onlinepubs/9699919799/functions/getenv.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah. Possibly "thread-safe with respect to other functions"?
The only reason why I wanted it to be thread-safe was to allow for a progress indicator (on GetDouble). Maybe we can split that out into a dedicated function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you switch to malloc-allocated output, or you let the caller pass a suitable buffer area...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far things have avoided assuming libc malloc. We could return a struct with a release callback?
Caller-allocated buffers get annoying because you have to somehow tell the caller the size...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caller-allocated buffers get annoying because you have to somehow tell the caller the size...
But at least the lifetime is safe and it makes things simpler for the producer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I'm also not sure why functions like vsnprintf choose to do that in the first place; maybe some POSIX convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getenv
is slightly different as global state where the application may not know what libraries it uses are also doing with the global state; here the operation is scoped to a specific handle that presumably an application can manage itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how about something like this?
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I suppose we should also update the base GetOption
(returns char*) though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated that too.
Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755
ca6ad0a
to
c8ba22f
Compare
There's been no movement here so I'll update this PR with Go and Java definitions |
Ok, Java and Go definitions are now present. |
I'm going to merge this for now so I can start drafting #319; we can come back and revisit the Go/Java APIs if needed (I know zeroshade is out for a bit so I don't want this held up) |
Intends to tackle: