Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of policies and guidelines for libcudf usage. #11853

Merged
merged 6 commits into from
Oct 18, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,44 @@ the device view can be obtained via function `column_device_view::create(column_
data, a specialized device view for list columns can be constructed via
`lists_column_device_view(column_device_view)`.

# libcudf policies and design principles
vyasr marked this conversation as resolved.
Show resolved Hide resolved

Some aspects of libcudf usage may be surprising to new users.
As a performance-oriented library, many design decisions prioritize flexibility and performance in ways that require some additional work or input from client code.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
We document these policies and the reasons behind them here:
vyasr marked this conversation as resolved.
Show resolved Hide resolved

## libcudf does not introspect data

libcudf APIs generally do not perform deep introspection and validation of input data.
There are numerous reasons for this:
1. It violates the single responsibility principle: validation is separate from execution.
2. Since libcudf data structures store data on the GPU, any validation incurs the minimal overhead of a kernel launch.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
3. API promises around data introspection often significantly complicate implementation.

Users are therefore responsible for passing valid data into such APIs.
_Note that this policy does not mean that libcudf performs no validation whatsoever_.
libcudf APIs should still perform any validation that does not require introspection, such as checking whether input column/table sizes or dtypes are appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extends to output sizes and results as well.

In general there is no validation that an operation will produce a column that fits in the 2GB limit. There are a few minor exceptions where the validation does not require an extra kernel launch.

Also, operations that may produce overflow (integer or floating-point arithmetic) are also not checked. Overflow produces undefined results. For example, integer overflow to negative values is not guaranteed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent here was not to provide exhaustive lists of things we will and will not check, but rather to provide some examples of cases to illustrate the difference between introspective and non-introspective operations. I am wary of trying to maintain a complete list that will inevitably go out of date the day after it is published. Do you think it's worth trying to keep a larger list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I would like to keep a list of things that keep being brought up (like these two here) so that I can point to this document to say this is by design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this does not need to be exhaustive, but it would be nice if this policy guide can serve as a mini-FAQ of "why is libcudf designed this way?" I think that the concrete examples @davidwendt mentioned help illustrate the philosophy of the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some, and I've reworded this section to reflect that we want to keep a list but make no promises that it's exhaustive.


**TODO: What about checking null counts? The most likely path to fixing the implicit kernel invocation in `null_count` (which we have to remove in order to present a safe stream-ordered API) is forcing specification of the null count on construction, which would make checking null counts a cheap operation.**
vyasr marked this conversation as resolved.
Show resolved Hide resolved
vyasr marked this conversation as resolved.
Show resolved Hide resolved

## libcudf will not sanitize nulls for nested types
vyasr marked this conversation as resolved.
Show resolved Hide resolved

Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
Specifically:
- Null elements of list columns should also be empty. They should not contain arbitrary data.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
- Null elements of struct columns should also be null elements in the underlying structs.
vyasr marked this conversation as resolved.
Show resolved Hide resolved

**TODO: Do we still need to ensure the above with the new nested comparators? Since the nested comparators no longer operate based on flattening struct columns, I suspect that we will no longer need to superimpose parent nulls. We'll need to check that, though.**
vyasr marked this conversation as resolved.
Show resolved Hide resolved
vyasr marked this conversation as resolved.
Show resolved Hide resolved

libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.

## libcudf does not synchronize streams
vyasr marked this conversation as resolved.
Show resolved Hide resolved

libcudf APIs may not synchronize the used stream before returning.
It is incumbent on calling code to perform any syncs needed before accessing returned data.
vyasr marked this conversation as resolved.
Show resolved Hide resolved

# libcudf++ API and Implementation

## Streams
Expand Down