From dfedb1d47f9c3aeafb5f42dcfdc05e3ea03892d4 Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyasr@nvidia.com>
Date: Mon, 3 Oct 2022 14:05:11 -0700
Subject: [PATCH 1/6] Initial draft of policies and guidelines for libcudf
 usage.

---
 .../developer_guide/DEVELOPER_GUIDE.md        | 38 +++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index b3774aeda38..c56cdb6f25c 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -346,6 +346,44 @@ the device view can be obtained via function `column_device_view::create(column_
 data, a specialized device view for list columns can be constructed via
 `lists_column_device_view(column_device_view)`.
 
+# libcudf policies and design principles
+
+Some aspects of libcudf usage may be surprising to new users.
+As a performance-oriented library, many design decisions prioritize flexibility and performance in ways that require some additional work or input from client code.
+We document these policies and the reasons behind them here:
+
+## libcudf does not introspect data
+
+libcudf APIs generally do not perform deep introspection and validation of input data.
+There are numerous reasons for this:
+1. It violates the single responsibility principle: validation is separate from execution.
+2. Since libcudf data structures store data on the GPU, any validation incurs the minimal overhead of a kernel launch.
+3. API promises around data introspection often significantly complicate implementation.
+
+Users are therefore responsible for passing valid data into such APIs.
+_Note that this policy does not mean that libcudf performs no validation whatsoever_.
+libcudf APIs should still perform any validation that does not require introspection, such as checking whether input column/table sizes or dtypes are appropriate.
+
+**TODO: What about checking null counts? The most likely path to fixing the implicit kernel invocation in `null_count` (which we have to remove in order to present a safe stream-ordered API) is forcing specification of the null count on construction, which would make checking null counts a cheap operation.**
+
+## libcudf will not sanitize nulls for nested types
+
+Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
+In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
+Specifically:
+- Null elements of list columns should also be empty. They should not contain arbitrary data.
+- Null elements of struct columns should also be null elements in the underlying structs.
+
+**TODO: Do we still need to ensure the above with the new nested comparators? Since the nested comparators no longer operate based on flattening struct columns, I suspect that we will no longer need to superimpose parent nulls. We'll need to check that, though.**
+
+libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
+Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
+
+## libcudf does not synchronize streams
+
+libcudf APIs may not synchronize the used stream before returning.
+It is incumbent on calling code to perform any syncs needed before accessing returned data.
+
 # libcudf++ API and Implementation
 
 ## Streams

From ee89e842d935a1446a8e2529dbbe061c26ff3bfc Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyasr@nvidia.com>
Date: Mon, 17 Oct 2022 14:06:08 -0700
Subject: [PATCH 2/6] Address PR comments and add notes on ordering.

---
 .../developer_guide/DEVELOPER_GUIDE.md        | 34 ++++++++++++++-----
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index c56cdb6f25c..0906c452f74 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -346,7 +346,7 @@ the device view can be obtained via function `column_device_view::create(column_
 data, a specialized device view for list columns can be constructed via
 `lists_column_device_view(column_device_view)`.
 
-# libcudf policies and design principles
+# libcudf Policies and Design Principles
 
 Some aspects of libcudf usage may be surprising to new users.
 As a performance-oriented library, many design decisions prioritize flexibility and performance in ways that require some additional work or input from client code.
@@ -357,32 +357,50 @@ We document these policies and the reasons behind them here:
 libcudf APIs generally do not perform deep introspection and validation of input data.
 There are numerous reasons for this:
 1. It violates the single responsibility principle: validation is separate from execution.
-2. Since libcudf data structures store data on the GPU, any validation incurs the minimal overhead of a kernel launch.
+2. Since libcudf data structures store data on the GPU, any validation incurs _at minimum_ the overhead of a kernel launch, and may in general be prohibitively expensive..
 3. API promises around data introspection often significantly complicate implementation.
 
 Users are therefore responsible for passing valid data into such APIs.
 _Note that this policy does not mean that libcudf performs no validation whatsoever_.
-libcudf APIs should still perform any validation that does not require introspection, such as checking whether input column/table sizes or dtypes are appropriate.
+libcudf APIs should still perform any validation that does not require introspection.
+To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.
+
+**Things that libcudf should validate**:
+- Input column/table sizes or dtypes
+
+**Things that libcudf should not validate**:
+- Integer overflow
+- Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
 
 **TODO: What about checking null counts? The most likely path to fixing the implicit kernel invocation in `null_count` (which we have to remove in order to present a safe stream-ordered API) is forcing specification of the null count on construction, which would make checking null counts a cheap operation.**
 
-## libcudf will not sanitize nulls for nested types
+## libcudf expects nested types to have sanitized null masks
 
 Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
 In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
 Specifically:
-- Null elements of list columns should also be empty. They should not contain arbitrary data.
+- Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
 - Null elements of struct columns should also be null elements in the underlying structs.
+- Nulls should only ever be present at the level of the parent column. Child columns should never contain nulls.
+- Slice operations on nested columns do not propagate offsets to child columns.
 
 **TODO: Do we still need to ensure the above with the new nested comparators? Since the nested comparators no longer operate based on flattening struct columns, I suspect that we will no longer need to superimpose parent nulls. We'll need to check that, though.**
 
 libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
 Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
 
-## libcudf does not synchronize streams
+## libcudf does not synchronize CUDA streams
+
+libcudf APIs called on the host do not guarantee that the stream is synchronized before returning.
+Work in cudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default stream (stream 0).
+Note that the stream 0 behavior differs if [per-thread default stream is enabled](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) via `CUDF_USE_PER_THREAD_DEFAULT_STREAM`.
+Any data provided to or returned by libcudf that uses a separate non-blocking stream requires synchronization with the default libcudf stream to ensure stream safety.
+
+## libcudf generally does not make ordering guarantees
 
-libcudf APIs may not synchronize the used stream before returning.
-It is incumbent on calling code to perform any syncs needed before accessing returned data.
+Functions like merge or groupby in libcudf make no guarantees about the order of entries in the output.
+Promising deterministic ordering is not, in general, conducive to fast parallel algorithms.
+Calling code is responsible for performing sorts after the fact if sorted outputs are needed.
 
 # libcudf++ API and Implementation
 

From cb57bdcd174eb859fccf006493a6f18fd16d331c Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Date: Mon, 17 Oct 2022 16:22:37 -0700
Subject: [PATCH 3/6] Apply suggestions from code review

Co-authored-by: Jake Hemstad <jhemstad@nvidia.com>
---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index 0906c452f74..b0edf923d8f 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -348,8 +348,10 @@ data, a specialized device view for list columns can be constructed via
 
 # libcudf Policies and Design Principles
 
-Some aspects of libcudf usage may be surprising to new users.
-As a performance-oriented library, many design decisions prioritize flexibility and performance in ways that require some additional work or input from client code.
+`libcudf` is designed to provide GPU-accelerated algorithm primitives for solving a wide variety of problems that arise in data science.
+Our goal is to enable diverse use cases like Spark or Pandas to benefit from the performance of GPUs. 
+As a result, libcudf prioritizes performance and flexibility, which sometimes may come at the cost of convenience.
+While we welcome users to use libcudf directly, we design with the expectation that most users will be consuming libcudf through higher-level layers like Spark or cuDF Python that handle some of details that direct users of libcudf must handle on their own.
 We document these policies and the reasons behind them here:
 
 ## libcudf does not introspect data
@@ -389,7 +391,7 @@ Specifically:
 libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
 Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
 
-## libcudf does not synchronize CUDA streams
+## Treat libcudf APIs as if they were asynchronous
 
 libcudf APIs called on the host do not guarantee that the stream is synchronized before returning.
 Work in cudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default stream (stream 0).

From 14f21c8a4493f5c37551b3be272bf8ab25f36d22 Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Date: Mon, 17 Oct 2022 17:14:19 -0700
Subject: [PATCH 4/6] Apply suggestions from code review

Co-authored-by: Bradley Dice <bdice@bradleydice.com>
---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index b0edf923d8f..328e7961309 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -359,7 +359,7 @@ We document these policies and the reasons behind them here:
 libcudf APIs generally do not perform deep introspection and validation of input data.
 There are numerous reasons for this:
 1. It violates the single responsibility principle: validation is separate from execution.
-2. Since libcudf data structures store data on the GPU, any validation incurs _at minimum_ the overhead of a kernel launch, and may in general be prohibitively expensive..
+2. Since libcudf data structures store data on the GPU, any validation incurs _at minimum_ the overhead of a kernel launch, and may in general be prohibitively expensive.
 3. API promises around data introspection often significantly complicate implementation.
 
 Users are therefore responsible for passing valid data into such APIs.
@@ -374,7 +374,6 @@ To give some idea of what should or should not be validated, here are (non-exhau
 - Integer overflow
 - Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
 
-**TODO: What about checking null counts? The most likely path to fixing the implicit kernel invocation in `null_count` (which we have to remove in order to present a safe stream-ordered API) is forcing specification of the null count on construction, which would make checking null counts a cheap operation.**
 
 ## libcudf expects nested types to have sanitized null masks
 
@@ -383,7 +382,7 @@ In this context, sanitization refers to ensuring that the null elements in a col
 Specifically:
 - Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
 - Null elements of struct columns should also be null elements in the underlying structs.
-- Nulls should only ever be present at the level of the parent column. Child columns should never contain nulls.
+- For compound columns, nulls should only be present at the level of the parent column. Child columns should not contain nulls.
 - Slice operations on nested columns do not propagate offsets to child columns.
 
 **TODO: Do we still need to ensure the above with the new nested comparators? Since the nested comparators no longer operate based on flattening struct columns, I suspect that we will no longer need to superimpose parent nulls. We'll need to check that, though.**
@@ -394,7 +393,7 @@ Therefore, the only problem is if users construct input columns that are not cor
 ## Treat libcudf APIs as if they were asynchronous
 
 libcudf APIs called on the host do not guarantee that the stream is synchronized before returning.
-Work in cudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default stream (stream 0).
+Work in libcudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default stream (stream 0).
 Note that the stream 0 behavior differs if [per-thread default stream is enabled](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) via `CUDF_USE_PER_THREAD_DEFAULT_STREAM`.
 Any data provided to or returned by libcudf that uses a separate non-blocking stream requires synchronization with the default libcudf stream to ensure stream safety.
 

From 11a579bc8644e983ea23a058ec5b177807dcca87 Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Date: Tue, 18 Oct 2022 10:31:46 -0700
Subject: [PATCH 5/6] Apply suggestions from code review

---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index 328e7961309..07df46a4021 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -385,8 +385,6 @@ Specifically:
 - For compound columns, nulls should only be present at the level of the parent column. Child columns should not contain nulls.
 - Slice operations on nested columns do not propagate offsets to child columns.
 
-**TODO: Do we still need to ensure the above with the new nested comparators? Since the nested comparators no longer operate based on flattening struct columns, I suspect that we will no longer need to superimpose parent nulls. We'll need to check that, though.**
-
 libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
 Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
 

From 5b6c90eeae216635205518259d065f166319bce9 Mon Sep 17 00:00:00 2001
From: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Date: Tue, 18 Oct 2022 11:32:02 -0700
Subject: [PATCH 6/6] Apply suggestions from code review

---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index 07df46a4021..52c443cd764 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -348,11 +348,13 @@ data, a specialized device view for list columns can be constructed via
 
 # libcudf Policies and Design Principles
 
-`libcudf` is designed to provide GPU-accelerated algorithm primitives for solving a wide variety of problems that arise in data science.
-Our goal is to enable diverse use cases like Spark or Pandas to benefit from the performance of GPUs. 
-As a result, libcudf prioritizes performance and flexibility, which sometimes may come at the cost of convenience.
+`libcudf` is designed to provide thread-safe, single-GPU accelerated algorithm primitives for solving a wide variety of problems that arise in data science.
+APIs are written to execute on the default GPU, which can be controlled by the caller through standard CUDA device APIs or environment variables like `CUDA_VISIBLE_DEVICES`.
+Our goal is to enable diverse use cases like Spark or Pandas to benefit from the performance of GPUs, and libcudf relies on these higher-level layers like Spark or Dask to orchestrate multi-GPU tasks.
+
+To best satisfy these use-cases, libcudf prioritizes performance and flexibility, which sometimes may come at the cost of convenience.
 While we welcome users to use libcudf directly, we design with the expectation that most users will be consuming libcudf through higher-level layers like Spark or cuDF Python that handle some of details that direct users of libcudf must handle on their own.
-We document these policies and the reasons behind them here:
+We document these policies and the reasons behind them here.
 
 ## libcudf does not introspect data