Skip to content

Commit

Permalink
Add protocol API to Sphinx doc
Browse files Browse the repository at this point in the history
  • Loading branch information
rgommers committed Aug 24, 2021
1 parent 6cc4401 commit eead53a
Show file tree
Hide file tree
Showing 4 changed files with 76 additions and 68 deletions.
72 changes: 72 additions & 0 deletions protocol/API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# API of the `__dataframe__` protocol

Specification for objects to be accessed, for the purpose of dataframe
interchange between libraries, via the `__dataframe__` method on a libraries'
data frame object.

For guiding requirements, see {ref}`design-requirements`.


## Concepts in this design

1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
only thing that actually maps to a 1-D array in a sense that it could be
converted to NumPy, CuPy, et al.
2. A `Column` class. A *column* has a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
which are identified with names that are unique strings. All the data
frame's rows are the same length. It can consist of multiple *chunks*. A
single chunk of a data frame is modeled as again a `DataFrame` instance.
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
to a *data frame* or a *column*.

Note that the only way to access these objects is through a call to
`__dataframe__` on a data frame object. This is NOT meant as public API;
only think of instances of the different classes here to describe the API of
what is returned by a call to `__dataframe__`. They are the concepts needed
to capture the memory layout and data access of a data frame.


## Design decisions

1. Use a separate column abstraction in addition to a dataframe interface.

Rationales:

- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)

Note that this does not imply a library must have such a public user-facing
abstraction (ex. ``pandas.Series``) - it can only be accessed via
``__dataframe__``.

2. Use methods and properties on an opaque object rather than returning
hierarchical dictionaries describing memory.

This is better for implementations that may rely on, for example, lazy
computation.

3. No row names. If a library uses row names, use a regular column for them.

See discussion at
[wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241)
Optional row names are not a good idea, because people will assume they're
present (see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out. Note that row labels
could be added in the future - right now there's no clear requirements for
more complex row labels that cannot be represented by a single column. These
do exist, for example Modin has has table and tree-based row labels.

## Interface



```{literalinclude} dataframe_protocol.py
---
language: python
---
67 changes: 0 additions & 67 deletions protocol/dataframe_protocol.py
Original file line number Diff line number Diff line change
@@ -1,70 +1,3 @@
"""
Specification for objects to be accessed, for the purpose of dataframe
interchange between libraries, via the ``__dataframe__`` method on a libraries'
data frame object.
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
Concepts in this design
-----------------------
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
only thing that actually maps to a 1-D array in a sense that it could be
converted to NumPy, CuPy, et al.
2. A `Column` class. A *column* has a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
which are identified with names that are unique strings. All the data
frame's rows are the same length. It can consist of multiple *chunks*. A
single chunk of a data frame is modeled as again a `DataFrame` instance.
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
to a *data frame* or a *column*.
Note that the only way to access these objects is through a call to
``__dataframe__`` on a data frame object. This is NOT meant as public API;
only think of instances of the different classes here to describe the API of
what is returned by a call to ``__dataframe__``. They are the concepts needed
to capture the memory layout and data access of a data frame.
Design decisions
----------------
**1. Use a separate column abstraction in addition to a dataframe interface.**
Rationales:
- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)
Note that this does not imply a library must have such a public user-facing
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
**2. Use methods and properties on an opaque object rather than returning
hierarchical dictionaries describing memory**
This is better for implementations that may rely on, for example, lazy
computation.
**3. No row names. If a library uses row names, use a regular column for them.**
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
Optional row names are not a good idea, because people will assume they're present
(see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out.
Note that row labels could be added in the future - right now there's no clear
requirements for more complex row labels that cannot be represented by a single
column. These do exist, for example Modin has has table and tree-based row
labels.
"""


class Buffer:
"""
Data in the buffer is guaranteed to be contiguous in memory.
Expand Down
4 changes: 3 additions & 1 deletion protocol/design_requirements.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# The `__dataframe__` protocol
# Design concepts and requirements

This document aims to describe the design requirements and principles of the
dataframe interchange protcol, and the functionality it needs to support.
Expand All @@ -20,6 +20,8 @@ A column or a dataframe can be "chunked"; a **chunk** is a subset of a column
or dataframe that contains a set of (neighboring) rows.


(design-requirements)=

## Protocol design requirements

1. Must be a standard Python-level API that is unambiguously specified, and
Expand Down
1 change: 1 addition & 0 deletions protocol/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ Contents

purpose_and_scope
design_requirements
API

0 comments on commit eead53a

Please sign in to comment.