[DISCUSS] Python ListDtype #5610

shwina · 2020-06-30T16:41:13Z

Edit: Please see #5629 for the initial implementation

As a first step towards enabling list types in cudf, we need to introduce a ListDtype in cuDF. What should this look like?

In Pandas, columns of lists have 'object' dtype:

In [14]: a = pd.Series([[1, 2], [1, 2, 3], [4,]])

In [15]: a
Out[15]:
0       [1, 2]
1    [1, 2, 3]
2          [4]
dtype: object

In [16]: a.dtype
Out[16]: dtype('O')

We can readily reject this approach for two reasons:

We currently use 'object' as the dtype for string columns:

In [2]: a = cudf.Series(['a', 'b', 'c'])

In [3]: a.dtype == 'object'
Out[3]: True

List columns in cudf are homogenous, in that all the values in a list column are of the same dtype. Our dtype object needs to be able to store that information somewhere .

PyArrow has ListArrays, which are of type ListType:

In [6]: a = pa.ListArray.from_pandas(pd.Series([[1, 2], [1, 2, 3], [1,]]))

In [7]: a
Out[7]:
<pyarrow.lib.ListArray object at 0x7f473f75b830>
[
  [
    1,
    2
  ],
  [
    1,
    2,
    3
  ],
  [
    1
  ]
]

In [8]: a.type
Out[8]: ListType(list<item: int64>)

In [9]: a.type.value_type
Out[9]: DataType(int64)

Something like PyArrow's ListType is much better suited to our needs.

Here's a skeleton ListDtype with some questions inline:

class ListDtype(object): # should we inherit from Pandas ExtensionDtype?
    def __init__(self, value_type):  # should value_type have a default?
        self._value_type = value_type

    @property
    def value_type(self):
        return self._value_type
    
    @property
    def name(self):
        return "list" # should we choose something else?
   
    @classmethod
    def from_arrow(self, typ):
        pass

    def to_arrow(self):
        pass
    
    def __eq__(self, other):
        pass

    # What other methods, if any, could we define here?

cc: @brandon-b-miller as you've been thinking about dtypes more generally :)

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-06-30T16:47:09Z

We currently use 'object' as the dtype for string columns:

As of Pandas 1.0 they have a new String dtype that we should move our strings to use. Implementation wise they still use objects under the hood, but ideally we can move both ourselves internally and users externally away from using object dtypes.

brandon-b-miller · 2020-06-30T18:08:21Z

I agree that we should probably move away from using object where we can, I feel it's a little misleading in a few ways especially for people who are used to python. I will probably have more thoughts on this once I'm a little clearer on how the data is actually arranged in memory in libcudf and how things like strings are handled. Off the top of my head, should a column containing lists that are restricted to being the same length (like points on an N-D grid) be it's own datatype? Would that even be useful if it was? (maybe following more efficient codepaths/less scanning and checking in this case, etc)

kkraus14 · 2020-06-30T18:15:00Z

I will probably have more thoughts on this once I'm a little clearer on how the data is actually arranged in memory in libcudf and how things like strings are handled.

We follow the Arrow memory layout for Strings / Lists: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

Off the top of my head, should a column containing lists that are restricted to being the same length (like points on an N-D grid) be it's own datatype? Would that even be useful if it was? (maybe following more efficient codepaths/less scanning and checking in this case, etc)

Would definitely have optimization opportunities and reduce memory footprint, but is a future optimization. For now there will always be an offset column (similar to String columns).

quasiben · 2020-07-06T19:37:25Z

@TomAugspurger I thought you might be interested this issue where cuDF is building out dtype extensions

TomAugspurger · 2020-07-06T19:47:34Z

Thanks for the ping. I suspect that pandas will grow a dedicated List type based on Arrow at some point.

We'll want to coordinate on what new APIs we'll have specifically for nested data (like Series.explode).

shwina · 2020-07-06T20:52:04Z

Thanks @TomAugspurger - any feedback you might have on the associated PR would also be appreciated :)

#5628

kkraus14 · 2020-08-27T18:36:46Z

Fixed by #5629. If there's any further discussion or concerns we can follow up in a new github issue.

shwina added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 30, 2020

shwina mentioned this issue Jul 2, 2020

[WIP] Add ListDtype #5628

Closed

TomAugspurger mentioned this issue Jul 8, 2020

ENH: ListDtype / ListArray pandas-dev/pandas#35176

Closed

shwina mentioned this issue Jul 27, 2020

Reusable Numba extension for CUDA target? scikit-hep/awkward#359

Closed

kkraus14 closed this as completed Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Python ListDtype #5610

[DISCUSS] Python ListDtype #5610

shwina commented Jun 30, 2020 •

edited

Loading

kkraus14 commented Jun 30, 2020

brandon-b-miller commented Jun 30, 2020

kkraus14 commented Jun 30, 2020

quasiben commented Jul 6, 2020

TomAugspurger commented Jul 6, 2020

shwina commented Jul 6, 2020

kkraus14 commented Aug 27, 2020

[DISCUSS] Python ListDtype #5610

[DISCUSS] Python ListDtype #5610

Comments

shwina commented Jun 30, 2020 • edited Loading

kkraus14 commented Jun 30, 2020

brandon-b-miller commented Jun 30, 2020

kkraus14 commented Jun 30, 2020

quasiben commented Jul 6, 2020

TomAugspurger commented Jul 6, 2020

shwina commented Jul 6, 2020

kkraus14 commented Aug 27, 2020

shwina commented Jun 30, 2020 •

edited

Loading