Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Python ListDtype #5610

Closed
shwina opened this issue Jun 30, 2020 · 7 comments
Closed

[DISCUSS] Python ListDtype #5610

shwina opened this issue Jun 30, 2020 · 7 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Jun 30, 2020

Edit: Please see #5629 for the initial implementation

As a first step towards enabling list types in cudf, we need to introduce a ListDtype in cuDF. What should this look like?

In Pandas, columns of lists have 'object' dtype:

In [14]: a = pd.Series([[1, 2], [1, 2, 3], [4,]])

In [15]: a
Out[15]:
0       [1, 2]
1    [1, 2, 3]
2          [4]
dtype: object

In [16]: a.dtype
Out[16]: dtype('O')

We can readily reject this approach for two reasons:

  1. We currently use 'object' as the dtype for string columns:
In [2]: a = cudf.Series(['a', 'b', 'c'])

In [3]: a.dtype == 'object'
Out[3]: True
  1. List columns in cudf are homogenous, in that all the values in a list column are of the same dtype. Our dtype object needs to be able to store that information somewhere .

PyArrow has ListArrays, which are of type ListType:

In [6]: a = pa.ListArray.from_pandas(pd.Series([[1, 2], [1, 2, 3], [1,]]))

In [7]: a
Out[7]:
<pyarrow.lib.ListArray object at 0x7f473f75b830>
[
  [
    1,
    2
  ],
  [
    1,
    2,
    3
  ],
  [
    1
  ]
]

In [8]: a.type
Out[8]: ListType(list<item: int64>)

In [9]: a.type.value_type
Out[9]: DataType(int64)

Something like PyArrow's ListType is much better suited to our needs.


Here's a skeleton ListDtype with some questions inline:

class ListDtype(object): # should we inherit from Pandas ExtensionDtype?
    def __init__(self, value_type):  # should value_type have a default?
        self._value_type = value_type

    @property
    def value_type(self):
        return self._value_type
    
    @property
    def name(self):
        return "list" # should we choose something else?
   
    @classmethod
    def from_arrow(self, typ):
        pass

    def to_arrow(self):
        pass
    
    def __eq__(self, other):
        pass

    # What other methods, if any, could we define here?

cc: @brandon-b-miller as you've been thinking about dtypes more generally :)

@shwina shwina added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 30, 2020
@kkraus14
Copy link
Collaborator

  1. We currently use 'object' as the dtype for string columns:

As of Pandas 1.0 they have a new String dtype that we should move our strings to use. Implementation wise they still use objects under the hood, but ideally we can move both ourselves internally and users externally away from using object dtypes.

@brandon-b-miller
Copy link
Contributor

I agree that we should probably move away from using object where we can, I feel it's a little misleading in a few ways especially for people who are used to python. I will probably have more thoughts on this once I'm a little clearer on how the data is actually arranged in memory in libcudf and how things like strings are handled. Off the top of my head, should a column containing lists that are restricted to being the same length (like points on an N-D grid) be it's own datatype? Would that even be useful if it was? (maybe following more efficient codepaths/less scanning and checking in this case, etc)

@kkraus14
Copy link
Collaborator

I will probably have more thoughts on this once I'm a little clearer on how the data is actually arranged in memory in libcudf and how things like strings are handled.

We follow the Arrow memory layout for Strings / Lists: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

Off the top of my head, should a column containing lists that are restricted to being the same length (like points on an N-D grid) be it's own datatype? Would that even be useful if it was? (maybe following more efficient codepaths/less scanning and checking in this case, etc)

Would definitely have optimization opportunities and reduce memory footprint, but is a future optimization. For now there will always be an offset column (similar to String columns).

@quasiben
Copy link
Member

quasiben commented Jul 6, 2020

@TomAugspurger I thought you might be interested this issue where cuDF is building out dtype extensions

@TomAugspurger
Copy link
Contributor

Thanks for the ping. I suspect that pandas will grow a dedicated List type based on Arrow at some point.

We'll want to coordinate on what new APIs we'll have specifically for nested data (like Series.explode).

@shwina
Copy link
Contributor Author

shwina commented Jul 6, 2020

Thanks @TomAugspurger - any feedback you might have on the associated PR would also be appreciated :)

#5628

@kkraus14
Copy link
Collaborator

Fixed by #5629. If there's any further discussion or concerns we can follow up in a new github issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants