-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Python ListDtype #5610
Comments
As of Pandas 1.0 they have a new |
I agree that we should probably move away from using object where we can, I feel it's a little misleading in a few ways especially for people who are used to python. I will probably have more thoughts on this once I'm a little clearer on how the data is actually arranged in memory in libcudf and how things like strings are handled. Off the top of my head, should a column containing lists that are restricted to being the same length (like points on an N-D grid) be it's own datatype? Would that even be useful if it was? (maybe following more efficient codepaths/less scanning and checking in this case, etc) |
We follow the Arrow memory layout for Strings / Lists: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
Would definitely have optimization opportunities and reduce memory footprint, but is a future optimization. For now there will always be an offset column (similar to String columns). |
@TomAugspurger I thought you might be interested this issue where cuDF is building out dtype extensions |
Thanks for the ping. I suspect that pandas will grow a dedicated List type based on Arrow at some point. We'll want to coordinate on what new APIs we'll have specifically for nested data (like |
Thanks @TomAugspurger - any feedback you might have on the associated PR would also be appreciated :) |
Fixed by #5629. If there's any further discussion or concerns we can follow up in a new github issue. |
Edit: Please see #5629 for the initial implementation
As a first step towards enabling list types in
cudf
, we need to introduce aListDtype
in cuDF. What should this look like?In Pandas, columns of lists have
'object'
dtype:We can readily reject this approach for two reasons:
'object'
as the dtype for string columns:dtype
object needs to be able to store that information somewhere .PyArrow has
ListArrays
, which are of typeListType
:Something like PyArrow's
ListType
is much better suited to our needs.Here's a skeleton
ListDtype
with some questions inline:cc: @brandon-b-miller as you've been thinking about dtypes more generally :)
The text was updated successfully, but these errors were encountered: