-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convenience function to turn an Awkward Array into a NumPy array in anyway that it can #336
Comments
Just to piggyback, I feel like |
Isn't the fact that In [9]: ak.fill_none(ak.pad_none(array.a, 2, clip=True), 0.)
Out[9]: <Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * float64'> casts the integers in |
I just took a look at this and I agree that it could be a better interface. But before developing a new function, perhaps I should throw some more ideas into the mix. The real issue here is that the padding and filling aren't going all the way down to the numeric level: they're applying to the records. That's why we get Nones in the place of the records (and the >>> ak.pad_none(array, 2, clip=True)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * ?{"a": int64, "b": fl...'>
>>> ak.pad_none(array, 2, clip=True).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [None, None], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]] Then when these get filled with zeros, they're zeros in the place of records, which has to be a union. >>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [0, 0], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]] What you really want are zeros in place of the numeric fields, which do unify the array elements with the fill value. To get at the fields individually, we can ak.unzip the records (remembering that breaking and merging records is an O(1) operation that we can do freely). >>> ak.unzip(array)
(<Array [[1, 2], [], [3, 4, 5]] type='3 * var * int64'>,
<Array [[1.1, 2.2], [], [3.3, 4.4, 5.5]] type='3 * var * float64'>) So what we really need to do is apply the padding and filling to each of these arrays. We can do it independently of the number of record fields with a list comprehension, >>> [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
[<Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * int64'>,
<Array [[1.1, 2.2], [0, 0], [3.3, 4.4]] type='3 * 2 * float64'>] and then to wrap the whole thing up, we can reverse the unzip with ak.zip. >>> regularized = ak.zip(dict(zip(
... ak.keys(array),
... [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
... )))
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
[{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
[{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]] Maybe this should have a high-level function? Combining >>> only_padded = ak.zip(dict(zip(
... ak.keys(array), [ak.pad_none(x, 2, clip=True) for x in ak.unzip(array)]
... )))
>>> ak.type(only_padded)
3 * 2 * {"a": ?int64, "b": ?float64}
>>> ak.to_list(only_padded)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
[{'a': None, 'b': None}, {'a': None, 'b': None}],
[{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]
>>>
>>> regularized = ak.fill_none(only_padded, 0)
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
[{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
[{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]] So the missing functionality is Actually, the |
@nsmith- No, that's intentional: >>> ak.fill_none(ak.Array([1, 2, None, 4]), 3)
<Array [1, 2, 3, 4] type='4 * int64'>
>>> ak.fill_none(ak.Array([1, 2, None, 4]), 3.0)
<Array [1, 2, 3, 4] type='4 * float64'> What's happening here is that Nones are first replaced by a temporary UnionArray that combines whatever is in the array with whatever the replacement value is: >>> np.concatenate([np.array([1, 2, 3]), np.array([4])])
array([1, 2, 3, 4])
>>> np.concatenate([np.array([1, 2, 3]), np.array([4.0])])
array([1., 2., 3., 4.]) (In fact, ak.concatenate calls does this through a UnionArray In @nikoladze's case, the UnionArray of records and numbers (zero) could not be |
In case you're wondering what all of this is about, I'm going through all of our open issues from oldest to newest to decide what should be done with them, post-2.0. In this case, @nikoladze's array can be converted to NumPy if you pay attention to all the details of which The point of this is to remember that sometimes, we don't care about structure and don't want to think about it: we just want a NumPy array somehow. This would be a good function to develop with |
Currently it seems a bit cumbersome to create a contiguous numpy array (after padding and filling - e.g. for input into ML models) from records with fields of different numeric types (e.g. int and float or float and double). I'm looking for a similar behaviour like
.values
or.to_numpy()
in pandas:There are two obstacles when trying this with awkward:
ak.fill_none
this will result in a union type that can't be converted to numpy e.g.I believe @nsmith- also ran into this when trying to show the padding and filling features of awkward in his tutorial on NanoEvents yesterday.
Not sure how to best implement convenience functions for this, but maybe one could add extra options to
ak.fill_none
andak.to_numpy
roughly like the following (+figure out how to deal with nested records)The text was updated successfully, but these errors were encountered: