Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience function to turn an Awkward Array into a NumPy array in anyway that it can #336

Open
nikoladze opened this issue Jul 14, 2020 · 5 comments
Labels
feature New feature or request one-hour-fix It can probably be done in an hour

Comments

@nikoladze
Copy link
Contributor

Currently it seems a bit cumbersome to create a contiguous numpy array (after padding and filling - e.g. for input into ML models) from records with fields of different numeric types (e.g. int and float or float and double). I'm looking for a similar behaviour like .values or .to_numpy() in pandas:

>>> df = pd.DataFrame({"a" : [1, 2, 3], "b" : [1.1, 2.2, 3.3]})
>>> df.dtypes
a      int64
b    float64
dtype: object
>>> df.to_numpy()
array([[1. , 1.1],
       [2. , 2.2],
       [3. , 3.3]])
>>> df.to_numpy().dtype
dtype('float64')`

There are two obstacles when trying this with awkward:

  • When i call ak.fill_none this will result in a union type that can't be converted to numpy e.g.
>>> import awkward1 as ak
>>> array = ak.zip({"a" : [[1, 2], [], [3, 4, 5]], "b" : [[1.1, 2.2], [], [3.3, 4.4, 5.5]]})
>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> padded = ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
>>> padded
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> ak.type(padded)
3 * 2 * union[{"a": int64, "b": float64}, int64]
  • When i have a record that can be converted to numpy it will result in a structured numpy array which i will still have to cast to a consistent dtype for many ML applications

I believe @nsmith- also ran into this when trying to show the padding and filling features of awkward in his tutorial on NanoEvents yesterday.

Not sure how to best implement convenience functions for this, but maybe one could add extra options to ak.fill_none and ak.to_numpy roughly like the following (+figure out how to deal with nested records)

def new_fill_none(array, value, cast_value=False, **kwargs):
    if cast_value and len(ak.keys(array)) > 0:
        # having this as a fill value won't result in a union array
        value = {k : value for k in ak.keys(array)}
    return ak.fill_none(array, value, **kwargs)

def new_to_numpy(array, consistent_dtype=None, **kwargs):
    np_array = ak.to_numpy(array, **kwargs)
    if consistent_dtype is not None:
        if len(ak.keys(array)) == 0:
            raise ValueError("Can't use `consistent_dtype` when array has no fields")
        np_array = np_array.astype(
            [(k, consistent_dtype) for k in ak.keys(array)], copy=False
        ).view((consistent_dtype, len(ak.keys(array))))
    return np_array

>>> import awkward1 as ak
>>> array = ak.zip({"a" : [[1, 2], [], [3, 4, 5]], "b" : [[1.1, 2.2], [], [3.3, 4.4, 5.5]]})
>>> new_to_numpy(new_fill_none(ak.pad_none(array, 2, clip=True), 0, cast_value=True), consistent_dtype="float64")
array([[[1. , 1.1],
        [2. , 2.2]],

       [[0. , 0. ],
        [0. , 0. ]],

       [[3. , 3.3],
        [4. , 4.4]]])
@nikoladze nikoladze added the feature New feature or request label Jul 14, 2020
@nsmith-
Copy link
Contributor

nsmith- commented Jul 14, 2020

Just to piggyback, I feel like ak.pad is a well-deserved function that could combine the arguments of ak.pad_none and ak.fill_none.

@nsmith-
Copy link
Contributor

nsmith- commented Jul 14, 2020

Isn't the fact that

In [9]: ak.fill_none(ak.pad_none(array.a, 2, clip=True), 0.)
Out[9]: <Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * float64'>

casts the integers in array.a into floats a bug?

@jpivarski
Copy link
Member

I just took a look at this and I agree that it could be a better interface. But before developing a new function, perhaps I should throw some more ideas into the mix.

The real issue here is that the padding and filling aren't going all the way down to the numeric level: they're applying to the records. That's why we get Nones in the place of the records (and the ? is on the record type, not the numeric fields within the record):

>>> ak.pad_none(array, 2, clip=True)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * ?{"a": int64, "b": fl...'>
>>> ak.pad_none(array, 2, clip=True).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [None, None], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

Then when these get filled with zeros, they're zeros in the place of records, which has to be a union.

>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0)
<Array [[{a: 1, b: 1.1}, ... a: 4, b: 4.4}]] type='3 * 2 * union[{"a": int64, "b...'>
>>> ak.fill_none(ak.pad_none(array, 2, clip=True), 0).tolist()
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}], [0, 0], [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

What you really want are zeros in place of the numeric fields, which do unify the array elements with the fill value. To get at the fields individually, we can ak.unzip the records (remembering that breaking and merging records is an O(1) operation that we can do freely).

>>> ak.unzip(array)
(<Array [[1, 2], [], [3, 4, 5]] type='3 * var * int64'>,
 <Array [[1.1, 2.2], [], [3.3, 4.4, 5.5]] type='3 * var * float64'>)

So what we really need to do is apply the padding and filling to each of these arrays. We can do it independently of the number of record fields with a list comprehension,

>>> [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
[<Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * int64'>,
 <Array [[1.1, 2.2], [0, 0], [3.3, 4.4]] type='3 * 2 * float64'>]

and then to wrap the whole thing up, we can reverse the unzip with ak.zip.

>>> regularized = ak.zip(dict(zip(
...     ak.keys(array),
...     [ak.fill_none(ak.pad_none(x, 2, clip=True), 0) for x in ak.unzip(array)]
... )))
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

Maybe this should have a high-level function? ak.pad_fields?

Combining ak.pad_none and ak.fill_none into a single ak.pad makes sense (the implementation would just combine the operations on the Python side), but this ak.pad_fields is a different thing: it operates at the field level. Perhaps there needs to be ak.pad_fields_none and ak.fill_fields_none as well? No, because ak.fill_fields_none, at least, isn't any different from the ak.fill_none operation (which recursively replaces None values).

>>> only_padded = ak.zip(dict(zip(
...     ak.keys(array), [ak.pad_none(x, 2, clip=True) for x in ak.unzip(array)]
... )))
>>> ak.type(only_padded)
3 * 2 * {"a": ?int64, "b": ?float64}
>>> ak.to_list(only_padded)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': None, 'b': None}, {'a': None, 'b': None}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]
>>> 
>>> regularized = ak.fill_none(only_padded, 0)
>>> ak.type(regularized)
3 * 2 * {"a": int64, "b": float64}
>>> ak.to_list(regularized)
[[{'a': 1, 'b': 1.1}, {'a': 2, 'b': 2.2}],
 [{'a': 0, 'b': 0.0}, {'a': 0, 'b': 0.0}],
 [{'a': 3, 'b': 3.3}, {'a': 4, 'b': 4.4}]]

So the missing functionality is ak.pad_fields_none (distinct from ak.pad_none's axis parameter because axis is the number of nested list depths, not record depths) and maybe convenience functions that merge ak.pad_none/ak.pad_fields_none and ak.fill_none.

Actually, the ak.pad_none/ak.pad_fields_none thing feels like it ought to be a function parameter. Then the convenience function ak.pad would have that same parameter.

@jpivarski
Copy link
Member

Isn't the fact that

In [9]: ak.fill_none(ak.pad_none(array.a, 2, clip=True), 0.)
Out[9]: <Array [[1, 2], [0, 0], [3, 4]] type='3 * 2 * float64'>

casts the integers in array.a into floats a bug?

@nsmith- No, that's intentional:

>>> ak.fill_none(ak.Array([1, 2, None, 4]), 3)
<Array [1, 2, 3, 4] type='4 * int64'>
>>> ak.fill_none(ak.Array([1, 2, None, 4]), 3.0)
<Array [1, 2, 3, 4] type='4 * float64'>

What's happening here is that Nones are first replaced by a temporary UnionArray that combines whatever is in the array with whatever the replacement value is: union[int64, int64] and union[int64, float64] in the two cases above. Then we attempt to simplify the temporary UnionArray. Unions of two numeric types can be unified to a numeric type, which is the broadest of the numeric choices: int64 and float64 in the two cases above. It is equivalent to the type unification that NumPy performs when concatenating:

>>> np.concatenate([np.array([1, 2, 3]), np.array([4])])
array([1, 2, 3, 4])
>>> np.concatenate([np.array([1, 2, 3]), np.array([4.0])])
array([1., 2., 3., 4.])

(In fact, ak.concatenate calls does this through a UnionArray simplify, too. The PR #337 that you motivated by finding NumPy dtype bugs ensures that we now use exactly the same unification rules as NumPy.)

In @nikoladze's case, the UnionArray of records and numbers (zero) could not be simplified.

@jpivarski jpivarski changed the title Convenience functions for padding and filling records of mixed numeric types Convenience function to turn an Awkward Array into a NumPy array in anyway that it can Dec 12, 2022
@jpivarski
Copy link
Member

In case you're wondering what all of this is about, I'm going through all of our open issues from oldest to newest to decide what should be done with them, post-2.0.

In this case, @nikoladze's array can be converted to NumPy if you pay attention to all the details of which axis needs to be padded and with some numeric fill value (i.e. don't try to fill missing records with a number). There ought to be a function to make some reasonable choices (apply standardized rules) to turn anything rectilinear with a given fill value that is by default 0. Maybe another function argument to choose between clipping to the smallest list length versus padding to the longest (the latter is the default).

The point of this is to remember that sometimes, we don't care about structure and don't want to think about it: we just want a NumPy array somehow. This would be a good function to develop with ak.transform; the hardest part might be naming it...

@agoose77 agoose77 added the one-hour-fix It can probably be done in an hour label Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request one-hour-fix It can probably be done in an hour
Projects
Status: Set aside (don't do)
Development

No branches or pull requests

4 participants