Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve Array initialisation performance #1700

Merged
merged 6 commits into from
Sep 14, 2022

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Sep 14, 2022

@jpivarski
Copy link
Member

Wait—are we sure that

        if where in dir(type(self)):

is equivalent to

        if hasattr(type(self), where):

The record fields are accessible via get-attribute (like my_array.my_physics_column). We want __getattr__, __setattr__, __dir__ to agree on the set of attributes as including these record fields. If they disagree, it gets confusing.

Is this PR breaking that relationship?

@agoose77
Copy link
Collaborator Author

agoose77 commented Sep 14, 2022

Wait—are we sure that

        if where in dir(type(self)):

is equivalent to

        if hasattr(type(self), where):

Well, this is the tricky part. dir is only supposed to show relevant attributes rather than a comprehensive set (although, oftentimes it uses the default impl which is comprehensive). NB it's also the type.__dir__ that we look at here, not self.__dir__

This code allows the user to set existing type-declared fields, or private attributes. Otherwise, they need to be trying to set an existing field name. This seems like the right approach; we want only fields to be immutable, and we don't want fields to shadow existing Array attributes (otherwise third-party consumers of arrays might run into trouble, e.g. if the user adds a fields field)!

So, the specification defined here is:

  • allow users to write to the private attributes (_XXX), which are prohibited by convention for external use
  • allow users to write to public type-level attributes (including attributes that override the ak.Array attributes, e.g. if a user defined their own behavior with def fields). Users should not do this, rather than us trying to validate things?
  • prevent users from writing any other fields

So, we don't go as far as preventing records from containing private-shadowing fields (e.g. "_layout_) or behaviours overriding the base ak.Array type interface.

@jpivarski
Copy link
Member

We were careful to make sure that it doesn't overshadow:

>>> import awkward as ak
>>> array = ak._v2.Array([{"fields": 1, "field": 2}])
>>> array.fields
['fields', 'field']
>>> array.field
<Array [2] type='1 * int64'>

This is in the order of tests in

if where in dir(type(self)):
return super().__getattribute__(where)
else:
if where in self._layout.fields:
try:
return self[where]
except Exception as err:
raise ak._v2._util.error(
AttributeError(
"while trying to get field {}, an exception "
"occurred:\n{}: {}".format(repr(where), type(err), str(err))
)
) from err
else:
raise ak._v2._util.error(AttributeError(f"no field named {where!r}"))

(We're taking advantage of the difference between __getattr__ and __getattribute__, cf here.)

@agoose77
Copy link
Collaborator Author

We were careful to make sure that it doesn't overshadow:

>>> import awkward as ak
>>> array = ak._v2.Array([{"fields": 1, "field": 2}])
>>> array.fields
['fields', 'field']
>>> array.field
<Array [2] type='1 * int64'>

This is in the order of tests in

Yes, indeed. What I mean here is that we don't do any validation at any stage to prevent this, we just implement the get/set behaviour such that users can't break anything. That means in this context, setting an attribute always succeeds if it's the name of an ak.Array-"approved" attribute. This means that users could still confuse themselves if they create a "bad" record-array, but unless we validate early, we just need to use docs for this.

@agoose77 agoose77 force-pushed the agoose77/perf-improve-array-init branch from 9ea085d to c201a1e Compare September 14, 2022 13:59
@jpivarski
Copy link
Member

Okay, we're in agreement about what it's supposed to do. When the dust settles, let's just make sure that all of your bullet points in #1700 (comment) are still satisfied.

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

Merging #1700 (241cb2e) into main (e692946) will increase coverage by 0.41%.
The diff coverage is 84.74%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_v2/_connect/jax/__init__.py 90.47% <ø> (+1.58%) ⬆️
src/awkward/_v2/config.py 0.00% <0.00%> (ø)
src/awkward/_v2/operations/ak_broadcast_arrays.py 100.00% <ø> (+4.34%) ⬆️
src/awkward/_v2/operations/ak_transform.py 65.51% <ø> (+56.89%) ⬆️
src/awkward/_v2/types/uniontype.py 84.61% <ø> (ø)
src/awkward/_v2/behaviors/categorical.py 73.13% <60.00%> (-8.94%) ⬇️
src/awkward/_v2/contents/bitmaskedarray.py 71.42% <60.00%> (-0.11%) ⬇️
src/awkward/_v2/contents/bytemaskedarray.py 88.32% <60.00%> (-0.03%) ⬇️
src/awkward/_v2/contents/listarray.py 91.89% <60.00%> (-0.02%) ⬇️
src/awkward/_v2/contents/unmaskedarray.py 70.76% <60.00%> (-0.13%) ⬇️
... and 54 more

So, the specification defined here is:
- allow users to write to the private attributes (`_XXX`), which are prohibited by convention for external use
- allow users to write to public type-level attributes (including attributes that override the `ak.Array` attributes, e.g. if a user defined their own behavior with `def fields`). Users should not do this, rather than us trying to validate things?
- prevent users from writing any other fields
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests check all the cases I was worried about. Someone might want to set

record.x = 10

expecting the 10 to be broadcasted over the array, but the error message is right in saying that they have to

record["x"] = 10

It's also correct that Array and Record subclasses can override these.

@jpivarski
Copy link
Member

So this is an approval; go ahead and merge!

If I don't hear back, I'll "update branch" and "enable auto-merge (squash)", to do extra testing and give about 10 minutes to intercept, if you have something else to add.

@jpivarski jpivarski enabled auto-merge (squash) September 14, 2022 15:29
@jpivarski jpivarski merged commit 40e32dd into main Sep 14, 2022
@jpivarski jpivarski deleted the agoose77/perf-improve-array-init branch September 14, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants