Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejection of invalid layouts. #1333

Closed
agoose77 opened this issue Mar 2, 2022 · 4 comments
Closed

Rejection of invalid layouts. #1333

agoose77 opened this issue Mar 2, 2022 · 4 comments
Labels
feature New feature or request policy Choice of behavior

Comments

@agoose77
Copy link
Collaborator

agoose77 commented Mar 2, 2022

In #1308 there was a conversation about the policy for accepting redundant (unsimplified) layouts. There have been at least two cases (#713, #1308 discussion) where non-simplified layouts produce strange results when passed to high level functions, and until now I've raised these as bugs.

After the discussion with @jpivarski in #1308 I've realised that this is a policy issue, as Awkward functions themselves do/should not produce such layouts. Accepting redundant layouts is transparent in some functions (like ak.unflatten and ak.packed), but not those where the high-level function needs to match a particular layout structure. Such functions are made far simpler by the requirement of simplified layouts.

The take-away is:

  • Awkward should not produce non-simplified Arrays
  • Users of the mid-layer should check that they don't produce non-simplified Arrays

To support external users of the mid layer, I think it would be helpful to catch cases where this invariant fails; bugs in the Awkward code-base will more easily found than those in external users' code (because we will have a greater number of users overall).

@agoose77 agoose77 added feature New feature or request policy Choice of behavior labels Mar 2, 2022
@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 2, 2022

We discussed the idea of a per-layout flag that determines whether a layout is in its simplest form. This would add a reasonable amount of code to check something that internally should never happen.

A less robust approach would be to tackle #516 at the same time and create a mid-level API that resembles ak._util, e.g.

def wrap(layout, behavior):
    assert_simplified(layout)
    return ak._v2._util.wrap(layout, behavior)

...

The issue with this approach is that it only checks for validity at wrap time, and doesn't remember layouts it has visited before. Users that pass a layout to ak.XXX functions with highlevel=True would then receive Arrays that have not been checked.

The more I think about it, the less confident I am that we want to go to all this trouble. We don't enforce check_valid by default, and that is the same class of error. Maybe we should include a simplification check when check_valid is set to True, and call it a day. Perhaps this is just a documentation problem?

I appreciate that I've gone round in a few circles here, thanks for coming for the ride!

@jpivarski
Copy link
Member

The more I think about it, the less confident I am that we want to go to all this trouble. We don't enforce check_valid by default, and that is the same class of error. Maybe we should include a simplification check when check_valid is set to True, and call it a day.

At least in the two Content subclasses I checked yesterday, the check_valid implementation also checks for this redundancy. So if a user is concerned about the validity of a layout, this option will fix it.

All in all, there are three ways for a layout to be invalid:

  1. Invalid metadata: property types, property value ranges—any state depending on the Content instance itself that is not in any arrays (Indexes or NumpyArray data). Examples: BitMaskedArray lsb_order must be boolean; RegularArray size must be non-negative.
  2. Invalid data: values in Indexes that are out of range, inconsistent with each other, or inconsistent with len(content). Examples: IndexedArray index with values that are negative or greater than or equal to len(content); ListArray stops[i] < starts[i].
  3. Invalid nesting: any two directly nested {IndexedArray, IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray} or directly nested UnionArray or a UnionArray containing mergeable contents.

All of this discussion is about (3), which could be caught in O(1) time with O(n) flags where n is the number of nodes in a tree (not very large).

However, just as a mid-level user (someone going into the ak.layout.* module) can make a layout that is invalid in way (3), they can make a layout that is invalid in way (2). Making an ak.Array with check_valid=True checks for types (2) and (3) errors. This is opt-in because checking for type (2) is O(N) where N is the number of elements in the array (more expensive).

Type (1) errors are always checked by Content subclass constructors, so it's never possible to make one without knowing about the error. This checking is O(1) in time and memory use.

So it comes down to how much we care about this: we'd be able to check for type (3) errors without requiring user opt-in at the cost of adding O(n) flags (not a big cost). But the layouts could still have type (2) errors. The current policy is that if an operation relies on some type (2) error not being wrong, it has to check for those errors just before using them: for example, checking stops[i] >= starts[i] inside a kernel just before using starts[i] or stops[i], and it should check for a type (3) error just before using it, too. For instance, if an operation needs to assume that two option-type nodes are not nested, it can do the O(1) check for the content directly inside the option-type node in question and raise an error if it is. It does not need to "deal" with it as though it were a valid state.

This PR is about a change in policy to detect type (3) errors more proactively: to raise an error when constructing the ak.Array in a way that users don't have to opt into, because the cost is small and there may be erroneous cases in the wild. There's a positive value to that—it would be better to have this than not have this—but I don't see it as a high priority. If you have other, more pressing items, I'd put this on pause in favor of those.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 2, 2022

At least in the two Content subclasses I checked yesterday, the check_valid implementation also checks for this redundancy. So if a user is concerned about the validity of a layout, this option will fix it.

(:facepalm:)

You're completely right. I don't know what I was looking at yesterday.

Yes, this issue is about if/when to check (3). The easiest solution is to check this when constructing Array vs doing this in the layout constructor. Whilst users could then have erroneous layouts, they will nearly always end up converting them to Array at some point and then discover the error.

So, this is a wishlist item for when we get around to splitting the structure checking into its own routine.

@agoose77
Copy link
Collaborator Author

I think @jpivarski work means that we can close this! We now have an explicit separation between simplified and __init__ that signals intent, and most checks are lightweight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request policy Choice of behavior
Projects
None yet
Development

No branches or pull requests

2 participants