Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support using @dataclass for engine params #8181

Merged
merged 8 commits into from
Oct 20, 2019

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 17, 2019

Problem

A first step to fixing #7074. We would like to be able to make use of python 3 @dataclasses, partially because they are the canonical way to make typed tuples in python now, and partially because it allows us to use mypy to resolve type errors at "compile" time instead of performing type checks at runtime when the tuple is constructed, which is what we do now, e.g.:

field_value = getattr(this_object, field_name)
try:
field_constraint.validate_satisfied_by(field_value)
except TypeConstraintError as e:
type_failure_msgs.append(
f"field '{field_name}' was invalid: {e}")

Solution

  • Edit the engine README.md to demonstrate how to use @dataclass and remove mentions of the old datatype().
  • Add an integration test to verify that mypy will correctly detect type errors when constructing @dataclass objects.

Result

Users are able to use a very canonical @dataclass declaration syntax for engine params which is statically type-checked with mypy!

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPIC. I love it!

Exactly(FieldType))`). The bare type name (e.g. `FieldType`) can also be used as a shorthand for
`Exactly(FieldType)`. If the tuple form is used, the constructor will create your object, then raise
an error if the field value does not satisfy the type constraint.
[`DatatypeMixin`](../util.objects.py) is intended to decouple the *declaration* of engine `Param` types from their *usage*. In particular, `DatatypeMixin` declares a few `@abstractmethod`s such as `get_type_dict()`, which it uses to implement several generic methods, including `__str__()`, `__repr__()`, and `copy()` (which returns a new instance of the same class, with one or more fields modified).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be feasible to get rid of DatatypeMixin and @dataclasstype at some point? It's a bit of a weird appendage. Any special methods on a datatype that the engine requires it can presumably implement as functions (e.g., copy(dataclass_instance, **kwargs) instead of dataclass_instance.copy(**kwargs)).

It seems best if the datatypes have little or no API at all, other than their declared fields. And being able to say "your pants products can be any frozen dataclass, with no extra boilerplate or syntax" is pretty hot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other conditions described in the Requirements for an Engine Param are pretty important: must have useful eq and a useful hash. If there is a way to guarantee that on a dataclass, then maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All dataclasses get a useful __eq__ by default (using, effectively, tuple comparison of fields), unless you set eq=False. And they all get a hash if frozen=True and eq=True.

True, both of these being meaningful depend on the types of the fields, but we would still be validating those, so we can ensure that.

It would be nice if our @datatype was just an alias for @dataclass(frozen=True) (with eq=True by default), as long as we can use the latter as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if our @datatype was just an alias for @DataClass(frozen=True) (with eq=True by default), as long as we can use the latter as well.

I like this! I think just exposing a single uniform @datatype annotation sounds great -- I'll use methods in pants.base.deprecated and introduce a temporary annotation named @datatype_wrapper or something to phase out the existing datatype.

Any special methods on a datatype that the engine requires it can presumably implement as functions (e.g., copy(dataclass_instance, **kwargs) instead of dataclass_instance.copy(**kwargs)).

I hadn't thought of this and I think this is a great solution that accomplishes the goals I was looking for! Thank you!

It seems best if the datatypes have little or no API at all, other than their declared fields. And being able to say "your pants products can be any frozen dataclass, with no extra boilerplate or syntax" is pretty hot.

I agree in the extreme! I will work to implement this now!

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can @dataclasstype imply @dataclass(frozen=True)? The boilerplate of the two annotations is a bummer.

Exactly(FieldType))`). The bare type name (e.g. `FieldType`) can also be used as a shorthand for
`Exactly(FieldType)`. If the tuple form is used, the constructor will create your object, then raise
an error if the field value does not satisfy the type constraint.
[`DatatypeMixin`](../util.objects.py) is intended to decouple the *declaration* of engine `Param` types from their *usage*. In particular, `DatatypeMixin` declares a few `@abstractmethod`s such as `get_type_dict()`, which it uses to implement several generic methods, including `__str__()`, `__repr__()`, and `copy()` (which returns a new instance of the same class, with one or more fields modified).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other conditions described in the Requirements for an Engine Param are pretty important: must have useful eq and a useful hash. If there is a way to guarantee that on a dataclass, then maybe.

@cosmicexplorer cosmicexplorer changed the title introduce @dataclasstype to wrap @dataclasses with datatype() methods support using @dataclass for engine params Aug 27, 2019
@cosmicexplorer
Copy link
Contributor Author

Can @dataclasstype imply @dataclass(frozen=True)? The boilerplate of the two annotations is a bummer.

I've just been able to remove @dataclasstype entirely!

@@ -122,7 +122,7 @@ print(x.content) # 'a string'
# datatype objects can be easily inspected:
print(x) # 'FormattedInt(content="a string")'

# Most mypy types, including parameterized types, may be used in field definitions.
# All mypy types, including parameterized types, may be used in field definitions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "mypy types" are. Do we mean the types in the typing module?

PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "mypy types" are. Do we mean the types in the typing module?

It was intended to read as "anything that can be used as the type of a type annotation", which would include the typing module, however:

PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.

I think you're totally right here. You'll see in

class _type_field_constraint(TypeConstraint):
def __init__(self):
super().__init__('<@dataclass `frozen=True` attribute-checking constraint>')
def satisfied_by(self, obj):
if not issubclass(type(obj), type):
return False
if dataclasses.is_dataclass(obj) and not obj.__dataclass_params__.frozen:
raise TypeConstraintError(
f'type {obj} is a dataclass declared without `frozen=True`! '
'The engine requires that fields in params are immutable for stable hashing!')
return True
that we validate that @dataclass classes used in a UnionRule, RootRule, or TaskRule have set frozen=True -- I think we can perform a little more validation here to avoid confusing error messages from the engine about unhashability. I'll do that now, then update the README to indicate which @dataclass field types that pants accepts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hashability enough? Don't we also need serializability? How do things get cached otherwise?

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.

Hm. Thinking about it again, I would prefer to try to avoid whitelisting specific types and instead just require hashability, which I think is the only real constraint on engine params? If you try to hash a dataclass which uses a List[_] field, for example, you get this:

>>> @dataclass(frozen=True)
... class C:
...   x: int
...   y: Tuple[str]
...   z: List[int]
...
>>> C(3, ("a",), [3])
C(x=3, y=('a',), z=[3])
>>> hash(C(3, ("a",), [3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 2, in __hash__
TypeError: unhashable type: 'list'

This does not occur for dataclasses with just Tuple[_] fields, for example. So I think we might be able to amend this section to say just that the only requirement on dataclass fields is hashability? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end don't we need to be able to serialize all engine datatypes, for the cache? This is an area of the engine I am fuzzy on, but I was under the strong impression that all Params can be cached, which presumably means they can be serialized. Although I don't know how this serialization happens today, if at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So say we codegen Foo.proto to Foo.scala. Foo.proto will be in the CAS because it was an on-disk source, it sounds like. Foo.scala is generated by remote process execution, so now it's sitting on disk on the remote machine. What happens to it then that allows it to participate in a subsequent process execution say to compile it to Foo.class ?

Copy link
Member

@stuhood stuhood Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploads and downloads to a remote CAS happen on-demand when you execute a process remotely.

^ this. So after the process finishes executing remotely, we sync the output digests locally.

EDIT: Oh. Hm. Sorry: I went looking for code to confirm this, and then realized that we apparently download lazily currently: downloading eagerly would look like calling

///
/// Download a directory from Remote ByteStore recursively to the local one. Called only with the
/// Digest of a Directory.
///
pub fn ensure_local_has_recursive_directory(
eagerly after executing a process (cc @illicitonion). In the absence of fetching eagerly, any lookup for a digest will later on potentially go to the network to grab that digest.

But particularly in the context of remoting, you might not want to fetch something locally. So I cannot now remember whether this laziness was intentional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I'm very happy with the laziness... Why fetch a file unless you need it. Imagine a large .jar or something...

But then can the second process execution (that takes the Foo.java as input) just reference it via the CAS, so we never even see the full file locally? I.e., the first remote process execution creates it in the CAS, the second reads it from there, and we never need to download the intermediate file to the Pants-running machine, just its digest?

And my other question was more about this: in the remote execution of the proto compiler, something has to know to take the resulting Foo.scala file from the local filesystem and stick it in the CAS (regardless of whether we then fetch it eagerly or lazily). How does that decision get made? I.e., what knows that "Foo.scala" is a file we care about, a product, but "incidental log file.txt" is not...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then can the second process execution (that takes the Foo.java as input) just reference it via the CAS, so we never even see the full file locally? I.e., the first remote process execution creates it in the CAS, the second reads it from there, and we never need to download the intermediate file to the Pants-running machine, just its digest?

Correct. But whether you download it or not (ie, lazy or no), it already exists in the remote CAS. Before executing the next process we do (confirmed) eagerly call

///
/// Ensures that the remote ByteStore has a copy of each passed Fingerprint, including any files
/// contained in any Directories in the list.
///
/// Returns a structure with the summary of operations.
///
pub fn ensure_remote_has_recursive(
: if the inputs already exist remotely, this is a noop.

And my other question was more about this: in the remote execution of the proto compiler, something has to know to take the resulting Foo.scala file from the local filesystem and stick it in the CAS (regardless of whether we then fetch it eagerly or lazily). How does that decision get made? I.e., what knows that "Foo.scala" is a file we care about, a product, but "incidental log file.txt" is not...

The high-level (python) ExecuteProcessRequest type has fields to define what should be captured from the process-execution sandbox, and these map ~1:1 with the bazel remoting API:

('output_files', hashable_string_list),
('output_directories', hashable_string_list),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so, to sum up:

Say we're running Pants on machine P1, and executing processes remotely on machines M1, M2.

  1. Foo.proto is a local source file, so it's already in the CAS and P1 has its fingerprint.

  2. P1 calls ExecuteProcess(protoc, Fingerprint(Foo.proto)), sending the request to M1.

  3. M1 fetches Foo.proto from the CAS, runs protoc on it (I'm eliding the question of where M1 gets a protoc binary...) thus generating Foo.scala.

  4. M1 puts Foo.scala in the CAS, returning just Fingerprint(Foo.scala) to P1.

  5. P1 calls ExecuteProcess(scalac, Fingerprint(Foo.scala)), sending the request to M2.

  6. M2 fetches Foo.scala from the CAS, runs scalac on it, thus generating Foo.class.

  7. M2 puts Foo.class in the CAS, returning just Fingerprint(Foo.class) to P1.

So P1 does not have the content of Foo.scala or Foo.class locally at any point.

Does that about sum it up reasonably?

If so, one more question: what does that mean for running ensure_remote_has_recursive before step 5 above? P1 doesn't have Foo.scala's content, so it seems like it can't ensure anything. It can I guess error out if the CAS is missing the file, but then so can M2 directly.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

@@ -122,7 +122,7 @@ print(x.content) # 'a string'
# datatype objects can be easily inspected:
print(x) # 'FormattedInt(content="a string")'

# Most mypy types, including parameterized types, may be used in field definitions.
# All mypy types, including parameterized types, may be used in field definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cosmicexplorer: that's correct. I try to differentiate between them by calling the in-memory approach "memoization in the graph", vs calling the persistent approach "caching". Memoization happens for all rules and requires python hash/eq, caching happens only for process executions, and uses the protobuf definitions.


# TODO(#7074): Migrate to python 3 dataclasses!
@deprecated('1.22.0.dev0', hint_message='Use @dataclasstype to declare typed named tuples!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to target dev2 with deprecations where possible. See:

https://groups.google.com/d/topic/pants-devel/rO1K7j1iCwk/discussion

I made a change to try and encourage this, so I'm surprised it's not taking effect here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what change you made to encourage this? I'm not seeing any code in deprecated.py that relates to this, and I'm wondering if I should make that change in a separate PR.

@cosmicexplorer cosmicexplorer force-pushed the python-dataclasses branch 2 times, most recently from b9511ae to cb48ca8 Compare August 31, 2019 03:06
@cosmicexplorer cosmicexplorer force-pushed the python-dataclasses branch 2 times, most recently from fc9866b to c6ff796 Compare September 12, 2019 17:32
@cosmicexplorer cosmicexplorer force-pushed the python-dataclasses branch 2 times, most recently from 77901f4 to bcf8e5d Compare September 26, 2019 15:57
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool!

src/python/pants/engine/README.md Outdated Show resolved Hide resolved
definitions to provide a unique and descriptive type is highly recommended:
disambiguate between various types of data in `@rule` signatures, so declaring small, unique classes
to encapsulate different states can help to make `@rule`sets with complex control flow between rules
more declarative and self-documenting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good rewrite!

src/python/pants/engine/README.md Show resolved Hide resolved
src/python/pants/engine/README.md Outdated Show resolved Hide resolved
src/python/pants/engine/README.md Outdated Show resolved Hide resolved
src/python/pants/engine/README.md Outdated Show resolved Hide resolved
"""Holds a normalized index of Rules used to instantiate Nodes."""

rules: Dict[Any, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tightened at all? I think rules are seen as Callable? Don't let this block landing the PR though. We can come back to this and I'd rather land the great work you've done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better version of lines 573-575 - should delete the original.

Thanks for adding type hints! Is Any representative or because we don't know how to make it more precise? If the latter, it's better to do Dict and Set with no types specified so that we know to later go back and make the type hint more precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a followup PR that addresses those more-precise type hints!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #8496!

testprojects/src/python/mypy/BUILD Outdated Show resolved Hide resolved
tests/python/pants_test/engine/test_scheduler.py Outdated Show resolved Hide resolved
tests/python/pants_test/util/test_objects_mypy.py Outdated Show resolved Hide resolved
Eric-Arellano added a commit that referenced this pull request Oct 1, 2019
#8181 demonstrated that it's possible to use `@dataclass(frozen=True)` as an alternative to `datatype()`. As argued there, this brings several benefits, including aligning us more to the stdlib and allowing MyPy to understand our code for free.

This PR takes a first pass to convert >75% of `datatype` usages to `@dataclass`. It naively does this translation:

* Does not attempt to add any more type hints than were previously used.
* Leaves untouched certain `datatypes()` that inherit normal classes with class variables, which would get overridden
* Leaves untouched certain `datatypes()` with custom `__new__()` that is not easily replicated
Eric-Arellano added a commit that referenced this pull request Oct 9, 2019
… with V2 (#8431)

### Problem
In #8423 (comment), we realized the need to be able to not always use `frozen=True` when using dataclasses for the V2 engine. Specifically, when we need a custom `__init__()`, we cannot have `frozen=True` because `__init__()` must be able to mutate the object's attributes.

However, we still want to keep the benefits of `frozen=True` as much as possible. After we create the object, we do not want to allow any additional modifications. This is important because the V2 engine uses the `__hash__()` of an object to determine its node ID - if the object mutates after creation, the ID will become invalid.

### Solution
As suggested in #8423 (comment), introduce a new decorator `@frozen_after_init` that only allows setting attributes in the `__init__()` and then marks the object as frozen after.

This is done by setting a flag in the `__init__()` to mark that the object should there-on-out be frozen. Then, we modify `__setattr__()` to check for that flag to prohibit any mutations other than those in the `__init__()`.

### Result
`@dataclass(unsafe_hash=True)` can now be marked with `@frozen_after_init` so that we can have the custom `__init__()`s but not lose the benefits of `frozen=True`.

This also works on any arbitrary class.

We do not yet enforce that dataclasses _must_ use this attribute (or `frozen=True`), which is left for #8181.
@cosmicexplorer cosmicexplorer force-pushed the python-dataclasses branch 2 times, most recently from 332cfd8 to 3fa073a Compare October 16, 2019 16:21
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Only holding off on approval because I didn't have enough time to closely review this yet.

"""Holds a normalized index of Rules used to instantiate Nodes."""

rules: Dict[Any, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better version of lines 573-575 - should delete the original.

Thanks for adding type hints! Is Any representative or because we don't know how to make it more precise? If the latter, it's better to do Dict and Set with no types specified so that we know to later go back and make the type hint more precise.


# TODO(#7074): Migrate to python 3 dataclasses!
@deprecated('1.24.0.dev2', hint_message='Use @dataclass to declare typed named tuples instead!')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to remove this without a deprecation cycle once possible under the same logic as the enum() removal, i.e. that this API was experimental (albeit highly useful!).

Maybe hold off on deprecating this so that we can have a dedicated PR for the removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that deprecating it is exactly what makes it easier to remove eventually? If we don't want to have a full deprecation cycle, we can kill it early. But if we're planning to remove it, I think a deprecation cycle sounds reasonable?

cosmicexplorer and others added 8 commits October 18, 2019 14:23
deprecate datatype()!

add test for rules with python dataclasses!

remove all the conversion to custom datatypes!!!

fix the readme!!!

ensure @DataClass types have frozen=True when used in the engine

clarify mypy compatibility

don't display deprecation warnings in CI

bump deprecation version

fix import ordering

make lint and unit tests pass

fix mypy line number

remove deprecation warning ignore filter in travis

fix weird mypy errors

try to see if removing the deprecation stops the timeouts!

fix test failure and uncomment deprecation!
Co-Authored-By: Eric Arellano <ericarellano@me.com>
@cosmicexplorer cosmicexplorer merged commit 00c4f8c into pantsbuild:master Oct 20, 2019
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!

stuhood added a commit to twitter/pants that referenced this pull request Oct 25, 2019
stuhood added a commit that referenced this pull request Oct 25, 2019
Fixes #8539 and #8520.

----

Revert "fix mypy type error in meta.py (#8509)"

This reverts commit 9ad14d3.

----

Revert "support using @DataClass for engine params (#8181)"

This reverts commit 00c4f8c.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants