From 33d0edb1a76bff01a165243b572a510d25cca324 Mon Sep 17 00:00:00 2001 From: Danny McClanahan <1305167+cosmicexplorer@users.noreply.github.com> Date: Thu, 26 Apr 2018 23:31:27 -0700 Subject: [PATCH] add "datatypes in depth" section --- src/python/pants/engine/README.md | 108 ++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) diff --git a/src/python/pants/engine/README.md b/src/python/pants/engine/README.md index f74380a15037..abdd85a8b134 100644 --- a/src/python/pants/engine/README.md +++ b/src/python/pants/engine/README.md @@ -105,6 +105,9 @@ constructor will create your object, then raise an error if of a *subclass* of a field's declared type will **fail** this type check in the constructor! +Please see [Datatypes in Depth](#datatypes-in-depth) for further discussion on +using `datatype` objects with the v2 engine. + ### Selectors and Gets As demonstrated above, the `Selector` classes select `@rule` inputs in the context of a particular @@ -222,3 +225,108 @@ in the context of scala and mixed scala & java builds. Twitter spiked on a proj a target-level scheduling system scoped to just the jvm compilation tasks. This bore fruit and served as further impetus to get a "tuple-engine" designed and constructed to bring the benefits seen in the jvm compilers to the wider pants world of tasks. + +## Datatypes in Depth + +`datatype` objects can be used to colocate multiple dependencies of an +`@rule`. For example, to compile C code, you typically require both source code +and a C compiler: + +``` python +class CCompileRequest(datatype(['c_compiler', 'c_sources'])): + pass + +class CObjectFiles(datatype(['files_snapshot'])): + pass + +# The engine ensures this is the only way to get from +# CCompileRequest -> CObjectFiles. +@rule(CObjectFiles, [Select(CCompileRequest)]) +def compile_c_sources(c_compile_request): + c_compiler, c_sources = c_compile_request + compiled_object_files = c_compiler.compile(c_sources) + return CObjectFiles(compiled_object_files) +``` + +Encoding different stages of a build process into different `datatype` +subclasses which have all the information they need and no more makes it easier +to add functionality to the build by consuming and/or producing types from a +concise shared set of `datatype` definitions. For example: + +``` python +# "Vendoring" refers to checking a source or binary copy of a 3rdparty +# library into source control. In this case, we assume the snapshot contains +# _only_ binary object files for the current platform. +class VendoredLibrary(datatype(['files_snapshot'])): + pass + +@rule(CObjectFiles, [Select(VendoredLibrary)]) +def get_vendored_object_files(vendored_library): + return CObjectFiles(vendored_library.files_snapshot) +``` + +We have added the ability to depend on checked-in binary object files with an +extremely small amount of code, because we can assume that `VendoredLibrary` is +constructed with a snapshot containing only object files, so we can ensure that +the `CObjectFiles` we construct also upholds that guarantee. The key to making +that assumption possible is encoding assumptions about our objects into specific +types, and letting the engine invoke the correct sequence of rules. + +### Encoding Assumptions into Types + +Passing around an instance of a primitive type such as `str` or `int` can +sometimes require significant mental overhead to keep track of assumptions that +the code makes about the object's value. If the `str` needs to be formatted a +specific way or the `int` must be within a certain range, using those types +directly can require repeated validation of the object wherever it's used, for +example to avoid injection attacks from user-provided strings, or attempting to +read a negative number of bytes from a file. Outside of the variable name, with +a `str` object there is no context about what validation or transformations have +been performed on the object or how it will be used. + +One way to keep track of assumptions made about an object's value is to make a +wrapper type for that object, and then control the ways that instances of the +wrapper type can be created. One way to implement this is to override the +wrapper type's constructor and raise an exception if the object's value is +invalid. Declaring a typed field for a `datatype` takes this approach, but it +can be extended for arbitrary types of input validation: + +``` python +# Declare a datatype with a single field 'int_value', +# which must be an int when the datatype is constructed. +class NonNegativeInt(datatype([('int_value', int)])): + def __new__(cls, *args, **kwargs): + # Call the superclass constructor first to check the type of `int_value`. + this_object = super(NonNegativeInt, cls).__new__(cls, *args, **kwargs) + + if this_object.int_value < 0: + raise cls.make_type_error("value is negative: {!r}" + .format(this_object.int_value)) + + return this_object +``` + +`make_type_error()` creates an exception object which can be raised in a +`datatype`'s constructor to note a type checking failure, and automatically +includes the type name in the error message. However, any other exception type +can be raised as well. + +For `NonNegativeInt`, the input is extremely simple (we're not calling any +methods on the `int`), and the validation is extremely straightforward (can be +expressed in a single `if`). These characteristics make it natural to declare a +specific type for the field in the call to `datatype()` and to ensure validity +with a check in the `__new__()` method. Using type checking in this way makes +types like `NonNegativeInt` usable in many different scenarios without +additional boilerplate for the user. + +`VendoredLibrary` and `CObjectFile` are the opposite: a synchronous scan of +every file in a `VendoredLibrary`'s `files_snapshot` to verify that they are all +indeed object files for the correct platform every time we construct one would +be difficult to justify, because the inputs are much more complex to construct, +and the result much more difficult to validate. In this case, making simple, +focused `datatype` definitions makes it easier to correctly consume, manipulate, +and produce them to form a common set of `@rule` definitions. The engine ensures +that there is at most one sequence of rules transforming type A to type B, and +makes this feasible by automatically linking together the rules to convert type +A to type B. Making a set of rules maximally composable implicitly helps to +ensure correctness by reusing logic as much as possible.