add "datatypes in depth" section

pantsbuild · Apr 27, 2018 · 33d0edb · 33d0edb
1 parent ce893b5
commit 33d0edb
Showing 1 changed file with 108 additions and 0 deletions.
diff --git a/src/python/pants/engine/README.md b/src/python/pants/engine/README.md
@@ -105,6 +105,9 @@ constructor will create your object, then raise an error if
 of a *subclass* of a field's declared type will **fail** this type check in the
 constructor!
 
+Please see [Datatypes in Depth](#datatypes-in-depth) for further discussion on
+using `datatype` objects with the v2 engine.
+
 ### Selectors and Gets
 
 As demonstrated above, the `Selector` classes select `@rule` inputs in the context of a particular
@@ -222,3 +225,108 @@ in the context of scala and mixed scala & java builds.  Twitter spiked on a proj
 a target-level scheduling system scoped to just the jvm compilation tasks.  This bore fruit and
 served as further impetus to get a "tuple-engine" designed and constructed to bring the benefits
 seen in the jvm compilers to the wider pants world of tasks.
+
+## Datatypes in Depth
+
+`datatype` objects can be used to colocate multiple dependencies of an
+`@rule`. For example, to compile C code, you typically require both source code
+and a C compiler:
+
+``` python
+class CCompileRequest(datatype(['c_compiler', 'c_sources'])):
+  pass
+
+class CObjectFiles(datatype(['files_snapshot'])):
+  pass
+
+# The engine ensures this is the only way to get from
+# CCompileRequest -> CObjectFiles.
+@rule(CObjectFiles, [Select(CCompileRequest)])
+def compile_c_sources(c_compile_request):
+  c_compiler, c_sources = c_compile_request
+  compiled_object_files = c_compiler.compile(c_sources)
+  return CObjectFiles(compiled_object_files)
+```
+
+Encoding different stages of a build process into different `datatype`
+subclasses which have all the information they need and no more makes it easier
+to add functionality to the build by consuming and/or producing types from a
+concise shared set of `datatype` definitions. For example:
+
+``` python
+# "Vendoring" refers to checking a source or binary copy of a 3rdparty
+# library into source control. In this case, we assume the snapshot contains
+# _only_ binary object files for the current platform.
+class VendoredLibrary(datatype(['files_snapshot'])):
+  pass
+
+@rule(CObjectFiles, [Select(VendoredLibrary)])
+def get_vendored_object_files(vendored_library):
+  return CObjectFiles(vendored_library.files_snapshot)
+```
+
+We have added the ability to depend on checked-in binary object files with an
+extremely small amount of code, because we can assume that `VendoredLibrary` is
+constructed with a snapshot containing only object files, so we can ensure that
+the `CObjectFiles` we construct also upholds that guarantee. The key to making
+that assumption possible is encoding assumptions about our objects into specific
+types, and letting the engine invoke the correct sequence of rules.
+
+### Encoding Assumptions into Types
+
+Passing around an instance of a primitive type such as `str` or `int` can
+sometimes require significant mental overhead to keep track of assumptions that
+the code makes about the object's value. If the `str` needs to be formatted a
+specific way or the `int` must be within a certain range, using those types
+directly can require repeated validation of the object wherever it's used, for
+example to avoid injection attacks from user-provided strings, or attempting to
+read a negative number of bytes from a file. Outside of the variable name, with
+a `str` object there is no context about what validation or transformations have
+been performed on the object or how it will be used.
+
+One way to keep track of assumptions made about an object's value is to make a
+wrapper type for that object, and then control the ways that instances of the
+wrapper type can be created. One way to implement this is to override the
+wrapper type's constructor and raise an exception if the object's value is
+invalid. Declaring a typed field for a `datatype` takes this approach, but it
+can be extended for arbitrary types of input validation:
+
+``` python
+# Declare a datatype with a single field 'int_value',
+# which must be an int when the datatype is constructed.
+class NonNegativeInt(datatype([('int_value', int)])):
+  def __new__(cls, *args, **kwargs):
+    # Call the superclass constructor first to check the type of `int_value`.
+    this_object = super(NonNegativeInt, cls).__new__(cls, *args, **kwargs)
+
+    if this_object.int_value < 0:
+      raise cls.make_type_error("value is negative: {!r}"
+                                .format(this_object.int_value))
+
+    return this_object
+```
+
+`make_type_error()` creates an exception object which can be raised in a
+`datatype`'s constructor to note a type checking failure, and automatically
+includes the type name in the error message. However, any other exception type
+can be raised as well.
+
+For `NonNegativeInt`, the input is extremely simple (we're not calling any
+methods on the `int`), and the validation is extremely straightforward (can be
+expressed in a single `if`). These characteristics make it natural to declare a
+specific type for the field in the call to `datatype()` and to ensure validity
+with a check in the `__new__()` method. Using type checking in this way makes
+types like `NonNegativeInt` usable in many different scenarios without
+additional boilerplate for the user.
+
+`VendoredLibrary` and `CObjectFile` are the opposite: a synchronous scan of
+every file in a `VendoredLibrary`'s `files_snapshot` to verify that they are all
+indeed object files for the correct platform every time we construct one would
+be difficult to justify, because the inputs are much more complex to construct,
+and the result much more difficult to validate. In this case, making simple,
+focused `datatype` definitions makes it easier to correctly consume, manipulate,
+and produce them to form a common set of `@rule` definitions. The engine ensures
+that there is at most one sequence of rules transforming type A to type B, and
+makes this feasible by automatically linking together the rules to convert type
+A to type B. Making a set of rules maximally composable implicitly helps to
+ensure correctness by reusing logic as much as possible.