Description
I was attempting to write documentation for IEstimator
and ITransformer
. One of the core responsibilities of these is, as expressed in #581, schema propagation, so as to, among other benefits, do the sort of "pre-fit validation" scenario that has proven so troublesome to us in the past (this w.r.t. #267).
However, what I find is that I struggled to detect a meaningful unifying "principle" behind IEstimator
and ITransformer
. I think the people that worked on it thought there was a principle, but when I encountered what people though tit was (something about annotations being a 'subset' of one or the other), it seemed like there were numerous "exceptions" that made the principle meaningless. . Indeed, I'm starting to suspect that there was no scenario by the original designer aside from "work until things compile and run on what we have so far," which is not really an acceptable state for something like IEstimator
and ITransformer
which are core concepts of this API, especially if we hope to describe them in such a way that people can implement these interfaces on their own.
In this issue I outline the trouble that I have observed. The points here are subtle, but they are important insofar as resolving them is, I think, crucial for us to define the relationship between IEstimator
and the ITransformer
returned by fitting that estimator.
CCing @artidoro, @shauheen since I know they had some interest in this problem... going to mark as the unusual step of both bug and code-sanitation, since it both touches upon our need to have a consistent architectural story among these key structures, as well as there having been some genuine bugs that have been uncovered.
A Troublesome Example
So, I'll start with the same seemingly innocuous code that led me to think there is something architecturally wrong at stake here: Consider the IsNormalized
annotation, in the KeyToVectorMappingEstimator
, here is the condition where this annotation is mentioned as coming:
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 803 to 804 in 73e29c8
Meanwhile, the resulting transformer has subtly different logic. If the input source value count is 1
, which of course happens when something is scalar, but also happens when the input is vector and happens to have vector size of one. (A concept that has no meaning in SchemaShape
, and is basically unknowable.)
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 339 to 345 in 73e29c8
This leads to an interesting decision, because from a certain perspective, the addition of the metadata in the ITransformer
code is correct (and, at the time it was originally written, some years I believe before the concept of IEstimator
and SchemaShape
was introduced, was in fact unambiguously correct, since there was no concept as we have now of pre-Fit
schema validation.) But in the world of IEstimator
, we are using some information that in some cases could be used post-Fit
in the sense that it can only be expressed in a DataViewSchema
, not a SchemaShape
-- that this is a vector type that happens to have length exactly one -- and that in many cases could not even be known pre-Fit
under some circumstnaces. (E.g., consider a value-to-key mapping estimator that winds up finding only one value.)
So, the IEstimator
makes no claim that the IsNormalized
data is there, but due to runtime logic that had existed prior to this time, it is in fact there.
This was known, I believe, the originators of the concept of IEstimator
and SchemaShape
, which at one point was deliberately made as, "all right, we consider the claim in SchemaShape
to be, in the area of annotations specifically, a subset of what is actually present among the non-hidden columns of DatavViewSchema
. This is in fact what the relevant method called from TestEstimatorCore
does. (Though, it is reversed from what is actually in the comment -- what it actually tests is that what is delivered is a superset of what is promised).
machinelearning/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs
Lines 151 to 163 in 3b576fe
This notion of "subsetting" being the requirement is mentioned in the documentation and testing I could find on the relationship between IEstimator
and ITransformer
(see also here). But is superset actually the right thing? Let's consider another very important transformer: key-to-value. In that case, the estimators relies on a complete description of the KeyValue
annotation to detect what the input type is! So in that case this mere talk of superset and subset becomes increasingly complex, because sometimes it is necessary, and sometimes it is not.
If we want an ITransformer
and IEstimator
to be paired, this suggests a mutuality of information. One of the core tenants of IDataView
is composability, but the foundation of that composibility isn't because we've just created a system that works to just accomodate the set of components we ourselves wrote. (Indeed, this is one thing that I argue in one of our documents about the IDataView
system and why it works.)
So right now we have this system where sometimes annotations are subsets, but possibly not, depending on whether we consider a combination of annotations and estimators and transformers valuable. But, that seems like a pretty ad hoc way of engineering a system. How is it possible at all to make any meaningful statements about such a thing like, "we have this requirement, but possibly not, it only depends on if we've written anything yet that cares."
What does that mean? Are we prepared to make as part of the contract of IEstimator
and ITransformer
that some sorts of annotations are allowed to impact the schema (KeyValues
must!), but that some are not (we effectively were having the assumption in a few places that SlotNames
and IsNormalized
could not, based on our uneven propagation of them)? That doesn't seem like a good idea. Are we allowed to evolve that definition in any way? So I wanted something much simpler, which leads me to this:
Proposed Schema Relationship Between an Estimator and its Transformer
Edit: This principle is wrong, see this comment. But I still think enforcing the principle on our own tests with our own components makes sense and does way more help than harm, but we can't insist on it as a principle of the estimators and transformers themselves.
So, to make the proposed relationship concrete, I'll consider the following code (for any est
and data
):
IEstimator est = ...;
IDataView data = ...;
var schema = data.Schema;
var promisedShape = est.GetOutputSchema(SchemaShape.Create(schema));
var deliveredShape = SchemaShape.Create(est.Fit(data).GetOutputSchema(schema));
-
If both constructions of
promisedShape
anddeliveredShape
succeed, they should be the same, that is, the two constructedSchemaShape
objects should be indistinguishable (besides, of course, say, tests on reference equality). -
If the construction of
deliveredShape
would succeed, then the construction ofpromisedShape
should succeed as well. (That is, the schema-propagation logic of an estimator should be no more strict than the schema-propagation logic of its produced transformer.)
Both represent a somewhat more "restrictive" view of the relationship of paired IEstimator
and ITransformer
. The previous state was sort of an ad hoc free for all. I have not yet
established that it will be completely possible to do this; I know it will require some changes of behavior of some ITransformer
s, especially around their annotations, but so far I have not seen any "show stoppers." Maybe I will though.
Bugs
Unfortunately due to the fact that this "requirement" is meaningless unless we test for it, it is best if I introduce the stricter test, and test for this, all at the same time. Doing so results in several test failures -- some were due to stricter requirements, but some were due to the fact that some IEstimator
implementations were just plain wrong. I will outline in comments below those bugs that I have found, since I have not completed that work yet. (I have only fixed two tests so far, and each takes me a few hours. Regrettably, given the subtlety of the issues at stake, this is one of those situations that requires more finesse and consideration than other changes.)/+
+