Skip to content

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Closed
@TomFinley

Description

@TomFinley

I was attempting to write documentation for IEstimator and ITransformer. One of the core responsibilities of these is, as expressed in #581, schema propagation, so as to, among other benefits, do the sort of "pre-fit validation" scenario that has proven so troublesome to us in the past (this w.r.t. #267).

However, what I find is that I struggled to detect a meaningful unifying "principle" behind IEstimator and ITransformer. I think the people that worked on it thought there was a principle, but when I encountered what people though tit was (something about annotations being a 'subset' of one or the other), it seemed like there were numerous "exceptions" that made the principle meaningless. . Indeed, I'm starting to suspect that there was no scenario by the original designer aside from "work until things compile and run on what we have so far," which is not really an acceptable state for something like IEstimator and ITransformer which are core concepts of this API, especially if we hope to describe them in such a way that people can implement these interfaces on their own.

In this issue I outline the trouble that I have observed. The points here are subtle, but they are important insofar as resolving them is, I think, crucial for us to define the relationship between IEstimator and the ITransformer returned by fitting that estimator.

CCing @artidoro, @shauheen since I know they had some interest in this problem... going to mark as the unusual step of both bug and code-sanitation, since it both touches upon our need to have a consistent architectural story among these key structures, as well as there having been some genuine bugs that have been uncovered.

A Troublesome Example

So, I'll start with the same seemingly innocuous code that led me to think there is something architecturally wrong at stake here: Consider the IsNormalized annotation, in the KeyToVectorMappingEstimator, here is the condition where this annotation is mentioned as coming:

if (!colInfo.OutputCountVector || (col.Kind == SchemaShape.Column.VectorKind.Scalar))
metadata.Add(new SchemaShape.Column(AnnotationUtils.Kinds.IsNormalized, SchemaShape.Column.VectorKind.Scalar, BooleanDataViewType.Instance, false));

Meanwhile, the resulting transformer has subtly different logic. If the input source value count is 1, which of course happens when something is scalar, but also happens when the input is vector and happens to have vector size of one. (A concept that has no meaning in SchemaShape, and is basically unknowable.)

if (!_parent._columns[iinfo].OutputCountVector || srcValueCount == 1)
{
ValueGetter<bool> getter = (ref bool dst) =>
{
dst = true;
};
builder.Add(AnnotationUtils.Kinds.IsNormalized, BooleanDataViewType.Instance, getter);

This leads to an interesting decision, because from a certain perspective, the addition of the metadata in the ITransformer code is correct (and, at the time it was originally written, some years I believe before the concept of IEstimator and SchemaShape was introduced, was in fact unambiguously correct, since there was no concept as we have now of pre-Fit schema validation.) But in the world of IEstimator, we are using some information that in some cases could be used post-Fit in the sense that it can only be expressed in a DataViewSchema, not a SchemaShape -- that this is a vector type that happens to have length exactly one -- and that in many cases could not even be known pre-Fit under some circumstnaces. (E.g., consider a value-to-key mapping estimator that winds up finding only one value.)

So, the IEstimator makes no claim that the IsNormalized data is there, but due to runtime logic that had existed prior to this time, it is in fact there.

This was known, I believe, the originators of the concept of IEstimator and SchemaShape, which at one point was deliberately made as, "all right, we consider the claim in SchemaShape to be, in the area of annotations specifically, a subset of what is actually present among the non-hidden columns of DatavViewSchema. This is in fact what the relevant method called from TestEstimatorCore does. (Though, it is reversed from what is actually in the comment -- what it actually tests is that what is delivered is a superset of what is promised).

private void CheckSameSchemaShape(SchemaShape promised, SchemaShape delivered)
{
Assert.True(promised.Count == delivered.Count);
var sortedCols1 = promised.OrderBy(x => x.Name);
var sortedCols2 = delivered.OrderBy(x => x.Name);
foreach (var (x, y) in sortedCols1.Zip(sortedCols2, (x, y) => (x, y)))
{
Assert.Equal(x.Name, y.Name);
// We want the 'promised' metadata to be a superset of 'delivered'.
Assert.True(y.IsCompatibleWith(x), $"Mismatch on {x.Name}");
}
}

This notion of "subsetting" being the requirement is mentioned in the documentation and testing I could find on the relationship between IEstimator and ITransformer (see also here). But is superset actually the right thing? Let's consider another very important transformer: key-to-value. In that case, the estimators relies on a complete description of the KeyValue annotation to detect what the input type is! So in that case this mere talk of superset and subset becomes increasingly complex, because sometimes it is necessary, and sometimes it is not.

If we want an ITransformer and IEstimator to be paired, this suggests a mutuality of information. One of the core tenants of IDataView is composability, but the foundation of that composibility isn't because we've just created a system that works to just accomodate the set of components we ourselves wrote. (Indeed, this is one thing that I argue in one of our documents about the IDataView system and why it works.)

So right now we have this system where sometimes annotations are subsets, but possibly not, depending on whether we consider a combination of annotations and estimators and transformers valuable. But, that seems like a pretty ad hoc way of engineering a system. How is it possible at all to make any meaningful statements about such a thing like, "we have this requirement, but possibly not, it only depends on if we've written anything yet that cares."

What does that mean? Are we prepared to make as part of the contract of IEstimator and ITransformer that some sorts of annotations are allowed to impact the schema (KeyValues must!), but that some are not (we effectively were having the assumption in a few places that SlotNames and IsNormalized could not, based on our uneven propagation of them)? That doesn't seem like a good idea. Are we allowed to evolve that definition in any way? So I wanted something much simpler, which leads me to this:

Proposed Schema Relationship Between an Estimator and its Transformer

Edit: This principle is wrong, see this comment. But I still think enforcing the principle on our own tests with our own components makes sense and does way more help than harm, but we can't insist on it as a principle of the estimators and transformers themselves.

So, to make the proposed relationship concrete, I'll consider the following code (for any est and data):

IEstimator est = ...;
IDataView data = ...;
var schema = data.Schema;

var promisedShape = est.GetOutputSchema(SchemaShape.Create(schema));
var deliveredShape = SchemaShape.Create(est.Fit(data).GetOutputSchema(schema));
  1. If both constructions of promisedShape and deliveredShape succeed, they should be the same, that is, the two constructed SchemaShape objects should be indistinguishable (besides, of course, say, tests on reference equality).

  2. If the construction of deliveredShape would succeed, then the construction of promisedShape should succeed as well. (That is, the schema-propagation logic of an estimator should be no more strict than the schema-propagation logic of its produced transformer.)

Both represent a somewhat more "restrictive" view of the relationship of paired IEstimator and ITransformer. The previous state was sort of an ad hoc free for all. I have not yet
established that it will be completely possible to do this; I know it will require some changes of behavior of some ITransformers, especially around their annotations, but so far I have not seen any "show stoppers." Maybe I will though.

Bugs

Unfortunately due to the fact that this "requirement" is meaningless unless we test for it, it is best if I introduce the stricter test, and test for this, all at the same time. Doing so results in several test failures -- some were due to stricter requirements, but some were due to the fact that some IEstimator implementations were just plain wrong. I will outline in comments below those bugs that I have found, since I have not completed that work yet. (I have only fixed two tests so far, and each takes me a few hours. Regrettably, given the subtlety of the issues at stake, this is one of those situations that requires more finesse and consideration than other changes.)/+
+

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcode-sanitationCode consistency, maintainability, and best practices, moreso than any public API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions