Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout

I was attempting to write documentation for `IEstimator`  and `ITransformer`. One of the core responsibilities of these is, as expressed in #581, schema propagation, so as to, among other benefits, do the sort of "pre-fit validation" scenario that has proven so troublesome to us in the past (this w.r.t. #267).

However, what I find is that I struggled to detect a meaningful unifying "principle" behind `IEstimator` and `ITransformer`. I think the people that worked on it thought there was a principle, but when I encountered what people though tit was (something about annotations being a 'subset' of one or the other), it seemed like there were numerous "exceptions" that made the principle meaningless. . Indeed, I'm starting to suspect that there was no scenario by the original designer aside from "work until things compile and run on what we have so far," which is not really an acceptable state for something like `IEstimator` and `ITransformer` which are core concepts of this API, especially if we hope to describe them in such a way that people can implement these interfaces on their own.

In this issue I outline the trouble that I have observed. The points here are ***subtle***, but they are important insofar as resolving them is, I think, crucial for us to define the relationship between `IEstimator` and the `ITransformer` returned by fitting that estimator.

CCing @artidoro, @shauheen since I know they had some interest in this problem... going to mark as the unusual step of both bug and code-sanitation, since it both touches upon our need to have a consistent architectural story among these key structures, as well as there having been some genuine bugs that have been uncovered.

# A Troublesome Example

So, I'll start with the same seemingly innocuous code that led me to think there is something architecturally wrong at stake here: Consider the `IsNormalized` annotation, in the `KeyToVectorMappingEstimator`, here is the condition where this annotation is mentioned as coming:

https://github.com/dotnet/machinelearning/blob/73e29c8acf63418ce1f89ef78296d52d775dafe3/src/Microsoft.ML.Data/Transforms/KeyToVector.cs#L803-L804

Meanwhile, the resulting transformer has subtly different logic. If the input source value count is `1`, which of course happens when something is scalar, but also happens when the input is vector and happens to have vector size of one. (A concept that has no meaning in `SchemaShape`, and is basically unknowable.)

https://github.com/dotnet/machinelearning/blob/73e29c8acf63418ce1f89ef78296d52d775dafe3/src/Microsoft.ML.Data/Transforms/KeyToVector.cs#L339-L345

This leads to an interesting decision, because from a certain perspective, the addition of the metadata in the `ITransformer` code is correct (and, at the time it was originally written, some years I believe before the concept of `IEstimator` and `SchemaShape` was introduced, was in fact unambiguously correct, since there was no concept as we have now of pre-`Fit` schema validation.) But in the world of `IEstimator`, we are using some information that in some cases could be used *post*-`Fit` in the sense that it can only be expressed in a `DataViewSchema`, not a `SchemaShape` -- that this is a vector type that happens to have length exactly one -- and that in many cases could not even be known pre-`Fit` under some circumstnaces. (E.g., consider a value-to-key mapping estimator that winds up finding only one value.)

So, the `IEstimator` makes no claim that the `IsNormalized` data is there, but due to runtime logic that had existed prior to this time, it *is* in fact there.

This was known, I believe, the originators of the concept of `IEstimator` and `SchemaShape`, which at one point was deliberately made as, "all right, we consider the claim in `SchemaShape` to be, in the area of annotations specifically, a *subset* of what is actually present among the non-hidden columns of `DatavViewSchema`. This is in fact what the relevant method called from `TestEstimatorCore` does. (Though, it is reversed from what is actually in the comment -- what it actually tests is that what is delivered is a superset of what is promised).

https://github.com/dotnet/machinelearning/blob/3b576fe058ed4f4331018bbc3eabc1ac26219644/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs#L151-L163

This notion of "subsetting" being the requirement is mentioned in the documentation and testing I could find on the relationship between `IEstimator` and `ITransformer` (see also here). But is superset actually the right thing? Let's consider another very important transformer: key-to-value. In that case, the estimators *relies* on a complete description of the `KeyValue` annotation to detect what the input type is! So in that case this mere talk of superset and subset becomes increasingly complex, because sometimes it is necessary, and sometimes it is not.

If we want an `ITransformer` and `IEstimator` to be paired, this suggests a mutuality of information. One of the core tenants of `IDataView` is composability, but the foundation of that composibility isn't because we've just created a system that works to just accomodate the set of components we ourselves wrote. (Indeed, this is [one thing that I argue](https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewImplementation.md#urgency-in-adhering-to-invariants) in one of our documents about the `IDataView` system and why it works.)

So right now we have this system where *sometimes* annotations are subsets, but possibly not, depending on whether we consider a combination of annotations and estimators and transformers valuable. But, that seems like a pretty ad hoc way of engineering a system. How is it possible at all to make any meaningful statements about such a thing like, "we have this requirement, but possibly not, it only depends on if we've written anything yet that cares."

What does that mean? Are we prepared to make as part of the contract of `IEstimator` and `ITransformer` that some sorts of annotations are allowed to impact the schema (`KeyValues` must!), but that some are not (we effectively were having the assumption in a few places that `SlotNames` and `IsNormalized` could not, based on our uneven propagation of them)? That doesn't seem like a good idea. Are we allowed to evolve that definition in any way? So I wanted something much simpler, which leads me to this:

# Proposed Schema Relationship Between an Estimator and its Transformer

Edit: This principle is wrong, see [this comment](https://github.com/dotnet/machinelearning/issues/3380#issuecomment-484722270). But I still think enforcing the principle on our own tests with our own components makes sense and does way more help than harm, but we can't insist on it as a principle of the estimators and transformers themselves.

So, to make the proposed relationship concrete, I'll consider the following code (for any `est` and `data`):

```csharp
IEstimator est = ...;
IDataView data = ...;
var schema = data.Schema;

var promisedShape = est.GetOutputSchema(SchemaShape.Create(schema));
var deliveredShape = SchemaShape.Create(est.Fit(data).GetOutputSchema(schema));
```

1. If both constructions of `promisedShape` and `deliveredShape` succeed, they should be the same, that is, the two constructed `SchemaShape` objects should be indistinguishable (besides, of course, say, tests on reference equality).

2. If the construction of `deliveredShape` would succeed, then the construction of `promisedShape` should succeed as well. (That is, the schema-propagation logic of an estimator should be no more strict than the schema-propagation logic of its produced transformer.)

Both represent a somewhat more "restrictive" view of the relationship of paired `IEstimator` and `ITransformer`. The previous state was sort of an ad hoc free for all. I have not yet
 established that it will be completely possible to do this; I know it will require some changes of behavior of some `ITransformer`s, *especially* around their annotations, but so far I have not seen any "show stoppers." Maybe I will though.

# Bugs

Unfortunately due to the fact that this "requirement" is meaningless unless we test for it, it is best if I introduce the stricter test, *and* test for this, all at the same time. Doing so results in several test failures -- some were due to stricter requirements, but some were due to the fact that some `IEstimator` implementations were just plain wrong. I will outline in comments below those bugs that I have found, since I have not completed that work yet. (I have only fixed two tests so far, and each takes me a few hours. Regrettably, given the subtlety of the issues at stake, this is one of those situations that requires more finesse and consideration than other changes.)/+
+


	if (!_parent._columns[iinfo].OutputCountVector \|\| srcValueCount == 1)
	{
	ValueGetter<bool> getter = (ref bool dst) =>
	{
	dst = true;
	};
	builder.Add(AnnotationUtils.Kinds.IsNormalized, BooleanDataViewType.Instance, getter);

	private void CheckSameSchemaShape(SchemaShape promised, SchemaShape delivered)
	{
	Assert.True(promised.Count == delivered.Count);
	var sortedCols1 = promised.OrderBy(x => x.Name);
	var sortedCols2 = delivered.OrderBy(x => x.Name);

	foreach (var (x, y) in sortedCols1.Zip(sortedCols2, (x, y) => (x, y)))
	{
	Assert.Equal(x.Name, y.Name);
	// We want the 'promised' metadata to be a superset of 'delivered'.
	Assert.True(y.IsCompatibleWith(x), $"Mismatch on {x.Name}");
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

A Troublesome Example

Proposed Schema Relationship Between an Estimator and its Transformer

Bugs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if (!colInfo.OutputCountVector \|\| (col.Kind == SchemaShape.Column.VectorKind.Scalar))
	metadata.Add(new SchemaShape.Column(AnnotationUtils.Kinds.IsNormalized, SchemaShape.Column.VectorKind.Scalar, BooleanDataViewType.Instance, false));

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Description

A Troublesome Example

Proposed Schema Relationship Between an Estimator and its Transformer

Bugs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions