-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Check for IEstimator/ITransformer schema consistency, fix bugs uncovered #3408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…etadata occur only on scalar inputs, not vector of size 1 inputs.
…hether invert hashing is used.
…ators conditional on input.
… claims to provide slot names or not.
* Calibrator estimators and transformers now agree on what the annotation data they produce. Before, they were inconsistent. * Calibrator transformers used to provide no annotation data. Now, the will provide the usual prediction kind annotations of the score column if that score column has that annotation. They also correctly identify their output as normalized unconditionally.
// We want the 'promised' metadata to be a superset of 'delivered'. | ||
Assert.True(y.IsCompatibleWith(x), $"Mismatch on {x.Name}"); | ||
Assert.True(d.IsCompatibleWith(p), $"Mismatch on {p.Name}, there was a mismatch, or some unexpected annotations was present."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.True(d.IsCompatibleWith(p), $"Mismatch on {p.Name}, there was a mismatch, or some unexpected annotations was present."); [](start = 16, length = 127)
I'm trying to come up with some example where this wouldn't be true....
I think previous statement was made based on fact what what SchemaShape
has less information than Schema
and we don't have actual data, so in some cases we will add annotations which was impossible to predict during GetOutputSchema time.
But nothing comes to my mind, and since tests are not failing, I would guess this is our current state of annotation business. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in some cases we will add annotations which was impossible to predict during GetOutputSchema time.
Maybe. The way the test worked, it made sure the promise was a superset of what was delivered -- that is, the estimator's annotation list would be a sort of "semi-promise" like, "ok, this annotation may be there or not, but if it is there, it will have this type," and so on. At least, so I must interpret this test as previously written.
There is one case in which this could potentially have been valuable, the key-to-vector annotation, as discussed in depth in issue #3380 (the special case where it was a known size vector that was of length precisely one). But that was such an obscure situation and such a micro-optimization that I hardly felt it worthwhile.
And anyway, the IEstimator
did things in the wrong way, it only said the slot names would be there if it was scalar, whereas if it actually wanted to capture this "possibility" it should have said that the slot names could be there if scalar or vector (but not var-vector!). But if you're going to go that far, I'd argue that the usefulness of SchemaShape
annotations data is utterly useless.
I could have imagined a "richer" schema shape where there were differing degrees of certainty -- "this will definitely be here, this might be here," etc., but in the end I decided with saying, "you know what, let's just make it a simple promise."
One nice thing is that IEstimator
is still not possible to fully implement, so hypothetical future authors might be able to refine this. Or come up with a future successor concept -- certainly not every ITransformer
is the result of fitting an IEstimator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #3408 +/- ##
==========================================
+ Coverage 72.69% 72.73% +0.03%
==========================================
Files 807 807
Lines 145171 145206 +35
Branches 16225 16230 +5
==========================================
+ Hits 105536 105614 +78
+ Misses 35220 35176 -44
- Partials 4415 4416 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -533,7 +533,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema) | |||
throw Host.ExceptParam(nameof(inputSchema), $"Input column '{colInfo.inputColumnName}' doesn't contain key values metadata"); | |||
|
|||
SchemaShape metadata = null; | |||
if (col.Annotations.TryFindColumn(AnnotationUtils.Kinds.SlotNames, out var slotCol)) | |||
if (col.HasSlotNames() && col.Annotations.TryFindColumn(AnnotationUtils.Kinds.SlotNames, out var slotCol)) | |||
metadata = new SchemaShape(new[] { slotCol }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it was just something I noticed, that we were propagating that we were going to do this even if it wasn't the right type, which actually led to a test failure. So I just changed the condition to be a bit tighter is all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @TomFinley! Nice change! |
Fixes #3380.
The commits are one first commit where I change the test of estimators, and subsequent commits are where I fix each of the bugs in
IEstimator
/ITransformer
pairs that were revealed through that stricter test.Note that that calibrator change (the last commit at the time of writing), and the
KeyToVector
/KeyToBinaryVector
change (third commit) result in actual changes to the annotation data produced by the transformers, so could be considered behavioral changes. However I do not anticipate a negative effect.Some tests were also changed a little bit to give greater coverage.