Skip to content

Check for IEstimator/ITransformer schema consistency, fix bugs uncovered #3408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 18, 2019

Conversation

TomFinley
Copy link
Contributor

Fixes #3380.

The commits are one first commit where I change the test of estimators, and subsequent commits are where I fix each of the bugs in IEstimator/ITransformer pairs that were revealed through that stricter test.

Note that that calibrator change (the last commit at the time of writing), and the KeyToVector/KeyToBinaryVector change (third commit) result in actual changes to the annotation data produced by the transformers, so could be considered behavioral changes. However I do not anticipate a negative effect.

Some tests were also changed a little bit to give greater coverage.

…etadata occur only on scalar inputs, not vector of size 1 inputs.
* Calibrator estimators and transformers now agree on what the annotation data they produce. Before, they were inconsistent.

* Calibrator transformers used to provide no annotation data. Now, the will provide the usual prediction kind annotations of the score column if that score column has that annotation. They also correctly identify their output as normalized unconditionally.
@TomFinley TomFinley self-assigned this Apr 18, 2019
@TomFinley TomFinley changed the title Doc estimator Check for IEstimator/ITransformer schema consistency, fix bugs uncovered Apr 18, 2019
// We want the 'promised' metadata to be a superset of 'delivered'.
Assert.True(y.IsCompatibleWith(x), $"Mismatch on {x.Name}");
Assert.True(d.IsCompatibleWith(p), $"Mismatch on {p.Name}, there was a mismatch, or some unexpected annotations was present.");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.True(d.IsCompatibleWith(p), $"Mismatch on {p.Name}, there was a mismatch, or some unexpected annotations was present."); [](start = 16, length = 127)

I'm trying to come up with some example where this wouldn't be true....
I think previous statement was made based on fact what what SchemaShape has less information than Schema and we don't have actual data, so in some cases we will add annotations which was impossible to predict during GetOutputSchema time.

But nothing comes to my mind, and since tests are not failing, I would guess this is our current state of annotation business. #ByDesign

Copy link
Contributor Author

@TomFinley TomFinley Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in some cases we will add annotations which was impossible to predict during GetOutputSchema time.

Maybe. The way the test worked, it made sure the promise was a superset of what was delivered -- that is, the estimator's annotation list would be a sort of "semi-promise" like, "ok, this annotation may be there or not, but if it is there, it will have this type," and so on. At least, so I must interpret this test as previously written.

There is one case in which this could potentially have been valuable, the key-to-vector annotation, as discussed in depth in issue #3380 (the special case where it was a known size vector that was of length precisely one). But that was such an obscure situation and such a micro-optimization that I hardly felt it worthwhile.

And anyway, the IEstimator did things in the wrong way, it only said the slot names would be there if it was scalar, whereas if it actually wanted to capture this "possibility" it should have said that the slot names could be there if scalar or vector (but not var-vector!). But if you're going to go that far, I'd argue that the usefulness of SchemaShape annotations data is utterly useless.

I could have imagined a "richer" schema shape where there were differing degrees of certainty -- "this will definitely be here, this might be here," etc., but in the end I decided with saying, "you know what, let's just make it a simple promise."

One nice thing is that IEstimator is still not possible to fully implement, so hypothetical future authors might be able to refine this. Or come up with a future successor concept -- certainly not every ITransformer is the result of fitting an IEstimator.

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@codecov
Copy link

codecov bot commented Apr 18, 2019

Codecov Report

Merging #3408 into master will increase coverage by 0.03%.
The diff coverage is 92.68%.

@@            Coverage Diff             @@
##           master    #3408      +/-   ##
==========================================
+ Coverage   72.69%   72.73%   +0.03%     
==========================================
  Files         807      807              
  Lines      145171   145206      +35     
  Branches    16225    16230       +5     
==========================================
+ Hits       105536   105614      +78     
+ Misses      35220    35176      -44     
- Partials     4415     4416       +1
Flag Coverage Δ
#Debug 72.73% <92.68%> (+0.03%) ⬆️
#production 68.27% <90.32%> (+0.04%) ⬆️
#test 88.98% <100%> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Core/Data/AnnotationUtils.cs 82% <0%> (ø) ⬆️
src/Microsoft.ML.Data/Transforms/KeyToValue.cs 79.16% <0%> (-0.65%) ⬇️
src/Microsoft.ML.Data/Transforms/KeyToVector.cs 81.81% <100%> (ø) ⬆️
...rc/Microsoft.ML.Data/Scorers/RowToRowScorerBase.cs 87.5% <100%> (ø) ⬆️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 73.73% <100%> (+0.02%) ⬆️
src/Microsoft.ML.Data/Transforms/Hashing.cs 92.32% <100%> (+0.14%) ⬆️
...soft.ML.Transforms/Text/NgramHashingTransformer.cs 88.79% <100%> (+0.01%) ⬆️
...ML.Tests/TrainerEstimators/CalibratorEstimators.cs 100% <100%> (ø) ⬆️
...soft.ML.Tests/Transformers/CategoricalHashTests.cs 100% <100%> (ø) ⬆️
...ML.Tests/TrainerEstimators/MetalinearEstimators.cs 100% <100%> (ø) ⬆️
... and 10 more

Copy link
Contributor

@glebuk glebuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -533,7 +533,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema)
throw Host.ExceptParam(nameof(inputSchema), $"Input column '{colInfo.inputColumnName}' doesn't contain key values metadata");

SchemaShape metadata = null;
if (col.Annotations.TryFindColumn(AnnotationUtils.Kinds.SlotNames, out var slotCol))
if (col.HasSlotNames() && col.Annotations.TryFindColumn(AnnotationUtils.Kinds.SlotNames, out var slotCol))
metadata = new SchemaShape(new[] { slotCol });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it was just something I noticed, that we were propagating that we were going to do this even if it wasn't the right type, which actually led to a test failure. So I just changed the condition to be a bit tighter is all.

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@TomFinley TomFinley merged commit 082ab77 into dotnet:master Apr 18, 2019
@glebuk
Copy link
Contributor

glebuk commented Apr 19, 2019

Thanks, @TomFinley! Nice change!

@TomFinley TomFinley deleted the DocEstimator branch April 19, 2019 20:05
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout
4 participants