-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix column purpose for PipelineSweeperMacro #461
Conversation
Sync with latest ML.NET
Merge with latest master of dotnet/machinelearning
Merge with latest dotnet master
@justinormont is added to the review. #Closed |
@yaeldekel . Could you kindly take a look at this PR ? #Closed |
var allPipelines = autoMlState.GetAllEvaluatedPipelines(); | ||
var bestPipeline = autoMlState.GetBestPipeline(); | ||
Assert.Equal(allPipelines.Length, numIterations); | ||
Assert.True(bestPipeline.PerformanceSummary.MetricValue > 0.87); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend a range. Perhaps 87-89?
In the future, we may cause a Label to be leaked, and I'd like that caught. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Justin. To catch issues with leaky label I have added ranges on the training AUC.
Also added ranges for the test AUC.
In reply to: 199331069 [](ancestors = 199331069)
… GroupId and Name columns
@@ -370,19 +371,21 @@ public void UpdateTerminator(ITerminator terminator) | |||
TransformInference.SuggestedTransform[] existingTransforms = null) | |||
{ | |||
// Infer transforms using experts | |||
var levelTransforms = TransformInference.InferTransforms(_env, data, args); | |||
var levelTransforms = TransformInference.InferTransforms(_env, data, args, this._columnPurpose); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this [](start = 91, length = 4)
we tend to omit usage of "this" word. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -36,6 +36,33 @@ public sealed class Arguments | |||
|
|||
[Argument(ArgumentType.AtMostOnce, HelpText = "Output datasets from previous iteration of sweep.", SortOrder = 7, Hide = true)] | |||
public IDataView[] CandidateOutputs; | |||
|
|||
[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as purpose 'Ignore'", SortOrder = 8, Hide = true)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hide = true [](start = 117, length = 11)
why you hide them? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Any thoughts on what's the implication of the 'Hide' argument for a Macro? Wasn't clear to me.
In reply to: 199589626 [](ancestors = 199589626)
@@ -98,6 +186,9 @@ public static Output ExtractSweepResult(IHostEnvironment env, ResultInput input) | |||
"Must have a valid AutoML State, or pass arguments to create one."); | |||
env.Check(input.BatchSize > 0, "Batch size must be > 0."); | |||
|
|||
// Get the user-defined column purposes (if any) | |||
var columnPurpose = SetColumnPurpose(env, input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetColumnPurpose [](start = 32, length = 16)
as you mention in comment you use this method to "Get' purposes, so it's better to name this method GetColumnPurpose #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach (var transform in pipeline.Transforms) | ||
{ | ||
transformK += 1; | ||
ch.Info($"Transform {transformK} : {transform.Transform.ToString()}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ToString() [](start = 83, length = 11)
i'm not sure you need to call ToString here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentBatchSize, | ||
this._columnPurpose); | ||
|
||
var h = _env.Register("AutoMlMlState"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var h = _env.Register("AutoMlMlState"); [](start = 15, length = 40)
you should have _host field, no reason to create new host variable. #Resolved
this._columnPurpose); | ||
|
||
var h = _env.Register("AutoMlMlState"); | ||
using (var ch = h.Start("GetNextCandidates")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetNextCandidates [](start = 41, length = 17)
Print suggested transforms? #Resolved
{ | ||
foreach (var pipeline in BatchCandidates) | ||
{ | ||
ch.Info("AutoInference Suggested Transforms."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoInference Suggested Transforms. [](start = 33, length = 35)
either add pipeline identification, or remove it, what's the point to repeat same phrase all the time? #Resolved
/// <returns>The result includes the array of auto-detected column purposes.</returns> | ||
public static InferenceResult InferPurposes(IHostEnvironment env, IDataView data, IEnumerable<int> columnIndices, Arguments args) | ||
public static InferenceResult InferPurposes(IHostEnvironment env, IDataView data, IEnumerable<int> columnIndices, Arguments args, | ||
Dictionary<string, ColumnPurpose> columnPurpose = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary<string, ColumnPurpose> columnPurpose = null [](start = 12, length = 54)
Why is this needed here? Should we intersect the result of this, with the purposes we have, in the places that call this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it here because we want to first honor the column purpose set by the user -- In line #344 we set the 'SuggestedPurpose' to be what the user provided. By doing this we ensure the 'Experts' below do not do purpose inference (and hence transform inference) on those user provided columns. Auto inference will only be done on columns which do not have user intent
In reply to: 199592844 [](ancestors = 199592844)
"Default": null | ||
}, | ||
{ | ||
"Name": "NumericFeatureColumn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumericFeatureColumn [](start = 19, length = 20)
I'd keep Column plural: NumericFeatureColumns, in all of those. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -30,12 +30,13 @@ public UniformRandomEngine(IHostEnvironment env) | |||
: base(env, env.Register("UniformRandomEngine(AutoML)")) | |||
{} | |||
|
|||
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates) | |||
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates, | |||
Dictionary<string, ColumnPurpose> columnPurpose = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnPurpose [](start = 46, length = 13)
any chance you can move them to base class PipelineOptimizerBase so you don't have to pass them around all the time? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the engines, I have created a field in PipelineOptimizerBase. This reduces some clutter.
In reply to: 199593664 [](ancestors = 199593664)
new ImportTextData.Input { InputFile = inputFileTest, CustomSchema = schema }).Data.Take(numOfSampleRows); | ||
#pragma warning restore 0618 | ||
// Define entrypoint graph | ||
string inputGraph = @" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what most of tests written in way of "Create manually json string and execute it", but it doesn't mean it has to be this way.
check TestSimpleExperiment
any chance I can convince you to use C# classes instead of composing JSON string manually? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a 4th test case (TestPipelineSweeperMacroColumnPurpose) which follows this paradigm. Does it look good ?
The other 3 tests I would like to keep as is if possible
In reply to: 199594792 [](ancestors = 199594792)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also vote to keep some of them as strings, because they test the codepath of translating from string to components.
In reply to: 199963125 [](ancestors = 199963125,199594792)
@@ -30,8 +30,10 @@ public UniformRandomEngine(IHostEnvironment env) | |||
: base(env, env.Register("UniformRandomEngine(AutoML)")) | |||
{} | |||
|
|||
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates) | |||
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates, | |||
Dictionary<string, ColumnPurpose> colPurpose = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colPurpose [](start = 46, length = 10)
I am somewhat unenthusiastic about this, since it basically seems like role-mappings, except in reverse (which also makes it, incidentally, less capable). Is there any possibility of re-using that concept, rather than inventing another thing that acts more or less just like it? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the 'recipe framework' added the concept of 'Purpose' previously. Not quite sure why the decision was made at that time to introduce Purpose as an additional concept, instead of re-using the concept of Roles and extending it.
<snip from InferenceUtil.cs>
public enum ColumnPurpose
{
Ignore = 0,
Name = 1,
Label = 2,
NumericFeature = 3,
CategoricalFeature = 4,
TextFeature = 5,
Weight = 6,
GroupId = 7,
ImagePath = 8
}
Is there a simple way to extend Roles to include purpose (e.g. 'TextFeature', 'Ignore') ? My understanding is that it would be a bigger change that warrants a separate design / PR. Let me know if that sounds reasonable.
This PR is to address the particular case where 'recipe framework' inferred the purpose incorrectly. In sich cases the user has no way of overriding the inferred purpose when using the PipelineSweeperMacro. This PR unblocks such scenarios.
In reply to: 199661532 [](ancestors = 199661532)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't extend roles, you simply use them. If you want to introduce roles appropriate for your task, that is totally fine and supported. I wonder though if the issue is that you do not have an actual ISchema
implementation we're working on? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ColumnPurpose and Roles the same things? Feels to me they are not. There is a ColumnPurpose of "ImagePath" but not such a Role, and @justin has ideas about adding more purposes. Purposes would be used to just communicate to Experts what transforms to suggest for the pipelines. Learners would not neccessarly need to know about them.
I have been wondering whether it makes more sense to extend the Loader columns to have a field/property of type ColumnPurpose, rather than pass yet another list of purposes, so they become part of the Schema, and whatever needs the purposes (the inference code) can just look it up in the columns.
In reply to: 200406010 [](ancestors = 200406010)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RecipeInference uses ColumnPurpose to infer the set of transforms to use (and also the columns used for each transform) .
In the GUI users can modify the ColumnPurpose through the wizard. We are adding arguments to the Macro so users can specify the purpose when using the macro programmatically.
Am not sure which ISchema
implementation is being referred to. If you are referring to an ISchema
implementation at the end of all transformations then yes, we do not have the final ISchema
implemenation at this point in the computation. Hence the need to pass ColumnPurpose so we can infer the set of transforms and get to the final ISchema
implementation.
Or, is the suggestion to completely do away with the notion of ColumnPurpose ?
In reply to: 200410413 [](ancestors = 200410413,200406010)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a couple of changes to address this:
- Dropped ‘Ignore’ from Arguments to the PipelineSweepMacro.
- Using RoleMappedData in the API (not exposing ColumnPurpose anymore in the APIs)
In reply to: 200723741 [](ancestors = 200723741,200410413,200406010)
@@ -44,6 +45,7 @@ public abstract class PipelineOptimizerBase : IPipelineOptimizer | |||
protected IDataView OriginalData; | |||
protected IDataView FullyTransformedData; | |||
protected AutoInference.DependencyMap DependencyMapping; | |||
protected Dictionary<string, ColumnPurpose> columnPurpose; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnPurpose; [](start = 52, length = 14)
C# naming conventions, please. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is simply copying what we do in the GUI Wizard to the PipelineSweeperMacro; the overall style for this was laid out in Recipes previously.
@codemzs : Since you created Recipes, can you take a look?
Merge latest code from dotnet/master
Merge latest code from agoswamiazureml/master
…amiazureml/machinelearning into agoswami/purpose_inference
@@ -36,6 +36,30 @@ public sealed class Arguments | |||
|
|||
[Argument(ArgumentType.AtMostOnce, HelpText = "Output datasets from previous iteration of sweep.", SortOrder = 7, Hide = true)] | |||
public IDataView[] CandidateOutputs; | |||
|
|||
[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as purpose 'Label'", SortOrder = 8)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purpose [](start = 83, length = 7)
do we want to use "role" rather than purpose. #Resolved
… consistent with Role 'Group'
merge with dotnet/master
update to latest master
public string[] TextFeatureColumns; | ||
|
||
[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as Role 'ImagePath'", SortOrder = 15)] | ||
public string[] ImagePathColumns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also log a bug about finding a more scalable solution? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
var roles = new List<KeyValuePair<RoleMappedSchema.ColumnRole, string>>(); | ||
|
||
if (input.LabelColumns != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (input.LabelColumns != null) [](start = 12, length = 31)
would you want to make specifying this mandatory? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one scenario that needs to be supported is when user does not specify anything. in that case case we will do AutoInference on all columns.
That's why these arguments (including Label) are not mandatory. These arguments are meant to give user the ability to override the auto inference (if needed)
In reply to: 202428925 [](ancestors = 202428925)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Adding arguments to PipelineSweep Macro * updating the unit tests * taking care of review comments; adding validations for Label, Weight, GroupId and Name columns * taking care of some review comments * some code cleanup * addressing PR comments * API changes to use RoleMappedData * taking care of review comments * using pipeline.UniqueId * taking care of review comments. update ColumnPurpose 'Group' so it is consistent with Role 'Group'
Fixes #460