Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix column purpose for PipelineSweeperMacro #461

Merged
18 commits merged into from Jul 13, 2018
Merged

Fix column purpose for PipelineSweeperMacro #461

18 commits merged into from Jul 13, 2018

Conversation

ghost
Copy link

@ghost ghost commented Jun 30, 2018

Fixes #460

  • Adding arguments to the PipelineSweeperMacro. Users will be able to specify the columns for a particular purpose.
  • Updating the methods in the PurposeInference and TransformInference classes to pass the user provided column purposes as optional arguments.
  • Added unit tests for the sweeping engines we have currently (Defaults, Rocket, UniformRandom)

@dnfclas
Copy link

dnfclas commented Jun 30, 2018

CLA assistant check
All CLA requirements met. #Closed

@ghost
Copy link
Author

ghost commented Jun 30, 2018

@justinormont is added to the review. #Closed

@ghost
Copy link
Author

ghost commented Jun 30, 2018

@yaeldekel . Could you kindly take a look at this PR ? #Closed

var allPipelines = autoMlState.GetAllEvaluatedPipelines();
var bestPipeline = autoMlState.GetBestPipeline();
Assert.Equal(allPipelines.Length, numIterations);
Assert.True(bestPipeline.PerformanceSummary.MetricValue > 0.87);
Copy link
Contributor

@justinormont justinormont Jun 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend a range. Perhaps 87-89?

In the future, we may cause a Label to be leaked, and I'd like that caught. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Justin. To catch issues with leaky label I have added ranges on the training AUC.

Also added ranges for the test AUC.


In reply to: 199331069 [](ancestors = 199331069)

@@ -370,19 +371,21 @@ public void UpdateTerminator(ITerminator terminator)
TransformInference.SuggestedTransform[] existingTransforms = null)
{
// Infer transforms using experts
var levelTransforms = TransformInference.InferTransforms(_env, data, args);
var levelTransforms = TransformInference.InferTransforms(_env, data, args, this._columnPurpose);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this [](start = 91, length = 4)

we tend to omit usage of "this" word. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


In reply to: 199589164 [](ancestors = 199589164)

@@ -36,6 +36,33 @@ public sealed class Arguments

[Argument(ArgumentType.AtMostOnce, HelpText = "Output datasets from previous iteration of sweep.", SortOrder = 7, Hide = true)]
public IDataView[] CandidateOutputs;

[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as purpose 'Ignore'", SortOrder = 8, Hide = true)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide = true [](start = 117, length = 11)

why you hide them? #Resolved

Copy link
Author

@ghost ghost Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Any thoughts on what's the implication of the 'Hide' argument for a Macro? Wasn't clear to me.


In reply to: 199589626 [](ancestors = 199589626)

@@ -98,6 +186,9 @@ public static Output ExtractSweepResult(IHostEnvironment env, ResultInput input)
"Must have a valid AutoML State, or pass arguments to create one.");
env.Check(input.BatchSize > 0, "Batch size must be > 0.");

// Get the user-defined column purposes (if any)
var columnPurpose = SetColumnPurpose(env, input);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetColumnPurpose [](start = 32, length = 16)

as you mention in comment you use this method to "Get' purposes, so it's better to name this method GetColumnPurpose #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


In reply to: 199590108 [](ancestors = 199590108)

foreach (var transform in pipeline.Transforms)
{
transformK += 1;
ch.Info($"Transform {transformK} : {transform.Transform.ToString()}");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ToString() [](start = 83, length = 11)

i'm not sure you need to call ToString here. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


In reply to: 199591671 [](ancestors = 199591671)

currentBatchSize,
this._columnPurpose);

var h = _env.Register("AutoMlMlState");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var h = _env.Register("AutoMlMlState"); [](start = 15, length = 40)

you should have _host field, no reason to create new host variable. #Resolved

this._columnPurpose);

var h = _env.Register("AutoMlMlState");
using (var ch = h.Start("GetNextCandidates"))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetNextCandidates [](start = 41, length = 17)

Print suggested transforms? #Resolved

{
foreach (var pipeline in BatchCandidates)
{
ch.Info("AutoInference Suggested Transforms.");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoInference Suggested Transforms. [](start = 33, length = 35)

either add pipeline identification, or remove it, what's the point to repeat same phrase all the time? #Resolved

/// <returns>The result includes the array of auto-detected column purposes.</returns>
public static InferenceResult InferPurposes(IHostEnvironment env, IDataView data, IEnumerable<int> columnIndices, Arguments args)
public static InferenceResult InferPurposes(IHostEnvironment env, IDataView data, IEnumerable<int> columnIndices, Arguments args,
Dictionary<string, ColumnPurpose> columnPurpose = null)
Copy link
Member

@sfilipi sfilipi Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary<string, ColumnPurpose> columnPurpose = null [](start = 12, length = 54)

Why is this needed here? Should we intersect the result of this, with the purposes we have, in the places that call this? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it here because we want to first honor the column purpose set by the user -- In line #344 we set the 'SuggestedPurpose' to be what the user provided. By doing this we ensure the 'Experts' below do not do purpose inference (and hence transform inference) on those user provided columns. Auto inference will only be done on columns which do not have user intent


In reply to: 199592844 [](ancestors = 199592844)

"Default": null
},
{
"Name": "NumericFeatureColumn",
Copy link
Member

@sfilipi sfilipi Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumericFeatureColumn [](start = 19, length = 20)

I'd keep Column plural: NumericFeatureColumns, in all of those. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using plural form as suggested


In reply to: 199593350 [](ancestors = 199593350)

@@ -30,12 +30,13 @@ public UniformRandomEngine(IHostEnvironment env)
: base(env, env.Register("UniformRandomEngine(AutoML)"))
{}

public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates)
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates,
Dictionary<string, ColumnPurpose> columnPurpose = null)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnPurpose [](start = 46, length = 13)

any chance you can move them to base class PipelineOptimizerBase so you don't have to pass them around all the time? #Resolved

Copy link
Author

@ghost ghost Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the engines, I have created a field in PipelineOptimizerBase. This reduces some clutter.


In reply to: 199593664 [](ancestors = 199593664)

new ImportTextData.Input { InputFile = inputFileTest, CustomSchema = schema }).Data.Take(numOfSampleRows);
#pragma warning restore 0618
// Define entrypoint graph
string inputGraph = @"
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what most of tests written in way of "Create manually json string and execute it", but it doesn't mean it has to be this way.
check TestSimpleExperiment
any chance I can convince you to use C# classes instead of composing JSON string manually? #Resolved

Copy link
Author

@ghost ghost Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 4th test case (TestPipelineSweeperMacroColumnPurpose) which follows this paradigm. Does it look good ?

The other 3 tests I would like to keep as is if possible


In reply to: 199594792 [](ancestors = 199594792)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also vote to keep some of them as strings, because they test the codepath of translating from string to components.


In reply to: 199963125 [](ancestors = 199963125,199594792)

@@ -30,8 +30,10 @@ public UniformRandomEngine(IHostEnvironment env)
: base(env, env.Register("UniformRandomEngine(AutoML)"))
{}

public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates)
public override PipelinePattern[] GetNextCandidates(IEnumerable<PipelinePattern> history, int numberOfCandidates,
Dictionary<string, ColumnPurpose> colPurpose = null)
Copy link
Contributor

@TomFinley TomFinley Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colPurpose [](start = 46, length = 10)

I am somewhat unenthusiastic about this, since it basically seems like role-mappings, except in reverse (which also makes it, incidentally, less capable). Is there any possibility of re-using that concept, rather than inventing another thing that acts more or less just like it? #Resolved

Copy link
Author

@ghost ghost Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the 'recipe framework' added the concept of 'Purpose' previously. Not quite sure why the decision was made at that time to introduce Purpose as an additional concept, instead of re-using the concept of Roles and extending it.

<snip from InferenceUtil.cs>
public enum ColumnPurpose
{
Ignore = 0,
Name = 1,
Label = 2,
NumericFeature = 3,
CategoricalFeature = 4,
TextFeature = 5,
Weight = 6,
GroupId = 7,
ImagePath = 8
}

Is there a simple way to extend Roles to include purpose (e.g. 'TextFeature', 'Ignore') ? My understanding is that it would be a bigger change that warrants a separate design / PR. Let me know if that sounds reasonable.

This PR is to address the particular case where 'recipe framework' inferred the purpose incorrectly. In sich cases the user has no way of overriding the inferred purpose when using the PipelineSweeperMacro. This PR unblocks such scenarios.


In reply to: 199661532 [](ancestors = 199661532)

Copy link
Contributor

@TomFinley TomFinley Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't extend roles, you simply use them. If you want to introduce roles appropriate for your task, that is totally fine and supported. I wonder though if the issue is that you do not have an actual ISchema implementation we're working on? #Resolved

Copy link
Member

@sfilipi sfilipi Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ColumnPurpose and Roles the same things? Feels to me they are not. There is a ColumnPurpose of "ImagePath" but not such a Role, and @justin has ideas about adding more purposes. Purposes would be used to just communicate to Experts what transforms to suggest for the pipelines. Learners would not neccessarly need to know about them.

I have been wondering whether it makes more sense to extend the Loader columns to have a field/property of type ColumnPurpose, rather than pass yet another list of purposes, so they become part of the Schema, and whatever needs the purposes (the inference code) can just look it up in the columns.


In reply to: 200406010 [](ancestors = 200406010)

Copy link
Author

@ghost ghost Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecipeInference uses ColumnPurpose to infer the set of transforms to use (and also the columns used for each transform) .

In the GUI users can modify the ColumnPurpose through the wizard. We are adding arguments to the Macro so users can specify the purpose when using the macro programmatically.

Am not sure which ISchema implementation is being referred to. If you are referring to an ISchema implementation at the end of all transformations then yes, we do not have the final ISchema implemenation at this point in the computation. Hence the need to pass ColumnPurpose so we can infer the set of transforms and get to the final ISchema implementation.

Or, is the suggestion to completely do away with the notion of ColumnPurpose ?


In reply to: 200410413 [](ancestors = 200410413,200406010)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple of changes to address this:

  • Dropped ‘Ignore’ from Arguments to the PipelineSweepMacro.
  • Using RoleMappedData in the API (not exposing ColumnPurpose anymore in the APIs)

In reply to: 200723741 [](ancestors = 200723741,200410413,200406010)

@@ -44,6 +45,7 @@ public abstract class PipelineOptimizerBase : IPipelineOptimizer
protected IDataView OriginalData;
protected IDataView FullyTransformedData;
protected AutoInference.DependencyMap DependencyMapping;
protected Dictionary<string, ColumnPurpose> columnPurpose;
Copy link
Contributor

@TomFinley TomFinley Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnPurpose; [](start = 52, length = 14)

C# naming conventions, please. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 199662160 [](ancestors = 199662160)

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is simply copying what we do in the GUI Wizard to the PipelineSweeperMacro; the overall style for this was laid out in Recipes previously.

@codemzs : Since you created Recipes, can you take a look?

@shauheen shauheen requested a review from codemzs July 10, 2018 07:45
@@ -36,6 +36,30 @@ public sealed class Arguments

[Argument(ArgumentType.AtMostOnce, HelpText = "Output datasets from previous iteration of sweep.", SortOrder = 7, Hide = true)]
public IDataView[] CandidateOutputs;

[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as purpose 'Label'", SortOrder = 8)]
Copy link
Member

@sfilipi sfilipi Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purpose [](start = 83, length = 7)

do we want to use "role" rather than purpose. #Resolved

public string[] TextFeatureColumns;

[Argument(ArgumentType.MultipleUnique, HelpText = "Column(s) to use as Role 'ImagePath'", SortOrder = 15)]
public string[] ImagePathColumns;
Copy link
Member

@sfilipi sfilipi Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also log a bug about finding a more scalable solution? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a bug for that already


In reply to: 202428704 [](ancestors = 202428704)

{
var roles = new List<KeyValuePair<RoleMappedSchema.ColumnRole, string>>();

if (input.LabelColumns != null)
Copy link
Member

@sfilipi sfilipi Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (input.LabelColumns != null) [](start = 12, length = 31)

would you want to make specifying this mandatory? #ByDesign

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one scenario that needs to be supported is when user does not specify anything. in that case case we will do AutoInference on all columns.

That's why these arguments (including Label) are not mandatory. These arguments are meant to give user the ability to override the auto inference (if needed)


In reply to: 202428925 [](ancestors = 202428925)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@ghost ghost merged commit 3053f3d into dotnet:master Jul 13, 2018
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* Adding arguments to PipelineSweep Macro

* updating the unit tests

* taking care of review comments; adding validations for Label, Weight, GroupId and Name columns

* taking care of some review comments

* some code cleanup

* addressing PR comments

* API changes to use RoleMappedData

* taking care of review comments

* using pipeline.UniqueId

* taking care of review comments. update ColumnPurpose 'Group' so it is consistent with Role 'Group'
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants