-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code generate TextLoader API and make it backward compatible with existing TextLoader API. #38
Conversation
…test data file instead of train file.
@glebuk is added to the review. #Closed |
/// <summary> | ||
/// Import a dataset from a text file | ||
/// </summary> | ||
public sealed partial class CustomTextLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CustomTextLoader [](start = 36, length = 16)
Why not just textloader? Is this a user-viisible name? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader is the name for the new entrypoint for text loader that does not use the custom schema, I have renamed the old entrypoint for text loader to CustomTextLoader as per suggestion from @TomFinley #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDhould it be directly used by end users? Is visible? Should it be deprecated? How can we avoid confusion about what to use?
In reply to: 186324368 [](ancestors = 186324368)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader (New entrypoint API) can be used directly by the users, refer to Scenario3 that shows that. I can get rid of CustomTextLoader but I believe you didn't want it to be removed ...we should chat offline about this :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep for now, but I'd like to find a way to mark as obsolete for a bit, clarify who was using it (if anyone) and give a fair chance for them to change their code or raise objections as to why we should keep it, then remove it as part of a separate PR if we judge that to be appropriate. #Resolved
|
||
} | ||
|
||
public sealed class TextLoaderRange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoaderRange [](start = 28, length = 15)
nice! #Pending
Arguments.TrimWhitespace = trimWhitespace; | ||
} | ||
|
||
private string TypeToName(Type type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeToName [](start = 23, length = 10)
Don't we already have this function for all types in utils?
Also, isn;t text loader capable of loading other types like ints? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... I see TlcModule.GetDataType
. That does the mappings listed here, plus a few more.
Even if for some reason we cannot use that, let's have this method use DataKind
directly. We are specifying these things like "R4"
and "BL"
, but these are not random string literals, these are the names of enums in our codebase, so we should just use the enums... the reason being, direct references are easier to be tracked with VS/Resharper, Roslyn analyzers, and other similar tools.
In reply to: 186311810 [](ancestors = 186311810)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but it seems we need to use TryGetDataKind.
In reply to: 186565505 [](ancestors = 186565505,186311810)
@@ -317,7 +317,7 @@ public bool IsValid() | |||
} | |||
} | |||
|
|||
public sealed class Arguments : ArgumentsCore | |||
public class Arguments : ArgumentsCore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class Arguments : ArgumentsCore [](start = 7, length = 39)
revert #Resolved
@@ -43,5 +53,16 @@ public static Output ImportText(IHostEnvironment env, Input input) | |||
var loader = host.CreateLoader(string.Format("Text{{{0}}}", input.CustomSchema), new FileHandleSource(input.InputFile)); | |||
return new Output { Data = loader }; | |||
} | |||
|
|||
[TlcModule.EntryPoint(Name = "Data.TextLoader", Desc = "Import a dataset from a text file", NoSeal = true)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoSeal = true [](start = 100, length = 13)
Just curious, what is the utility of having the class unsealed? Do we expect/want people to inherit from the auto-generated code? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for convenience API for text loader. So that it can inherit all the loader settings but once we get rid of it then there would be no need for unsealing.
In reply to: 186547720 [](ancestors = 186547720)
public float InitWtsDiameter { get; set; } | ||
|
||
/// <summary> | ||
/// Whether to shuffle for each training iteration | ||
/// </summary> | ||
[TlcModule.SweepableDiscreteParamAttribute("Shuffle", new object[]{false, true})] | ||
[TlcModule.SweepableDiscreteParamAttribute("Shuffle", new object[] { false, true })] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ct[] { fal [](start = 74, length = 10)
While we should probably change the code generator to produce more properly formatted code as we see here, I wonder if you intended to make this change? I suspect you did not, since I do not see that there was a corresponding change in CSharpApiGenerator.cs
that would explain this format change. Perhaps something happened to this file that made your Visual Studio reformat it? I might choose to revert these changes in that file, since the next time that someone generates this file, we would revert back anyway, leading to a somewhat dirty blame. #Resolved
writer.WriteLine("{"); | ||
writer.Indent(); | ||
writer.WriteLine(); | ||
if (classBase.Contains("ILearningPipelineLoader")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ILearningPipelineLoader [](start = 36, length = 23)
So I'm curious, I might have expected these "special" kinds like these to be detected using direct type comparisons, on entryPointInfo.InputKinds
. That code would be considerably safer and simpler than searching strings, but does something make that impossible?
Having these detections be string comparisons makes me uneasy. This makes the usage invisible to any refactoring utilities (either within VS or Resharper), so from that perspective even a nameof
would make me feel better. But this .Contains
strategy also will catch things where the name is either a prefix or suffix of a type. (E.g., testing against IFoo
will match IFooBar
, which is not intentional.) #Resolved
public sealed class Output | ||
{ | ||
[TlcModule.Output(Desc = "The resulting data view", SortOrder = 1)] | ||
public IDataView Data; | ||
} | ||
|
||
[TlcModule.EntryPoint(Name = "Data.TextLoader", Desc = "Import a dataset from a text file")] | ||
[TlcModule.EntryPoint(Name = "Data.CustomTextLoader", Desc = "Import a dataset from a text file")] | ||
public static Output ImportText(IHostEnvironment env, Input input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are keeping this around I suppose, but I wonder if we shouldn't have some mechanism for its eventual deprecation, since I imagine we don't plan to include it forever.
Typically I think this is accomplished using a System.ObsoleteAttribute
. I wonder if the code-gen step could (1) check to see if this attribute is marked with that attribute and if so, migrate the same attribute over the autogenerated class or method (what is currently in CSharpApi.cs
). I see you have changed CSharpApiGenerator.cs
... that is, in the attribute list, if there is an ObsoleteAttribute
with message "foo"
, declare [Obsolete("foo")]
on the generated class.
It strikes me as a good time to maybe add that... should I hope be a few lines of code, unless I'm mistaken, since you could get the obsolete attribute, if any, from EntryPointInfo.Method.Attributes
. #Resolved
public sealed class Output | ||
{ | ||
[TlcModule.Output(Desc = "The resulting data view", SortOrder = 1)] | ||
public IDataView Data; | ||
} | ||
|
||
[TlcModule.EntryPoint(Name = "Data.TextLoader", Desc = "Import a dataset from a text file")] | ||
[TlcModule.EntryPoint(Name = "Data.CustomTextLoader", Desc = "Import a dataset from a text file")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CustomTextLoader [](start = 43, length = 16)
why we need it if we have generic version with strictly types? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This existing code are two more examples of things that should really be direct type checks, rather than looking for substrings. #Resolved Refers to: src/Microsoft.ML/Runtime/Internal/Tools/CSharpApiGenerator.cs:844 in 51d5658. [](commit_id = 51d5658, deletion_comment = False) |
@@ -191,6 +191,7 @@ public static string FindColumn(IExceptionContext ectx, ISchema schema, Optional | |||
/// </summary> | |||
public static class CommonInputs | |||
{ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stray newline was probably changed accidentally, should be reverted. #Resolved
This should be changed so that programmatically people just specify a Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:298 in 51d5658. [](commit_id = 51d5658, deletion_comment = False) |
|
||
pipeline.Add(new TextLoader(dataPath) | ||
{ | ||
Arguments = new TextLoaderArguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to go with the API that was spec'd in https://github.com/dotnet/machinelearning/blob/master/Documentation/specs/mvp.md#training?
var options = CsvOptions.CreateFrom<HousePriceData>(separator: '\t', header: true);
namespace Microsoft.MachineLearning
{
public enum DataType
{
Boolean,
Int32,
Single,
Double,
String
}
public class CsvColumn
{
public CsvColumn(string name, DataType dataType, int ordinal);
public string Name { get; }
public DataType DataType { get; }
public int Ordinal { get; }
}
public class CsvOptions
{
public static CsvOptions CreateFrom(char separator = ',', bool hasHeader = true, Type rowType);
public static CsvOptions CreateFrom<T>(char separator = ',', bool hasHeader = true);
public CsvOptions();
public char Separator { get; set; }
public bool HasHeader { get; set; }
public CsvColumnCollection Columns { get; }
}
public class CsvColumnCollection : Collection<CsvColumn>
{
public CsvColumn Add(string name, DataType dataType);
public CsvColumn Add(string name, DataType dataType, int ordinal);
}
}
Do we still have plans on implementing this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this format, should the types be just Type
instead of custom enum
? How about locales while reading or writing? Should these separate CSV specific types such as CsvColumn
and CsvColumnCollecttion
for CSV data be rolled to some more generic structure? I mean, it would feel plausible to think it's just a collection of data in row or column format (or some other format) while the data in files can be in any format (JSON, XML, whatnot). This is a simple example (but not that good since not heterogenous): https://gist.github.com/veikkoeeva/50c8f38ec46b0a3ce70467d16af00dc1, then ADO.NET is fuller taking into account types too: https://docs.microsoft.com/en-us/dotnet/api/system.data.datatable?view=netframework-4.7.2 (https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/ado-net-datasets). I don't advocate taking a dependecy on ADO.NET, but might have food for thoughts.
In general readong or writing CSV is "easy" as https://stackoverflow.com/questions/5116604/read-csv-using-linq, in practice there are locale format spefic things (and just problems) and one might consider taking a dependency to a library (i.e. https://www.nuget.org/packages?q=CSV) and then having, say, an extension method such as .FromCsv(this FileInfo...): IEnumerable<TRecord>
(and other appropriate convenience overloads). Thinking larger, if one could pull records from sources in IAsyncEnumerable
fashion, when it comes widely available, it might build nicely on many sources of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veikkoeeva the types being a type instead of an enum
might be a good idea -- maybe something along the lines of a helper create method. I'd still like to have an enum
in any such convenience, since it becomes clear what types are supported.
Regarding your other notes about more types, I certainly agree with that, but we have elsewhere not addressed these problems by changing TextLoader
, but rather by introducing new implementations of IDataLoader
(in the case where we're reading new formats, see e.g., PR 61), or just an IDataView
in the case that nothing really is being loaded per se (see e.g., PR 106). Something like ADO.NET you could imagine having it, phrasing it as an IDataView
implementation, then just plugging it in, and it should work. Of course, someone would have to write that implementation. (The name TextLoader
is somewhat unfortunate, since it has, I've noticed, created the impression that it is meant to read any conceivable type of text, rather than conforming to its own specific format.)
@eerhardt thanks for sharing this. If the suggestion was we take this document as a guideline, then we should consider this PR or more precisely its successor PR 142 as perhaps an attempt to generate something in that spirit in that it solves the problem, but that being an actual attempt to solve the problem and undergoing peer review will necessarily diverge from a document like this.
I get the sense though that you're thinking this proposal should be taken literally. I also get the sense this proposal was written somewhat hurriedly. The following three flaws occur to me on cursory examination: Ordinal as int
whiffed on vector valued columns, DataType
needlessly duplicates the existing concept of DataKind
(but less completely, it missed many types including those related to dates and times, most seriously, which people have already opened issues to
point out), and finally has Csv
in its name despite the fact the separator/delimiter is configurable. Anyway, perhaps open an issue, these and other problems can be worked out more deliberatively. (Mayhap one is already opened on this subject, I don't know.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also have a public collection SupportedTypes
and have a check in construction that throws an error with appropriate error message if unsupported types are tried. The benefit of Type
is in all the operations it's clear what is the type (and its semantics) after it has been constructed. I.e., checks and transformatios only in source -- possibly in the sink -- and plays nicely with the rest of the ecosystem.
By the way, if you look at ADO.NET, you notice there are vendor specific types and then just general ADO.NET types and there's some sort of a mapping between them when necessary. It looks like this. Notably there it's known that's all that's needed, but a more general purpose mapping would take the vendor invariant and use that to map types better (Oracle, for instance, doesn't have bool
type).
In a way it looks like this is going into the direction of ADO.NET. If you look at here, it's like opening a source, reading it, transforming the results into some other format, and then processing it in some way. (And you can wrap that one like here and then offer a simple API like here (see ReflectionSelector
that uses the mapping referred earlier) and by playing that idea, maybe many source could be handle with that by supplying an lambda).
I apologize if it's exhaustive I push various view points. I suspect, though, you need to explain these things in any event later to people asking -- at least if MachineLearning.NET becomes popular as we hope. :) I'd like to ask also (and did ponder in Gitter shortly), if your're building a composition pipeline with reflection, and want to use someting less fragile (judging from the comments), would MEF be an option? And if it would, then maybe using dependency injection like much of other frameworks and .NET ecosystem? And if not, maybe there needs to be documentation explaining how to do similarish things.
Can you close this one? Or it's not a duplicate of #142 |
This change was completed and checked-in as PR #142 |
…age sources. (dotnet#38) * Added sequential grouping of columns * removed nuget.config and have only props mentions the nuget sources * reverted the file
…ature branch (#3324) * Initial commit * ci test build * forgot to save this one file * Debug-Intrinsics isn't a valid config, trying windows-x64 * disabled tests for now * disable tests attempt 2 * initial code push, no history, test project not in the build so is the internal client * battling with warn as err * test build * test change * make params for MLContext data extensions match ML.NET default names and values; update gitignore; nit rev for Benchmarking.cs (#5) * Create README.md (#2) * API folder changes (#6) * comment out fast forest trainer, per discussion on ML.NET open issue #1983, for now, to run E2E w/o exceptions (#7) * Make validation data param mandatory; remove GetFirstPipeline sample (#10) * Make validation data param mandatory; remove GetFirstPipeline sample * remove deprecated todo * Create ISSUE_TEMPLATE.md & PULL_REQUEST_TEMPLATE.md (#12) * Create ISSUE_TEMPLATE.md * Create PULL_REQUEST_TEMPLATE.md * NestedObject For pipeline (#14) * add estimator extensions / catalog; add conversion from external to internal pipeline; transform clean-up; add back in test proj and fix build; refactor trainer ext name mappings (#15) * Make validation data param mandatory; remove GetFirstPipeline sample * remove deprecated todo * add estimator extensions / catalog; add ability to go from external to internal pipeline; a lot of transform clean-up; add back in test proj and get it building; refactor trainer ext name mappings * corrected the typo in readme (#16) * make GetNextPipeline API w/ public Pipeline method on PipelineSuggester; write GetNextPipeline API test; fix public Pipeline object serialization; fix header inferencing bug; write test utils for fetching datasets (#18) * get next pipeline API rev -- refactor API to consume column dimensions, purpose, type, and name instead of available trainers & transforms (#19) * mark get next pipeline test as ignore for now (#20) * fix dataview take util bug, add dataview skip util, add some UTs to increase code coverage (#21) * fix dataview take util bug, add dataview skip util, add some UTs to increase code coverage * add accuracy threshold on AutoFit test * add null check to best pipeline on autofit result * unit test additions (including user input validation testing); dead code removal for code coverage (including KDO & associated utils); misc fixes & revs (#22) * add trainer extension tests, & misc fixes (#23) * add estimator extension tests (#24) * add conversions tests (#25) * fix multiclass runs & add multiclass autofit UT (#27) * add basic autofit regression test (#28) * fix categorical transform bug (sometimes categorical features weren't concatenated to final features); add UT transforms; add PipelineNode equality & tests to serve as AutoML testing infra * add example to readme (#26) * add lightgbm args as nested properties (#33) * fix bug where if one pipeline hyperparam optimization converges, run terminates (#36) * add open-source headers to files; other nit clean-ups along the way (#35) * Ungroup Columns in Column Inference (#40) * Added sequential grouping of columns * added ungrouping of column option * reverted the file * Misc fixes (#39) * misc fixes -- fix bug where SMAC returning already-seen values; fix param encoding return bug in pipeline object model; nit clean-up AutoFit; return in pipeline suggester when sweeper has no next proposal; null ref fix in public object model pipeline suggester * fix in BuildPipelineNodePropsLightGbm test, fix / use correct 'newTrainer' variable in PipelneSuggester * SMAC perf improvement * Removing the nuget.config and have build.props mention the nuget package sources. (#38) * Added sequential grouping of columns * removed nuget.config and have only props mentions the nuget sources * reverted the file * transform inferencing concat / ignore fixes (#41) * make pipeline object model & other public classes internal (#43) * handle SMAC exception when fewer trees were trained than requested (#44) * Throw error on incorrect Label name in InferColumns API (#47) * Added sequential grouping of columns * reverted the file * addded infer columns label name checking * added column detection error * removed unsed usings * added quotes * replace Where with Any clause * replace Where with Any clause * Set Nullable Auto params to null values (#50) * Added sequential grouping of columns * reverted the file * added auto params as null * change to the update fields method * First public api propsal (#52) * Includes following 1) Final proposal for 0.1 public API surface 2) Prefeaturization 3) Splitting train data into train and validate when validation data is null 4) Providing end to end samples one each for regression, binaryclassification and multiclass classification * Incorporating code review feedbacks * Revert "Set Nullable Auto params to null values" (#53) * Revert "First public api propsal (#52)" This reverts commit e4a64cf. * Revert "Set Nullable Auto params to null values (#50)" This reverts commit 41c663c. * AutoFit return type is now an IEnumerable (#55) AutoFit returns is now an IEnumerable - this enables many good things Implementing variety of early stopping criteria (See sample) Early discard of models that are no good. This improves memory usage efficiency. (See sample) No need to implement a callback to get results back Getting best score is now outside of API implementation. It is a simple math function to compare scores (See sample). Also templatized the return type for better type safety through out the code. * misc fixes & test additions, towards 0.1 release (#56) * Enable UnitTests on build server (#57) * 1) Making trainer name public (#62) 2) Fixing up samples to reflect it * Initial version of CLI tool for mlnet (#61) * added global tool initial project * removed unneccesary files, renamed files * refactoring and added base abstract classes for trainer generator * removed unused class * Added classes for transforms * added transform generate dummy classes * more refactoring, added first transform * more refactoring and added classes * changed the project structure * restructing added options class * sln changes * refactored options to different class: * added more logic for code generation of class * misc changes * reverted file * added commandline api package * reverted sample * added new command line api parser * added normalization of column names * Added command defaults and error message * implementation of all trainers * changed auto to null * added all transform generators * added error handling when args is empty and minor changes due to change in AutoML api names * changed the name of param * added new command line options and restructuring code * renamed proj file and added solution * Added code to generate usings, Fixed few bugs in the code * added validation to the command line options * changed project name * Bug fixes due to API change in AutoML * changed directory structure * added test framework and basic tests * added more tests * added improvements to template and error handling * renamed the estimator name * fixed test case * added comments * added headers * changed namespace and removed unneccesary properties from project * Revert "changed namespace and removed unneccesary properties from project" This reverts commit 9edae033e9845e910f663f296e168f1182b84f5f. * fixed test cases and renamed namespaces * cleaned up proj file * added folder structure * added symbols/tokens for strings * added more tests * review comments * modified test cases * review comments * change in the exception message * normalized line endings * made method private static * simplified range building /optimization * minor fix * added header * added static methods in command where necessary * nit picks * made few methods static * review comments * nitpick * remove line pragmas * fix test case * Use better AutiFit overload and ignore Multiclass (#64) * Upgrading CLI to produce ML.NET V.10 APIs and bunch of Refactoring tasks (#65) * Added sequential grouping of columns * reverted the file * upgrade to v .10 and refactoring * added null check * fixed unit tests * review comments * removed the settings change * added regions * fixed unit tests * Upgrade ML.NET package to 0.10.0 (#70) * Change in template to accomodate new API of TextLoader (#72) * Added sequential grouping of columns * reverted the file * changed to new API of Text Loader * changed signature * added params for taking additional settings * changes to codegen params * refactoring of templates and fixing errors * Enable gated check for mlnet.tests (#79) * Added sequential grouping of columns * reverted the file * changed to new API of Text Loader * changed signature * added params for taking additional settings * changes to codegen params * refactoring of templates and fixing errors * added run-tests.proj and referred it in build.proj * CLI tool - make validation dataset optional and support for crossvalidation in generated code (#83) * Added sequential grouping of columns * reverted the file * bug fixes, more logic to templates to support cross-validate * formatting and fix type in consolehelper * Added logic in templates * revert settings * benchmarking related changes (#63) * Create test.txt * Create test.txt * changes needed for benchmarking * forgot one file * merge conflict fix * fix build break * back out my version of the fix for Label column issue and fix the original fix * bogus file removal * undo SuggestedPipeline change * remove labelCol from pipeline suggester * fix build break * fix fast forest learner (don't sweep over learning rate) (#88) * Made changes to Have non-calibrated scoring for binary classifiers (#86) * Added sequential grouping of columns * reverted the file * added calibration workaround * removed print probability * reverted settings * rev ColumnInference API: can take label index; rev output object types; add tests (#89) * rename AutoML to Microsoft.ML.Auto everywhere and a shot at publishing nuget package (#99) * Create test.txt * Create test.txt * changes needed for benchmarking * forgot one file * merge conflict fix * fix build break * back out my version of the fix for Label column issue and fix the original fix * bogus file removal * undo SuggestedPipeline change * remove labelCol from pipeline suggester * fix build break * rename AutoML to Microsoft.ML.Auto everywhere and a shot at publishing nuget package (will probably need tweaks once I try to use the pipleline) * publish nuget (#101) * use dotnet-internal-temp agent for internal build * use dotnet-internal feed * Fix Codegen for columnConvert and ValueToKeyMapping transform and add individual transform tests (#95) * Added sequential grouping of columns * reverted the file * fix usings for type convert * added transforms tests * review comments * When generating usings choose only distinct usings directives (#94) * Added sequential grouping of columns * reverted the file * Added code to have unique strings * refactoring * minor fix * minor fix * Autofit overloads + cancellation + progress callbacks 1) Introduce AutoFit overloads (basic and advanced) 2) AutoFit Cancellation 3) AutoFit progress callbacks * Default the kfolds to value 5 in CLI generated code (#115) * Added sequential grouping of columns * reverted the file * Set up CI with Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * remove file * added kfold param and defaulted to value * changed type * added for regression * Remove extra ; from generated code (#114) * Added sequential grouping of columns * reverted the file * Set up CI with Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * removed extra ; from generated code * removed file * fix unit tests * TimeoutInSeconds (#116) Specifying timeout in seconds instead of minutes * Added more command line args implementation to CLI tool and refactoring (#110) * Added sequential grouping of columns * reverted the file * Set up CI with Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * Update azure-pipelines.yml for Azure Pipelines * added git status * reverted change * added codegen options and refactoring * minor fixes' * renamed params, minor refactoring * added tests for commandline and refactoring * removed file * added back the test case * minor fixes * Update src/mlnet.Test/CommandLineTests.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * review comments * capitalize the first character * changed the name of test case * remove unused directives * Fail gracefully if unable to instantiate data view with swept parameters (#125) * gracefully fail if fail to parse a datai * rev * validate AutoFit 'Features' column must be of type R4 (#132) * Samples: exceptions / nits (#124) * Logging support in CLI + Implementation of cmd args [--name,--output,--verbosity] (#121) * addded logging and helper methods * fixing code after merge * added resx files, added logger framework, added logging messages * added new options * added spacing * minor fixes * change command description * rename option, add headers, include new param in test * formatted * build fix * changed option name * Added NlogConfig file * added back config package * fix tests * added correct validation check (#137) * Use CreateTextLoader<T>(..) instead of CreateTextLoader(..) (#138) * added support to loaddata by class in the generated code * fix tests * changed CreateTextLoader to ReadFromTextFile method. (#140) * changed textloader to readfromtextfile method * formatting * exception fixes (#136) * infer purpose of hidden columns as 'ignore' (#142) * Added approval tests and bunch of refactoring of code and normalizing namespaces (#148) * changed textloader to readfromtextfile method * formatting * added approval tests and refactoring of code * removed few comments * API 2.0 skeleton (#149) Incorporating API review feedback * The CV code should come before the training when there is no test dataset in generated code (#151) * reorder cv code * build fix * fixed structure * Format the generated code + bunch of misc tasks (#152) * added formatting and minor changes for reordering cv * fixing the template * minor changes * formatting changes * fixed approval test * removed unused nuget * added missing value replacing * added test for new transform * fix test * Update src/mlnet/Templates/Console/MLCodeGen.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * Sanitize the column names in CLI (#162) * added sanitization layer in CLI * fix test * changed exception.StackTrace to exception.ToString() * fix package name (#168) * Rev public API (#163) * Rename TransformGeneratorBase .cs to TransformGeneratorBase.cs (#153) * Fix minor version for the repository + remove Nlog config package (#171) * changed the minor version * removed the nlog config package * Added new test to columninfo and fixing up API (#178) * Make optimizing metric customizable and add trainer whitelist functionality (#172) * API rev (#181) * propagate root MLContext thru AutoML (instead of creating our own) (#182) * Enabling new command line args (#183) * fix package name * initial commit * added more commandline args * fixed tests * added headers * fix tests * fix test * rename 'AutoFitter' to 'Experiment' (#169) * added tests (#187) * rev InferColumns to accept ColumnInfo input param (#186) * Implement argument --has-header and change usage of dataset (#194) * added has header and fixed dataset and train dataset * fix tests * removed dummy command (#195) * Fix bug for regression and sanitize input label from user (#198) * removed dummy command * sanitize label and fix template * fix tests * Do not generate code concatenating columns when the dataset has a single feature column (#191) * Include some missed logging in the generated code. (#199) * added logging messages for generated code * added log messages * deleted file * cleaning up proj files (#185) * removed platform target * removed platform target * Some spaces and extra lines + bug in output path (#204) * nit picks * nit picks * fix test * accept label from user input and provide in generated code (#205) * Rev handling of weight / label columns (#203) * migrate to private ML.NET nuget for latest bug fixes (#131) * fix multiclass with nonstandard label (#207) * Multiclass nondefault label test (#208) * printing escaped chars + bug (#212) * delete unused internal samples (#211) * fix SMAC bug that causes multiclass sample to infinite loop (#209) * Rev user input validation for new API (#210) * added console message for exit and nit picks (#215) * exit when exception encountered (#216) * Seal API classes (and make EnableCaching internal) (#217) * Suggested sample nits (feel free to ask for any of these to be reverted) (#219) * User input column type validation (#218) * upgrade commandline and renaming (#221) * upgrade commandline and renaming * renaming fields * Make build.sh, init-tools.sh, & run.sh executable on OSX/Linux (#225) * CLI argument descriptions updated (#224) * CLI argument descriptions updated * No version in .csproj * added flag to disable training code (#227) * Exit if perfect model produced (#220) * removed header (#228) * removed header * added auto generated header * removed console read key (#229) * Fix model path in generated file (#230) * removed console read key * fix model path * fix test * reorder samples (#231) * remove rule that infers column purpose as categorical if # of distinct values is < 100 (#233) * Null reference exception fix for finding best model when some runs have failed (#239) * samples fixes (#238) * fix for defaulting Averaged Perceptron # of iterations to 10 (#237) * Bug bash feedback Feb 27. API changes and sample changes (#240) * Bug bash feedback Feb 27. API changes Sample changes Exception fix * Samples / API rev from 2/27 bug bash feedback (#242) * changed the directory structure for generated project (#243) * changed the directory structure for generated project * changed test * upgraded commandline package * Fix test file locations on OSX (#235) * fix test file locations on OSX * changing to Path.Combine() * Additional Path.Combine() * Remove ConsoleCodeGeneratorTests.GeneratedTrainCodeTest.received.txt * Additional Path.Combine() * add back in double comparison fix * remove metrics agent NaN returns * test fix * test format fix * mock out path Thanks to @daholste for additional fixes! * upgrade to latest ML.NET public surface (#246) * Upgrade to ML.NET 0.11 (#247) * initial changes * fix lightgbm * changed normalize method * added tests * fix tests * fix test * Private preview final API changes (#250) * .NET framework design guidelines applied to public surface * WhitelistedTrainers -> Trainers * Add estimator to public API iteration result (#248) * LightGBM pipeline serialization fix (#251) * Change order that we search for TextLoader's parameters (#256) * CLI IFileInfo null exception fix (#254) * Averaged Perceptron pipeline serialization fix (#257) * Upgrade command-line-api and default folder name change (#258) * change in defautl folderName * upgrade command line * Update src/mlnet/Program.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * eliminate IFileInfo from CLI (#260) * Rev samples towards private preview; ignored columns fix (#259) * remove unused methods in consolehelper and nit picks in generated code (#261) * nit picks * change in console helper * fix tests * add space * fix tests * added nuget sources in generated csproj (#262) * added nuget sources in csproj * changed the structure in generated code * space * upgrade to mlnet 0.11 (#263) * Formatting CLI metrics (#264) Ensures space between printed metrics (also model counter). Right aligned metrics. Extended AUC to four digits. * Add implementation of non -ova multi class trainers code gen (#267) * added non ova multi class learners * added tests * test cases * Add caching (#249) * AdvancedExperimentSettings sample nits (#265) * Add sampling key column (#268) * Initial work for multi-class classification support for CLI (#226) * Initial work for multi-class classification support for CLI * String updates * more strings * Whitelist non-OVA multi-class learners * Refactor the orchestration of AutoML calls (#272) * Do not auto-group columns with suggested purpose = 'Ignore' (#273) * Fix: during type inferencing, parse whitespace strings as NaN (#271) * Printing additional metrics in CLI for binary classification (#274) * Printing additional metrics in CLI for binary classification * Update src/mlnet/Utilities/ConsolePrinter.cs * Add API option to store models on disk (instead of in memory); fix IEstimator memory leak (#269) * Print failed iterations in CLI (#275) * change the type to float from double (#277) * cache arg implementation in CLI (#280) * cache implementation * corrected the null case * added tests for all cases * Remove duplicate value-to-key mapping transform for multiclass string labels (#283) * Add post-trainer transform SDK infra; add KeyToValueMapping transform to CLI; fix: for generated multiclass models, convert predicted label from key to original label column type (#286) * Implement ignore columns command line arg (#290) * normalize line endings * added --ignore-columns * null checks * unit tests * Print winning iteration and runtime in CLI (#288) * Print best metric and runtime * Print best metric and runtime * Line endings in AutoMLEngine.cs * Rename time column to duration to match Python SDK * Revert to MicroAccuracy and MacroAccuracy spellings * Revert spelling of BinaryClassificationMetricsAgent to BinaryMetricsAgent to reduce merge conflicts * Revert spelling of MulticlassMetricsAgent to MultiMetricsAgent to reduce merge conflicts * missed some files * Fix merge conflict * Update AutoMLEngine.cs * Add MacOS & Linux to CI; MacOS & Linux test fixes (#293) * MicroAccuracy as default for multi-class (#295) Change default optimization metric for multi-class classification to MicroAccuracy (accuracy). Previously it was set to MacroAccuracy. * Null exception for ignorecolumns in CLI (#294) * Null exception for ignorecolumns in CLI * Check if ignore-columns array has values (as the default is now a empty array) * Emit caching flag in pipeline object model. (Includes SuggestedPipelineBuilder refactor & debug string fixes / refactor) (#296) * removed sln (#297) * Caching enabling in code gen part -2 (#298) * add * added caching codegen * support comma separated values for --ignore-columns (#300) * default initialization for ignore columns (#302) * default initialization * adde null check * Codegen for multiclass non-ova (#303) * changes to template * multicalss codegen * test cases * fix test cases * Generated Project new structure. (#305) * added new templates * writing files to disck * change path * added new templates * misisng braces * fix bugs * format code * added util methods for solution file creation and addition of projects to it * added extra packages to project files * new tests * added correct path for sln * build fix * fix build * include using system in prediction class (#307) * added using * fix test * Random number generator is not thread safe (#310) * Random number generator is not thread safe * Another local random generator * Missed a few references * Referncing AutoMlUtils.random instead of a local RNG * More refs to mail RNG; remove Float as per #1669 * Missed Random.cs * Fix multiclass code gen (#314) * compile error in codegen * removes scores printing * fix bugs * fix test * Fix compile error in codegen project (#319) * removed redundant code * fix test case * Rev OVA pipeline node SDK output: wrap binary trainers as children inside parent OVA node (#317) * Ova Multi class codegen support (#321) * dummy * multiova implementation * fix tests * remove inclusion list * fix tests and console helper * Rev run result trainer name for OVA: output different trainer name for each OVA + binary learner combination (#322) * Rev run result trainer name for Ova: output different trainer name for each Ova + binary learner combination * test fixes * Console helper bug in generated code for multiclass (#323) * fix * fix test * looping perlogclass * fix test * Initial version of Progress bar impl and CLI UI experience (#325) * progressbar * added progressbar and refactoring * reverted * revert sign assembly * added headers and removed exception rethrow * Setting model directory to temp directory (#327) * Suggested changes to progress bar (#335) * progressbar * added progressbar and refactoring * reverted * revert sign assembly * added headers and removed exception rethrow * bug fixes and updates to UI * added friendly name printing for metric * formatting * Rev Samples (#334) * Telemetry2 (#333) * Create test.txt * Create test.txt * changes needed for benchmarking * forgot one file * merge conflict fix * fix build break * back out my version of the fix for Label column issue and fix the original fix * bogus file removal * undo SuggestedPipeline change * remove labelCol from pipeline suggester * fix build break * rename AutoML to Microsoft.ML.Auto everywhere and a shot at publishing nuget package (will probably need tweaks once I try to use the pipleline) * tweak queue in vsts-ci.yml * CLI telemetry implementation * Telemetry implementation * delete unnecessary file and change file size bucket to actually log log2 instead of nearest ceil value * add headers, remove comments * one more header missing * Fix progress bar in linux/osx (#336) * progressbar * added progressbar and refactoring * reverted * revert sign assembly * added headers and removed exception rethrow * bug fixes and updates to UI * added friendly name printing for metric * formatting * change from task to thread * Update src/mlnet/CodeGenerator/CodeGenerationHelper.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * Mem leak fix (#328) * Create test.txt * Create test.txt * changes needed for benchmarking * forgot one file * merge conflict fix * fix build break * back out my version of the fix for Label column issue and fix the original fix * bogus file removal * undo SuggestedPipeline change * remove labelCol from pipeline suggester * fix build break * rename AutoML to Microsoft.ML.Auto everywhere and a shot at publishing nuget package (will probably need tweaks once I try to use the pipleline) * tweak queue in vsts-ci.yml * there is still investigation to be done but this fix works and solves memory leak problems * minor refactor * Upgrade ML.NET package (#343) * Add cross-validation (CV), and auto-CV for small datasets; push common API experiment methods into base class (#287) * restore old yml for internal pipeline so we can publish nuget again to devdiv stream (#344) * Polishing the CLI UI part-1 (#338) * formatting of pbar message * Polishing the UI * optimization * rename variable * Update src/mlnet/AutoML/AutoMLEngine.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * Update src/mlnet/CodeGenerator/CodeGenerationHelper.cs Co-Authored-By: srsaggam <41802116+srsaggam@users.noreply.github.com> * new message * changed hhtp to https * added iteration num + 1 * change string name and add color to artifacts * change the message * build errors * added null checks * added exception messsages to log file * added exception messsages to log file * CLI ML.NET version upgrade (#345) * Sample revs; ColumnInformation property name revs; pre-featurizer fixes (#346) * CLI -- consume logs from AutoML SDK (#349) * Rename RunDetails --> RunDetail (#350) * command line api upgrade and progress bar rendering bug (#366) * added fix for all platforms progress bar * upgrade nuget * removed args from writeline * change in the version (#368) * fix few bugs in progressbar and verbosity (#374) * fix few bugs in progressbar and verbosity * removed unused name space * Fix for folders with space in it while generating project (#376) * support for folders with spaces * added support for paths with space * revert file * change name of var * remove spaces * SMAC fix for minimizing metrics (#363) * Formatting Regression metrics and progress bar display days. (#379) * added progress bar day display and fix regression metrics * fix formatting * added total time * formatted total time * change command name and add pbar message (#380) * change command name and add pbar message * fix tests * added aliases * duplicate alias * added another alias for task * UI missing features (#382) * added formatting changes * added accuracy specifically * downgrade the codepages (#384) * Change in project structure (#385) * initial changes * Change in project structure * correcting test * change variable name * fix tests * fix tests * fix more tests * fix codegen errors * adde log file message * changed name of args * change variable names * fix test * FileSizeBuckets in correct units (#387) * Minor telemetry change to log in correct units and make our life easier in the future * Use Ceiling instead of Round * changed order (#388) * prep work to transfer to ml.net (#389) * move test projects to top level test subdir * rename some projects to make naming consistent and make it build again * fix test project refs * Add AutoML components to build, fix issues related to that so it builds
…age sources. (dotnet#38) * Added sequential grouping of columns * removed nuget.config and have only props mentions the nuget sources * reverted the file
*Code generate support for IDataLoader
*Make TextLoader API code generated so that it's at functional parity with the text loader in the ML.Net infrastructure.
*Move TextLoader API under Microsoft.ML.Data namespace
*Make TextLoader API backward compatible.
*Add error checking for invalid loader arguments such as ordinal, column names.
*Update baselines.
*Update samples with new loader API and backward compatibility with old loader API.