-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert PcaTransform to Estimator API #1333
Conversation
separator: ';', hasHeader: true) | ||
.Read(dataSource); | ||
.Read(new MultiFileSource(_dataSource)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new MultiFileSource(_dataSource) [](start = 22, length = 32)
I would prefer to have string path rather than MultiFileSource #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All snippets I had seen were using MultiFileSource. Didn't know that string input was an option, which is much cleaner. Will change to string.
In reply to: 227046972 [](ancestors = 227046972)
public PcaTests(ITestOutputHelper helper) | ||
: base(helper) | ||
{ | ||
_env = new ConsoleEnvironment(seed: 1, conc: 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_env = new ConsoleEnvironment(seed: 1, conc: 1); [](start = 12, length = 48)
How necessary is to have this environment?
I believe you specify seed in arguments, so you don't have to specify seed here.
Does concurrency affect results of PCA? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -15,19 +13,29 @@ | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using Microsoft.ML.Runtime.Model; | |||
using Microsoft.ML.Runtime.Numeric; | |||
using Microsoft.ML.Core.Data; | |||
using Microsoft.ML.StaticPipe; | |||
using Microsoft.ML.StaticPipe.Runtime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctrl+K+G #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
[assembly: LoadableClass(PcaTransform.Summary, typeof(PcaTransform), null, typeof(SignatureLoadModel), | ||
PcaTransform.UserName, PcaTransform.LoaderSignature)] | ||
|
||
[assembly: LoadableClass(typeof(IRowMapper), typeof(PcaTransform), null, typeof(SignatureLoadRowMapper), | ||
PcaTransform.UserName, PcaTransform.LoaderSignature)] | ||
|
||
[assembly: LoadableClass(typeof(void), typeof(PcaTransform), null, typeof(SignatureEntryPointModule), PcaTransform.LoaderSignature)] | ||
|
||
namespace Microsoft.ML.Runtime.Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime.Data [](start = 23, length = 12)
Can you rename it to Microsoft.ML.Transforms? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | ||
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | ||
|
||
if (!(col.Kind == SchemaShape.Column.VectorKind.Vector && col.ItemType.IsNumber)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col.ItemType.IsNumber [](start = 74, length = 21)
Just want to double check. Do we really support all number types? Doubles, longs? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original check was with .IsNumber, but you're right, it only works with floats. I changed the check.
In reply to: 227048656 [](ancestors = 227048656)
_transformInfos[i] = new TransformInfo(args.Column[i], args, Infos[i].TypeSrc.ValueCount); | ||
_center[i] = args.Column[i].Center ?? args.Center; | ||
_oversampling[i] = args.Column[i].Oversampling ?? args.Oversampling; | ||
Host.CheckUserArg(_oversampling[i] >= 0, nameof(args.Oversampling), "Oversampling must be non-negative"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oversampling must be non-negative" [](start = 85, length = 34)
can you put this check inside ColumnInfo constructor? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
Oversampling = overSampling; | ||
Center = center; | ||
Seed = seed; | ||
Contracts.CheckUserArg(Oversampling >= 0, nameof(Oversampling), "Oversampling must be non-negative."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts [](start = 16, length = 9)
Can you add
Contracts.CheckParam(0 < Rank && Rank <= Dimension, nameof(Rank), "Rank must be positive, and at most the dimension of untransformed data");
as well?
Also change it to CheckParam
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That check already exists in TransformInfo constructor. The Dimension information is kept in that class and is not available in ColumnInfo class, because it depends on the input schema.
In reply to: 227573435 [](ancestors = 227573435)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's part of TransformInfo constructor, but in same time we hit that constructor only during training stage.
You can create estimator, put it as part of chain, and never call it, or call it few screens later, and only after it hit TransformInfo it would throw exception.
If you put it in ColumnInfo as well, it will throw immediately.
In reply to: 227624838 [](ancestors = 227624838,227573435)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dimension
is equal to the vector size of the column, and it's only known at the time of Fit
, not before. So there is no way to throw early.
In reply to: 227869834 [](ancestors = 227869834,227624838,227573435)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least you can check rank negativity.
In reply to: 227869834 [](ancestors = 227869834,227624838,227573435)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added positivity check for rank
In reply to: 227879168 [](ancestors = 227879168,227879084,227877885,227869834,227624838,227573435)
/// <summary> | ||
/// Describes how the transformer handles one column pair. | ||
/// </summary> | ||
public ColumnInfo(string input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you copy comments about params from pigstension to here? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string dataSource = GetDataPath("generated_regression_dataset.csv"); | ||
var data = TextLoader.CreateReader(env, | ||
c => (label: c.LoadFloat(11), features: c.LoadFloat(0, 10)), | ||
var data = TextLoader.CreateReader(_env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data [](start = 16, length = 4)
nit: you can create third data with same schema but loading floats from 1 to 6, and pass it to validForFitNotValidForTransformInput #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you mean?
c.LoadFloat(11), weight: c.LoadFloat(0), features: c.LoadText(1, 6)
that sounds like validFitInput to me.
In reply to: 227574746 [](ancestors = 227574746)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you replace features:c.LoadText(1,6) to features:c.LoadFloat(1,6) then yes (I doubt you can fit on top of text vector)
And that's exactly my point, it's a valid fit but if you already train you data on bigger array, it should fail on smaller one, this is why it called validForFitNotValidForTransformInput :)
In reply to: 227626841 [](ancestors = 227626841,227574746)
return dstGetter; | ||
} | ||
|
||
private static void TransformFeatures(IExceptionContext ectx, ref VBuffer<float> src, ref VBuffer<float> dst, TransformInfo transformInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void TransformFeatures [](start = 20, length = 29)
nit: I don't see reason why it's static, you can just make it part of class and stop accept IExceptionContext and use Host instead #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
} | ||
|
||
protected override ColumnType GetColumnTypeCore(int iinfo) | ||
internal static void ValidatePcaInput(IHost host, string name, ColumnType type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static [](start = 17, length = 6)
any reason why it's static? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it reflects the fact that the PCA check doesn't depends any instance state. It only cares about a column type. It also gets used inside the Mapper.
In reply to: 227576143 [](ancestors = 227576143)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
} | ||
|
||
protected override ColumnType GetColumnTypeCore(int iinfo) | ||
internal static void ValidatePcaInput(IHost host, string name, ColumnType type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IHost [](start = 46, length = 5)
let's use the narrowest possible interface. In this case it should be IExceptionContext
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
Host.Check(0 <= iinfo & iinfo < Utils.Size(_types)); | ||
return _types[iinfo]; | ||
if (!type.IsVector) | ||
throw host.Except($"Pca transform can only be applied to vector columns. Column ${name} is of type ${type}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
host.Except( [](start = 22, length = 12)
I suggest to refactor this slightly to fit into ExceptSchemaMismatch
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wasn't a good type to to put for the expected type. The best I could come up with was this, which didn't look good. Do you have a better version?
string inputSchema; // just used for the excpections
if (!type.IsVector)
throw ectx.ExceptSchemaMismatch(nameof(inputSchema), "input", name, "Vector", type.ToString());
In reply to: 227587808 [](ancestors = 227587808)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you are expecting a fixed-size vector of float. You can either spell it out 'fixed size vector of float', or create a SchemaShape.Column
and call GetTypeString
In reply to: 227630278 [](ancestors = 227630278,227587808)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
} | ||
} | ||
|
||
public ColumnType InputType => _schema[_input].Type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public ColumnType InputType => _schema[_input].Type; [](start = 16, length = 52)
instead of properties over internal state, let's adopt a pattern where all of the properties are computed once in the ctor, and there is no private fields. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
if (colSchemaInfo.InputType.VectorSize != _parent._transformInfos[i].Dimension) | ||
{ | ||
var msg = $"Dimension of column ${colPair.input} is ${colSchemaInfo.InputType.VectorSize}, which doesn't match the expected size ${_parent._transformInfos[i].Dimension}"; | ||
throw Host.Except(msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except [](start = 35, length = 6)
this message also should be ExceptSchemaMismatch
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExceptSchemaMismatch only produces this type of message $"Schema mismatch for {columnRole} column '{columnName}': expected {expectedType}, got {actualType}"; I don't know how to turn this message to fit that.
In reply to: 227591272 [](ancestors = 227591272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema mismatch for input column 'Foo': expected V<R4, 4>, got V<R4, 5>
Produced by ExceptSchemaMismatch("input", colPair.input, new VectorType(NumberType.R4, _parent._transformInfos[i].Dimension).ToString(), colSchemaInfo.InputType.ToString())
In reply to: 227631956 [](ancestors = 227631956,227591272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | ||
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | ||
|
||
if (col.Kind != SchemaShape.Column.VectorKind.Vector || col.ItemType.RawKind != DataKind.R4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ItemType [](start = 76, length = 8)
!col.ItemType.Equals(NumberType.R4)
is a more direct comparison to R4
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
IReadOnlyDictionary<PipelineColumn, string> outputNames, | ||
IReadOnlyCollection<string> usedNames) | ||
{ | ||
// Only one column is allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Only one column is allowed. [](start = 15, length = 31)
remove this comment #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.PCA/PcaTransform.cs
Outdated
} | ||
} | ||
|
||
/// <summary>Compute the principal components of the input column. Can significantly reduce size of vector.</summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compute the principal components of the input column. [](start = 21, length = 53)
describe the transformation, not the training procedure. 'Replaces vector with its projection to the principal component subspace' or something along these lines #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is proper conversion of PcaTransform to estimator API (a work item related to #754):