-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify API for advanced settings (RandomizedPcaTrainer) #2390
Modify API for advanced settings (RandomizedPcaTrainer) #2390
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2390 +/- ##
==========================================
+ Coverage 71.26% 71.27% +0.01%
==========================================
Files 797 799 +2
Lines 141292 141377 +85
Branches 16118 16118
==========================================
+ Hits 100692 100768 +76
- Misses 36138 36147 +9
Partials 4462 4462
|
@@ -103,23 +103,23 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight | |||
|
|||
} | |||
|
|||
internal RandomizedPcaTrainer(IHostEnvironment env, Arguments args) | |||
:this(env, args, args.FeatureColumn, args.WeightColumn) | |||
internal RandomizedPcaTrainer(IHostEnvironment env, Options options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomizedPcaTrainer [](start = 17, length = 20)
It's strange... I noticed that renaming Arguments to Options did not modify anything in the mlContext catalog. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked it up, and I don't think there is an entry for this trainer in mlContext. Can you add it?
In reply to: 253319239 [](ancestors = 253319239)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. i noticed couple more components which do not have mlcontext extension.
will add
In reply to: 253319255 [](ancestors = 253319255,253319239)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added mlcontext extension for this. Also added a test for it that exercises the Fit() and Transform() APIs.
Evaluate() API currently missing from Anomaly Detection. i will create a separate issue for that.
In reply to: 253584603 [](ancestors = 253584603,253319255,253319239)
@@ -49,7 +49,7 @@ public sealed class RandomizedPcaTrainer : TrainerEstimatorBase<AnomalyPredictio | |||
internal const string Summary = "This algorithm trains an approximate PCA using Randomized SVD algorithm. " | |||
+ "This PCA can be made into Kernel PCA by using Random Fourier Features transform."; | |||
|
|||
public class Arguments : UnsupervisedLearnerInputBaseWithWeight | |||
public class Options : UnsupervisedLearnerInputBaseWithWeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options [](start = 21, length = 7)
xml docs are coming later? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -91,7 +91,7 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight | |||
/// <param name="oversampling">Oversampling parameter for randomized PCA training.</param> | |||
/// <param name="center">If enabled, data is centered to be zero mean.</param> | |||
/// <param name="seed">The seed for random number generation.</param> | |||
public RandomizedPcaTrainer(IHostEnvironment env, | |||
internal RandomizedPcaTrainer(IHostEnvironment env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just make the class Internal
/BestFriend
and keep this public
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really.
we would want to expose these through mlcontext. not via constructors
In reply to: 254391018 [](ancestors = 254391018)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. I hadn't seen the pattern for trainable transforms where the class is public
and methods are internal
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe in ML.NET terms, this is "trainer estimator" (for anomaly detection tasks)
most other "trainer estimator"s follow the same pattern e.g. KMeansPlusPlusTrainer
In reply to: 255587702 [](ancestors = 255587702)
@@ -347,14 +347,14 @@ protected override AnomalyPredictionTransformer<PcaModelParameters> MakeTransfor | |||
Desc = "Train an PCA Anomaly model.", | |||
UserName = UserNameValue, | |||
ShortName = ShortName)] | |||
internal static CommonOutputs.AnomalyDetectionOutput TrainPcaAnomaly(IHostEnvironment env, Arguments input) | |||
internal static CommonOutputs.AnomalyDetectionOutput TrainPcaAnomaly(IHostEnvironment env, Options input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above; these can be kept public
and the whole class can be made internal
/BestFriend
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Options
class above should be public. Hence we cannot make entire class internal.
In reply to: 254391534 [](ancestors = 254391534)
var trainData = reader.Read(GetDataPath(TestDatasets.mnistOneClass.trainFilename)); | ||
var testData = reader.Read(GetDataPath(TestDatasets.mnistOneClass.testFilename)); | ||
|
||
var pipeline = mlContext.AnomalyDetection.Trainers.RandomizedPca(featureColumn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnomalyDetection [](start = 37, length = 16)
Anomaly detection currently does not support evaluation metrics
Issue #2471 #Resolved
/// <param name="score">The name of the score column in <paramref name="data"/>.</param> | ||
/// <param name="predictedLabel">The name of the predicted label column in <paramref name="data"/>.</param> | ||
/// <returns>The evaluation results for these outputs.</returns> | ||
public AnomalyDetectionMetrics Evaluate(IDataView data, string label, string score, string predictedLabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults for label
, score
, and predictedLabel
? #Resolved
{ | ||
public sealed class AnomalyDetectionMetrics | ||
{ | ||
public double Auc { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summaries, Remarks, and links to relevant documentation. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added basic summaries for now.
wanted to also add the remarks from TLC website., but the explanations there were not clear esp. for the detection rate metrics.
In reply to: 255583277 [](ancestors = 255583277)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these summaries, check in with @shmoradims ; he's building a set of generic docs for things like AUC, F1, RMSE, etc.
In reply to: 255703503 [](ancestors = 255703503,255583277)
@@ -35,5 +36,25 @@ public static class PcaCatalog | |||
/// <param name="columns">Input columns to apply PrincipalComponentAnalysis on.</param> | |||
public static PrincipalComponentAnalysisEstimator ProjectToPrincipalComponents(this TransformsCatalog.ProjectionTransforms catalog, params PrincipalComponentAnalysisEstimator.ColumnInfo[] columns) | |||
=> new PrincipalComponentAnalysisEstimator(CatalogUtils.GetEnvironment(catalog), columns); | |||
|
|||
public static RandomizedPcaTrainer RandomizedPca(this AnomalyDetectionCatalog.AnomalyDetectionTrainers catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs xml docs with remarks and links to a sample. Here or add to #1209 . #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -91,7 +91,7 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight | |||
/// <param name="oversampling">Oversampling parameter for randomized PCA training.</param> | |||
/// <param name="center">If enabled, data is centered to be zero mean.</param> | |||
/// <param name="seed">The seed for random number generation.</param> | |||
public RandomizedPcaTrainer(IHostEnvironment env, | |||
internal RandomizedPcaTrainer(IHostEnvironment env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. I hadn't seen the pattern for trainable transforms where the class is public
and methods are internal
. #Resolved
// Evaluate | ||
var metrics = ML.AnomalyDetection.Evaluate(transformedData); | ||
|
||
Assert.Equal(0.99, metrics.Auc, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that these numbers are correct? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried out the same dataset in TLC with the same trainer, the numbers are close. Not exact hough
in general, in this PR i am only exposing the trainer / evaluators as they exist currently in the codebase. the PR does not have any algorithmic changes or changes in evaluation metrics themselves.
In reply to: 255589211 [](ancestors = 255589211)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the big question is, what do we want to test here?
- Do we want a baseline test to make sure future versions don't change the numerics? (e.g. AUC is always 0.99 with 2 decimals of precision?
- Do we want a functionality test, like we have in Functional.Tests? (e.g. check that 0 <= AUC <= 1)
- Do we want correctness tests? (e.g. We know what the answer should be and we want to make sure this matches it.)
If we do want a baseline test, can we mark it as such, and check to further decimal places?
As an aside, are there correctness tests on these metrics that we can migrate from the internal repo? If so, can you file it as an issue to be done later?)
In reply to: 255680814 [](ancestors = 255680814,255589211)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
as per the classification above, these seem like "baseline" tests, I have increased the precision to 5 places of decimal.
as for test migration from internal repo, it seems we ported over the PcaAnomalyTest
already . that should suffice for correctness.
In reply to: 255796608 [](ancestors = 255796608,255680814,255589211)
/// <param name="label">The name of the label column in <paramref name="data"/>.</param> | ||
/// <param name="score">The name of the score column in <paramref name="data"/>.</param> | ||
/// <param name="predictedLabel">The name of the predicted label column in <paramref name="data"/>.</param> | ||
/// <returns>The evaluation results for these calibrated outputs.</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calibrated [](start = 54, length = 10)
What is calibrated here? Could you explain in a few more words? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// </summary> | ||
public double DrAtK { get; } | ||
/// <summary> | ||
/// Detection rate at fraction p false positives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p [](start = 39, length = 1)
What is p
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me the Detection rate metrics look suspicious. perhaps we should not expose these for V1.0 . only expose AUC
In reply to: 255719011 [](ancestors = 255719011)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last question about tests! |
var metrics = ML.AnomalyDetection.Evaluate(transformedData, k: 10); | ||
|
||
Assert.Equal(0.98558, metrics.Auc, 5); | ||
Assert.Equal(0.90, metrics.DrAtK, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.Equal(0.90, metrics.DrAtK, 2); [](start = 11, length = 38)
This one @ 5 places too, please :) #Resolved
/// RandomizedPcaTrainer test | ||
/// </summary> | ||
[Fact] | ||
public void RandomizedPcaTrainer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomizedPcaTrainer [](start = 20, length = 20)
RandomizedPcaTrainerBaselineTest
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small comments.
Thanks for the reviews! |
Towards #1798 .
This PR addresses the following algos
The following changes have been made: