Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

codemzs · 2018-06-06T20:09:51Z

fixes #319

… regression tree predictor for categorical splits.

codemzs · 2018-06-06T20:12:41Z

src/Microsoft.ML.FastTree/Training/TreeLearners/LeastSquaresRegressionTreeLearner.cs

+            double gainResidual = (gainShift * trust + usePenalty) / splitInfo.CategoricalFeatureIndices.Length;
+            for (int i = 0; i < splitInfo.CategoricalFeatureIndices.Length; i++)
+            {
+                //REVIEW mzs: For categorical split points some gain values can be negative because when gain


mzs [](start = 25, length = 3)

remove #Resolved

TomFinley · 2018-06-08T12:16:14Z

src/Microsoft.ML.FastTree/Training/TreeLearners/LeastSquaresRegressionTreeLearner.cs

+                : FeatureReusePenalty * Math.Log(FeatureUseCount[firstFlockFeature] + 1);
+
+            //During feature split evaluation step the accumalated gain is subtracted from gainShift and multiplied with trust and 
+            //penalty. The below equation seperates out this quantity from each feature gain so that the cummalative 


cummalative [](start = 105, length = 11)

cumulative #Resolved

TomFinley · 2018-06-08T12:21:54Z

src/Microsoft.ML.FastTree/Training/TreeLearners/LeastSquaresRegressionTreeLearner.cs

+            double min = Math.Min(sumGain, splitInfo.Gain);
+            double max = Math.Max(sumGain, splitInfo.Gain);
+
+            Contracts.Assert(min / max >= 0.99999);


Contracts.Assert(min / max >= 0.99999); [](start = 12, length = 39)

Due to imprecisions in floating point calculations, I am not certain this will always be true. This might be something you'd put in a test of one particular problem, but the idea of putting it into the actual code that runs for all cases makes me nervous. I also don't like having #if DEBUG unless there's a really great reason for it, since every block like that is a potential source of disparity between the two.

You probably found this useful during development, but I might prefer if we got rid of this. #Resolved

You are right as the nodes increase the precision will decrease.

In reply to: 194039690 [](ancestors = 194039690)

TomFinley · 2018-06-08T12:22:56Z

src/Microsoft.ML.FastTree/Training/TreeLearners/LeastSquaresRegressionTreeLearner.cs

+                                              gtCount * Math.Log(gtCount));
+
+                        currentShiftedGain += EntropyCoefficient * entropyGain;
+                    }


Hmmm. :( Are there alternatives to copy-pasta code? #Resolved

TomFinley · 2018-06-08T12:29:05Z

src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs

                    CategoricalSplitFeatureRanges = new int[NumNodes][];
                    foreach (var index in categoricalNodeIndices)
                    {
                        Contracts.Assert(CategoricalSplit[index]);
                        Contracts.Assert(index >= 0 && index < NumNodes);

                        CategoricalSplitFeatures[index] = reader.ReadIntArray();
+                        CategoricalFeatureGain[index] = reader.ReadDoubleArray();


You've changed the format but didn't change the version number of the FastTree models, or account for the fact that existing models obviously won't have this information. Correspondingly, you've broken backwards compatibility. #Resolved

TomFinley · 2018-06-08T12:43:23Z

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

So, note this code here, on _splitGain, _gainPValue, or _previousLeafValue. Note that there is no error if this is null. We do not insist that _splitGain or any other purely informational structures be specified. You must do the same for these categorical splits, allow them to be left null and unspecified.

Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

TomFinley · 2018-06-08T12:44:38Z

src/Microsoft.ML.FastTree/FastTree.cs

+                //gain of split at that node. The reason we chose to do this way was to speed things up and 
+                //reduce memory footprint during training time. A better solution would be to keep track 
+                //of each split point's contribution during training but that is an expensive approach.
+                double value = pair.Value < 0 ? -1 * Math.Sqrt(Math.Abs(pair.Value)) : Math.Sqrt(pair.Value);


pair.Value < 0 [](start = 31, length = 14)

I'm not sure I get this, so maybe you can help me out. I think you're encoding some information in the sign. Could we explain this? What I see the comment above, but as an explanation it's just not doing it for me. #Resolved

TomFinley

🕐

…to fasttree

codemzs · 2018-06-16T00:01:43Z

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

That would be a much bigger change and may not in the scope of this PR. Currently at train time we always produce a boolean array for categorical splits that indicates if a node has categorical splits. See line 449. Allowing categorical splits boolean array to be null would also require model format change.

In reply to: 395749419 [](ancestors = 395749419)

Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

TomFinley · 2018-06-18T18:40:44Z

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi @codemzs thank you for your reply. CategoricalFeatureGain's presence or absence must be optional. I'm not sure on why you think it's beyond the scope of this change, since you are introducing this structure in in this PR, so I have no idea why we wouldn't want to make sure we get that structure right here.

In reply to: 397770145 [](ancestors = 397770145,395749419)

Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

codemzs · 2018-06-19T20:08:56Z

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi @tfinley@gmail.com I'm sorry, I thought you were talking about CategorialSplitFeatures, yes, we can make CategoricalFeatureGain optional.

In reply to: 398154411 [](ancestors = 398154411,397770145,395749419)

Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

codemzs · 2018-06-19T20:10:11Z

I'm going to get to this PR after 0.3 release because @TomFinley has asked me to mathematically verify some of the stuff here. #Resolved

…to fasttree # Conflicts: # src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs

codemzs · 2018-07-09T00:42:40Z

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi Tom, we are only checking for CategoricalFeatureGain's presence if CatgeoricalSplit is true for any of the node. I believe in that case CategoricalFeatureGain should not be null.

In reply to: 398528564 [](ancestors = 398528564,398154411,397770145,395749419)

Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

markusweimer · 2018-09-27T16:32:52Z

@codemzs, what is the plan for this PR? It seems to have a lot of conflicts with master now. Do you want to update or abandon?

Zruty0 · 2018-10-05T16:20:06Z

I suggest we close the PR. Feel free to reopen when you get back to this work.

Include all categorical split points in feature gain map and clean up…

f6efa1c

… regression tree predictor for categorical splits.

codemzs requested a review from TomFinley June 6, 2018 20:09

add release baselines.

205ef0d

codemzs commented Jun 6, 2018

View reviewed changes

codemzs changed the title ~~Include all categorical split points in feature gain map and clean up…~~ Include all categorical split points in feature gain map and clean up regression tree categorical predictor Jun 6, 2018

codemzs changed the title ~~Include all categorical split points in feature gain map and clean up regression tree categorical predictor~~ Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. Jun 6, 2018

TomFinley reviewed Jun 8, 2018

View reviewed changes

TomFinley suggested changes Jun 8, 2018

View reviewed changes

Merge branch 'master' of https://github.com/dotnet/machinelearning in…

cac95ee

…to fasttree

PR feedback.

031d098

codemzs closed this Jun 26, 2018

codemzs deleted the fasttree branch June 26, 2018 22:14

Merge branch 'master' of https://github.com/dotnet/machinelearning in…

1cbad30

…to fasttree # Conflicts: # src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs

codemzs restored the fasttree branch July 6, 2018 21:00

codemzs reopened this Jul 6, 2018

PR feedback.

e5e0fdc

Zruty0 closed this Oct 5, 2018

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

codemzs commented Jun 6, 2018

codemzs Jun 6, 2018 •

edited

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

codemzs Jun 16, 2018 •

edited

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley commented Jun 8, 2018

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley left a comment

codemzs commented Jun 16, 2018

TomFinley commented Jun 18, 2018

codemzs commented Jun 19, 2018

codemzs commented Jun 19, 2018 •

edited

Loading

codemzs commented Jul 9, 2018

markusweimer commented Sep 27, 2018

Zruty0 commented Oct 5, 2018

Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

Conversation

codemzs commented Jun 6, 2018

codemzs Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

TomFinley Jun 8, 2018 • edited by codemzs Loading

Choose a reason for hiding this comment

TomFinley Jun 8, 2018 • edited by codemzs Loading

Choose a reason for hiding this comment

codemzs Jun 16, 2018 • edited Loading

Choose a reason for hiding this comment

TomFinley Jun 8, 2018 • edited by codemzs Loading

Choose a reason for hiding this comment

TomFinley Jun 8, 2018 • edited by codemzs Loading

Choose a reason for hiding this comment

TomFinley commented Jun 8, 2018

TomFinley Jun 8, 2018 • edited by codemzs Loading

Choose a reason for hiding this comment

TomFinley left a comment

Choose a reason for hiding this comment

codemzs commented Jun 16, 2018

TomFinley commented Jun 18, 2018

codemzs commented Jun 19, 2018

codemzs commented Jun 19, 2018 • edited Loading

codemzs commented Jul 9, 2018

markusweimer commented Sep 27, 2018

Zruty0 commented Oct 5, 2018

codemzs Jun 6, 2018 •

edited

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

codemzs Jun 16, 2018 •

edited

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

TomFinley Jun 8, 2018 •

edited by codemzs

Loading

codemzs commented Jun 19, 2018 •

edited

Loading