-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320
Conversation
… regression tree predictor for categorical splits.
double gainResidual = (gainShift * trust + usePenalty) / splitInfo.CategoricalFeatureIndices.Length; | ||
for (int i = 0; i < splitInfo.CategoricalFeatureIndices.Length; i++) | ||
{ | ||
//REVIEW mzs: For categorical split points some gain values can be negative because when gain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mzs [](start = 25, length = 3)
remove #Resolved
: FeatureReusePenalty * Math.Log(FeatureUseCount[firstFlockFeature] + 1); | ||
|
||
//During feature split evaluation step the accumalated gain is subtracted from gainShift and multiplied with trust and | ||
//penalty. The below equation seperates out this quantity from each feature gain so that the cummalative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cummalative [](start = 105, length = 11)
cumulative #Resolved
double min = Math.Min(sumGain, splitInfo.Gain); | ||
double max = Math.Max(sumGain, splitInfo.Gain); | ||
|
||
Contracts.Assert(min / max >= 0.99999); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts.Assert(min / max >= 0.99999); [](start = 12, length = 39)
Due to imprecisions in floating point calculations, I am not certain this will always be true. This might be something you'd put in a test of one particular problem, but the idea of putting it into the actual code that runs for all cases makes me nervous. I also don't like having #if DEBUG
unless there's a really great reason for it, since every block like that is a potential source of disparity between the two.
You probably found this useful during development, but I might prefer if we got rid of this. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right as the nodes increase the precision will decrease.
In reply to: 194039690 [](ancestors = 194039690)
gtCount * Math.Log(gtCount)); | ||
|
||
currentShiftedGain += EntropyCoefficient * entropyGain; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. :( Are there alternatives to copy-pasta code? #Resolved
CategoricalSplitFeatureRanges = new int[NumNodes][]; | ||
foreach (var index in categoricalNodeIndices) | ||
{ | ||
Contracts.Assert(CategoricalSplit[index]); | ||
Contracts.Assert(index >= 0 && index < NumNodes); | ||
|
||
CategoricalSplitFeatures[index] = reader.ReadIntArray(); | ||
CategoricalFeatureGain[index] = reader.ReadDoubleArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've changed the format but didn't change the version number of the FastTree models, or account for the fact that existing models obviously won't have this information. Correspondingly, you've broken backwards compatibility. #Resolved
So, note this code here, on Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False) |
//gain of split at that node. The reason we chose to do this way was to speed things up and | ||
//reduce memory footprint during training time. A better solution would be to keep track | ||
//of each split point's contribution during training but that is an expensive approach. | ||
double value = pair.Value < 0 ? -1 * Math.Sqrt(Math.Abs(pair.Value)) : Math.Sqrt(pair.Value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pair.Value < 0 [](start = 31, length = 14)
I'm not sure I get this, so maybe you can help me out. I think you're encoding some information in the sign. Could we explain this? What I see the comment above, but as an explanation it's just not doing it for me. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
That would be a much bigger change and may not in the scope of this PR. Currently at train time we always produce a boolean array for categorical splits that indicates if a node has categorical splits. See line 449. Allowing categorical splits boolean array to be null would also require model format change. In reply to: 395749419 [](ancestors = 395749419) Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False) |
Hi @codemzs thank you for your reply. In reply to: 397770145 [](ancestors = 397770145,395749419) Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False) |
Hi @tfinley@gmail.com I'm sorry, I thought you were talking about CategorialSplitFeatures, yes, we can make CategoricalFeatureGain optional. In reply to: 398154411 [](ancestors = 398154411,397770145,395749419) Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False) |
I'm going to get to this PR after 0.3 release because @TomFinley has asked me to mathematically verify some of the stuff here. #Resolved |
…to fasttree # Conflicts: # src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs
Hi Tom, we are only checking for CategoricalFeatureGain's presence if CatgeoricalSplit is true for any of the node. I believe in that case CategoricalFeatureGain should not be null. In reply to: 398528564 [](ancestors = 398528564,398154411,397770145,395749419) Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False) |
@codemzs, what is the plan for this PR? It seems to have a lot of conflicts with |
I suggest we close the PR. Feel free to reopen when you get back to this work. |
fixes #319