Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. #320

Closed
wants to merge 6 commits into from

Conversation

codemzs
Copy link
Member

@codemzs codemzs commented Jun 6, 2018

fixes #319

… regression tree predictor for categorical splits.
@codemzs codemzs requested a review from TomFinley June 6, 2018 20:09
double gainResidual = (gainShift * trust + usePenalty) / splitInfo.CategoricalFeatureIndices.Length;
for (int i = 0; i < splitInfo.CategoricalFeatureIndices.Length; i++)
{
//REVIEW mzs: For categorical split points some gain values can be negative because when gain
Copy link
Member Author

@codemzs codemzs Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mzs [](start = 25, length = 3)

remove #Resolved

@codemzs codemzs changed the title Include all categorical split points in feature gain map and clean up… Include all categorical split points in feature gain map and clean up regression tree categorical predictor Jun 6, 2018
@codemzs codemzs changed the title Include all categorical split points in feature gain map and clean up regression tree categorical predictor Include all categorical split points in feature gain map and clean up regression tree predictor for categorical splits. Jun 6, 2018
: FeatureReusePenalty * Math.Log(FeatureUseCount[firstFlockFeature] + 1);

//During feature split evaluation step the accumalated gain is subtracted from gainShift and multiplied with trust and
//penalty. The below equation seperates out this quantity from each feature gain so that the cummalative
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cummalative [](start = 105, length = 11)

cumulative #Resolved

double min = Math.Min(sumGain, splitInfo.Gain);
double max = Math.Max(sumGain, splitInfo.Gain);

Contracts.Assert(min / max >= 0.99999);
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts.Assert(min / max >= 0.99999); [](start = 12, length = 39)

Due to imprecisions in floating point calculations, I am not certain this will always be true. This might be something you'd put in a test of one particular problem, but the idea of putting it into the actual code that runs for all cases makes me nervous. I also don't like having #if DEBUG unless there's a really great reason for it, since every block like that is a potential source of disparity between the two.

You probably found this useful during development, but I might prefer if we got rid of this. #Resolved

Copy link
Member Author

@codemzs codemzs Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right as the nodes increase the precision will decrease.


In reply to: 194039690 [](ancestors = 194039690)

gtCount * Math.Log(gtCount));

currentShiftedGain += EntropyCoefficient * entropyGain;
}
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. :( Are there alternatives to copy-pasta code? #Resolved

CategoricalSplitFeatureRanges = new int[NumNodes][];
foreach (var index in categoricalNodeIndices)
{
Contracts.Assert(CategoricalSplit[index]);
Contracts.Assert(index >= 0 && index < NumNodes);

CategoricalSplitFeatures[index] = reader.ReadIntArray();
CategoricalFeatureGain[index] = reader.ReadDoubleArray();
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've changed the format but didn't change the version number of the FastTree models, or account for the fact that existing models obviously won't have this information. Correspondingly, you've broken backwards compatibility. #Resolved

@TomFinley
Copy link
Contributor

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

So, note this code here, on _splitGain, _gainPValue, or _previousLeafValue. Note that there is no error if this is null. We do not insist that _splitGain or any other purely informational structures be specified. You must do the same for these categorical splits, allow them to be left null and unspecified.


Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

//gain of split at that node. The reason we chose to do this way was to speed things up and
//reduce memory footprint during training time. A better solution would be to keep track
//of each split point's contribution during training but that is an expensive approach.
double value = pair.Value < 0 ? -1 * Math.Sqrt(Math.Abs(pair.Value)) : Math.Sqrt(pair.Value);
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pair.Value < 0 [](start = 31, length = 14)

I'm not sure I get this, so maybe you can help me out. I think you're encoding some information in the sign. Could we explain this? What I see the comment above, but as an explanation it's just not doing it for me. #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@codemzs
Copy link
Member Author

codemzs commented Jun 16, 2018

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

That would be a much bigger change and may not in the scope of this PR. Currently at train time we always produce a boolean array for categorical splits that indicates if a node has categorical splits. See line 449. Allowing categorical splits boolean array to be null would also require model format change.


In reply to: 395749419 [](ancestors = 395749419)


Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

@TomFinley
Copy link
Contributor

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi @codemzs thank you for your reply. CategoricalFeatureGain's presence or absence must be optional. I'm not sure on why you think it's beyond the scope of this change, since you are introducing this structure in in this PR, so I have no idea why we wouldn't want to make sure we get that structure right here.


In reply to: 397770145 [](ancestors = 397770145,395749419)


Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

@codemzs
Copy link
Member Author

codemzs commented Jun 19, 2018

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi @tfinley@gmail.com I'm sorry, I thought you were talking about CategorialSplitFeatures, yes, we can make CategoricalFeatureGain optional.


In reply to: 398154411 [](ancestors = 398154411,397770145,395749419)


Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

@codemzs
Copy link
Member Author

codemzs commented Jun 19, 2018

I'm going to get to this PR after 0.3 release because @TomFinley has asked me to mathematically verify some of the stuff here. #Resolved

@codemzs codemzs closed this Jun 26, 2018
@codemzs codemzs deleted the fasttree branch June 26, 2018 22:14
…to fasttree

# Conflicts:
#	src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs
@codemzs codemzs restored the fasttree branch July 6, 2018 21:00
@codemzs codemzs reopened this Jul 6, 2018
@codemzs
Copy link
Member Author

codemzs commented Jul 9, 2018

        checker(Utils.Size(_splitGain) == 0 || _splitGain.Length == numMaxNodes, "bad splitgain length");

Hi Tom, we are only checking for CategoricalFeatureGain's presence if CatgeoricalSplit is true for any of the node. I believe in that case CategoricalFeatureGain should not be null.


In reply to: 398528564 [](ancestors = 398528564,398154411,397770145,395749419)


Refers to: src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs:512 in 205ef0d. [](commit_id = 205ef0d, deletion_comment = False)

@markusweimer
Copy link
Member

@codemzs, what is the plan for this PR? It seems to have a lot of conflicts with master now. Do you want to update or abandon?

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 5, 2018

I suggest we close the PR. Feel free to reopen when you get back to this work.

@Zruty0 Zruty0 closed this Oct 5, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants