Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does min.node.size actually do? #138

Closed
tcovert opened this issue Nov 13, 2017 · 2 comments
Closed

What does min.node.size actually do? #138

tcovert opened this issue Nov 13, 2017 · 2 comments
Labels

Comments

@tcovert
Copy link

tcovert commented Nov 13, 2017

Should I expect that if I set min.node.size = 10, every leaf of every tree in a causal_forest should have at least 10 sample's? Or am misunderstanding what min.node.size is supposed to mean? I have found that setting positive values for min.node.size does not result in leaves with at least that many samples.

Here is an MWE inspired by the examples in the documentation:

library(grf)
library(purrr)

leafsamples <- function(t) {
  leaves <-
    t %>%
    pluck("nodes") %>%
    keep("is_leaf") %>%
    map(pluck("samples"))
  return(leaves)
}

n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, min.node.size = 4, seed = 1)

leaves <- leafsamples(get_tree(tau.forest, 1))

On my machine, only 53 of the 159 leaves in the first tree of this forest have 4 or more samples.

@aliciaMAR
Copy link

My understanding was that min.node.size should constraint the minimum nb of units from treatment (resp. control) in terminal nodes, for a given tree. But indeed, as tcovert, causal_forest with min.node.size = 10L returns trees with terminal nodes containing sometimes only 1 sample !

@jtibshirani
Copy link
Member

Currently, we don't prevent splits from occurring that could result in nodes with size less than min.node.size. Instead, the algorithm simply stops splitting if a node's size is less than or equal to min.node.size. Our core splitting implementation is based on the ranger package (which is in turn based on Breiman + Cutler's randomForest package). Both packages make this approximation around min.node.size.

I agree that this behavior is quite misleading, and I've filed #143 to track the issue. For now, I'll add documentation to explain why there is a discrepancy in node sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants