Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One Feature per Tree #4134

Closed
ruedika opened this issue Mar 29, 2021 · 10 comments · Fixed by #4324
Closed

One Feature per Tree #4134

ruedika opened this issue Mar 29, 2021 · 10 comments · Fixed by #4324
Labels

Comments

@ruedika
Copy link

ruedika commented Mar 29, 2021

I'm trying to build a model using only 1 feature per tree. Using colsample_bytree or interaction_constraints does not work as expected. colsample_bytree does not use the last feature in data, when set to low values. interaction_constraints appears not to be implemented for python?

Code:

import numpy as np
import pandas as pd
import lightgbm as lgbm
from lightgbm import LGBMClassifier

np.random.seed(123)
n=100000
rho=0.8
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(-1-df.x1-2np.sin(-3+df.x2)-0.5df.x3))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'

vidx=df.sample(frac=0.2).index
tidx=df.drop(vidx).index

md=5

all three features used (as expected)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.9,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()

last feature not used (bug?)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.1,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()

all three features used in every tree (interaction_constraints not implemented?)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,num_leaves=2md)
gbm.set_params(
{'interaction_contraints':[[i] for i in range(len(num)+len(cat))]})
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().groupby('tree_index')['split_feature'].nunique().value_counts()

LGBM version 3.1.1

@btrotta
Copy link
Collaborator

btrotta commented Mar 30, 2021

Regarding the interaction constraints, there's a typo in your code (missing first s in constraints). Also you need to pass key-value pairs, not a dict. The following works for me:

gbm.set_params(interaction_constraints=[[i] for i in range(len(num)+len(cat))])

The issue with colsample_bytree sounds like it may be a bug.

@btrotta btrotta added the bug label Mar 30, 2021
@ruedika
Copy link
Author

ruedika commented Mar 30, 2021

Nice.. Sorry I made you check on my typos and many thanks!!

@shiyu1994
Copy link
Collaborator

When colsample_bytree=0.1, the column sampler always samples feature 0 and 2, and feature 1 has never been sampled. I think this is due to a bug of Sample function in include/LightGBM/utils/random.h. Though it is a corner case, since there's only 3 features in the example above. I'm still investigating this to see if I can fix it.

@shiyu1994
Copy link
Collaborator

This is due to the internal LightGBM random generator (which is a linear congruential generator https://en.wikipedia.org/wiki/Linear_congruential_generator). For efficiency, we use the form m a power of 2, c = 0. Though efficient, such random generators have a limitation. The lowest bit always alternate between 0 and 1. So the randomly generated integer numbers will alternate between odd numbers and even numbers.

With that in mind, if we follow the logic in the Sample method (with parameters N=3 and K=2), we'll find that the code always samples 0 and 2.

I think this can be counted as a limitation due to pseudo random generators. And this is a very extreme case, because usually when there're only 3 features, we don't need feature sampling. Changing to a better version of pseudo generator may solve this problem, but can also incur slower sampling speed. WDYT @btrotta

@ruedika
Copy link
Author

ruedika commented Apr 6, 2021

Thanks a lot, but I do not think that this is the issue. I expanded the example a little to include more features. Setting the number of features k to different values always results in the last feature remaining unused when colsample_bytree is set to something smaller or equal than 1/k:

Code Example:

np.random.seed(123)
n=100000 #number of samples
k=10 #number of features
rho=0.8 #correlation between covariates
num=['x'+str(i) for i in range(1,k+1)]
df=pd.DataFrame(multivariate_normal(mean=[0]k,cov=rhonp.ones(shape=[k,k])+np.diag(k*[1-rho])).rvs(size=n),columns=num)
cat=[]
coef=np.random.rand(len(cat+num))
coef=1.4*coef/np.sqrt(np.sum(coef**2))
df['prob_true']=1/(1+np.exp(-4+np.matmul(df[cat+num].values,coef.reshape(-1,1))))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'

vidx=df.sample(frac=0.2).index
tidx=df.drop(vidx).index

md=5

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=1/(k+1),num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
features_used=gbm._Booster.trees_to_dataframe().split_feature.value_counts()
features_unused=[v for v in cat+num if v not in features_used.index]
print(features_unused)

@btrotta
Copy link
Collaborator

btrotta commented Apr 18, 2021

@shiyu1994 Reading the wikipedia page, it seems that the cycling behavior also occurs for higher-order bits (although the cycle is longer), so I think that means we would see non-random behavior even in larger sets of features.

@shiyu1994
Copy link
Collaborator

@btrotta Yes, so I think keep the current random number generator is OK. Because such generators would have some limitation anyway.
But about the latest post of @ruedika, that's really a bug of our internal Sample function when sampling only one element from a range, see

} else {
std::set<int> sample_set;
for (int r = N - K; r < N; ++r) {
int v = NextInt(0, r);
if (!sample_set.insert(v).second) {
sample_set.insert(r);
}
}

When K=1, NextInt samples a number from range [0, N - 1), and since the iteration only execute once, there's no further opportunity for N-1 to be inserted into the set. Only when K > 1, that's possible.
So we should fix this by adding a special case for K=1 in the code of Sample.
@ruedika Thanks for reporting this issue!

@JoshuaC3
Copy link

@ruedika see some of my references here: Feature Cycling #4066 and Borrow EBM Ideas #3905 for more information.

Out of interest, what is your motivation to get single feature trees? I wonder if it is similar to mine/EBMs?

@ruedika
Copy link
Author

ruedika commented May 10, 2021

Hi,
Sorry for getting back so late on this. My hand-on fix for the above was just to add a redundant column to the dataset...
@JoshuaC3 thanks for the info! I'm working with GAMs (generalized additive models) and mostly learning greedily (tranining on the best feature, which is guess is what LGBM) gives good results. So using interaction_constraints, as mentioned above, works nice enough. I've (maybe) run into some problems using EBMs:
interpretml/interpret#232
The GAM built using LGBM with interaction_constraints does not have these issues..

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants