-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One Feature per Tree #4134
Comments
Regarding the interaction constraints, there's a typo in your code (missing first
The issue with |
Nice.. Sorry I made you check on my typos and many thanks!! |
When |
This is due to the internal LightGBM random generator (which is a linear congruential generator https://en.wikipedia.org/wiki/Linear_congruential_generator). For efficiency, we use the form With that in mind, if we follow the logic in the I think this can be counted as a limitation due to pseudo random generators. And this is a very extreme case, because usually when there're only 3 features, we don't need feature sampling. Changing to a better version of pseudo generator may solve this problem, but can also incur slower sampling speed. WDYT @btrotta |
Thanks a lot, but I do not think that this is the issue. I expanded the example a little to include more features. Setting the number of features k to different values always results in the last feature remaining unused when colsample_bytree is set to something smaller or equal than 1/k: Code Example:np.random.seed(123) vidx=df.sample(frac=0.2).index md=5 gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=1/(k+1),num_leaves=2**md) |
@shiyu1994 Reading the wikipedia page, it seems that the cycling behavior also occurs for higher-order bits (although the cycle is longer), so I think that means we would see non-random behavior even in larger sets of features. |
@btrotta Yes, so I think keep the current random number generator is OK. Because such generators would have some limitation anyway. LightGBM/include/LightGBM/utils/random.h Lines 85 to 92 in 0a847ef
When K=1 , NextInt samples a number from range [0, N - 1) , and since the iteration only execute once, there's no further opportunity for N-1 to be inserted into the set. Only when K > 1 , that's possible.So we should fix this by adding a special case for K=1 in the code of Sample .@ruedika Thanks for reporting this issue! |
Hi, |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I'm trying to build a model using only 1 feature per tree. Using colsample_bytree or interaction_constraints does not work as expected. colsample_bytree does not use the last feature in data, when set to low values. interaction_constraints appears not to be implemented for python?
Code:
import numpy as np
import pandas as pd
import lightgbm as lgbm
from lightgbm import LGBMClassifier
np.random.seed(123)
n=100000
rho=0.8
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(-1-df.x1-2np.sin(-3+df.x2)-0.5df.x3))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'
vidx=df.sample(frac=0.2).index
tidx=df.drop(vidx).index
md=5
all three features used (as expected)
gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.9,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()
last feature not used (bug?)
gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.1,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()
all three features used in every tree (interaction_constraints not implemented?)
gbm = LGBMClassifier(max_depth=md,n_estimators=1000,num_leaves=2md)
gbm.set_params({'interaction_contraints':[[i] for i in range(len(num)+len(cat))]})
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().groupby('tree_index')['split_feature'].nunique().value_counts()
LGBM version 3.1.1
The text was updated successfully, but these errors were encountered: