One Feature per Tree #4134

ruedika · 2021-03-29T13:07:30Z

I'm trying to build a model using only 1 feature per tree. Using colsample_bytree or interaction_constraints does not work as expected. colsample_bytree does not use the last feature in data, when set to low values. interaction_constraints appears not to be implemented for python?

Code:

import numpy as np
import pandas as pd
import lightgbm as lgbm
from lightgbm import LGBMClassifier

np.random.seed(123)
n=100000
rho=0.8
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(-1-df.x1-2np.sin(-3+df.x2)-0.5df.x3))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'

vidx=df.sample(frac=0.2).index
tidx=df.drop(vidx).index

md=5

all three features used (as expected)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.9,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()

last feature not used (bug?)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=0.1,num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().split_feature.value_counts()

all three features used in every tree (interaction_constraints not implemented?)

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,num_leaves=2md)
gbm.set_params({'interaction_contraints':[[i] for i in range(len(num)+len(cat))]})
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
gbm._Booster.trees_to_dataframe().groupby('tree_index')['split_feature'].nunique().value_counts()

LGBM version 3.1.1

btrotta · 2021-03-30T09:09:04Z

Regarding the interaction constraints, there's a typo in your code (missing first s in constraints). Also you need to pass key-value pairs, not a dict. The following works for me:

gbm.set_params(interaction_constraints=[[i] for i in range(len(num)+len(cat))])

The issue with colsample_bytree sounds like it may be a bug.

ruedika · 2021-03-30T10:36:19Z

Nice.. Sorry I made you check on my typos and many thanks!!

shiyu1994 · 2021-04-05T09:31:34Z

When colsample_bytree=0.1, the column sampler always samples feature 0 and 2, and feature 1 has never been sampled. I think this is due to a bug of Sample function in include/LightGBM/utils/random.h. Though it is a corner case, since there's only 3 features in the example above. I'm still investigating this to see if I can fix it.

shiyu1994 · 2021-04-05T10:49:00Z

This is due to the internal LightGBM random generator (which is a linear congruential generator https://en.wikipedia.org/wiki/Linear_congruential_generator). For efficiency, we use the form m a power of 2, c = 0. Though efficient, such random generators have a limitation. The lowest bit always alternate between 0 and 1. So the randomly generated integer numbers will alternate between odd numbers and even numbers.

With that in mind, if we follow the logic in the Sample method (with parameters N=3 and K=2), we'll find that the code always samples 0 and 2.

I think this can be counted as a limitation due to pseudo random generators. And this is a very extreme case, because usually when there're only 3 features, we don't need feature sampling. Changing to a better version of pseudo generator may solve this problem, but can also incur slower sampling speed. WDYT @btrotta

ruedika · 2021-04-06T05:42:05Z

Thanks a lot, but I do not think that this is the issue. I expanded the example a little to include more features. Setting the number of features k to different values always results in the last feature remaining unused when colsample_bytree is set to something smaller or equal than 1/k:

Code Example:

np.random.seed(123)
n=100000 #number of samples
k=10 #number of features
rho=0.8 #correlation between covariates
num=['x'+str(i) for i in range(1,k+1)]
df=pd.DataFrame(multivariate_normal(mean=[0]k,cov=rhonp.ones(shape=[k,k])+np.diag(k*[1-rho])).rvs(size=n),columns=num)
cat=[]
coef=np.random.rand(len(cat+num))
coef=1.4*coef/np.sqrt(np.sum(coef**2))
df['prob_true']=1/(1+np.exp(-4+np.matmul(df[cat+num].values,coef.reshape(-1,1))))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'

vidx=df.sample(frac=0.2).index
tidx=df.drop(vidx).index

md=5

gbm = LGBMClassifier(max_depth=md,n_estimators=1000,colsample_bytree=1/(k+1),num_leaves=2**md)
gbm.fit(df.loc[tidx,cat+num],df.loc[tidx,target],eval_set=[(df.loc[vidx,cat+num],df.loc[vidx,target])],early_stopping_rounds=10,verbose=0)
features_used=gbm._Booster.trees_to_dataframe().split_feature.value_counts()
features_unused=[v for v in cat+num if v not in features_used.index]
print(features_unused)

btrotta · 2021-04-18T02:06:40Z

@shiyu1994 Reading the wikipedia page, it seems that the cycling behavior also occurs for higher-order bits (although the cycle is longer), so I think that means we would see non-random behavior even in larger sets of features.

shiyu1994 · 2021-04-22T02:52:35Z

@btrotta Yes, so I think keep the current random number generator is OK. Because such generators would have some limitation anyway.
But about the latest post of @ruedika, that's really a bug of our internal Sample function when sampling only one element from a range, see

LightGBM/include/LightGBM/utils/random.h

Lines 85 to 92 in 0a847ef

    
           } else { 
        
             std::set<int> sample_set; 
        
             for (int r = N - K; r < N; ++r) { 
        
               int v = NextInt(0, r); 
        
               if (!sample_set.insert(v).second) { 
        
                 sample_set.insert(r); 
        
               } 
        
             }

When K=1, NextInt samples a number from range [0, N - 1), and since the iteration only execute once, there's no further opportunity for N-1 to be inserted into the set. Only when K > 1, that's possible.
So we should fix this by adding a special case for K=1 in the code of Sample.
@ruedika Thanks for reporting this issue!

JoshuaC3 · 2021-04-23T09:56:43Z

@ruedika see some of my references here: Feature Cycling #4066 and Borrow EBM Ideas #3905 for more information.

Out of interest, what is your motivation to get single feature trees? I wonder if it is similar to mine/EBMs?

ruedika · 2021-05-10T11:30:39Z

Hi,
Sorry for getting back so late on this. My hand-on fix for the above was just to add a redundant column to the dataset...
@JoshuaC3 thanks for the info! I'm working with GAMs (generalized additive models) and mostly learning greedily (tranining on the best feature, which is guess is what LGBM) gives good results. So using interaction_constraints, as mentioned above, works nice enough. I've (maybe) run into some problems using EBMs:
interpretml/interpret#232
The GAM built using LGBM with interaction_constraints does not have these issues..

github-actions · 2023-08-23T14:28:20Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

btrotta added the bug label Mar 30, 2021

shiyu1994 added a commit to shiyu1994/LightGBM that referenced this issue May 26, 2021

fix Sample when sampling only one element (fix microsoft#4134)

f979e2a

shiyu1994 mentioned this issue May 26, 2021

[fix] fix Sample when sampling only one element (fix #4134) #4324

Merged

StrikerRUS mentioned this issue Jun 12, 2021

feature_fraction RNG looks broken #4371

Closed

shiyu1994 closed this as completed in #4324 Jun 18, 2021

shiyu1994 added a commit that referenced this issue Jun 18, 2021

fix Sample when sampling only one element (fix #4134) (#4324)

42ddff1

shiyu1994 mentioned this issue Jul 6, 2021

[fix] fix Reservoir Sampling in Sample of random.h (fix #4371 and #4134) #4450

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One Feature per Tree #4134

One Feature per Tree #4134

ruedika commented Mar 29, 2021

btrotta commented Mar 30, 2021

ruedika commented Mar 30, 2021

shiyu1994 commented Apr 5, 2021

shiyu1994 commented Apr 5, 2021

ruedika commented Apr 6, 2021

btrotta commented Apr 18, 2021

shiyu1994 commented Apr 22, 2021

JoshuaC3 commented Apr 23, 2021

ruedika commented May 10, 2021

github-actions bot commented Aug 23, 2023

One Feature per Tree #4134

One Feature per Tree #4134

Comments

ruedika commented Mar 29, 2021

all three features used (as expected)

last feature not used (bug?)

all three features used in every tree (interaction_constraints not implemented?)

LGBM version 3.1.1

btrotta commented Mar 30, 2021

ruedika commented Mar 30, 2021

shiyu1994 commented Apr 5, 2021

shiyu1994 commented Apr 5, 2021

ruedika commented Apr 6, 2021

Code Example:

btrotta commented Apr 18, 2021

shiyu1994 commented Apr 22, 2021

JoshuaC3 commented Apr 23, 2021

ruedika commented May 10, 2021

github-actions bot commented Aug 23, 2023