Skip to content

Commit

Permalink
Update doc and code to support sklearn class interface
Browse files Browse the repository at this point in the history
  • Loading branch information
kingychiu committed Jul 9, 2023
1 parent e661a27 commit 5d52027
Show file tree
Hide file tree
Showing 10 changed files with 460 additions and 57 deletions.
97 changes: 75 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,18 +42,11 @@ This method were originally proposed/implemented by:

## Features
1. Compute null importances with only one function call.
2. Support tree models with sklearn's `feature_importances_` attribute, such as
- `RandomForestClassifier`, `RandomForestRegressor`,
- `XGBClassifier`. `XGBRegressor`,
- `LGBMClassifier`, `LGBMRegressor`,
- `CatBoostClassifier`, `CatBoostRegressor` etc.
3. Support linear models with sklearn's `coef_` attribute, such as
- `Lasso`
- `LinearSVC`
4. Support `sklearn`'s `MultiOutputClassifier` or `MultiOutputRegressor` interface.
4. Support data in `pandas.DataFrame` and `numpy.ndarray`
5. Highly customizable with both the exposed `compute` and `generic_compute` functions.
6. Proven effectiveness in Kaggle competitions and in [`Our Benchmarks Results`](https://target-permutation-importances.readthedocs.io/en/latest/benchmarks/).
2. Support models that provide information about feature importance (e.g. coef_, feature_importances_), such as `RandomForestClassifier`, `RandomForestRegressor`, `XGBClassifier`. `XGBRegressor`, `LGBMClassifier`, `LGBMRegressor`,`CatBoostClassifier`, `CatBoostRegressor`, `Lasso`, `LinearSVC`, etc.
1. Support `sklearn`'s `MultiOutputClassifier` or `MultiOutputRegressor` interface.
2. Support data in `pandas.DataFrame` and `numpy.ndarray`
3. Highly customizable with both the exposed `compute` and `generic_compute` functions.
4. Proven effectiveness in Kaggle competitions and in [`Our Benchmarks Results`](https://target-permutation-importances.readthedocs.io/en/latest/benchmarks/).

---

Expand All @@ -70,7 +63,7 @@ Below show the benchmark results of running null-importances with feature select


| model | n_dataset | n_better | better % |
|------------------------|-----------|----------|----------|
| ---------------------- | --------- | -------- | -------- |
| RandomForestClassifier | 10 | 10 | 100.0 |
| RandomForestRegressor | 12 | 8 | 66.67 |
| XGBClassifier | 10 | 7 | 70.0 |
Expand Down Expand Up @@ -107,7 +100,7 @@ beartype = "^0.14.1"
```
---

## Get Started
## Get Started (Functional APIs)

### Tree Models with `feature_importances_` Attribute

Expand Down Expand Up @@ -274,11 +267,10 @@ Running 2 actual runs and 10 random runs

You can find more detailed examples in the "Feature Selection Examples" section.

---

## Customization
### Customization

### Changing model or parameters
#### Changing model or parameters

You can pick your own model by changing
`model_cls`, `model_cls_params` and `model_fit_params`, for example, using with `LGBMClassifier`
Expand Down Expand Up @@ -307,10 +299,11 @@ Note: Tree models are greedy. Usually it is a good idea to introduce some random
It forces the model to explore the importances of different features. In other words, setting these
parameters avoid a feature from being under-representative in the importance calculation because of having another highly correlated feature.

### Changing null importances calculation
#### Changing null importances calculation

You can pick your own calculation method by changing `permutation_importance_calculator`.
There are 2 provided calculations:
There are 3 provided calculations:

- `tpi.compute_permutation_importance_by_subtraction`
- `tpi.compute_permutation_importance_by_division`
- `tpi.compute_permutation_importance_by_wasserstein_distance`
Expand All @@ -319,10 +312,70 @@ You can also implement you own calculation function and pass it in. The function
`PermutationImportanceCalculatorType` specification, you can find it in
[API Reference](https://target-permutation-importances.readthedocs.io/en/latest/reference/)

### Advance Customization
#### Advance Customization

This package exposes `tpi.generic_compute` to allow advance customization.
Read the followings for details:

- [generic_compute API reference](https://target-permutation-importances.readthedocs.io/en/latest/reference/#target_permutation_importances.functional.generic_compute)
- [`target_permutation_importances.functional.py`](https://github.com/kingychiu/target-permutation-importances/blob/main/target_permutation_importances/functional.py).

---

## Get Started (scikit-learn APIs)


`TargetPermutationImportances` follows scikit-learn interfaces and support scikit-learn feature selection method such as `SelectFromModel`:

```python
# Import the function
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Models
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()

# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)

# Compute permutation importances with default settings
ranker = tpi.TargetPermutationImportances(
model_cls=RandomForestClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
"n_jobs": -1,
},
)
ranker.fit(
X=Xpd, # pd.DataFrame, np.ndarray
y=data.target, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
shuffle_feature_order=False,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
# And other fit parameters for the model.
n_jobs=-1,
)
# Get the feature importances as a pandas dataframe
result_df = ranker.feature_importances_df_
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())


This package exposes `generic_compute` to allow advance customization.
Read [`target_permutation_importances.__init__.py`](https://github.com/kingychiu/target-permutation-importances/blob/main/target_permutation_importances/__init__.py) for details.
# Select features with sklearn feature selectors
selector = SelectFromModel(
estimator=ranker, prefit=True, threshold=result_df["importance"].max()
).fit(Xpd, data.target)
selected_x = selector.transform(X)
print(selected_x.shape)
```
Fork above code from [Kaggle](https://www.kaggle.com/code/kingychiu/target-permutation-importances-basic-usage/notebook).

---

Expand Down
96 changes: 74 additions & 22 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,18 +42,11 @@ This method were originally proposed/implemented by:

## Features
1. Compute null importances with only one function call.
2. Support tree models with sklearn's `feature_importances_` attribute, such as
- `RandomForestClassifier`, `RandomForestRegressor`,
- `XGBClassifier`. `XGBRegressor`,
- `LGBMClassifier`, `LGBMRegressor`,
- `CatBoostClassifier`, `CatBoostRegressor` etc.
3. Support linear models with sklearn's `coef_` attribute, such as
- `Lasso`
- `LinearSVC`
4. Support `sklearn`'s `MultiOutputClassifier` or `MultiOutputRegressor` interface.
4. Support data in `pandas.DataFrame` and `numpy.ndarray`
5. Highly customizable with both the exposed `compute` and `generic_compute` functions.
6. Proven effectiveness in Kaggle competitions and in [`Our Benchmarks Results`](https://target-permutation-importances.readthedocs.io/en/latest/benchmarks/).
2. Support models that provide information about feature importance (e.g. coef_, feature_importances_), such as `RandomForestClassifier`, `RandomForestRegressor`, `XGBClassifier`. `XGBRegressor`, `LGBMClassifier`, `LGBMRegressor`,`CatBoostClassifier`, `CatBoostRegressor`, `Lasso`, `LinearSVC`, etc.
1. Support `sklearn`'s `MultiOutputClassifier` or `MultiOutputRegressor` interface.
2. Support data in `pandas.DataFrame` and `numpy.ndarray`
3. Highly customizable with both the exposed `compute` and `generic_compute` functions.
4. Proven effectiveness in Kaggle competitions and in [`Our Benchmarks Results`](https://target-permutation-importances.readthedocs.io/en/latest/benchmarks/).

---

Expand All @@ -70,7 +63,7 @@ Below show the benchmark results of running null-importances with feature select


| model | n_dataset | n_better | better % |
|------------------------|-----------|----------|----------|
| ---------------------- | --------- | -------- | -------- |
| RandomForestClassifier | 10 | 10 | 100.0 |
| RandomForestRegressor | 12 | 8 | 66.67 |
| XGBClassifier | 10 | 7 | 70.0 |
Expand Down Expand Up @@ -107,7 +100,7 @@ beartype = "^0.14.1"
```
---

## Get Started
## Get Started (Functional APIs)

### Tree Models with `feature_importances_` Attribute

Expand Down Expand Up @@ -274,11 +267,10 @@ Running 2 actual runs and 10 random runs

You can find more detailed examples in the "Feature Selection Examples" section.

---

## Customization
### Customization

### Changing model or parameters
#### Changing model or parameters

You can pick your own model by changing
`model_cls`, `model_cls_params` and `model_fit_params`, for example, using with `LGBMClassifier`
Expand Down Expand Up @@ -307,10 +299,11 @@ Note: Tree models are greedy. Usually it is a good idea to introduce some random
It forces the model to explore the importances of different features. In other words, setting these
parameters avoid a feature from being under-representative in the importance calculation because of having another highly correlated feature.

### Changing null importances calculation
#### Changing null importances calculation

You can pick your own calculation method by changing `permutation_importance_calculator`.
There are 2 provided calculations:
There are 3 provided calculations:

- `tpi.compute_permutation_importance_by_subtraction`
- `tpi.compute_permutation_importance_by_division`
- `tpi.compute_permutation_importance_by_wasserstein_distance`
Expand All @@ -319,10 +312,69 @@ You can also implement you own calculation function and pass it in. The function
`PermutationImportanceCalculatorType` specification, you can find it in
[API Reference](https://target-permutation-importances.readthedocs.io/en/latest/reference/)

### Advance Customization
#### Advance Customization

This package exposes `tpi.generic_compute` to allow advance customization.
Read the followings for details:

- [generic_compute API reference](https://target-permutation-importances.readthedocs.io/en/latest/reference/#target_permutation_importances.functional.generic_compute)
- [`target_permutation_importances.functional.py`](https://github.com/kingychiu/target-permutation-importances/blob/main/target_permutation_importances/functional.py).

---

## Get Started (scikit-learn APIs)


`TargetPermutationImportances` follows scikit-learn interfaces and support scikit-learn feature selection method such as `SelectFromModel`:

```python
# Import the function
import target_permutation_importances as tpi

# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Models
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()

This package exposes `generic_compute` to allow advance customization.
Read [`target_permutation_importances.__init__.py`](https://github.com/kingychiu/target-permutation-importances/blob/main/target_permutation_importances/__init__.py) for details.
# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)

# Compute permutation importances with default settings
ranker = tpi.TargetPermutationImportances(
model_cls=RandomForestClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
"n_jobs": -1,
},
)
ranker.fit(
X=Xpd, # pd.DataFrame, np.ndarray
y=data.target, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
shuffle_feature_order=False,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
# And other fit parameters for the model.
n_jobs=-1,
)
# Get the feature importances as a pandas dataframe
result_df = ranker.feature_importances_df_
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())


# Select features with sklearn feature selectors
selector = SelectFromModel(
estimator=ranker, prefit=True, threshold=result_df["importance"].max()
).fit(Xpd, data.target)
selected_x = selector.transform(X)
print(selected_x.shape)
```

---

Expand Down
14 changes: 13 additions & 1 deletion docs/reference.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# API Reference

## Class APIs
::: target_permutation_importances.class_wrapper
handler: python
options:
show_signature_annotations: true
members:
- TargetPermutationImportances
show_root_heading: false
show_root_toc_entry: false
show_source: true
heading_level: 3

## Functional APIs
::: target_permutation_importances.functional
handler: python
Expand Down Expand Up @@ -27,7 +39,7 @@
- YBuilderType
- ModelBuilderType
- ModelFitterType
- ModelImportanceCalculatorType
- ModelImportanceGetter
- PermutationImportanceCalculatorType
show_root_heading: false
show_root_toc_entry: false
Expand Down
10 changes: 10 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ extra_javascript:
extra_css:
- https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.7/katex.min.css

markdown_extensions:
- tables
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences

plugins:
- search
- autorefs
Expand Down
4 changes: 4 additions & 0 deletions target_permutation_importances/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@
compute_permutation_importance_by_wasserstein_distance,
generic_compute,
)

from target_permutation_importances.sklearn_wrapper import (
TargetPermutationImportances,
) # noqa
3 changes: 0 additions & 3 deletions target_permutation_importances/class.py

This file was deleted.

Loading

0 comments on commit 5d52027

Please sign in to comment.