Skip to content

Commit

Permalink
v2023.01.03
Browse files Browse the repository at this point in the history
  • Loading branch information
jolespin committed Jan 11, 2023
1 parent 37a3ac1 commit e49eac4
Show file tree
Hide file tree
Showing 7 changed files with 367 additions and 148 deletions.
Empty file removed Icon
Empty file.
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,17 @@ Reimplementation for `Clairvoyance` from [Espinoza & Dupont et al. 2021](https:/
#### Details:
`import clairvoyance as cy`

`__version__ = "2022.01.01"`
`__version__ = "2022.01.03"`

#### Installation

```
# Stable:
pip install clairvoyance_feature_selection
conda install -c jolespin clairvoyance
# Developmental:
pip install git+https://github.com/jolespin/clairvoyance
```

#### Citation
Expand All @@ -30,7 +34,7 @@ Espinoza JL, Dupont CL, O’Rourke A, Beyhan S, Morales P, Spoering A, et al. (2


##### Feature selection based on classification tasks
Here's a basic classifcation using a `LogisticRegression` model and a grid search for different `C` and `penalty` parameters. We add 996 noise variables within the range of values as the original Iris features. After that we normalize them so their scale is standardized. In this case, we are optimizing for `accuracy`.
Here's a basic classifcation using a `LogisticRegression` model and a grid search for different `C` and `penalty` parameters. We add 996 noise variables within the range of values as the original Iris features. After that we normalize them so their scale is standardized. In this case, we are optimizing for `accuracy`. We are using a `LogisticRegression` where we don't really have to worry about features with zero weight in the end so we are going to set ` remove_zero_weighted_features=False`. This will allow us to plot a nice RCI curve with error.

```python
import clairvoyance as cy
Expand Down Expand Up @@ -75,6 +79,8 @@ clf = cy.ClairvoyanceClassification(
estimator=estimator,
param_grid=param_grid,
verbose=1,
remove_zero_weighted_features=False,

)
clf.fit(X_normalized, y)#, sort_hyperparameters_by=["C", "penalty"], ascending=[True, False])
history = clf.recursive_feature_inclusion(early_stopping=10)
Expand All @@ -90,7 +96,7 @@ clf.plot_weights(weight_type="cross_validation")
```
![](images/2.png)

There are still a few noise variables, though with much lower weight, suggesting our classifier is modeling noise. We can add an additional penalty where a change in score must exceed a threshold to add a new feature during the recursive feature inclusion algorithm.
There are still a few noise variables, though with much lower weight, suggesting our classifier is modeling noise. We can add an additional penalty where a change in score must exceed a threshold to add a new feature during the recursive feature inclusion algorithm. We are keeping ` remove_zero_weighted_features=False` for this example.

```python
history = clf.recursive_feature_inclusion(early_stopping=10, minimum_improvement_in_score=0.05)
Expand Down Expand Up @@ -118,6 +124,7 @@ clf_binary = cy.ClairvoyanceClassification(
n_draws=10,
estimator=estimator,
param_grid=param_grid,
remove_zero_weighted_features=False,
verbose=1,
)

Expand All @@ -138,7 +145,9 @@ clf_binary.plot_weights(weight_type="cross_validation")
![](images/5.png)

##### Feature selection based on regression tasks
Here's a basic regression using a `DecisionTreeRegressor` model and a grid search for different `min_samples_leaf` and `min_samples_split` parameters. We add 87 noise variables and normalize all of the features so their scale is standardized. In this case, we are optimizing for `neg_root_mean_squared_error`. We are using a validation set of ~16% of the data during our recursive feature inclusion.
Here's a basic regression using a `DecisionTreeRegressor` model and a grid search for different `min_samples_leaf` and `min_samples_split` parameters. We add 87 noise variables and normalize all of the features so their scale is standardized. In this case, we are optimizing for `neg_root_mean_squared_error`. We are using a validation set of ~16% of the data during our recursive feature inclusion. For decision trees, we have the issue of getting zero-weighted features which are uninformative and misleading for RCI. To get around this, we implement a recursive feature removal that only keeps non-zero weighted features. We can turn this on via `remove_zero_weighted_features=True`. This also ensures that there are no redundant feature sets (not an issue when `remove_zero_weighted_features=False` because they are recursively added).

Note: When we use `remove_zero_weighted_features=True`, we get a scatter plot instead of a line plot with error because there are multiple feature sets (each with their own performance distribution on the CV set) that may have the same number of features.

```python
from sklearn.datasets import load_boston
Expand All @@ -164,7 +173,15 @@ estimator = DecisionTreeRegressor(random_state=0)
param_grid = {"min_samples_leaf":[1,2,3,5,8],"min_samples_split":{ 0.1618, 0.382, 0.5, 0.618}}

# Fit model
reg = cy.ClairvoyanceRegression(name="Boston", n_jobs=-1, n_draws=10, estimator=estimator, param_grid=param_grid, verbose=1)
reg = cy.ClairvoyanceRegression(
name="Boston",
n_jobs=-1,
n_draws=10,
estimator=estimator,
param_grid=param_grid,
verbose=1,
remove_zero_weighted_features=True,
)
reg.fit(X_training, y_training)
history = reg.recursive_feature_inclusion(early_stopping=10, X=X_validation, y=y_validation)
history.head()
Expand All @@ -178,7 +195,7 @@ reg.plot_weights(weight_type="cross_validation")
```
![](images/7.png)

Let's use the weighted fitted with a `DecisionTreeRegressor` but use an ensemble `RandomForestRegressor` for the actual feature inclusion algorithm.
There's some noise features that make it through using `DecisionTreeRegressor` models. Instead of adding a penalty, let's use the weights fitted with a `DecisionTreeRegressor` but use an ensemble `RandomForestRegressor` for the actual feature inclusion algorithm.

```python
from sklearn.ensemble import RandomForestRegressor
Expand All @@ -189,6 +206,8 @@ reg.plot_weights(weight_type="cross_validation")
```
![](images/8.png)

That's much better...

##### Recursive feature selection based on classification tasks
Here we are running `Clairvoyance` recursively identifying several feature sets that work with different hyperparameters to get a range of feature sets to select from in the end. We will iterate through all of the hyperparamater configurations, recursively feed in the data using different percentiles of the weights, and use different score thresholds from the random draws. The recursive usage is similar to the legacy implementation used in [Espinoza & Dupont et al. 2021](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008857) (which is still provided as an executable).

Expand Down Expand Up @@ -217,6 +236,7 @@ rci = cy.ClairvoyanceRecursive(
percentiles=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.925, 0.95, 0.975, 0.99],
minimum_scores=[-np.inf, 0.382, 0.5],
verbose=0,
remove_zero_weighted_features=False,
)
rci.fit(X_normalized, y, sort_hyperparameters_by=["C", "penalty"], ascending=[True, True])
rci.plot_recursive_feature_selection()
Expand Down
4 changes: 2 additions & 2 deletions clairvoyance/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# =======
# Version
# =======
__version__= "2023.01.01"
__version__= "2023.01.03"
__author__ = "Josh L. Espinoza"
__email__ = "jespinoz@jcvi.org, jol.espinoz@gmail.com"
__url__ = "https://github.com/jolespin/clairvoyance"
Expand All @@ -40,7 +40,7 @@
"format_cross_validation",
"get_feature_importance_attribute",
"recursive_feature_inclusion",
"plot_scores","plot_weights_bar",
"plot_scores_line","plot_weights_bar",
"plot_weights_box",
"plot_recursive_feature_selection",
}
Expand Down
Loading

0 comments on commit e49eac4

Please sign in to comment.