Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ss test mapie #41

Open
wants to merge 79 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
31791cb
bring in the ci for rf in fs_algo_train_eval.py
Jan 17, 2025
dcc6c5a
bring in ci to fs_proc_algo.py
Jan 20, 2025
1805ec0
bring in ci to fs_pred_algo.py
Jan 20, 2025
93ede73
apply only one n_estimators (grid selection bug)
Jan 20, 2025
2d3dbfa
fix a syntax error in fs_algo_train_eval
Jan 20, 2025
1154eef
clean fs_algo_train_eval.py
Jan 21, 2025
50e790c
added unit test for std_Xtrain_path function
Jan 21, 2025
ffbbc62
Revert "Merge remote-tracking branch 'upstream/dev' into ss_test_fci_…
Jan 22, 2025
bdb7d32
added unit test for fci function
Jan 22, 2025
cb0efe1
Update fs_pred_algo.py
glitt13 Jan 22, 2025
5eb00e4
brought back list of values in n_estimators in xssa_algo_config.yaml
Jan 22, 2025
c310236
Merge branch 'ss_test_fci_dev3' of https://github.com/ssorou1/formula…
Jan 22, 2025
737c36a
Incorporate Bagging into mlp in fs_algo_train_eval
Jan 28, 2025
88661ad
rf n_estimators=400
Jan 28, 2025
ea914f4
Incorporate Bagging into rf in fs_algo_train_eval
Jan 28, 2025
ad24727
incorporated mapie for rf model
Jan 31, 2025
983a662
update the MAPIE to use the same fit as rf
Jan 31, 2025
710038a
incorporated mapie for mlp model
Jan 31, 2025
d24e19f
deleted unsed pred_rf variable inside tran_algos function
Jan 31, 2025
08deb3c
add number of bootstrap runs as a parameter to the yaml file
Feb 6, 2025
6476710
Implemented multiple confidence intervals (90, 95 & 99%) for rf Baggi…
Feb 6, 2025
e94b32d
Implemented multiple confidence intervals (90, 95 & 99%) for mlp Bagg…
Feb 6, 2025
746b172
Update rf_Bagging_ci function to dynamically read the number of boots…
Feb 6, 2025
6fdb8d6
Update mlp_Bagging_ci function to dynamically read the number of boot…
Feb 6, 2025
52f088f
develop a separate function for MAPIE and update fs_algo_train_eval a…
Feb 6, 2025
048ca30
Rename ci for rf model for clarification
Feb 6, 2025
5c9e14b
Update fs_pred to consider forestci only for rf model. Update saving …
Feb 7, 2025
275564f
added pkg/proc.attr.hydfab/.RData to the gitignore
Feb 7, 2025
75a5140
add confidence level to the yaml file
Feb 7, 2025
3bc23ea
update Bagging_ci files to calculate the confidence interval from use…
Feb 7, 2025
ba8858d
delete .RData
Feb 7, 2025
3e65cd6
add MAPIE alpha to the yaml file
Feb 10, 2025
cfc0c24
boolean variable for MAPIE
Feb 10, 2025
39174fc
implement MAPIE dynamically into fs_proc and fs_train_eval from infor…
Feb 10, 2025
91cb971
clean up fs_algo_train_eval from regarding MAPIE
Feb 10, 2025
afaa087
add forestci boolean to yaml file
Feb 11, 2025
aecf707
change forestci as an independent function outside of train_algos. Ad…
Feb 11, 2025
bf34bbf
added flag for Bagging enable/disable in yaml file
Feb 11, 2025
7077843
fixed a bug in calculate_Bagging_ci
Feb 11, 2025
7e4ee03
merged Bagging_ci functions into one and bring it out of train_algos(…
Feb 12, 2025
6aaa3ee
clean up fs_algo_train_eval from individual Bagging_ci(), and forestc…
Feb 12, 2025
4de7e19
replace model_name with algo_str in Bagging_ci
Feb 12, 2025
15bbc04
replace model_cfg with algo_cfg for consistency
Feb 12, 2025
5046c84
replace n_models with n_algos
Feb 12, 2025
1d8bba6
relace model with algo
Feb 12, 2025
adf4746
delete Bagging_ci flag from yaml and build the logic around n_algos f…
Feb 12, 2025
3df0553
delete MAPIE flag from yaml and build the logic around MAPIE_alpha fo…
Feb 12, 2025
7e5ce60
create a separate section in yaml for Bagging
Feb 12, 2025
9edb092
update fs_proc and fs_algo_train_eval to treat uncertainty as a separ…
Feb 13, 2025
7b616a3
delete unused mapie boolean arg
Feb 13, 2025
ddab11a
added some information about OPTIONAL/REQUIRED to yaml file
Feb 13, 2025
9ab4ccf
replace Bagging with bagging (lower case)
Feb 13, 2025
8404b69
fixed a bug in the range of mapie_alpha
Feb 14, 2025
52e73ce
temporarily change the the COMID retrieval till update for pynhd package
Feb 14, 2025
373986c
temporarily change the the COMID retrieval till update for pynhd pack…
Feb 14, 2025
685023e
Apply the changes in fs_prog_algo.py into fs_proc_algo_viz.py
Feb 14, 2025
99d8784
implement random sampling and recycle hyper-parameters into uncertain…
Feb 14, 2025
9825210
revert back the COMID retrieval section as pynhd is updated to >=0.19
Feb 14, 2025
8129667
add additional comment for MAPIE_alpha
Feb 14, 2025
9263a49
resolve merge conflicts and apply fixes by Guy
Feb 18, 2025
2390b90
simplify the calculate_bagging_ci function
Feb 18, 2025
4ac5ce7
update calculate_mapie() to write mapie to self.algs_dict
Feb 18, 2025
ef57c39
clean up calculate_bagging_ci()
Feb 18, 2025
b3cbc62
update train_algos_grid_search to calculate forestci with grid search
Feb 18, 2025
3336544
Merge remote-tracking branch 'upstream/dev' into ss_test_mapie
Feb 19, 2025
e263848
add save_Xtrain_to_csv function to the model and update the code and …
Feb 19, 2025
36d6a62
update the save_algos() to output the X_train shape to the joblib file
Feb 20, 2025
2563199
Write forestci parameters in terms of numpy arrays instead of dataframe
Feb 20, 2025
d3c6624
updated fs_pred to read the X_train dimension from joblib file rather…
Feb 20, 2025
1b3def2
clean up unnecessary Xtrain related functions in fs_algo_train_eval a…
Feb 20, 2025
3142576
clean up save_algos function
Feb 20, 2025
5913aed
fix the forestci with grid search while keeping the forestci calculat…
Feb 20, 2025
982ec5b
implement calculate_bagging_ci to work with grid search
Feb 21, 2025
6fee686
update the keys and indices of tuples inside Bagging ci
Feb 21, 2025
80620a0
update the output format for mapie
Feb 21, 2025
6b3c5b8
bring confidence_levels out of Bagging_uncertainty in the config file…
Feb 21, 2025
c6f4554
implement forestci with user-defined confidence intervals
Feb 21, 2025
b17010b
update the code to save uncertainty dictionary
Feb 24, 2025
5b35735
Generate n_algos unique random states using self.rs as the seed, ensu…
Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ coverage.xml
#######
*.Rproj
*.Rhistory
pkg/proc.attr.hydfab/.RData

# Front-End #
#############
Expand Down
251 changes: 204 additions & 47 deletions pkg/fs_algo/fs_algo/fs_algo_train_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV,learning_curve
import numpy as np
import pandas as pd
Expand Down Expand Up @@ -31,6 +31,11 @@
import urllib
import zipfile
import forestci as fci
from sklearn.utils import resample
from mapie.regression import MapieRegressor
from scipy.stats import norm
import random
import scipy.stats as st

# %% BASIN ATTRIBUTES (PREDICTORS) & RESPONSE VARIABLES (e.g. METRICS)
class AttrConfigAndVars:
Expand Down Expand Up @@ -558,21 +563,6 @@ def std_pred_path(dir_out: str | os.PathLike, algo: str, metric: str, dataset_id
path_pred_rslt = Path(dir_preds_ds)/Path(basename_pred_alg_ds_metr)
return path_pred_rslt

def std_Xtrain_path(dir_out_alg_ds:str | os.PathLike, dataset_id: str
) -> pathlib.PosixPath:
"""Standardize the algorithm save path
:param dir_out_alg_ds: Directory where algorithm's output stored.
:type dir_out_alg_ds: str | os.PathLike
:param metric: The metric or hydrologic signature identifier of interest
:type metric: str
:return: full save path for joblib object
:rtype: str
"""
Path(dir_out_alg_ds).mkdir(exist_ok=True,parents=True)
basename_alg_ds = f'Xtrain__{dataset_id}'
path_Xtrain = Path(dir_out_alg_ds) / Path(basename_alg_ds + '.csv')
return path_Xtrain

def std_eval_metrs_path(dir_out_viz_base: str|os.PathLike,
ds:str, metr:str
) -> pathlib.PosixPath:
Expand Down Expand Up @@ -781,7 +771,11 @@ def __init__(self, df: pd.DataFrame, attrs: Iterable[str], algo_config: dict,
dir_out_alg_ds: str | os.PathLike, dataset_id: str,
metr: str, test_size: float = 0.3,rs: int = 32,
test_ids = None,test_id_col:str = 'comid',
verbose: bool = False):
verbose: bool = False,
forestci: bool = False,
confidence_levels: int = 95,
mapie_alpha : float = 0.05,
bagging_ci_params: dict = None):
"""The algorithm training and evaluation class.

:param df: The combined response variable and predictor variables DataFrame.
Expand Down Expand Up @@ -810,6 +804,12 @@ def __init__(self, df: pd.DataFrame, attrs: Iterable[str], algo_config: dict,
:type test_id_col: str
:param verbose: Should print, defaults to False.
:type verbose: bool, optional
:param: confidence_levels: confidence levels for ci calculation, defaults to 95
:type confidence_levels: int, optional
:param mapie_alpha: alpha for MAPIE, defaults to 0.05.
:type test_size: float, optional
:param bagging_ci: Configuration dictionary for Bagging-based uncertainty estimation.
:type bagging_ci: dict or None, optional
"""
# class args
self.df = df
Expand All @@ -823,6 +823,10 @@ def __init__(self, df: pd.DataFrame, attrs: Iterable[str], algo_config: dict,
self.rs = rs
self.dataset_id = dataset_id
self.verbose = verbose
self.forestci = forestci
self.confidence_levels = confidence_levels
self.mapie_alpha = mapie_alpha
self.bagging_ci_params = bagging_ci_params

# train/test split
self.X_train = pd.DataFrame()
Expand Down Expand Up @@ -963,7 +967,7 @@ def select_algs_grid_search(self):
# e.g. {'activation':'relu'} becomes {'activation':['relu']}
self.algo_config_grid = self.convert_to_list(self.algo_config_grid)

def calculate_rf_uncertainty(self, forest, X_train, X_test):
def calculate_forestci_uncertainty(self, forest, X_train, X_test):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the code is in good order, the next task will be adding/adapting unit tests for all the new functions that have undergone changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that this branch needs an update to the unit test test_calculate_rf_uncertainty with your latest changes.

"""
Calculate uncertainty using forestci for a Random Forest model.

Expand All @@ -976,18 +980,125 @@ def calculate_rf_uncertainty(self, forest, X_train, X_test):
:return: Confidence intervals for each prediction.
:rtype: ndarray
"""
ci = fci.random_forest_error(
# ci = fci.random_forest_error(
# forest=forest,
# X_train_shape=X_train.shape,
# X_test=X_test,
# inbag=None,
# calibrate=True,
# memory_constrained=False,
# memory_limit=None,
# y_output=0 # Change this if multi-output
# )
# return ci

# Compute standard deviation of prediction errors

confidence_levels = self.confidence_levels
ci_std = np.sqrt(fci.random_forest_error(
forest=forest,
X_train_shape=X_train.shape,
X_test=X_test,
inbag=None,
calibrate=True,
memory_constrained=False,
memory_limit=None,
y_output=0 # Change this if multi-output
)
return ci
y_output=0
))

# Compute confidence intervals for each level
ci_dict = {}
for alpha in confidence_levels:
z_value = st.norm.ppf(1 - (1 - alpha/100) / 2) # Get z-score for two-tailed CI
ci_lower = -z_value * ci_std
ci_upper = z_value * ci_std
ci_dict[f'ci_{int(alpha)}'] = {
"lower_bound": ci_lower,
"upper_bound": ci_upper
}

return ci_dict

def calculate_bagging_ci(self, algo_str,best_algo):
"""
Generalized function to calculate Bagging confidence intervals for any model.
"""
# algo_cfg = self.algo_config[algo_str]
algo_cfg = self.algo_config.get(algo_str, self.algo_config_grid.get(algo_str))
if algo_cfg is None:
raise KeyError(f"Algorithm {algo_str} not found in configurations.")

n_algos = self.bagging_ci_params['n_algos']
predictions = []
# Extract the model if it's inside a pipeline
if isinstance(best_algo, Pipeline):
# Try to find the model step
algo_step = None
for name, step in best_algo.named_steps.items():
if isinstance(step, (RandomForestRegressor, MLPRegressor)): # Extend if more models are used
algo_step = step
break
if algo_step is None:
raise ValueError(f"Could not find '{algo_str}' in the pipeline steps.")
else:
algo_step = best_algo # Direct model

base_algo = algo_step # Now we have the extracted model

# for _ in range(n_algos):
# X_train_resampled, y_train_resampled = resample(self.X_train, self.y_train)

# # Create a new model with the same parameters but a different random_state
# new_random_state = random.randint(1, 100)
# algo_tmp = type(base_algo)(**{**base_algo.get_params(), "random_state": new_random_state})

# algo_tmp.fit(X_train_resampled, y_train_resampled)
# predictions.append(algo_tmp.predict(self.X_test))

# Generate `n_algos` unique random states using self.rs as the seed
random.seed(self.rs)
random_states = [random.randint(1, 10000) for _ in range(n_algos)]

for rand_state in random_states:
# Resample data with a fixed random state for reproducibility
X_train_resampled, y_train_resampled = resample(
self.X_train, self.y_train, random_state=rand_state
)

# Create a new model with the same parameters but a different random_state
algo_tmp = type(base_algo)(**{**base_algo.get_params(), "random_state": rand_state})

algo_tmp.fit(X_train_resampled, y_train_resampled)
predictions.append(algo_tmp.predict(self.X_test))

predictions = np.array(predictions)
mean_pred = predictions.mean(axis=0)
std_pred = predictions.std(axis=0)
confidence_levels = self.confidence_levels #self.bagging_ci_params.get('confidence_levels')
confidence_intervals = {}

for cl in confidence_levels:
lower_bound, upper_bound = np.percentile(predictions, [(100 - cl) / 2, 100 - (100 - cl) / 2], axis=0)
confidence_intervals[f"confidence_level_{cl}"] = {
"lower_bound": lower_bound,
"upper_bound": upper_bound
}

if 'Uncertainty' not in self.algs_dict[algo_str]:
self.algs_dict[algo_str]['Uncertainty'] = {}

self.algs_dict[algo_str]['Uncertainty']['bagging_mean_pred'] = mean_pred
self.algs_dict[algo_str]['Uncertainty']['bagging_std_pred'] = std_pred
self.algs_dict[algo_str]['Uncertainty']['bagging_confidence_intervals'] = confidence_intervals

def calculate_mapie(self):
"""Generalized function to calculate prediction uncertainty using MAPIE."""
for algo_str, algo_data in self.algs_dict.items():
algo = algo_data['algo']
mapie = MapieRegressor(algo, cv="prefit", agg_function="median")
mapie.fit(self.X_train, self.y_train)
self.algs_dict[algo_str]['mapie'] = mapie

def train_algos(self):
"""Train algorithms based on what has been defined in the algo config file

Expand All @@ -1010,20 +1121,16 @@ def train_algos(self):
)
pipe_rf = make_pipeline(rf)
pipe_rf.fit(self.X_train, self.y_train)

# --- Calculate confidence intervals ---
ci = self.calculate_rf_uncertainty(rf, self.X_train, self.X_test)

# --- Compare predictions with confidence intervals ---
self.algs_dict['rf'] = {'algo': rf,
'pipeline': pipe_rf,
'type': 'random forest regressor',
'metric': self.metric,
'ci': ci}
'Uncertainty': {}
}

if 'mlp' in self.algo_config: # MULTI-LAYER PERCEPTRON


if self.verbose:
print(f" Performing Multilayer Perceptron Training")
mlpcfg = self.algo_config['mlp']
Expand All @@ -1038,10 +1145,13 @@ def train_algos(self):
max_iter=mlpcfg.get('max_iter', 200))
pipe_mlp = make_pipeline(StandardScaler(),mlp)
pipe_mlp.fit(self.X_train, self.y_train)

self.algs_dict['mlp'] = {'algo': mlp,
'pipeline': pipe_mlp,
'type': 'multi-layer perceptron regressor',
'metric': self.metric}
'metric': self.metric,
'Uncertainty': {}
}

def train_algos_grid_search(self):
"""Train algorithms using GridSearchCV based on the algo config file.
Expand All @@ -1068,16 +1178,13 @@ def train_algos_grid_search(self):

grid_rf.fit(self.X_train, self.y_train)

# calculate rf confidence intervals from the best rf estimator
ci = self.calculate_rf_uncertainty(grid_rf.best_estimator_.named_steps['randomforestregressor'],
self.X_train, self.X_test)

self.algs_dict['rf'] = {'algo': grid_rf.best_estimator_.named_steps['randomforestregressor'],
'pipeline': grid_rf.best_estimator_,
'gridsearchcv': grid_rf,
'type': 'random forest regressor',
'metric': self.metric,
'ci': ci}
'Uncertainty': {}
}

if 'mlp' in self.algo_config_grid: # MULTI-LAYER PERCEPTRON
if self.verbose:
Expand All @@ -1098,7 +1205,9 @@ def train_algos_grid_search(self):
self.algs_dict['mlp'] = {'algo': grid_mlp.best_estimator_,
'pipeline': grid_mlp,
'type': 'multi-layer perceptron regressor',
'metric': self.metric}
'metric': self.metric,
'Uncertainty': {}
}

def predict_algos(self) -> dict:
""" Make predictions with trained algorithms
Expand All @@ -1118,9 +1227,27 @@ def predict_algos(self) -> dict:
print(f" Generating predictions for {type_algo} algorithm.")

y_pred = pipe.predict(self.X_test)
self.preds_dict[k] = {'y_pred': y_pred,
'type': v['type'],
'metric': v['metric']}
if 'mapie' in v:
y_test_pred, y_test_pis = v['mapie'].predict(self.X_test, alpha=self.mapie_alpha)

# Rename rows
row_labels = ['lower_limit', 'upper_limit']

# Rename columns based on mapie_alpha values
col_labels = [f'alpha_{alpha:.2f}' for alpha in self.mapie_alpha]

# Convert to DataFrame
y_pis_list = [pd.DataFrame(y_test_pis[i], index=row_labels, columns=col_labels) for i in range(y_test_pis.shape[0])]

self.preds_dict[k] = {'y_pred': y_pred,
'y_pis': y_pis_list,
'type': v['type'],
'metric': v['metric']}
else:
self.preds_dict[k] = {'y_pred': y_pred,
'type': v['type'],
'metric': v['metric']}

return self.preds_dict

def evaluate_algos(self) -> dict:
Expand Down Expand Up @@ -1161,16 +1288,14 @@ def save_algos(self):
# write trained algorithm
# joblib.dump(self.algs_dict[algo]['pipeline'], path_algo)

# --- Modified part: Combine rf model and ci into a single dictionary ---
pipeline_with_ci = {
'pipe': self.algs_dict[algo]['pipeline'], # The trained model
'confidence_intervals': self.algs_dict[algo].get('ci',None) # The ci object if it exists
# Save pipeline and metadata in a dictionary
pipeline_data = {
'pipeline': self.algs_dict[algo]['pipeline'], # The trained model pipeline
'X_train_shape': self.X_train.shape, # Store the shape of X_train
'Uncertainty': self.algs_dict[algo]['Uncertainty']
}

# print(self.algs_dict[algo].get('ci'))

# Save the combined pipeline (model + ci) using joblib
joblib.dump(pipeline_with_ci, path_algo)

joblib.dump(pipeline_data, path_algo) # Save pipeline + X_train shape

self.algs_dict[algo]['file_pipe'] = str(path_algo.name)

Expand Down Expand Up @@ -1206,6 +1331,38 @@ def train_eval(self):
if self.algo_config: # Just run a single simulation for these algos
self.train_algos()

# Calculate forestci uncertainty if enabled
# Determine the best Random Forest model
if self.grid_search_algs and 'gridsearchcv' in self.algs_dict['rf']:
best_rf_algo = self.algs_dict['rf']['gridsearchcv'].best_estimator_.named_steps['randomforestregressor']
else:
best_rf_algo = self.algs_dict['rf']['algo']
# Compute forestci uncertainty with the best RF model
self.algs_dict['rf']['Uncertainty']['forestci'] = self.calculate_forestci_uncertainty(
best_rf_algo, np.array(self.X_train), np.array(self.X_test)
)

# Calculate Bagging uncertainty if enabled
# for algo_str in self.algo_config.keys():
# if self.bagging_ci_params.get('n_algos', None):
# self.calculate_bagging_ci(algo_str)
if self.bagging_ci_params.get('n_algos', None):
for algo_dict in [self.algo_config, self.algo_config_grid]: # Iterate over both configurations
for algo_str in algo_dict.keys(): # algo_str is the correct algorithm name
# Select the best model if Grid Search was performed
best_algo = (
self.algs_dict[algo_str]['gridsearchcv'].best_estimator_
if self.grid_search_algs and 'gridsearchcv' in self.algs_dict[algo_str]
else self.algs_dict[algo_str]['algo']
)

# Compute Bagging CI (pass correct algorithm name + best model)
self.calculate_bagging_ci(algo_str, best_algo)

# --- Calculate prediction intervals using MAPIE if enabled ---
if getattr(self, 'mapie_alpha', None):
self.calculate_mapie()

# Make predictions (aka validation)
self.predict_algos()

Expand Down
Loading