-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Catboost reranking with train, val, test #148
Conversation
experiments/subgraphs_reranking/graph_features/graph_features_preparation.py
Show resolved
Hide resolved
experiments/subgraphs_reranking/graph_features/graph_features_preparation.py
Show resolved
Hide resolved
) | ||
|
||
X_train = train_df.drop(["correct", "question"], axis=1) | ||
X_test = test_df.drop(["correct", "question"], axis=1) | ||
X_test = val_df.drop(["correct", "question"], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the names to reflect their actual meaning. It should be "X_validation" or "X_val", both here and throughout the "main" section.
@@ -192,7 +189,6 @@ def find_weight(target): | |||
text_features=text_features, | |||
feature_names=list(X_test), | |||
embedding_features=emb_features, | |||
weight=find_weight(y_test), | |||
) | |||
|
|||
# hyper-params tuning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit unusual search space for the grid search. Please update it according to best practices.
For example like that, it should be ok: https://www.projectpro.io/recipes/find-optimal-parameters-for-catboost-using-gridsearchcv-for-classification
test = test.drop("tfidf_vector", axis=1) | ||
else: # with tfidf | ||
embedding_features.append("tfidf_vector") | ||
train = train.drop("tfidf_vector", axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, if you simply drop it at the preparation stage, it will not be necessary to drop it here. It is difficult to understand now.
We use this params for greed search
|
I added updated Catboost training & reranking. The training is now correctly training on only train & val; not touching test. This updated dataset is pushed to HuggingFace (T5-Large, T5-XL ). I've also fixed the scaling issue (
fit_transform
&transform
). Furthermore, training is now not using the TDA TFIDF vectors.As usual, the jupyter notebook can be converted to
py
script to train and rerank. Please check inside the notebook for configuration parameters.Else, one could also train via the native script, then load the trained model to the notebook above to rerank
Load model saved from the above train, place the path in the notebook and rerank