Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Catboost reranking with train, val, test #148

Closed
wants to merge 1 commit into from

Conversation

highly0
Copy link
Collaborator

@highly0 highly0 commented Apr 12, 2024

I added updated Catboost training & reranking. The training is now correctly training on only train & val; not touching test. This updated dataset is pushed to HuggingFace (T5-Large, T5-XL ). I've also fixed the scaling issue (fit_transform & transform). Furthermore, training is now not using the TDA TFIDF vectors.

As usual, the jupyter notebook can be converted to py script to train and rerank. Please check inside the notebook for configuration parameters.

jupyter nbconvert --to script catboost_features.ipynb
python3 catboost_features.ipynb

Else, one could also train via the native script, then load the trained model to the notebook above to rerank

python3 train_catboost_regressor.py run_name

Load model saved from the above train, place the path in the notebook and rerank

)

X_train = train_df.drop(["correct", "question"], axis=1)
X_test = test_df.drop(["correct", "question"], axis=1)
X_test = val_df.drop(["correct", "question"], axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the names to reflect their actual meaning. It should be "X_validation" or "X_val", both here and throughout the "main" section.

@@ -192,7 +189,6 @@ def find_weight(target):
text_features=text_features,
feature_names=list(X_test),
embedding_features=emb_features,
weight=find_weight(y_test),
)

# hyper-params tuning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unusual search space for the grid search. Please update it according to best practices.
For example like that, it should be ok: https://www.projectpro.io/recipes/find-optimal-parameters-for-catboost-using-gridsearchcv-for-classification

test = test.drop("tfidf_vector", axis=1)
else: # with tfidf
embedding_features.append("tfidf_vector")
train = train.drop("tfidf_vector", axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if you simply drop it at the preparation stage, it will not be necessary to drop it here. It is difficult to understand now.

@Yegor5
Copy link
Collaborator

Yegor5 commented Apr 15, 2024

We use this params for greed search

params = { "learning_rate": list(np.linspace(0.03, 0.3, 5)), "depth": [4, 6, 8, 10], "iterations": [2000, 3000, 4000] }

@highly0 highly0 closed this Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants