Updated Catboost reranking with train, val, test #148

highly0 · 2024-04-12T10:00:46Z

I added updated Catboost training & reranking. The training is now correctly training on only train & val; not touching test. This updated dataset is pushed to HuggingFace (T5-Large, T5-XL ). I've also fixed the scaling issue (fit_transform & transform). Furthermore, training is now not using the TDA TFIDF vectors.

As usual, the jupyter notebook can be converted to py script to train and rerank. Please check inside the notebook for configuration parameters.

jupyter nbconvert --to script catboost_features.ipynb
python3 catboost_features.ipynb

Else, one could also train via the native script, then load the trained model to the notebook above to rerank

python3 train_catboost_regressor.py run_name

Load model saved from the above train, place the path in the notebook and rerank

experiments/subgraphs_reranking/graph_features/graph_features_preparation.py

mdsalnikov · 2024-04-12T10:30:51Z

experiments/subgraphs_reranking/graph_features/train_catboost_regressor.py

    )

    X_train = train_df.drop(["correct", "question"], axis=1)
-    X_test = test_df.drop(["correct", "question"], axis=1)
+    X_test = val_df.drop(["correct", "question"], axis=1)


Please update the names to reflect their actual meaning. It should be "X_validation" or "X_val", both here and throughout the "main" section.

mdsalnikov · 2024-04-12T10:34:01Z

experiments/subgraphs_reranking/graph_features/train_catboost_regressor.py

@@ -192,7 +189,6 @@ def find_weight(target):
        text_features=text_features,
        feature_names=list(X_test),
        embedding_features=emb_features,
-        weight=find_weight(y_test),
    )

    # hyper-params tuning


It is a bit unusual search space for the grid search. Please update it according to best practices.
For example like that, it should be ok: https://www.projectpro.io/recipes/find-optimal-parameters-for-catboost-using-gridsearchcv-for-classification

mdsalnikov · 2024-04-12T10:36:27Z

experiments/subgraphs_reranking/graph_features/train_catboost_regressor.py

-        test = test.drop("tfidf_vector", axis=1)
-    else:  # with tfidf
-        embedding_features.append("tfidf_vector")
+    train = train.drop("tfidf_vector", axis=1)


Again, if you simply drop it at the preparation stage, it will not be necessary to drop it here. It is difficult to understand now.

Yegor5 · 2024-04-15T09:42:59Z

We use this params for greed search

params = { "learning_rate": list(np.linspace(0.03, 0.3, 5)), "depth": [4, 6, 8, 10], "iterations": [2000, 3000, 4000] }

added updated catboost with correct train,val, test

2f42ee1

highly0 requested review from mdsalnikov and DmiitriyJarosh April 12, 2024 10:00

mdsalnikov requested changes Apr 12, 2024

View reviewed changes

highly0 closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Catboost reranking with train, val, test #148

Updated Catboost reranking with train, val, test #148

highly0 commented Apr 12, 2024

mdsalnikov Apr 12, 2024

mdsalnikov Apr 12, 2024

mdsalnikov Apr 12, 2024

Yegor5 commented Apr 15, 2024

Updated Catboost reranking with train, val, test #148

Updated Catboost reranking with train, val, test #148

Conversation

highly0 commented Apr 12, 2024

mdsalnikov Apr 12, 2024

Choose a reason for hiding this comment

mdsalnikov Apr 12, 2024

Choose a reason for hiding this comment

mdsalnikov Apr 12, 2024

Choose a reason for hiding this comment

Yegor5 commented Apr 15, 2024