feat(l2g): better l2g training, evaluation, and integration #576

ireneisdoomed · 2024-04-17T13:03:01Z

✨ Context

L2G training and prediction rewrite to use Scikit instead of pyspark.ml.

The main points that made me do this were:

Evaluation of the model was really slow. Even though I was persisting the L2G feature matrix and model, any time that we interacted with the model, this was triggering a retraining. This led to runtimes of ~2h if we wanted to run the pipeline end to end.
I don't think using Spark is necessary for training. We have a simple model. Testing the approach to scikit made training and predicting possible in seconds.
Using scikit brings many niceties: bigger support from community, more control, native integration with W&B (panel is more informative now), the model is easier to share, and we can upload it to the Hugging Face Hub.

Training end to end is now 18mins - job
Predicting is also 18 mins - job

This basically reassures me that annotating features is practically the entirety of the pipeline, which is consistent with my tests.

🛠 What does this PR implement

I've deprecated almost its entirety, so it's easier if i describe what this inclu
However, the L2G workflow is very similar to the previous one. This is just a change in the implementation that optimises the process.

LocusToGeneStep.run_train creates a L2GModel based on a GradientBoostingClassifier. The model is trained, exported to Hugging Face and saved locally.
LocusToGeneStep.run_predict downloads the model from the hub by default, and extracts scores.
The W&B and Hugging Face secrets are now accessed from the Secret Manager. So this part is now fully integrated in the pipeline.
- Added the copy_to_gcs util to move the L2G model local file into a GCS destination.
The only particularity in LocusToGeneTrainer is the addition of hyperparameter_tuning. This function uses Weights&Biases Sweeps, to sweep over all parameters provided in a grid, train a model, evaluate the results and upload them in a group. For example:

(interesting that the metrics are the same) This is not part of the pipeline, only useful in development stages.
L2GFeatureMatrix have 2 new attributes fixed_cols, to complement the columns that are not features, and mode. Both are connected, on train mode the fixed columns are studyLocusId, geneId and the target column (goldStandardSet); on predict mode, we don't have the latter one.
LocusToGeneModel gain 3 attributes: hyperparameters, training_data (to store the input feature matrix and log it to W&B), label_encoder. Features_list is no longer necessary.

🙈 Missing

Bug in the logging of the feature matrix that don't show the feature labels
Cross validation to have a more accurate evaluation of the model.
Testing for the Training functions
When the model is public, we could upload the predictions dataset to the HF Hub to link a dataset with a model.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-3263

…o il-3263

…-3263

pyproject.toml

src/gentropy/common/utils.py

src/gentropy/config.py

src/gentropy/dataset/l2g_feature_matrix.py

src/gentropy/dataset/l2g_prediction.py

src/gentropy/l2g.py

tests/gentropy/method/test_locus_to_gene.py

…-3263

* chore: checkpoint * chore: checkpoint * chore: deprecate spark evaluator * chore: checkpoint * chore: resolve conflicts with dev * chore: resolve conflicts with dev * chore(model): add parameters class property * feat: add module to export model to hub * refactor: make model agnostic of features list * chore: add wandb to gitignore * feat: download model from hub * chore(model): adapt predict method * feat(trainer): add hyperparameter tuning * chore: deprecate trainer tests * refactor: modularise step * feat: download model from hub by default * fix: convert omegaconfig defaults to python objects * fix: write serialised model to disk and then upload to gcs * fix(matrix): drop goldStandardSet when in predict mode * chore: pass token to access private model * chore: pass token to access private model * fix: pass right schema * chore: pre-commit auto fixes [...] * chore: fix mypy issues * build: remove xgboost * chore: merge * chore: pre-commit auto fixes [...] * chore: address comments

ireneisdoomed added 3 commits March 22, 2024 16:08

chore: checkpoint

69c4542

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

1fb8470

…-3263

chore: checkpoint

c9a01f6

github-actions bot added size-L Method Performance labels Apr 17, 2024

ireneisdoomed added 24 commits April 18, 2024 18:28

chore: deprecate spark evaluator

b02b0fe

chore: checkpoint

d25069d

chore: resolve conflicts with dev

b33dcb9

chore: resolve conflicts with dev

0608f1c

chore(model): add parameters class property

4a21aa5

chore: merge

fe30136

feat: add module to export model to hub

9411501

refactor: make model agnostic of features list

27eb295

chore: add wandb to gitignore

8792090

feat: download model from hub

866ab8b

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

be25f0c

…-3263

chore(model): adapt predict method

bc834bb

feat(trainer): add hyperparameter tuning

13aca28

chore: deprecate trainer tests

f540ff8

refactor: modularise step

18e1471

feat: download model from hub by default

8071897

fix: convert omegaconfig defaults to python objects

7636cbd

fix: write serialised model to disk and then upload to gcs

78f1675

fix(matrix): drop goldStandardSet when in predict mode

e05375b

chore: pass token to access private model

76e2f63

chore: pass token to access private model

233eb03

chore: merge

ce8de02

fix: pass right schema

a013daa

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

79cee6d

…-3263

github-actions bot added Dataset Step airflow size-XL and removed size-L labels Jun 17, 2024

pre-commit-ci bot and others added 3 commits June 17, 2024 15:50

chore: pre-commit auto fixes [...]

1cdc060

chore: fix mypy issues

3bf0ea7

Merge branch 'il-3263' of https://github.com/opentargets/gentropy int…

fb7a232

…o il-3263

github-actions bot removed the airflow label Jun 17, 2024

build: remove xgboost

ae830c1

github-actions bot added the documentation Improvements or additions to documentation label Jun 17, 2024

ireneisdoomed and others added 3 commits June 18, 2024 11:25

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

33a8125

…-3263

chore: merge

7bf47ac

chore: pre-commit auto fixes [...]

f739088

ireneisdoomed changed the title ~~perf(l2g): change l2g implementation to scikit-learn~~ feat(l2g): better l2g training, evaluation, and integration Jun 18, 2024

ireneisdoomed marked this pull request as ready for review June 18, 2024 10:47

ireneisdoomed requested a review from project-defiant June 18, 2024 10:48

project-defiant reviewed Jun 19, 2024

View reviewed changes

ireneisdoomed added 2 commits June 24, 2024 14:13

chore: address comments

1f856bb

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

209af5b

…-3263

github-actions bot added Feature and removed Performance labels Jun 24, 2024

ireneisdoomed requested a review from project-defiant June 24, 2024 13:27

project-defiant approved these changes Jun 24, 2024

View reviewed changes

ireneisdoomed merged commit 0d9160f into dev Jun 24, 2024
4 checks passed

ireneisdoomed deleted the il-3263 branch June 24, 2024 16:21

ireneisdoomed mentioned this pull request Sep 18, 2024

L2G model evaluation is very slow opentargets/issues#3263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2g): better l2g training, evaluation, and integration #576

feat(l2g): better l2g training, evaluation, and integration #576

ireneisdoomed commented Apr 17, 2024 •

edited

Loading

feat(l2g): better l2g training, evaluation, and integration #576

feat(l2g): better l2g training, evaluation, and integration #576

Conversation

ireneisdoomed commented Apr 17, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

ireneisdoomed commented Apr 17, 2024 •

edited

Loading