Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support model uploads #176

Merged
merged 1 commit into from
Sep 20, 2024
Merged

Conversation

jmrichardson
Copy link
Contributor

Reference #173

Not sure the best way to to create pytest given requires API credentials
Did not update the main readme

Local tests for me work without issue

@jrosenfeld13
Copy link
Collaborator

This looks great, nice addition and thanks! I have one slight concern that there may be many errors cropping up around dependency issues. Model uploads only work with models trained on a very specific set of requirements and runtimes: https://github.com/numerai/numerai-predict/blob/master/requirements.txt and we don't want to limit numerblox to those versions of many libraries.

Since we view numerblox as a tool for power users, those errors aren't a good reason to exclude this great PR from the library, but we may just want to add documentation and possibly assertions for version requirements that match those needed for numerai-predict https://github.com/numerai/numerai-predict/tree/master It looks like there may be good ways to locally test as well to ensure pkl compatibility w/ model uploads and there may be interesting ways to integrate that

@jmrichardson
Copy link
Contributor Author

Hi @jrosenfeld13

Glad to support this great project. I will be traveling for the next week and won't be able update until I return. Feel free to make any changes and merge those changes. Otherwise, will update the documentation when I return.

With regard to assertions, please let me know what you were thinking. To your point, I didn't want to make it too restrictive as the requirements could change over time.

On a separate note, it would be great to have a walk forward method that would enable model retraining over time/eras. This would allow ensembles to be trained or metrics to be updated for adaptive models or existing ensembles. Is this something on your roadmap? If not, I am happy to work on that feature.

@jrosenfeld13
Copy link
Collaborator

With regard to assertions, please let me know what you were thinking. To your point, I didn't want to make it too restrictive as the requirements could change over time.

I'm not sure I have a particular solution in mind. Given that numerai-predict is quite restrictive, it seems like making the ModelUpload at least as restrictive makes sense--otherwise it won't work anyway. The assertions/version checks can always reference the ground truth here too: https://github.com/numerai/numerai-predict/blob/master/requirements.txt Maybe @CarloLepelaars would have some good ideas on how to implement something here

On a separate note, it would be great to have a walk forward method that would enable model retraining over time/eras. This would allow ensembles to be trained or metrics to be updated for adaptive models or existing ensembles. Is this something on your roadmap? If not, I am happy to work on that feature.

CrossValEstimator might be what you are looking for? https://crowdcent.github.io/numerblox/meta/#crossvalestimator You can pass in whatever sklearn compatible CV splitter you'd like to define the time/eras that will be retrained on. And of course if you don't like the behavior of CrossValEstimator, you can always try to roll your own with some more custom features to your liking.

@jmrichardson
Copy link
Contributor Author

jmrichardson commented Sep 17, 2024

I'm not sure I have a particular solution in mind. Given that numerai-predict is quite restrictive, it seems like making the ModelUpload at least as restrictive makes sense--otherwise it won't work anyway. The assertions/version checks can always reference the ground truth here too: https://github.com/numerai/numerai-predict/blob/master/requirements.txt Maybe @CarloLepelaars would have some good ideas on how to implement something here

Sounds good, however, there would need to be inspections in the pipeline which could get tricky to make sure the package requirements are met.

CrossValEstimator might be what you are looking for? https://crowdcent.github.io/numerblox/meta/#crossvalestimator You can pass in whatever sklearn compatible CV splitter you'd like to define the time/eras that will be retrained on. And of course if you don't like the behavior of CrossValEstimator, you can always try to roll your own with some more custom features to your liking.

Unless I misunderstand its purpose, I don't think CrossValEstimator is quite what I am suggesting. I did a poor job of describing the workflow so let me try again:

The idea is to simulate what would have happened if models were continuously retrained with new data, rather than using the metrics from the same model on an entire test set. So for example, if we train a model on the train dataset and score over the test set, the model will tend to lose performance with regime shifts over time. However, if we retrain the model for each new era (or sum number of eras) with new data in the test set, the model adapts over time and the predictions reflect adaptability of the model.

In my case, I want to train my models up to the latest era in the dataset to make sure the models are up-to-date with the latest information. However, I don't want to use training cross validation scores to understand their true performance because again the validation data is over a large period of time and also would shift the training prior to the gap/embargo between train/validation. Instead, I want to retrain all my models (most are ensembles of models) prior to each new era in the test set (lets say last 52 eras). At the end of this iterative process, all the models are fully trained and you also have 52 eras of OOF predictions of up-to-date models to either select the best one or create an ensemble for the next live era.

Hope that makes sense. Obviously this is compute intensive but I think could provide better insights than a static evaluation set that doesn't allow for adaptive models that are retrained frequently.

@CarloLepelaars
Copy link
Collaborator

Very cool idea @jmrichardson! Could you make a separate Github Issue for continuous re-training? We would like to do more with re-training and partial_fit models, but have not taken the time to build it yet. Always happy to discuss and look at how we can approach this.

I'll merge in the model uploads. Thank you! We will make one big Numerblox release at the end of next week as v4 data gets deprecated for Numerai Classic.

@CarloLepelaars CarloLepelaars merged commit c416095 into crowdcent:master Sep 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants