-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Trainer controller framework #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Few suggestions @seshapad !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments..
examples/training-control-configs/ctldef_epoch_threshold_v0.3.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! just 2 minor nits.
@alex-jw-brooks PTAL |
2d02686
to
555d90c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @seshapad, this is super cool, thanks for putting so much thought into it! I've added some comments, but as far as some general thoughts go:
-
It would be best to move away from generic exception catching leading to
continue/pass
silently - this happens in a few places, but in such situations it would probably be better to have throw a well-defined error as soon as possible, otherwise I think there’s risk of early stopping etc not working properly, and not logging due to continue/pass in generic exceptions. This is probably the biggest change needed IMO, the framework itself looks great -
It would be great to get some unit tests here, e.g., by converting some of the examples into end to end tests on a tiny model and validate that the behavior is actually being triggered correctly
-
I’ll leave this up to you, but splitting the PR up into a more atomic smaller PRs might make it both easier to test and review quickly. For example, one PR could contain the base framework / validation / callback etc, while other PR(s) could contain the early stopping implementation, the cache metric handler, etc. But this is up to you, if you'd prefer to keep it all together, that is fine also
Sorry for the delayed review. Please reach out if you need anything or have any questions! Great work!
tuning/trainercontroller/callback.py
Outdated
for k, v in controller['control-operations'].items(): | ||
setattr(control, k, v) | ||
except Exception as e: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to avoid generic exception catching or passing I think, because then in a rule is completely broken, it will likely throw every time silently, and the early stopping will just fail to work.
Any thoughts on just letting it throw the error? If eval throws, that means the rule or the metric computation is likely incorrect, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not critical provided small suggestions.
Comments from review: [x] Fold the result marshalling for controller-metrics to the base class |
…ass inheritance Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
…ema etc Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
@alex-jw-brooks @Ssukriti Once again, many thanks for reviewing the PR in detail. We have addressed the issues raised during linting, formatting and testing. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you very much! 🚀
* feat: Extended gitignore to include backup files and folders Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Addressed review comments related to exceptions and abstract class inheritance Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Design changes to trainer controller including validations, schema etc Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: trainer controller revamped Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: trainer controller revamped Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Documentation and some test case bug fixes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues to make build succeed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Add rule validation to make eval safe again Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed default package typing from requirements.txt Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Added test cases, data, some exception handling Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Addressed the action filter bug and added a test case for it Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: bugs in operation validate Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Architecture document for trainer-controller Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Trainer controller examples directory renamed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Trainer controller examples directory renamed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Prefix regex corrected Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Details on key collisions added Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Details on key collisions added Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase issues related to aim callback addressed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase issues related to aim callback addressed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: brackets missing comma Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Addressed lint comments Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added lint disable directives Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Reformatted files from black Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Resolved cyclic package dependencies Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Padmanabha Venkatagiri Seshadri seshapad@in.ibm.com
Co-authored-by: Dushyant Behl dushyantbehl@hotmail.com
Co-authored-by: Ashok Pon Kumar ashokponkumar@gmail.com
Motivation
evaluate
events, and handles only compares instantaneous metric value to a threshold.User Design Details:
We have implemented a trainer callback (see here) which accepts a
training control definition
file (in YAML format) which facilitates the definition of:The trainer controller configuration is structured as shown below. There are list of metric definitions under
controller-metrics
, a list of operations and their actions underoperations
and a list of controllers, each of which define the rules, triggers and control operations.The
controller-metrics
andoperations
are optional. We provide a set of built-incontroller-metrics
andoperations
which could be referred to without actually defining them as. For example, the below configuration defines acontroller-metric
calledloss
which refers to a built-inLoss
controller-metric class with custom arguments (in this case, no arguments), but does not define anyoperations
. It only refers to a built-in operation.For defining custom handler classes, we have an interface defined as an abstract class as shown below, with two abstract methods, namely:
validate()
to define the validation conditions, andcompute()
to compute the metric.These classes can be user-defined. To add a new metric class, simply implement the above structure and register it with the trainer controller framework using the
register_metric_handlers()
method. To use the metric handler class, add the class name, arguments to the above configuration file.Similarly, there is an operator abstract class
Operation
which could be inherited and custom operations could be defined as illustrated below:Every action defined in the custom operation should be represented as a function with
should_
prefixed in the function name. The controller will automatically pickup these functions and invoke them if they are referred to in the configuration. Custom operations could be registered usingregister_operation_handlers()
method.rule
is python expression which could express a condition to evaluate on a metric variable. For example, in the above configuration,loss
is the variable, and the rule is applying a threshold on it.operations
lists the operation-actions to be performed when the rule evaluates to True. The convention followed to refer to an operation is<operation-name>.<action-name>
. In this example, the<operation-class-name>
is referring to built-in operationhfcontrols
and one of its corresponding actionaction-name
i.eshould_training_stop
.