feat: Trainer controller framework #45

seshapad · 2024-02-15T06:05:22Z

Signed-off-by: Padmanabha Venkatagiri Seshadri seshapad@in.ibm.com
Co-authored-by: Dushyant Behl dushyantbehl@hotmail.com
Co-authored-by: Ashok Pon Kumar ashokponkumar@gmail.com

Motivation

There is a need for stopping an ongoing training if some stopping criteria is satisfied (E.g loss validation reaching a certain target, loss increasing with epoch, loss values for last 100 steps increasing etc).
There is a EarlyStoppingCallback in HF, but the granularity of stopping is only on evaluate events, and handles only compares instantaneous metric value to a threshold.
Therefore, there is a need for a mechanism to capture the user-defined custom stopping criteria which could involve multiple metrics.
In addition to user-defined stopping criteria, there could other types of control operations with respect to training (for instance, should the trainer perform saving, logging or evaluation operations or not, should we scale resources dynamically so that training could run faster and so on). Therefore, there is a need for general need to capture all these use-cases in a single framework. This PR attempts to provide such a framework.

User Design Details:

We have implemented a trainer callback (see here) which accepts a training control definition file (in YAML format) which facilitates the definition of:

Rules to control training loop
Trigger points that evaluate the above rules
Control operation and action that needs to be performed if rule is evaluated to true.

The trainer controller configuration is structured as shown below. There are list of metric definitions under controller-metrics, a list of operations and their actions under operations and a list of controllers, each of which define the rules, triggers and control operations.

controller-metrics:
  <controller-name>:
    <controller-handler-class>:
      <arg1>: <value>
      ...
operations:
  <operation-name>:
    <operation-handler-class>:
      <arg1>: <value>
      ...
controllers:
  - name: <controller-name>
    triggers:
      - <event-1>
      ...
    rule: <rule-string>
    operations:
      - <operation-action-1>
      ...

The controller-metrics and operations are optional. We provide a set of built-in controller-metrics and operations which could be referred to without actually defining them as. For example, the below configuration defines a controller-metric called loss which refers to a built-in Loss controller-metric class with custom arguments (in this case, no arguments), but does not define any operations. It only refers to a built-in operation.

controller-metrics:
  loss:
    Loss:
controllers:
  - name: loss-controller
    triggers:
      - on_log
    rule: loss < 1.0
    operations:
      - hfcontrols.should_training_stop

For defining custom handler classes, we have an interface defined as an abstract class as shown below, with two abstract methods, namely: validate() to define the validation conditions, and compute() to compute the metric.

class MetricHandler(metaclass=abc.ABCMeta):
    @abc.abstractmethod 
    def validate(self) -> bool:
        pass

    @abc.abstractmethod 
    def compute(self, event_name: str, **kwargs) -> Any:
        pass

These classes can be user-defined. To add a new metric class, simply implement the above structure and register it with the trainer controller framework using the register_metric_handlers() method. To use the metric handler class, add the class name, arguments to the above configuration file.

Similarly, there is an operator abstract class Operation which could be inherited and custom operations could be defined as illustrated below:

class CustomOperation(Operation):
    def should_perform_action_xyz(args):
        pass

Every action defined in the custom operation should be represented as a function with should_ prefixed in the function name. The controller will automatically pickup these functions and invoke them if they are referred to in the configuration. Custom operations could be registered using register_operation_handlers() method.

rule is python expression which could express a condition to evaluate on a metric variable. For example, in the above configuration, loss is the variable, and the rule is applying a threshold on it.

operations lists the operation-actions to be performed when the rule evaluates to True. The convention followed to refer to an operation is <operation-name>.<action-name>. In this example, the <operation-class-name> is referring to built-in operation hfcontrols and one of its corresponding action action-name i.e should_training_stop.

ashokponkumar

Nice work. Few suggestions @seshapad !

tuning/config/configs.py

tuning/policydrivencontroller/__init__.py

tuning/policydrivencontroller/pdt_callback.py

tuning/policydrivencontroller/controllermetrics/metrics.py

tuning/policydrivencontroller/pdt_callback.py

ashokponkumar

Few more comments..

examples/training-control-configs/Readme.md

examples/training-control-configs/ctldef_epoch_threshold_v0.3.yaml

examples/training-control-configs/ctldef_evaluate_v0.3.yaml

examples/training-control-configs/ctldef_step_v0.3.yaml

tuning/sft_trainer.py

tuning/trainingcontroller/controllermetrics/metrics.py

examples/trainer-controller-configs/Readme.md

tuning/trainercontroller/controllermetrics/metrics.py

ashokponkumar

Very nice! just 2 minor nits.

tuning/sft_trainer.py

tuning/trainercontroller/callback.py

ashokponkumar · 2024-02-23T14:25:37Z

@alex-jw-brooks PTAL

alex-jw-brooks

Hi @seshapad, this is super cool, thanks for putting so much thought into it! I've added some comments, but as far as some general thoughts go:

It would be best to move away from generic exception catching leading to continue/pass silently - this happens in a few places, but in such situations it would probably be better to have throw a well-defined error as soon as possible, otherwise I think there’s risk of early stopping etc not working properly, and not logging due to continue/pass in generic exceptions. This is probably the biggest change needed IMO, the framework itself looks great
It would be great to get some unit tests here, e.g., by converting some of the examples into end to end tests on a tiny model and validate that the behavior is actually being triggered correctly
I’ll leave this up to you, but splitting the PR up into a more atomic smaller PRs might make it both easier to test and review quickly. For example, one PR could contain the base framework / validation / callback etc, while other PR(s) could contain the early stopping implementation, the cache metric handler, etc. But this is up to you, if you'd prefer to keep it all together, that is fine also

Sorry for the delayed review. Please reach out if you need anything or have any questions! Great work!

tuning/trainercontroller/callback.py

tuning/trainercontroller/controllermetrics/metricshandler.py

tuning/trainercontroller/callback.py

alex-jw-brooks · 2024-03-07T01:28:33Z

tuning/trainercontroller/callback.py

+                    for k, v in controller['control-operations'].items():
+                        setattr(control, k, v)
+            except Exception as e:
+                pass


It would be best to avoid generic exception catching or passing I think, because then in a rule is completely broken, it will likely throw every time silently, and the early stopping will just fail to work.

Any thoughts on just letting it throw the error? If eval throws, that means the rule or the metric computation is likely incorrect, right?

tuning/trainercontroller/callback.py

examples/trainer-controller-configs/Readme.md

tuning/trainercontroller/controllermetrics/steploss.py

examples/trainer-controller-configs/Readme.md

tuning/trainercontroller/controllermetrics/steploss.py

kmehant

not critical provided small suggestions.

seshapad · 2024-03-12T05:54:14Z

Comments from review:

[x] Fold the result marshalling for controller-metrics to the base class
[x] Handling the scenarios when eval() does not have the variables required to evaluate the rule - more discussion required
[x] eval() needs to be replaced by safer version
[x] Add event_name as argument to compute() function in metrichandler
[x] Split the initialization of controller: initialization should happen in init, validation should happen in on_init_end()
[x] Error out and stop training, if config file is incorrect or missing. On the other hand if config file is present, but empty, continue with the training...but the callback will be do nothing.
[x] Use IntervalStrategy enum for steps and epoch strings.

requirements.txt

tuning/trainercontroller/callback.py

tuning/trainercontroller/operations/operation.py

…ass inheritance Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

…ema etc Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

seshapad · 2024-04-09T05:55:08Z

Hey @seshapad - thanks for addressing the requested changes and adding the ADR! For the pylint errors:
1. Do you get this on `main`? I am a bit surprised to see it here, since this PR doesn't really touch the peft config stuff, but it probably comes from the local name and importing `peft_config` directly from `tuning.config`. If it is on `main`, I'll fix this one in a separate PR - I'll also try running the format check stuff here
For 2 & 3, we can disable them using a comment like this, since it seems like these are both needed.

I think everything else looks good! Great work 🥳

@alex-jw-brooks @Ssukriti Once again, many thanks for reviewing the PR in detail. We have addressed the issues raised during linting, formatting and testing. The make fmt, make lint and make test have been run to verify it. The changes have also been pushed to the PR.
If you are satisfied with the PR, We request you to take the process forward in triggering the workflow checks, approval and merge.

alex-jw-brooks

LGTM! Thank you very much! 🚀

* feat: Extended gitignore to include backup files and folders Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Policy driven training control Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Addressed review comments related to exceptions and abstract class inheritance Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Design changes to trainer controller including validations, schema etc Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: trainer controller revamped Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: trainer controller revamped Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com> * feat: Documentation and some test case bug fixes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues to make build succeed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Add rule validation to make eval safe again Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed default package typing from requirements.txt Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Added test cases, data, some exception handling Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Addressed the action filter bug and added a test case for it Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: bugs in operation validate Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Architecture document for trainer-controller Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Trainer controller examples directory renamed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Trainer controller examples directory renamed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Prefix regex corrected Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Details on key collisions added Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * adr: Details on key collisions added Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase issues related to aim callback addressed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase issues related to aim callback addressed Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: brackets missing comma Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Addressed lint comments Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added lint disable directives Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Reformatted files from black Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Resolved cyclic package dependencies Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

seshapad marked this pull request as draft February 15, 2024 06:06

seshapad force-pushed the ptc branch from ada3aca to 42a07fb Compare February 15, 2024 10:21

ashokponkumar requested changes Feb 16, 2024

View reviewed changes

seshapad force-pushed the ptc branch from 81b7946 to 10dd4ec Compare February 21, 2024 04:31

ashokponkumar requested changes Feb 21, 2024

View reviewed changes

seshapad force-pushed the ptc branch from ce3655c to 89daeaf Compare February 21, 2024 15:56

ashokponkumar requested changes Feb 21, 2024

View reviewed changes

ashokponkumar approved these changes Feb 23, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved

tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved

seshapad changed the title ~~WIP: Policy driven training control~~ feat: Policy driven training control Feb 23, 2024

seshapad marked this pull request as ready for review February 23, 2024 12:25

seshapad force-pushed the ptc branch 2 times, most recently from 2d02686 to 555d90c Compare March 5, 2024 08:02

alex-jw-brooks requested changes Mar 7, 2024

View reviewed changes

seshapad requested review from anhuong and Ssukriti as code owners March 11, 2024 16:57

Akash-Nayak reviewed Mar 11, 2024

View reviewed changes

examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved

Akash-Nayak reviewed Mar 11, 2024

View reviewed changes

examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved

kmehant reviewed Mar 12, 2024

View reviewed changes

tuning/trainercontroller/controllermetrics/steploss.py Outdated Show resolved Hide resolved

kmehant reviewed Mar 12, 2024

View reviewed changes

tuning/trainercontroller/controllermetrics/steploss.py Outdated Show resolved Hide resolved

kmehant reviewed Mar 12, 2024

View reviewed changes

seshapad force-pushed the ptc branch from 7edf675 to 903bd86 Compare March 17, 2024 12:01

kmehant reviewed Mar 19, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

kmehant reviewed Mar 19, 2024

View reviewed changes

tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved

seshapad force-pushed the ptc branch from 2c16369 to a4dde7f Compare March 19, 2024 13:11

seshapad changed the title ~~feat: Policy driven training control~~ feat: Trainer controller framework Mar 19, 2024

kmehant reviewed Mar 20, 2024

View reviewed changes

tuning/trainercontroller/operations/operation.py Outdated Show resolved Hide resolved

kmehant reviewed Mar 20, 2024

View reviewed changes

tuning/trainercontroller/operations/operation.py Outdated Show resolved Hide resolved

kmehant reviewed Mar 20, 2024

View reviewed changes

tuning/trainercontroller/operations/operation.py Outdated Show resolved Hide resolved

seshapad and others added 21 commits April 9, 2024 00:44

feat: Addressed review comments related to exceptions and abstract cl…

8adc64c

…ass inheritance Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

feat: Design changes to trainer controller including validations, sch…

60d3d9f

…ema etc Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

feat: trainer controller revamped

9ceec50

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

feat: trainer controller revamped

22e3d7a

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

feat: Documentation and some test case bug fixes

53666d7

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Formatting issues to make build succeed

1c9d09f

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Add rule validation to make eval safe again

de1de92

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Removed default package typing from requirements.txt

79ad268

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

feat: Added test cases, data, some exception handling

4120a08

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Addressed the action filter bug and added a test case for it

854a62e

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: bugs in operation validate

0daeeda

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

adr: Architecture document for trainer-controller

86a810a

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Trainer controller examples directory renamed

0bba53e

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Trainer controller examples directory renamed

22a414b

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Prefix regex corrected

4e0ae74

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

adr: Details on key collisions added

4ec9592

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

adr: Details on key collisions added

5cd124a

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: rebase issues related to aim callback addressed

1be25a9

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: rebase issues related to aim callback addressed

a746b27

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: brackets missing comma

a8daf68

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Addressed lint comments

cb14f42

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

seshapad force-pushed the ptc branch from f4f4509 to cb14f42 Compare April 9, 2024 04:44

seshapad added 3 commits April 9, 2024 01:17

fix: Added lint disable directives

2e75f58

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Reformatted files from black

fa72f73

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

fix: Resolved cyclic package dependencies

f3673cb

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

alex-jw-brooks approved these changes Apr 9, 2024

View reviewed changes

alex-jw-brooks merged commit 7b7effd into foundation-model-stack:main Apr 9, 2024
3 checks passed

HarikrishnanBalagopal mentioned this pull request Jun 25, 2024

docs: instructions for using the trainer controller framework #214

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Trainer controller framework #45

feat: Trainer controller framework #45

seshapad commented Feb 15, 2024 •

edited

Loading

ashokponkumar left a comment

ashokponkumar left a comment

ashokponkumar left a comment

ashokponkumar commented Feb 23, 2024

alex-jw-brooks left a comment

alex-jw-brooks Mar 7, 2024

kmehant left a comment •

edited

Loading

seshapad commented Mar 12, 2024 •

edited

Loading

seshapad commented Apr 9, 2024 •

edited

Loading

alex-jw-brooks left a comment

feat: Trainer controller framework #45

feat: Trainer controller framework #45

Conversation

seshapad commented Feb 15, 2024 • edited Loading

Motivation

User Design Details:

ashokponkumar left a comment

Choose a reason for hiding this comment

ashokponkumar left a comment

Choose a reason for hiding this comment

ashokponkumar left a comment

Choose a reason for hiding this comment

ashokponkumar commented Feb 23, 2024

alex-jw-brooks left a comment

Choose a reason for hiding this comment

alex-jw-brooks Mar 7, 2024

Choose a reason for hiding this comment

kmehant left a comment • edited Loading

Choose a reason for hiding this comment

seshapad commented Mar 12, 2024 • edited Loading

seshapad commented Apr 9, 2024 • edited Loading

alex-jw-brooks left a comment

Choose a reason for hiding this comment

seshapad commented Feb 15, 2024 •

edited

Loading

kmehant left a comment •

edited

Loading

seshapad commented Mar 12, 2024 •

edited

Loading

seshapad commented Apr 9, 2024 •

edited

Loading