Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Trainer controller framework #45

Merged
merged 47 commits into from
Apr 9, 2024

Conversation

seshapad
Copy link
Contributor

@seshapad seshapad commented Feb 15, 2024

Signed-off-by: Padmanabha Venkatagiri Seshadri seshapad@in.ibm.com
Co-authored-by: Dushyant Behl dushyantbehl@hotmail.com
Co-authored-by: Ashok Pon Kumar ashokponkumar@gmail.com

Motivation

  • There is a need for stopping an ongoing training if some stopping criteria is satisfied (E.g loss validation reaching a certain target, loss increasing with epoch, loss values for last 100 steps increasing etc).
  • There is a EarlyStoppingCallback in HF, but the granularity of stopping is only on evaluate events, and handles only compares instantaneous metric value to a threshold.
  • Therefore, there is a need for a mechanism to capture the user-defined custom stopping criteria which could involve multiple metrics.
  • In addition to user-defined stopping criteria, there could other types of control operations with respect to training (for instance, should the trainer perform saving, logging or evaluation operations or not, should we scale resources dynamically so that training could run faster and so on). Therefore, there is a need for general need to capture all these use-cases in a single framework. This PR attempts to provide such a framework.

User Design Details:

We have implemented a trainer callback (see here) which accepts a training control definition file (in YAML format) which facilitates the definition of:

  1. Rules to control training loop
  2. Trigger points that evaluate the above rules
  3. Control operation and action that needs to be performed if rule is evaluated to true.

The trainer controller configuration is structured as shown below. There are list of metric definitions under controller-metrics, a list of operations and their actions under operations and a list of controllers, each of which define the rules, triggers and control operations.

controller-metrics:
  <controller-name>:
    <controller-handler-class>:
      <arg1>: <value>
      ...
operations:
  <operation-name>:
    <operation-handler-class>:
      <arg1>: <value>
      ...
controllers:
  - name: <controller-name>
    triggers:
      - <event-1>
      ...
    rule: <rule-string>
    operations:
      - <operation-action-1>
      ...

The controller-metrics and operations are optional. We provide a set of built-in controller-metrics and operations which could be referred to without actually defining them as. For example, the below configuration defines a controller-metric called loss which refers to a built-in Loss controller-metric class with custom arguments (in this case, no arguments), but does not define any operations. It only refers to a built-in operation.

controller-metrics:
  loss:
    Loss:
controllers:
  - name: loss-controller
    triggers:
      - on_log
    rule: loss < 1.0
    operations:
      - hfcontrols.should_training_stop

For defining custom handler classes, we have an interface defined as an abstract class as shown below, with two abstract methods, namely: validate() to define the validation conditions, and compute() to compute the metric.

class MetricHandler(metaclass=abc.ABCMeta):
    @abc.abstractmethod 
    def validate(self) -> bool:
        pass

    @abc.abstractmethod 
    def compute(self, event_name: str, **kwargs) -> Any:
        pass

These classes can be user-defined. To add a new metric class, simply implement the above structure and register it with the trainer controller framework using the register_metric_handlers() method. To use the metric handler class, add the class name, arguments to the above configuration file.

Similarly, there is an operator abstract class Operation which could be inherited and custom operations could be defined as illustrated below:

class CustomOperation(Operation):
    def should_perform_action_xyz(args):
        pass

Every action defined in the custom operation should be represented as a function with should_ prefixed in the function name. The controller will automatically pickup these functions and invoke them if they are referred to in the configuration. Custom operations could be registered using register_operation_handlers() method.

rule is python expression which could express a condition to evaluate on a metric variable. For example, in the above configuration, loss is the variable, and the rule is applying a threshold on it.

operations lists the operation-actions to be performed when the rule evaluates to True. The convention followed to refer to an operation is <operation-name>.<action-name>. In this example, the <operation-class-name> is referring to built-in operation hfcontrols and one of its corresponding action action-name i.e should_training_stop.

Copy link
Collaborator

@ashokponkumar ashokponkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Few suggestions @seshapad !

tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/__init__.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/pdt_callback.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/pdt_callback.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/pdt_callback.py Outdated Show resolved Hide resolved
tuning/policydrivencontroller/pdt_callback.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ashokponkumar ashokponkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments..

examples/training-control-configs/Readme.md Outdated Show resolved Hide resolved
examples/training-control-configs/ctldef_step_v0.3.yaml Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/trainingcontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/metrics.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ashokponkumar ashokponkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! just 2 minor nits.

tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
@seshapad seshapad changed the title WIP: Policy driven training control feat: Policy driven training control Feb 23, 2024
@seshapad seshapad marked this pull request as ready for review February 23, 2024 12:25
@ashokponkumar
Copy link
Collaborator

@alex-jw-brooks PTAL

Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @seshapad, this is super cool, thanks for putting so much thought into it! I've added some comments, but as far as some general thoughts go:

  • It would be best to move away from generic exception catching leading to continue/pass silently - this happens in a few places, but in such situations it would probably be better to have throw a well-defined error as soon as possible, otherwise I think there’s risk of early stopping etc not working properly, and not logging due to continue/pass in generic exceptions. This is probably the biggest change needed IMO, the framework itself looks great

  • It would be great to get some unit tests here, e.g., by converting some of the examples into end to end tests on a tiny model and validate that the behavior is actually being triggered correctly

  • I’ll leave this up to you, but splitting the PR up into a more atomic smaller PRs might make it both easier to test and review quickly. For example, one PR could contain the base framework / validation / callback etc, while other PR(s) could contain the early stopping implementation, the cache metric handler, etc. But this is up to you, if you'd prefer to keep it all together, that is fine also

Sorry for the delayed review. Please reach out if you need anything or have any questions! Great work!

tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
for k, v in controller['control-operations'].items():
setattr(control, k, v)
except Exception as e:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to avoid generic exception catching or passing I think, because then in a rule is completely broken, it will likely throw every time silently, and the early stopping will just fail to work.

Any thoughts on just letting it throw the error? If eval throws, that means the rule or the metric computation is likely incorrect, right?

tuning/trainercontroller/callback.py Outdated Show resolved Hide resolved
examples/trainer-controller-configs/Readme.md Outdated Show resolved Hide resolved
tuning/trainercontroller/controllermetrics/steploss.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical provided small suggestions.

@seshapad
Copy link
Contributor Author

seshapad commented Mar 12, 2024

Comments from review:

[x] Fold the result marshalling for controller-metrics to the base class
[x] Handling the scenarios when eval() does not have the variables required to evaluate the rule - more discussion required
[x] eval() needs to be replaced by safer version
[x] Add event_name as argument to compute() function in metrichandler
[x] Split the initialization of controller: initialization should happen in init, validation should happen in on_init_end()
[x] Error out and stop training, if config file is incorrect or missing. On the other hand if config file is present, but empty, continue with the training...but the callback will be do nothing.
[x] Use IntervalStrategy enum for steps and epoch strings.

requirements.txt Outdated Show resolved Hide resolved
@seshapad seshapad changed the title feat: Policy driven training control feat: Trainer controller framework Mar 19, 2024
seshapad and others added 21 commits April 9, 2024 00:44
…ass inheritance

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
…ema etc

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
@seshapad
Copy link
Contributor Author

seshapad commented Apr 9, 2024

Hey @seshapad - thanks for addressing the requested changes and adding the ADR! For the pylint errors:

1. Do you get this on `main`? I am a bit surprised to see it here, since this PR doesn't really touch the peft config stuff, but it probably comes from the local name and importing `peft_config` directly from `tuning.config`. If it is on `main`, I'll fix this one in a separate PR - I'll also try running the format check stuff here

For 2 & 3, we can disable them using a comment like this, since it seems like these are both needed.

I think everything else looks good! Great work 🥳

@alex-jw-brooks @Ssukriti Once again, many thanks for reviewing the PR in detail. We have addressed the issues raised during linting, formatting and testing. The make fmt, make lint and make test have been run to verify it. The changes have also been pushed to the PR.
If you are satisfied with the PR, We request you to take the process forward in triggering the workflow checks, approval and merge.

Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much! 🚀

@alex-jw-brooks alex-jw-brooks merged commit 7b7effd into foundation-model-stack:main Apr 9, 2024
3 checks passed
jbusche pushed a commit to jbusche/fms-hf-tuning that referenced this pull request Apr 9, 2024
* feat: Extended gitignore to include backup files and folders

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Policy driven training control

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Addressed review comments related to exceptions and abstract class inheritance

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Design changes to trainer controller including validations, schema etc

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: trainer controller revamped

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

* feat: trainer controller revamped

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>

* feat: Documentation and some test case bug fixes

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Formatting issues to make build succeed

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Add rule validation to make eval safe again

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Removed default package typing from requirements.txt

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Added test cases, data, some exception handling

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Addressed the action filter bug and added a test case for it

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: bugs in operation validate

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* adr: Architecture document for trainer-controller

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Trainer controller examples directory renamed

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Trainer controller examples directory renamed

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Prefix regex corrected

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* adr: Details on key collisions added

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* adr: Details on key collisions added

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: rebase issues related to aim callback addressed

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: rebase issues related to aim callback addressed

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: brackets missing comma

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Addressed lint comments

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Added lint disable directives

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Reformatted files from black

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Resolved cyclic package dependencies

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

---------

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants