Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support early stop feature #692

Closed
hougangliu opened this issue Jul 19, 2019 · 17 comments
Closed

support early stop feature #692

hougangliu opened this issue Jul 19, 2019 · 17 comments

Comments

@hougangliu
Copy link
Member

hougangliu commented Jul 19, 2019

Early stop should not only consider goal metrics, but also epoch, step and so on

@gaocegege
Copy link
Member

/0.7.0

@gaocegege
Copy link
Member

/kind feature

@gaocegege
Copy link
Member

/priority p1

@andreyvelich
Copy link
Member

Do we have any updates to support Early Stopping in Katib?
I think firstly we should try to implement Median Stopping Rule as @richardsliu noticed here #936 (comment).

According to Google Vizier paper, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf algorithm should analyse metrics from Running Trials and stop them.
In our case, Trial controller should process Trial metrics during training process and send them to Early Stopping service.

Right now, Metrics Collector parses metrics only once training process is finished (https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1alpha3/file-metricscollector/main.go#L94).
So we should somehow send metrics from Trials to Early Stopping service.

Any thoughts @hougangliu @gaocegege @johnugeorge ?

@johnugeorge
Copy link
Member

There was some discussion regarding this before. I couldn't find the issue. @gaocegege

@andreyvelich
Copy link
Member

I have an idea how we can implement Early Stopping in current Katib functionality.

Maybe instead of creating independent service for Early Stopping, try to Mutate another container to Trial Pod, like we are doing with metrics collector, and stop main training process when Trial needs to be early stopped.

This is an example for median stopping rule:

  1. Katib is training and reporting metrics for X steps.

  2. After X steps Suggestion generates median rule from reported metrics.

  3. Katib controller creates Trial job and injects to training pod Early Stopping container with appropriate parameters (e.g stop_rule = accuracy >= 0.7). We can use same injection webhook as now.

  4. We add additional commands to Training Container to handle kill Linux process.
    For example

python3 mnist.py --num-layers=2 1 > /var/log/katib/metrics.log 
|| if test -f '/var/log/katib/early-stopping'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1; 
    fi; 
&& echo completed > /var/log/katib/$$$$.pid

Ones python3 execution is failed container runs:

if test -f '/var/log/katib/early-stopping'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1;
fi; 
  1. Early Stopping is tailing metrics log file or TF event metrics log (for TFEvent metrics collector).

  2. When Early Stopping reads metrics that < 0.7, Early Stopping gets appropriate training process PID. I believe it is possible, since metrics collector gets it: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1alpha3/common/pns.go#L4.

  3. After that, Early Stopping container:

  • Stop training Linux process execution: pkill -STOP -s <PID>.
  • Create file under /var/log/katib/early-stopping. Training, Metrics Collector and Early Stopping containers will share the same EmptyDir.
  • Kill training process pkill -KILL -s <PID>.

Main Training Job will be completed if /var/log/katib/early-stopping file exists or failed
(exit 1 on else) if file doesn't exist.

  1. Then metrics collector checks if training container was early stopped (If /var/log/katib/early-stopping file exists).

  2. In observation log, I proposed to create another field: is_early_stopped to indicate whether Trial was early stopped or not. We can create additional table for it.

  3. If training container was early stopped, metrics collector sends metrics to DB and is_early_stopped = 1. I think it is useful to report metrics, even if Trial was early stopped.

  4. is_early_stopped parameter can be used later for Katib Controller to change Trial status and not report them to Suggestion.

In this approach we didn't brake normal Kubernetes Job execution. If training process was failed because of code execution Training job will be failed also.
I tested, it is possible to stop and kill Training process from injected container (metrics collector).
I hope, this approach should work not only for python training containers, since we are working with Linux processes.

What do you think @gaocegege @johnugeorge ?

/cc @jlewi @richardsliu

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

It will be hard to implement some complex logic this way. I think. There are many early stopping algorithms.

@andreyvelich
Copy link
Member

@gaocegege Is there any early stopping algorithms where analysing logs from Training Container can be not enough ?

@johnugeorge
Copy link
Member

johnugeorge commented May 10, 2020

Failing a Job might be wrong as it gives a wrong meaning to the user.

From what you told, why don't we follow the same control flow as we have now. Katib controller in each iteration takes the decision whether early stopping condition is met and marks the experiment successful with message saying that early stopping condition is met.

@andreyvelich
Copy link
Member

andreyvelich commented May 10, 2020

Failing a Job might be wrong as it gives a wrong meaning to the user.

In case of early stopping, Job will be succeeded after we kill the process.

From what you told, why don't we follow the same control flow as we have now. Katib controller in each iteration takes the decision whether early stopping condition is met and marks the experiment successful with message saying that early stopping condition is met.

How Katib controller can get information about each iteration? Currently, metrics collector parses logs only ones Training Job is completed. And Trial controller watches only for Training Job changes: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/trial/trial_controller.go#L97.

@andreyvelich
Copy link
Member

After Katib meeting discussion we have few thoughts about this issue:

  1. Create Katib Python SDK for Early Stopping.
    Users can directly push appropriate parameters from the Trial training container pod to the Early Stopping service.
    After that, service can terminate pod.
    In that approach, users need to manually change training container code and create docker image for it.

  2. Try to release suggestion, proposed here: support early stop feature #692 (comment).
    In that case, users need to print all information that early stopping service should analyse (epochs, accuracy, etc..).
    Instead of creating another sidecar container, add this logic to metrics collector container, since the early stopping also parses logs.

  3. Investigate more about early stopping in other AutoML projects.

/cc @gaocegege @johnugeorge

/priority p0

@c-bata
Copy link
Member

c-bata commented Jun 30, 2020

Investigate more about early stopping in other AutoML projects.

In Optuna, user reports intermediate values via Python API, then stop to train models if pruner is triggered.

import optuna


def objective(trial):
    for epoch in range(10):
        # 1. Train a model
        ...

        # 2. Evaluate loss.
        ...

        # 3. Report an intermediate metric value on each epochs.
        trial.report(accuracy, step=epoch)

        # 4. Ask SuccessiveHalvingPruner() to tell whether this trial should stop or continue.
        if trial.should_prune():
             raise optuna.exceptions.TrialPruned()

    ...
    return accuracy  # final metric score

if __name__ == '__main__':
    study = optuna.create_study(
        sampler=optuna.samplers.TPESampler(),  # corresponds to suggestion service in Katib.
        pruner=optuna.pruners.SuccessiveHalvingPruner(),
    )
    study.optimize(objective, n_trials=100)

https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py

@andreyvelich
Copy link
Member

In Optuna, user reports intermediate values via Python API, then stop to train models if pruner is triggered.

import optuna


def objective(trial):
    for epoch in range(10):
        # 1. Train a model
        ...

        # 2. Evaluate loss.
        ...

        # 3. Report an intermediate metric value on each epochs.
        trial.report(accuracy, step=epoch)

        # 4. Ask SuccessiveHalvingPruner() to tell whether this trial should stop or continue.
        if trial.should_prune():
             raise optuna.exceptions.TrialPruned()


if __name__ == '__main__':
    study = optuna.create_study(
        sampler=optuna.samplers.TPESampler(),  # corresponds to suggestion service in Katib.
        pruner=optuna.pruners.SuccessiveHalvingPruner(),
    )
    study.optimize(objective, n_trials=100)

https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py

Thank you for this example! As I can see, they also have some sort of SDK for it.

Do you know, how do they stop the training and mark Trials Pruned?
To stop they just raise optuna.exceptions.TrialPruned() exception in the objective function?

@c-bata
Copy link
Member

c-bata commented Jun 30, 2020

Do you know, how do they stop the training and mark Trials Pruned?
To stop they just raise optuna.exceptions.TrialPruned() exception in the objective function?

Yes. They just raise TrialPruned exception inside their objective function, then Optuna catch it and mark the trial Pruned at following lines.

        # Create a new trial.
        trial = ...

        try:
            # Call an objective function
            result = func(trial)
        except exceptions.TrialPruned as e:
            # Mark the trial `PRUNED`
            ...
            self._storage.set_trial_state(trial_id, TrialState.PRUNED)
            ...
        except Exception as e:
            # Mark the trial `FAILED`
            ...
            self._storage.set_trial_state(trial_id, TrialState.FAIL)

https://github.com/optuna/optuna/blob/6a1674666a5d8d778b39b902627ff40e98e14c48/optuna/study.py#L685-L700

@andreyvelich
Copy link
Member

I was testing this approach: #692 (comment).
For file metrics collector and StdOut, it was working and we can implement it.

Here: https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1beta1/file-metricscollector/main.go#L67-L83, we can analyse require information for the Early Stopping and Kill appropriate running training process.

It is working for simple Batch Job and PyTorch Job.

I believe Ray also uses SDK for early stopping: https://docs.ray.io/en/ray-0.4.0/hyperband.html#median-stopping-rule.

Do you have any other ideas or thoughts about Early Stopping in Katib @gaocegege @johnugeorge @c-bata @sperlingxx ?

@andreyvelich
Copy link
Member

Let's continue discussion in #1330.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants