Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

katib metrics collector solution #685

Closed
5 of 7 tasks
hougangliu opened this issue Jul 12, 2019 · 24 comments
Closed
5 of 7 tasks

katib metrics collector solution #685

hougangliu opened this issue Jul 12, 2019 · 24 comments

Comments

@hougangliu
Copy link
Member

hougangliu commented Jul 12, 2019

@hougangliu
Copy link
Member Author

hougangliu commented Jul 18, 2019

Overview

This solution key ideas are:

  1. Inject collector container as a sidecar into Job/Tfjob/PytorchJob Pod. Collector container collects metrics from worker container by parsing file or http based on metrics source, and persist metrics into persistent server, such as katib-db by Katib-manager or metadata server;
  2. Introduce TrialMetrics CRD to trace metrics of trial by Kubernetes native way;
  3. Introduce Katib-Metrics-Manager server to poll metrics of running Trial from metric persistent server and update related TrialMetrics CR. Trial controller updates trial's status based on corresponding TrialMetrics CR;

See the architecture as below:

image

Data Structure

type ExperimentSpec struct {
	...
	MetricsCollectorSpec *MetricsCollectorSpec `json:"metricsCollectorSpec,omitempty"`
    ...
}

type MetricsCollectorSpec struct {
	Source    *SourceSpec    `json:"source,omitempty"`
	Collector *CollectorSpec `json:"collector,omitempty"`
}

type SourceSpec struct {
    // maybe model-train source code exposes metrics by http, such as HTTP endpoint in prometheus metric format
    HttpGet *v1.HTTPGetAction `json:"httpGet,omitempty"`
    // during training model, metrics maybe be persistent into local file in source code, such as tfevent
    FileSystemPath *FileSystemPath     `json:"fileSystemPath,omitempty"`

    Filter *FilterSpec              `json:"filter,omitempty"`
}

type FilterSpec struct {
    // when the metrics output follows format as this field specified, metricsCollector collects it and report to metrics server
    MetricsFormat []string `json:"metricsFormat,omitempty"`
}

type FileSystemKind string
const (
	DirectoryKind          FileSystemKind = "diretory"
	FileKind               FileSystemKind = "file"
)
type FileSystemPath struct {
    Path  string         `json:"path,omitempty"`
    Kind  FileSystemKind `json:"kind,omitempty"`
}

type CollectorKind string
const (
	StdOutCollector           CollectorKind = "stdOutCollector"
	FileCollector             CollectorKind = "fileCollector"
	TfEventCollector          CollectorKind = "tfEventCollector"
	PrometheusMetricCollector CollectorKind = "prometheusMetricCollector"
	CustomCollector           CollectorKind = "customCollector"
	NoneCollector             CollectorKind = "noneCollector"
)

type CollectorSpec struct {
    Kind            CollectorKind `json:"kind"`
    CustomCollector *v1.Container `json:"customCollector,omitempty"`
}

type TrialMetrics struct {
 	metav1.TypeMeta           `json:",inline"`
	metav1.ObjectMeta         `json:"metadata,omitempty"`
	Spec     TrialMetricsSpec   `json:"spec,omitempty"`
	Status   TrialMetricsStatus   `json:"status,omitempty"`
}

type TrialMetricsSpec struct {
    MetricNames []string `json:"metricNames,omitempty"`
}

type TrialMetricsStatus struct {
	StartTime                *metav1.Time `json:"startTime,omitempty"`
	CompletionTime     *metav1.Time `json:"completionTime,omitempty"`
	LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"`
	Observation            *common.Observation `json:"observation,omitempty"`
	Epoch                      *int32 `json:"epoch,omitempty"`
	Step                         *int32 `json:"step,omitempty"`
}
  1. For collector kind StdOutCollector, FileCollector, TfEventCollector and PrometheusMetricCollector (low priority), katib will maintain their collector implementations and build related images;StdOutCollector, FileCollector will be similar as current default metrics collector, and TfEventCollector keeps same as current tfevent metrics collector. If MetricsCollectorSpec field in Experiment is nil, we will take it as StdOutCollector collector by default.
  2. katib will install a configmap katib-config to match images for each collector kind (not including customCollector), experiment controller will inject collector sidecar container into TrialTemplate (maybe it's better trial controller injects collector sidecar container into woker job)
  3. Share metrics file/dir between worker container and collector container by emptyDir
  4. For NoneCollector, it is used when train source code persists its metrics into persistent server directly (maybe we can provide python SDK to persist metrics into katib-db, just like metadata will provide?).
  5. To support early stopping in katib, user is required to output epoch and step value when output metrics value. so we need refactor observation_logs table as below, if epoch and step are missed, early stopping feature will not support the related experiment. By default, metricCollectors will try to parse output like {"metric": "<choose_metric_name>", "value": <int_or_float>, "epoch": <int>, "step": <int>}, and "epoch" and "step" are optional if the train doesn't need early stopping feature.
    (trial_name VARCHAR(255) NOT NULL,
     id INT AUTO_INCREMENT PRIMARY KEY,
     time DATETIME(6),
     metric_name VARCHAR(255) NOT NULL,
     value TEXT NOT NULL,
     epoch INT,
     step INT,
     FOREIGN KEY (trial_name) REFERENCES trials(name) ON DELETE CASCADE)

Example

  1. For train source code prints metric in stdout, experiment should be like below (if metricsCollectorSpec is nil, we can take it as this case by default):
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    collector:
      kind: stdOutCollector
  ...
  1. For train source code prints metric into a specifed file with katib default format, experiment should be like below:
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
    collector:
      kind: fileCollector
  ...
  1. For train source code prints metric into a specifed file without katib default format, experiment should be like below (need user implements his custom collector):
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
    collector:
      kind: customCollector
      customCollector:
        image: xx-repo/xx-user/user-custom-collector-image
        ...
  ...
  1. For tensorflow source code persists metrics by tfevent:
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: directory
        path: "/var/log/train/"
    collector:
      kind: tfEventCollector
  1. For train source code expose its metrics by http with Prometheus metric format:
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      httpGet:
        path: /metrics
        port: 8080
    collector:
      kind: PrometheusMetricCollector
  ...
  1. For train source code persist its metrics into persistent server directory directly:
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    collector:
      kind: noneCollector
  ...
  1. For train source code prints metric into a specifed file with specified format, experiment should be like below:
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
      filter:
        metricsFormat:
        - "{{metricName}}:{{metricValue}}"
        - "{{metricName}} is {{metricValue}}"
        - "{{metricName}} = {{metricValue}}"
    collector:
      kind: fileCollector
  ...

Implementation Detail

Considering that when run a trial whose worker is PytorchJob/TfJob, there may be multiple Pods which will run model train task and output metrics. That is, duplicate metrics may be generated. To avoid metrics duplication for a trial, metricsCollector as sidecar container will work to collect metrics and then report them into kabit-manger as below:
1). for PytorchJob, only metricsCollector sidecar container in Master Pod takes effect;
2). for TfJob, If Master exists, only metricsCollector sidecar container in Master Pod takes effect; Else only metricsCollector sidecar container in Worker Pod whose index is 0 takes effect;

MetricsCollector sidecar container should be only injected into corresponding Pod as above by MutatingWebhook for Pod level. However, since in a cluster, Pod level webhook will degrade cluster performance for frequent operations of Pod, so we will just inject MetricsCollector sidecar container into trial's Job level by TfJob/PytorchJob MutatingWebhook:

  1. For PytorchJob, inject MetricsCollector sidecar container into Master template;
  2. For TfJob, inject MetricsCollector sidecar container into Master template if Master exists; else inject MetricsCollector sidecar container into Worker template, and when running, MetricsCollector will do nothing if it is not like {'task': {'type': 'worker', 'index': 0}} from TF_CONFIG ENV variable.

Future

Since metrics collector as above mentioned can not only be applied in Katib (persist metrics value into kabit data backend) but also useful in many other aspect, for exameple used to early stop for tainning, training monitor, training process visualization, etc.
So we can make metrics collector more general. For example, tfjob/pytorchjob can also add MetricsCollectorSpec field as above in spec. To increase portability of metricCollector, we need think more for metricCollector interface or API to support below configurable:

  1. metrics-source: where metricCollector will try to parse to get metrics; it may be file, directory or http/tcp endpoint
  2. metrics-format set: what formats the metrics key/value may be shown in metrics-source
  3. metrics persist server URL: the server where the metrics key/value will be persisted
  4. metrics expose endpoint: a http/tcp endpoint by which 3rd-party side can access to get metrics

@gaocegege
Copy link
Member

LGTM, generally. Let's discuss it today

@wuchunghsuan
Copy link
Contributor

For injecting metrics collector sidecar, @gaocegege and I have another implementation:
The MutatingWebHook of metrics collector can decide which pod to be injected sidecar based on labels, for example, the label like kubeflow.org/replica-role: master. The labels are tagged by job operators (e.x. TFjob/PyTorchjob). The sidecar will be injected only if the webhook recognize this label.

In this new design,

  1. It is more general since the webhook treats every job as the same and inject sidecar based on labels.
  2. Slightly modification in the job operator is needed. Operators will tag the labels.

@johnugeorge
Copy link
Member

Can you explain what component is referred to "MutatingWebHook of metrics collector"?

@gaocegege
Copy link
Member

@johnugeorge I think we should say Mutating Webhook Server of Katib

@gaocegege
Copy link
Member

Summary: In the worst case, we use a pod level webhook to inject sidecar into pod. In Kubernetes 1.5, the webhook supports objectSelector, then we do not have the performance problem.

@hougangliu
Copy link
Member Author

hougangliu commented Jul 30, 2019

In istio solution, if a namespace has "istio-injection=enabled" label, all pods in the namespace will be injected by pod level webhook and its performance influence is little.

So I think maybe we can handle it by like this:

  1. if a namespace has "katib-metrics-collector-injection=enabled" label, we can just inject sidecar container in pod level webhook.
  2. if a namespace has no "katib-metrics-collector-injection=enabled" label, we can just inject sidecar container in job/tfjob/pytorchjob level webhook

@gaocegege
Copy link
Member

Yeah, I think so. In the current version, it is the best way I think.

@hougangliu
Copy link
Member Author

hougangliu commented Jul 30, 2019

then @wuchunghsuan please update sidecar injection solution in design doc in two levels as above. thanks

@wuchunghsuan
Copy link
Contributor

OK, I got it.

@hougangliu
Copy link
Member Author

Per talk by slack, in this solution, we just focus on metricsCollector:

  1. metricsCollector will collect and persist metrics into persistent layer (for now katib-manager)
  2. when a worker completed, trial controller will call DB to fill back final metrics key/value into trial status.

So periodic metrics will be discussed in early-stopping topic and TrialMetrics also be dropped here.

@gaocegege
Copy link
Member

Pod level webhook for sidecar

Can you take the task @wuchunghsuan

@gaocegege
Copy link
Member

Can talk to apiserver, but have some problems

  • When a sidecar is injected, cannot finish the training
  • How to control the retry behaviour
    • In-container
    • Using K8s mechanism

@jlewi
Copy link
Contributor

jlewi commented Aug 26, 2019

Hi folks what are you planning on delivering in 0.7?

@hougangliu hougangliu mentioned this issue Aug 26, 2019
5 tasks
@hougangliu
Copy link
Member Author

Hi folks what are you planning on delivering in 0.7?

@jlewi I open #728 to trace what we plan to deliver in 0.7

@hougangliu
Copy link
Member Author

hougangliu commented Sep 11, 2019

for this solution, we add a metricCollector sidecar container into worker Pod. metricCollector sidecar container need collect the metrics, it also need know that worker container has finished so that it can exit, too. otherwise the Pod will keep Running since metricCollector sidecar container is running.
So we should notify metricCollector sidecar container when to exit. maybe we can implement it by below solutions:

  1. share process namespace: I like this solution, and argo also supports this option to watch other containers in same Pod, but share process namespace upgraded to beta version when 1.12, we should also consider that katib runs on k8s version older than 1.12
  2. k8s api to get/watch Pod: call k8s client api to watch the worker Pod, and get other containers status by the Pod status's containerStatuses filed.
    2.1: If it is metricsCollector to watch/get Pod, the worker Pod need have Pod watch/get role, since the experiment can be created in any namespace, we must create serviceAccount with the role in corresponding namespace(katib-controller need getOrCreateServiceAccount in the namespace and set it as serviceaccount of the Pod while inject metrics sidecar container, if the original Pod already have a serviceaccount with non-default service-account-token, it is not easy to handle this case since only one service-account-token will be applied in the container).
    2.2: we need a long-running service (maybe katib-controller or a new service) with watch/get Pod role, it has a rest/grpc api which accepts argument (metrics-collector-client-side, podNamespacedName), when a worker Pod starts, metrics-collector container call the rest/grpc api with argument like ($podIP:8000, kubeflow/random-experiment-xztf8nph-master-0), then katib-controller starts a goroutine to watch the kubeflow/random-experiment-xztf8nph-master-0 Pod, once it finds that any container except metricsCollectors have terminated, it will send a message to $podIP:8000 (metricsCollector listen to this port), and when metricsCollectors get this message, it exit with 0. For this solution, the long-running service(maybe katib-controller) may have many goroutine to watch Pod, it consumes much resource.
  3. wrap command of containers in the Pod except metricsCollector container: this solution is applied by Tekton project, while inject metricsCollector sidecar container, we should also wrap command of other containers as the new command in the Pod, the wrap function is very easy just like as below, metricsCollector will wait for the signal generated by the wrap function. Also we can easily redirect stdout of worker container to a file, so Stdout and File metricsCollector can share same logic.
exitcode=`run the original command`;
send an end message to 127.0.0.1:8000 or generate a shared file;
exit $exitcode;

@richardsliu @johnugeorge @gaocegege IMO, maybe we have to implement solution 3. but I need your suggestion, too.

@gaocegege
Copy link
Member

Thanks for the research!

we should also consider that katib runs on k8s version older than 1.12

I am not sure about it.

Personally, I prefer the first option.

@hougangliu
Copy link
Member Author

Per talk with team, we decide to choose share process namespace solution to solve this problem

@johnugeorge
Copy link
Member

Option 1 seems to the best option as it has no other dependency and it is specifically meant for the same purpose.

@jlewi
Copy link
Contributor

jlewi commented Nov 7, 2019

@hougangliu @johnugeorge any update on this?

@hougangliu
Copy link
Member Author

@hougangliu @johnugeorge any update on this?

@jlewi for now, metrics collector can work well by pod level metricsCollector container sidecar injection (also included in 0.7.0 release).
considering potential performance impact, we plan to also support to inject metricsCollector container in Job/TfJob/PytorchJob level in next release (will start to discuss in katib weekly meeting)

@gaocegege
Copy link
Member

I think we can safely remove 0.7 label and add 1.0 since high priority features are implemented now.

@jlewi
Copy link
Contributor

jlewi commented Nov 23, 2019

@gaocegege and @hougangliu should we file more fine grained issues for the remaining work and then close this issue?

@hougangliu
Copy link
Member Author

@jlewi #929 and #928 trace the remaining two issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants