katib metrics collector solution #685

hougangliu · 2019-07-12T02:35:37Z

metricsCollector API API for metricCollector #697
Pod level webhook for sidecar
Job level webhook for sidecar (traced by Job level webhook for metrics collector sidecar container #928)
StdOutCollector metricsCollector executor
FileCollector metricsCollector executor
TfEventCollector metricsCollector executor
PrometheusMetricCollector metricsCollector executor P2 (traced by PrometheusMetricCollector metricsCollector executor #929)

hougangliu · 2019-07-18T07:25:25Z

Overview

This solution key ideas are:

Inject collector container as a sidecar into Job/Tfjob/PytorchJob Pod. Collector container collects metrics from worker container by parsing file or http based on metrics source, and persist metrics into persistent server, such as katib-db by Katib-manager or metadata server;
Introduce TrialMetrics CRD to trace metrics of trial by Kubernetes native way;
Introduce Katib-Metrics-Manager server to poll metrics of running Trial from metric persistent server and update related TrialMetrics CR. Trial controller updates trial's status based on corresponding TrialMetrics CR;

See the architecture as below:

Data Structure

type ExperimentSpec struct {
	...
	MetricsCollectorSpec *MetricsCollectorSpec `json:"metricsCollectorSpec,omitempty"`
    ...
}

type MetricsCollectorSpec struct {
	Source    *SourceSpec    `json:"source,omitempty"`
	Collector *CollectorSpec `json:"collector,omitempty"`
}

type SourceSpec struct {
    // maybe model-train source code exposes metrics by http, such as HTTP endpoint in prometheus metric format
    HttpGet *v1.HTTPGetAction `json:"httpGet,omitempty"`
    // during training model, metrics maybe be persistent into local file in source code, such as tfevent
    FileSystemPath *FileSystemPath     `json:"fileSystemPath,omitempty"`

    Filter *FilterSpec              `json:"filter,omitempty"`
}

type FilterSpec struct {
    // when the metrics output follows format as this field specified, metricsCollector collects it and report to metrics server
    MetricsFormat []string `json:"metricsFormat,omitempty"`
}

type FileSystemKind string
const (
	DirectoryKind          FileSystemKind = "diretory"
	FileKind               FileSystemKind = "file"
)
type FileSystemPath struct {
    Path  string         `json:"path,omitempty"`
    Kind  FileSystemKind `json:"kind,omitempty"`
}

type CollectorKind string
const (
	StdOutCollector           CollectorKind = "stdOutCollector"
	FileCollector             CollectorKind = "fileCollector"
	TfEventCollector          CollectorKind = "tfEventCollector"
	PrometheusMetricCollector CollectorKind = "prometheusMetricCollector"
	CustomCollector           CollectorKind = "customCollector"
	NoneCollector             CollectorKind = "noneCollector"
)

type CollectorSpec struct {
    Kind            CollectorKind `json:"kind"`
    CustomCollector *v1.Container `json:"customCollector,omitempty"`
}

type TrialMetrics struct {
 	metav1.TypeMeta           `json:",inline"`
	metav1.ObjectMeta         `json:"metadata,omitempty"`
	Spec     TrialMetricsSpec   `json:"spec,omitempty"`
	Status   TrialMetricsStatus   `json:"status,omitempty"`
}

type TrialMetricsSpec struct {
    MetricNames []string `json:"metricNames,omitempty"`
}

type TrialMetricsStatus struct {
	StartTime                *metav1.Time `json:"startTime,omitempty"`
	CompletionTime     *metav1.Time `json:"completionTime,omitempty"`
	LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"`
	Observation            *common.Observation `json:"observation,omitempty"`
	Epoch                      *int32 `json:"epoch,omitempty"`
	Step                         *int32 `json:"step,omitempty"`
}

For collector kind StdOutCollector, FileCollector, TfEventCollector and PrometheusMetricCollector (low priority), katib will maintain their collector implementations and build related images;StdOutCollector, FileCollector will be similar as current default metrics collector, and TfEventCollector keeps same as current tfevent metrics collector. If MetricsCollectorSpec field in Experiment is nil, we will take it as StdOutCollector collector by default.
katib will install a configmap katib-config to match images for each collector kind (not including customCollector), experiment controller will inject collector sidecar container into TrialTemplate (maybe it's better trial controller injects collector sidecar container into woker job)
Share metrics file/dir between worker container and collector container by emptyDir
For NoneCollector, it is used when train source code persists its metrics into persistent server directly (maybe we can provide python SDK to persist metrics into katib-db, just like metadata will provide?).
To support early stopping in katib, user is required to output epoch and step value when output metrics value. so we need refactor observation_logs table as below, if epoch and step are missed, early stopping feature will not support the related experiment. By default, metricCollectors will try to parse output like {"metric": "<choose_metric_name>", "value": <int_or_float>, "epoch": <int>, "step": <int>}, and "epoch" and "step" are optional if the train doesn't need early stopping feature.

    (trial_name VARCHAR(255) NOT NULL,
     id INT AUTO_INCREMENT PRIMARY KEY,
     time DATETIME(6),
     metric_name VARCHAR(255) NOT NULL,
     value TEXT NOT NULL,
     epoch INT,
     step INT,
     FOREIGN KEY (trial_name) REFERENCES trials(name) ON DELETE CASCADE)

Example

For train source code prints metric in stdout, experiment should be like below (if metricsCollectorSpec is nil, we can take it as this case by default):

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    collector:
      kind: stdOutCollector
  ...

For train source code prints metric into a specifed file with katib default format, experiment should be like below:

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
    collector:
      kind: fileCollector
  ...

For train source code prints metric into a specifed file without katib default format, experiment should be like below (need user implements his custom collector):

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
    collector:
      kind: customCollector
      customCollector:
        image: xx-repo/xx-user/user-custom-collector-image
        ...
  ...

For tensorflow source code persists metrics by tfevent:

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: directory
        path: "/var/log/train/"
    collector:
      kind: tfEventCollector

For train source code expose its metrics by http with Prometheus metric format:

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      httpGet:
        path: /metrics
        port: 8080
    collector:
      kind: PrometheusMetricCollector
  ...

For train source code persist its metrics into persistent server directory directly:

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    collector:
      kind: noneCollector
  ...

For train source code prints metric into a specifed file with specified format, experiment should be like below:

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: random-experiment
spec:
  ...
  metricsCollectorSpec:
    source:
      fileSystemPath:
        kind: file
        path: "/var/log/train.log"
      filter:
        metricsFormat:
        - "{{metricName}}:{{metricValue}}"
        - "{{metricName}} is {{metricValue}}"
        - "{{metricName}} = {{metricValue}}"
    collector:
      kind: fileCollector
  ...

Implementation Detail

Considering that when run a trial whose worker is PytorchJob/TfJob, there may be multiple Pods which will run model train task and output metrics. That is, duplicate metrics may be generated. To avoid metrics duplication for a trial, metricsCollector as sidecar container will work to collect metrics and then report them into kabit-manger as below:
1). for PytorchJob, only metricsCollector sidecar container in Master Pod takes effect;
2). for TfJob, If Master exists, only metricsCollector sidecar container in Master Pod takes effect; Else only metricsCollector sidecar container in Worker Pod whose index is 0 takes effect;

MetricsCollector sidecar container should be only injected into corresponding Pod as above by MutatingWebhook for Pod level. However, since in a cluster, Pod level webhook will degrade cluster performance for frequent operations of Pod, so we will just inject MetricsCollector sidecar container into trial's Job level by TfJob/PytorchJob MutatingWebhook:

For PytorchJob, inject MetricsCollector sidecar container into Master template;
For TfJob, inject MetricsCollector sidecar container into Master template if Master exists; else inject MetricsCollector sidecar container into Worker template, and when running, MetricsCollector will do nothing if it is not like {'task': {'type': 'worker', 'index': 0}} from TF_CONFIG ENV variable.

Future

Since metrics collector as above mentioned can not only be applied in Katib (persist metrics value into kabit data backend) but also useful in many other aspect, for exameple used to early stop for tainning, training monitor, training process visualization, etc.
So we can make metrics collector more general. For example, tfjob/pytorchjob can also add MetricsCollectorSpec field as above in spec. To increase portability of metricCollector, we need think more for metricCollector interface or API to support below configurable:

metrics-source: where metricCollector will try to parse to get metrics; it may be file, directory or http/tcp endpoint
metrics-format set: what formats the metrics key/value may be shown in metrics-source
metrics persist server URL: the server where the metrics key/value will be persisted
metrics expose endpoint: a http/tcp endpoint by which 3rd-party side can access to get metrics

gaocegege · 2019-07-19T01:11:25Z

LGTM, generally. Let's discuss it today

wuchunghsuan · 2019-07-26T06:19:38Z

For injecting metrics collector sidecar, @gaocegege and I have another implementation:
The MutatingWebHook of metrics collector can decide which pod to be injected sidecar based on labels, for example, the label like kubeflow.org/replica-role: master. The labels are tagged by job operators (e.x. TFjob/PyTorchjob). The sidecar will be injected only if the webhook recognize this label.

In this new design,

It is more general since the webhook treats every job as the same and inject sidecar based on labels.
Slightly modification in the job operator is needed. Operators will tag the labels.

johnugeorge · 2019-07-26T07:58:17Z

Can you explain what component is referred to "MutatingWebHook of metrics collector"?

gaocegege · 2019-07-26T08:57:29Z

@johnugeorge I think we should say Mutating Webhook Server of Katib

gaocegege · 2019-07-30T01:46:52Z

Summary: In the worst case, we use a pod level webhook to inject sidecar into pod. In Kubernetes 1.5, the webhook supports objectSelector, then we do not have the performance problem.

hougangliu · 2019-07-30T02:35:09Z

In istio solution, if a namespace has "istio-injection=enabled" label, all pods in the namespace will be injected by pod level webhook and its performance influence is little.

So I think maybe we can handle it by like this:

if a namespace has "katib-metrics-collector-injection=enabled" label, we can just inject sidecar container in pod level webhook.
if a namespace has no "katib-metrics-collector-injection=enabled" label, we can just inject sidecar container in job/tfjob/pytorchjob level webhook

gaocegege · 2019-07-30T02:35:56Z

Yeah, I think so. In the current version, it is the best way I think.

hougangliu · 2019-07-30T02:38:10Z

then @wuchunghsuan please update sidecar injection solution in design doc in two levels as above. thanks

wuchunghsuan · 2019-07-30T02:39:07Z

OK, I got it.

hougangliu · 2019-07-30T02:43:46Z

Per talk by slack, in this solution, we just focus on metricsCollector:

metricsCollector will collect and persist metrics into persistent layer (for now katib-manager)
when a worker completed, trial controller will call DB to fill back final metrics key/value into trial status.

So periodic metrics will be discussed in early-stopping topic and TrialMetrics also be dropped here.

gaocegege · 2019-08-12T04:15:55Z

Pod level webhook for sidecar

Can you take the task @wuchunghsuan

gaocegege · 2019-08-22T04:40:43Z

Can talk to apiserver, but have some problems

When a sidecar is injected, cannot finish the training
How to control the retry behaviour
- In-container
- Using K8s mechanism

jlewi · 2019-08-26T20:12:10Z

Hi folks what are you planning on delivering in 0.7?

hougangliu · 2019-08-26T23:19:36Z

Hi folks what are you planning on delivering in 0.7?

@jlewi I open #728 to trace what we plan to deliver in 0.7

hougangliu · 2019-09-11T15:17:36Z

for this solution, we add a metricCollector sidecar container into worker Pod. metricCollector sidecar container need collect the metrics, it also need know that worker container has finished so that it can exit, too. otherwise the Pod will keep Running since metricCollector sidecar container is running.
So we should notify metricCollector sidecar container when to exit. maybe we can implement it by below solutions:

share process namespace: I like this solution, and argo also supports this option to watch other containers in same Pod, but share process namespace upgraded to beta version when 1.12, we should also consider that katib runs on k8s version older than 1.12
k8s api to get/watch Pod: call k8s client api to watch the worker Pod, and get other containers status by the Pod status's containerStatuses filed.
2.1: If it is metricsCollector to watch/get Pod, the worker Pod need have Pod watch/get role, since the experiment can be created in any namespace, we must create serviceAccount with the role in corresponding namespace(katib-controller need getOrCreateServiceAccount in the namespace and set it as serviceaccount of the Pod while inject metrics sidecar container, if the original Pod already have a serviceaccount with non-default service-account-token, it is not easy to handle this case since only one service-account-token will be applied in the container).
2.2: we need a long-running service (maybe katib-controller or a new service) with watch/get Pod role, it has a rest/grpc api which accepts argument (metrics-collector-client-side, podNamespacedName), when a worker Pod starts, metrics-collector container call the rest/grpc api with argument like ($podIP:8000, kubeflow/random-experiment-xztf8nph-master-0), then katib-controller starts a goroutine to watch the kubeflow/random-experiment-xztf8nph-master-0 Pod, once it finds that any container except metricsCollectors have terminated, it will send a message to $podIP:8000 (metricsCollector listen to this port), and when metricsCollectors get this message, it exit with 0. For this solution, the long-running service(maybe katib-controller) may have many goroutine to watch Pod, it consumes much resource.
wrap command of containers in the Pod except metricsCollector container: this solution is applied by Tekton project, while inject metricsCollector sidecar container, we should also wrap command of other containers as the new command in the Pod, the wrap function is very easy just like as below, metricsCollector will wait for the signal generated by the wrap function. Also we can easily redirect stdout of worker container to a file, so Stdout and File metricsCollector can share same logic.

exitcode=`run the original command`;
send an end message to 127.0.0.1:8000 or generate a shared file;
exit $exitcode;

@richardsliu @johnugeorge @gaocegege IMO, maybe we have to implement solution 3. but I need your suggestion, too.

gaocegege · 2019-09-12T01:29:19Z

Thanks for the research!

we should also consider that katib runs on k8s version older than 1.12

I am not sure about it.

Personally, I prefer the first option.

hougangliu · 2019-09-12T02:16:02Z

Per talk with team, we decide to choose share process namespace solution to solve this problem

johnugeorge · 2019-09-12T03:29:13Z

Option 1 seems to the best option as it has no other dependency and it is specifically meant for the same purpose.

jlewi · 2019-11-07T22:26:26Z

@hougangliu @johnugeorge any update on this?

hougangliu · 2019-11-07T23:26:26Z

@hougangliu @johnugeorge any update on this?

@jlewi for now, metrics collector can work well by pod level metricsCollector container sidecar injection (also included in 0.7.0 release).
considering potential performance impact, we plan to also support to inject metricsCollector container in Job/TfJob/PytorchJob level in next release (will start to discuss in katib weekly meeting)

gaocegege · 2019-11-08T01:34:03Z

I think we can safely remove 0.7 label and add 1.0 since high priority features are implemented now.

jlewi · 2019-11-23T00:01:05Z

@gaocegege and @hougangliu should we file more fine grained issues for the remaining work and then close this issue?

hougangliu · 2019-11-24T23:58:48Z

@jlewi #929 and #928 trace the remaining two issue.

richardsliu added the 0.7.0 label Jul 19, 2019

hougangliu mentioned this issue Jul 28, 2019

API for metricCollector #697

Merged

hougangliu mentioned this issue Aug 11, 2019

Katib job and Katib services cannot be in different namespaces due to metrics collector service account kubeflow/kubeflow#2311

Closed

wuchunghsuan mentioned this issue Aug 12, 2019

Add pod level inject webhook #716

Merged

jlewi added priority/p1 area/katib improvement/enhancement labels Aug 16, 2019

hougangliu mentioned this issue Aug 26, 2019

0.7 release trace #728

Closed

5 tasks

jlewi added kind/feature and removed improvement/enhancement labels Aug 28, 2019

johnugeorge mentioned this issue Sep 30, 2019

Decoupling DB interface from katib manager #498

Closed

jlewi mentioned this issue Oct 8, 2019

Use Kubeflow metadata for metrics collection #862

Open

johnugeorge removed the 0.7.0 label Oct 9, 2019

hougangliu closed this as completed Nov 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

katib metrics collector solution #685

katib metrics collector solution #685

hougangliu commented Jul 12, 2019 •

edited

Loading

hougangliu commented Jul 18, 2019 •

edited

Loading

gaocegege commented Jul 19, 2019

wuchunghsuan commented Jul 26, 2019

johnugeorge commented Jul 26, 2019

gaocegege commented Jul 26, 2019

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 •

edited

Loading

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 •

edited

Loading

wuchunghsuan commented Jul 30, 2019

hougangliu commented Jul 30, 2019

gaocegege commented Aug 12, 2019

gaocegege commented Aug 22, 2019

jlewi commented Aug 26, 2019

hougangliu commented Aug 26, 2019

hougangliu commented Sep 11, 2019 •

edited

Loading

gaocegege commented Sep 12, 2019

hougangliu commented Sep 12, 2019

johnugeorge commented Sep 12, 2019

jlewi commented Nov 7, 2019

hougangliu commented Nov 7, 2019

gaocegege commented Nov 8, 2019

jlewi commented Nov 23, 2019

hougangliu commented Nov 24, 2019

katib metrics collector solution #685

katib metrics collector solution #685

Comments

hougangliu commented Jul 12, 2019 • edited Loading

hougangliu commented Jul 18, 2019 • edited Loading

Overview

Data Structure

Example

Implementation Detail

Future

gaocegege commented Jul 19, 2019

wuchunghsuan commented Jul 26, 2019

johnugeorge commented Jul 26, 2019

gaocegege commented Jul 26, 2019

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 • edited Loading

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 • edited Loading

wuchunghsuan commented Jul 30, 2019

hougangliu commented Jul 30, 2019

gaocegege commented Aug 12, 2019

gaocegege commented Aug 22, 2019

jlewi commented Aug 26, 2019

hougangliu commented Aug 26, 2019

hougangliu commented Sep 11, 2019 • edited Loading

gaocegege commented Sep 12, 2019

hougangliu commented Sep 12, 2019

johnugeorge commented Sep 12, 2019

jlewi commented Nov 7, 2019

hougangliu commented Nov 7, 2019

gaocegege commented Nov 8, 2019

jlewi commented Nov 23, 2019

hougangliu commented Nov 24, 2019

hougangliu commented Jul 12, 2019 •

edited

Loading

hougangliu commented Jul 18, 2019 •

edited

Loading

hougangliu commented Jul 30, 2019 •

edited

Loading

hougangliu commented Jul 30, 2019 •

edited

Loading

hougangliu commented Sep 11, 2019 •

edited

Loading