Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pipeline-ui] Retrieve pod logs from argo archive #2081

Merged

Conversation

eterna2
Copy link
Contributor

@eterna2 eterna2 commented Sep 10, 2019

A possible workaround for #1803

Currently, I persist pod logs with argo's archiveLogs config to s3 bucket.

This PR allow the UI to fallback to retrieving GCed pods logs from the argo archive artifactory - this can be either minio store or a s3 bucket.

This feature can be enabled by setting the following env variables:

  /** Is Argo log archive enabled? */
  ARGO_ARCHIVE_LOGS = "true",
  /** Use minio or s3 client to retrieve archives. */
  ARGO_ARCHIVE_ARTIFACTORY = 'minio',
  /** Bucket to retrive logs from */
  ARGO_ARCHIVE_BUCKETNAME = 'mlpipeline',
  /** Prefix to logs. */
  ARGO_ARCHIVE_PREFIX = 'logs',

Updates 20 sep 2019:

Investigate the feasibility of getting pod logs archive location from argo workflow status:

  • conclusion is yes, but if the pipeline did not complete successfully, the status will not be updated with the output information (even if the logs are actually archived)
Changes
  • Updated @types/minio to match the minio package (useSSL instead of insecure <- old)

  • Created a few additional helpers modules:

    • aws-helper provides utils to query and handles AWS instance profile session credentials (aka kube2iam or ec2 profiles can be used instead of providing access key and secret to minio client)
    • workflow-helper provides utils to retrieve pod logs archive info from argo workflow status
    • minio-helper provides utils to retrieve objects from minio or s3 backends, includes capability to use AWS ec2 credentials to access s3
  • Replace existing handlers for s3 and minio objects with utils from minio-helper

  • Update get pod logs handler to try getting the pod logs in the following orders:

    • get logs from pod with k8s api
    • get archive location from workflow status, and retrieve from archive
    • if workflow status does not have required info, fallback to the values provided in the environment variables (by user) and try to retrieve the logs.
  • I have build the image and test on my own cluster in AWS. However, i only tested on the pod logs, have not really test on the ui-metadata yet.

TODO

  • probably shld use the schema from third-party/argo
  • some unit tests?

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @eterna2. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/** helper function to retrieve pod logs from argo artifactory. */
const getPodLogsFromArtifactory = _as_bool(ARGO_ARCHIVE_LOGS) ? k8sHelper.getPodLogsFromArtifactoryHelper(
ARGO_ARCHIVE_ARTIFACTORY==='minio' ? minioClient : s3Client,
ARGO_ARCHIVE_BUCKETNAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the actual log location URL be taken from the Workflow status object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
Just a question on which client to use.

The workflow status does provide the output as well as the artifactory configs.

Do u think it is better to try to infer the pre-created client to use, or retrieve the secrets and creates a new client each time?

If the latter case, should we do it for the artifacts retrieval also? i.e. skip the configuration for minio outright.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question on which client to use.

AFAIK, in the Frontend the s3Client is actually and instance of minioClient. There is some weird logic where s3 reads data uncompressed while minio expects tar.gz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I feel that the UI handling of artifacts can get some improvement. If you see some ways to streamline and improve that part, I hope @jingzhang36 and other people working on UX will be glad to see that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eterna2 Do you think you can take on this: #2172 later ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@eterna2
Copy link
Contributor Author

eterna2 commented Sep 12, 2019

My other issue is IAM role for minio-js. Minio-js does not support IAM role (unlike minio-go), I made a PR to include this capability to minio-js, but they are asking for a relatively big rewrite (which I don't think I have time for it) instead of the monkey patch which I provided.

Wondering if u are open to me adding a short routine to support this outside of minio-js (retrieving and updating AWS IAM tmp credentials) inside here.

Because of this limitation, our current deployment relies on minio-gateway as a proxy to our s3 buckets. The crux of the issue is that in some places in kf, we use s3:// while for UI specifically, we have to use minio://.

This is quite confusing for our data scientist.

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 17, 2019

/ok-to-test

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 17, 2019

/retest

1 similar comment
@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 17, 2019

/retest

@eterna2
Copy link
Contributor Author

eterna2 commented Sep 19, 2019

@Ark-kun
Made a big change to the code base.
Basically, I added a few new features:

pod logs

workflow status does provide the archive information only if the job completes without errors. So we might still need the user-provided env variables as an optional fallback.

So the logic now is:

  • query k8s for logs
  • if fails, query for workflow, and try extract log location, create required client, and get the logs
  • else, try to retrieve based on the info provided in the env
  • if all else fails, return error message

AWS instance profile

A wrapper over the minio-js to mimic similar behavior as minio-go, so that pipeline ui can retrieve objects from s3 w/o access key/secrets, and instead use the ec2 instance profile (or kube2iam credentials).

This means if the IAM role for the pod/ec2 running the pipeline-ui can be used to interact with s3 instead of access key/secret.

k8s role/service account

Updated manifest for ui role to be able to get secrets as well as workflow.

});
case 'minio':
try {
res.send(await getTarObjectAsString({bucket, key, client: minioClient}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next PR, can you please make a function that auto-detects whether the data is gzipped and unpacks if needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking whether to change from node-tar to tar-stream + maybe-gzip.

But probably need more testing to ensure it works.

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 19, 2019

I think this is a good PR that also improves the handling of data access by Frontend.
/lgtm
/cc @Bobgy @jingzhang36 - Can you please review the Frontend code.

@Bobgy
Copy link
Contributor

Bobgy commented Sep 20, 2019

@eterna2 Thanks for the contribution! This is a great PR.

I briefly went through the change, and there's no obvious problems. I think you can start on some testing.

Our integration test infra doesn't run tests on AWS, so you cannot rely on that. I recommend unit tests for the node server that covers end to end flow of your use case. Please keep in mind we can only rely on your tests to make sure we don't break your feature, so prefer tests that cover the whole flow, rather than implementation details.

I need some time to gather enough context about minio, argo workflow and the exact approach you took. I will add more comments when I have time for a more detailed review.

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 24, 2019

@Bobgy Let's try to prioritize this PR a bit since there are some UX improvements that might depend on some of the refactorings introduced here.

Copy link
Contributor

@Bobgy Bobgy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned with 0 test coverage. @Ark-kun what's your opinion? Do you think follow up tests in separate PR or no test coverage for now is okay?

I'd prefer having at least tests that cover the happy path for new features.

frontend/server/aws-helper.ts Outdated Show resolved Hide resolved
frontend/server/minio-helper.ts Outdated Show resolved Hide resolved
frontend/server/server.ts Outdated Show resolved Hide resolved
@eterna2
Copy link
Contributor Author

eterna2 commented Sep 24, 2019

@Bobgy

I need some time to write the tests. I dun see anything specific currently in this repo thou. I do see some tests far down the line, but those are more end-2-end tests.

How do u want me to include the tests? Would adding tests/ folder and update package json and update the build script works?

Standard jest tests that mocks some of the AWS and K8s requests to test the various fallbacks.

@Ark-kun
Might need some time for the ux artifacts fix. Cuz I am on vacation. Not allowed to touch my computer until I am back next mon.

Copy link
Contributor

@neuromage neuromage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bobgy, neuromage

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@IronPan
Copy link
Member

IronPan commented Oct 15, 2019

@eterna2 I got a general question - how does the argo pod log got persisted in the artifactory in the first place? where did you specify the env var such as ARGO_ARCHIVE_LOGS? Did you set it in workflow controller?

@IronPan
Copy link
Member

IronPan commented Oct 15, 2019

Read closer it seems the log persistence is set in argo output artifact repository api
https://github.com/kubeflow/pipelines/pull/2081/files#diff-35aaf752af0c32d077dee5fffb0eadb4R156

does our DSL expose this for now? @Ark-kun

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 15, 2019

It is not done by kfp. It is currently set by the cluster admin (i.e. via the Argo manifest). Argo allows u to config a default artifactory, and there is an archive logs flag u can set.

https://github.com/argoproj/argo/blob/master/ARTIFACT_REPO.md#configure-the-default-artifact-repository

How my PR handles this when unable to retrieve from the pod directly, is to

  • look at the corresponding Argo workflow crd status. parse the workflow status and check for the archiveLog flag. This is set by the Argo artifactory config in the manifest or inside the workflow template itself (not supported by current kfp sdk).

  • but because the status will not be updated properly if the workflow throws an error, the API will fallback to the manual config set inside the front end server itself and makes some simple assumptions.

@IronPan
Copy link
Member

IronPan commented Oct 17, 2019

Argo artifact configmap is implementation detail that we are not intend to expose, either to cluster user or admin.

@Ark-kun thoughts?

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 17, 2019

I think it's important for cluster administrators to control the artifact storage location.
If we do not want to expose the implementation detail, we should probably provide an official way for configuring the storage.

P.S. I consider archiving the logs to be beneficial. We should enable it by default. Otherwise the logs are lost in non-GCP environments.

@IronPan
Copy link
Member

IronPan commented Nov 7, 2019

/hold cancel

@mattiasarro
Copy link

mattiasarro commented Dec 9, 2019

Can anyone give a bit more details on how to enable this? From a master branch installation of Kubeflow, if I set the below env vars on ml-pipeline-ui:

  ARGO_ARCHIVE_LOGS = "true"
  ARGO_ARCHIVE_ARTIFACTORY = 'minio'
  ARGO_ARCHIVE_BUCKETNAME = 'mlpipeline'
  ARGO_ARCHIVE_PREFIX = 'artifacts' # also tried 'logs'

...and data.config.artifactRepository.archiveLogs: true for workflow-controller-configmap, I do get logs in Minio mlpipeline/artifacts/*/*/main.log, but ml-pipeline-ui does not fetch logs from Minio if a node has been removed.

Is the above setup supposed to work? And is thyhere a way of enabling this for e.g. KF 0.7?

@eterna2
Copy link
Contributor Author

eterna2 commented Dec 9, 2019

@mattiasarro

Hmmm. It works on my cluster thou - but I am on AWS eks with my custom manifest rather than kf's manifest.

Can u check what version of pipeline is deployed? And what is the returned error msg?

U might need to update the pipeline-ui service account to have access to k8s secrets and Argo workflow crd.

Because the under the hood, it will first try with k8s API to get the pod logs, followed by getting it from the Argo workflow crd status (which will tell u where the artifacts are stored, it will load the secrets and try to retrieve the logs). And if that fails, it will just try to read the logs based on the env var u set.

I have set it to return diff error msg depending on where it failed.

@eterna2
Copy link
Contributor Author

eterna2 commented Dec 9, 2019

I just checked. Kf 0.7 uses 0.1.31.

This PR are only merged in for versions > 0.1.34.

U can try using 0.1.34 or 0.1.35. u shld also update the pipeline-ui service account to have access to k8s secrets and Argo workflow crd to.

U can look at the manifest changes in this PR. This is not updated for the main kubeflow manifest repo.

@Bobgy Bobgy mentioned this pull request Dec 17, 2019
1 task
@mattiasarro
Copy link

Thanks @eterna2! 0.1.34 doesn't have the change but 0.1.35 works. I used your kubeflow-aws as a basis and created a non-AWS-specific version (since I'm using Minio not S3): https://github.com/mattiasarro/kubeflow-0-7-argo-minio-logs

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 24, 2020

@eterna2 Is this feature supposed to work for non-AWS installations (e.g. GCP) as well? There are some reports that it might not be the case.

@eterna2
Copy link
Contributor Author

eterna2 commented Apr 24, 2020

Yes it should work with both minio and s3.

But there are a few configuration they need to set. By default it is off.

What it does under the hood is just getting the workflow status and retrieving the log artifacts provided in the status.

There are a few ways that it will fail. When workflow error out, sometimes the status will be incomplete - no artifact info. Then in this case, it will not work.

And if secret is required to retrieve the artifacts, the UI sa must have permission to retrieve these secrets.

I don't think I have updated the main kubeflow manifest for this feature.

@eterna2
Copy link
Contributor Author

eterna2 commented Apr 24, 2020

Is there a good place for us to document some of the configuration guide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants