Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

Closed
vaskozl opened this issue Jul 26, 2021 · 30 comments

Comments

@vaskozl
Copy link
Contributor

vaskozl commented Jul 26, 2021

/kind bug

I upgraded Kubeflow from 1.4.0 to 1.7.0-rc1 with the platnform-agnostic manifests.

While I now see correct visualizations of statistics from runs that happened before upgrading to 1.7.0-rc1, new runs only display the markdown details.

The TFX pipelines I submit are exactly the same. On the new runs the ML Metadata tab of the components prints:

"Corresponding ML Metadata not found."

Furthermore I don't see any new executions on the executions page despite running many pipelines since upgrading.

I don't see anything special in the logs of the TFX pods except:

WARNING:absl:metadata_connection_config is not provided by IR.

But that was present before upgrading to 1.7.0-rc1.

The only errors I see in the metadata-grpc-deployment pod is:

name: "sp-lstm-rh6xt"
Internal: mysql_query failed: errno: 1062, error: Duplicate entry '48-sp-lstm-rh6xt' for key 'type_id'
	Cannot create node for type_id: 48 name: "sp-lstm-rh6xt"

Which I also think is normal?

Basically I don't think executions and artifacts are getting written to the DB for some reason in 1.7.0-rc1. Not sure how to debug this. This causes the visualizations to not show up as far as I can see.

Metadata in the TFX pipelines is configured via the get_default_kubeflow_metadata_config tfx.orchestration.kubeflow function.

Environment:

Kubeflow version: 1.4.0 -> 1.7.0-rc1
kfctl version: Not used. Using tfx.orchestration.kubeflow to submit pipelines.
Kubernetes platform: Upstream kubeadm: k8s v1.20.5
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release): Centos 8

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@vaskozl vaskozl changed the title [backend] <Bug Name> [backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) Jul 26, 2021
@vaskozl
Copy link
Contributor Author

vaskozl commented Jul 26, 2021

I've also just tested the issues with 1.7.0-rc2.

I've tried kfp 1.4.0 and 1.6.6.

@vaskozl
Copy link
Contributor Author

vaskozl commented Jul 26, 2021

Downgrading the metadata images seems to have solved this for me:

images:
  - name: gcr.io/ml-pipeline/metadata-envoy
    newTag: 1.4.0
  - name: gcr.io/tfx-oss-public/ml_metadata_store_server
    newTag: 0.25.1
  - name: gcr.io/ml-pipeline/metadata-writer
    newTag: 1.4.0

I believe the metadata-writer container version is the most important one here, though I have no context.

@vaskozl
Copy link
Contributor Author

vaskozl commented Jul 28, 2021

Actually reverting doesn't solve the problem completely, I still find the Statistics and Evaluator artifacts don't display.

In the markdown they have zeroed out id's:

id: 0
span: None
type_id: 0

@vaskozl
Copy link
Contributor Author

vaskozl commented Jul 28, 2021

There seems to be a problem with ml-metadata 1.0.0 which is required by tfx 1.0.0 and matches the grpc server of 1.7.1.

Downgrading the the metadata-grpc-server doesn't work solve anything.

I've gone back down to tfx 0.30 which uses metadata 0.30.0

Visualisations work when I downgrade gcr.io/tfx-oss-public/ml_metadata_store_server to 0.25.1 as well.

@zijianjoy
Copy link
Collaborator

Hello @vaskozl , what is the TFX version you are using when you are using KFP 1.4.0?

@Bobgy
Copy link
Contributor

Bobgy commented Aug 6, 2021

/cc @jiyongjung0
do you know any context?

@zijianjoy
Copy link
Collaborator

Possible related: tensorflow/tensorflow#50045

@vaskozl
Copy link
Contributor Author

vaskozl commented Aug 6, 2021

When on 1.4.0 everything from TFX 0.27 to 1.0.0 wrote metadata. (visualizations didn't work but they do in the 1.7.0 release). I do recall having to bump the grpcio version in requirements up and down while hopping up to TFX 1.0.0 .

TFX 1.0.0 -> Kubeflow 1.7.0 (with metadata) No metadata written
TFX 1.0.0 -> Kubeflow 1.4.0 Partial metadata written it seemed
TFX 0.30.0 -> Kubeflow 1.4.0 All as expected

As I said I'm now using 1.7.0 with just the ml_metadata_store_server downgraded below 1.0.0.

@zijianjoy zijianjoy self-assigned this Aug 9, 2021
@jiyongjung0
Copy link
Contributor

I found some possible cause. It is related to the changes in the way TFX stores their contexts since 1.0.0 (which is related to the changes in the execution stack using TFX IR).

In TFX 0.X, the context were

  • type:pipeline, value: "my-pipeline"
  • type:run, value: "my-pipeline.my-pipeline-xaew1" (some hash is appended in the second part.)
  • type:component_run, value: "my-pipeline.my-pipeline-xaew1.CsvExampleGen"
    Related code

However in TFX 1.0, the context became

  • type:pipeline, value: "my-pipeline"
  • type:pipeline_run, value: "my-pipeline-xaew1"
  • type:node, value: "my-pipeline.CsvExampleGen"

Related code

So it seems like Kubeflow Pipelines cannot find context (and artifacts) properly.
I think that we should change mlmd access code like here.

CC. @zhitaoli , @1025KB , @Bobgy

@zijianjoy
Copy link
Collaborator

Thank you @jiyongjung0 for finding the difference between TFX 0.X and 1.0+!

For backward compatibility, what should KFP frontend do to detect if a TFX pipeline is TFX 0.X or TFX 1.0+?

@jiyongjung0
Copy link
Contributor

Unfortunately, it seems that there is no direct clue when finding executions. (Artifacts has tfx_version property, but there is no such information in Context / Execution.)

I think that we can try to find 1.0 context first, and fallback to 0.X context if not found.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 13, 2021

Makes sense to me to find 1.0 context first and then fall back to 0.X

@zijianjoy
Copy link
Collaborator

Sounds good, agree with the fallback strategy here.

@zijianjoy
Copy link
Collaborator

Hello @jiyongjung0 ,

Currently we use this code logic to identify the execution for a specific node using node ID (which is the corresponding Pod ID). However, with the latest integration with TFX, we are unable to find this connection from Execution, see one of the following example for a statisticsGen execution:

tfxemptyexecution

How do I fix the TFX integration that I get the properties correctly like the following? (From an old deployment)

tfxstepnotempty

@Bobgy Bobgy mentioned this issue Aug 16, 2021
23 tasks
@jiyongjung0
Copy link
Contributor

Hi, this is a bug from TFX side introduced in tensorflow/tfx@24fc5d1. It seems like we don't record pod names in TFX 1.0.0. I'll try to fix this ASAP, and will update the release plan.

copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
Kubeflow Dag Runner should record additional execution property to store pod names.
See also kubeflow/pipelines#6138.

PiperOrigin-RevId: 391192276
copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
Kubeflow Dag Runner should record additional execution property to store pod names.
See also kubeflow/pipelines#6138.

PiperOrigin-RevId: 391220221
@jiyongjung0
Copy link
Contributor

I'm trying to include the above fix in the TFX 1.2.0 which is expected to be released tomorrow. I'll update you when the release is finalized.

dhruvesh09 added a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
…4157)

* Fixes missing kfp_pod_name execution property in Kubeflow Pipelines.

Kubeflow Dag Runner should record additional execution property to store pod names.
See also kubeflow/pipelines#6138.

PiperOrigin-RevId: 391220221

Co-authored-by: jiyongjung <jiyongjung@google.com>
dhruvesh09 added a commit to tensorflow/tfx that referenced this issue Aug 17, 2021
* Update RELEASE.md

* Update version.py

* Update version.py

* Update version.py

* Update dependencies.py

* Update RELEASE.md

* Fixes missing kfp_pod_name execution property in Kubeflow Pipelines.

Kubeflow Dag Runner should record additional execution property to store pod names.
See also kubeflow/pipelines#6138.

PiperOrigin-RevId: 391220221

* Update RELEASE.md

Co-authored-by: jiyongjung <jiyongjung@google.com>
@jiyongjung0
Copy link
Contributor

TFX 1.2.0 was released today and this should be fixed. Thank you again for reporting this!

@zijianjoy
Copy link
Collaborator

zijianjoy commented Aug 18, 2021

Appreciate it a lot @jiyongjung0 for the quick fix and release of TFX 1.2.0!

The following is what I found:

  1. I need to make changes to notebook example for TFX v1.2.0 for pip installation and image. I send a PR for this: feat(sample): Use TFX 1.2.0 for Taxi tips prediction sample. Partial #6138 #6381, appreciate for your review in advance!

  2. I am able to see HTML visualization for staticsgen, schemagen, etc. (yay!), but I am not able to see the visualization of transform step, for the artifact like pre_transform_stats. Because KFP is trying to visit Split-eval and Split-train in the code files = tf.io.gfile.listdir('${specificUri}'): https://github.com/kubeflow/pipelines/blob/master/frontend/src/lib/OutputArtifactLoader.ts#L304, where I don't have those files, see the screenshot below:
    pretransformstat

  3. The step evaluator fails with the following logs, how does TFX utilize the KFP visualization feature?

  File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 779, in process
    result.extend(self._batch_reducible_process(unbatched_element))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 928, in _batch_reducible_process
    input_specs = get_input_specs(model, signature_name, required) or {}
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 472, in get_input_specs
    signature_name, model.signatures))
ValueError: tft_layer not found in model signatures: _SignatureMap({'serving_default': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E62D0>, 'transform_features': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E3410>}) [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractTransformedFeatures/Predict']
time="2021-08-18T16:09:09.989Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
Error: exit status 1

@jiyongjung0
Copy link
Contributor

jiyongjung0 commented Aug 19, 2021

@zijianjoy

Thank you so much!!

For 2.(Transform output) It seems like an inconsistency in TFX implementation. The artifact from Transform was added recently and I never tried before. I'll talk with other TFX folks and get back to you.

For 3.(Evaluator), We need to change the preprocessing_function_names in the evaluator config in 1.2.0, because the example was changed in 1.2.0. Please see https://github.com/tensorflow/tfx/blob/34bdbc8c0f7c2d0da36559c9cb7afd603e44a5e3/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py#L126

Bobgy added a commit that referenced this issue Aug 19, 2021
…6138 (#6381)

* fix(sample): Use TFX1.2.0 for Taxi tips prediction sample

* also update python

* Update parameterized_tfx_oss.py

* Update taxi_pipeline_notebook.ipynb

* Update parameterized_tfx_oss.py

Co-authored-by: Yuan (Bob) Gong <4957653+Bobgy@users.noreply.github.com>
google-oss-robot pushed a commit that referenced this issue Aug 19, 2021
@zijianjoy
Copy link
Collaborator

zijianjoy commented Aug 19, 2021

Document the conversation:

Now TFX moves the information in Execution to Context with type node. see code: https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/metadata.py#L428-L435.

KFP will consider the possibility to pulling context for TFX.

casassg pushed a commit to casassg/tfx that referenced this issue Aug 25, 2021
Kubeflow Dag Runner should record additional execution property to store pod names.
See also kubeflow/pipelines#6138.

PiperOrigin-RevId: 391220221
@ConverJens
Copy link
Contributor

@zijianjoy Was the TFX > 1.0.0 fix included in the KFP 1.7.0 release?

@vaskozl
Copy link
Contributor Author

vaskozl commented Aug 26, 2021

@ConverJens I can confirm it has.

TFX 1.2.0 and Pipelines 1.7.0 work perfectly with no patches.

@ConverJens
Copy link
Contributor

@vaskozl Great news, thank you for the information!

@zijianjoy
Copy link
Collaborator

Yes @ConverJens , it is as confirmed by @vaskozl .

BTW, @Bobgy is working on a compatibility matrix for KFP and TFX (and more) which shows the working version combinations in the future.

@ConverJens
Copy link
Contributor

@zijianjoy Great! A compatibility matrix would be awsome!

@zijianjoy
Copy link
Collaborator

Hello @ConverJens , you can check out the compatibility matrix in https://www.kubeflow.org/docs/components/pipelines/installation/compatibility-matrix/ now.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@wyljpn
Copy link

wyljpn commented Sep 30, 2022

Hi, there.
I am using kfp v.1.5.1 with tfx v0.28.0.
I faced the issue of loading ML metadata in the Runs page, although in the compatibility matrix it is "Fully Compatible".
That was because of the contextName inconsistency between what frontend wants to find and that stored in the MySQL.
frontend wanted to find something like "my_pipeline.my-pipeline-xaew1" (some hash is appended in the second part.), but what is stored in the MySQL is "my-pipeline.my-pipeline-xaew1", so error happened.
See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L66
image
image
After changing it from .join('_') to .join('-'), it works.

  const pipelineName = argoWorkflowName
    .split('-')
    .slice(0, -1)
    .join('-');

And it also had an issue with getKfpPod using KFP_POD_NAME, because kfpRun wasn't written into the DB.
See:
https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L146
Took a workaround, using POD_NAME, made it works.

export enum KfpExecutionProperties {
  KFP_POD_NAME = 'kfp_pod_name',
  POD_NAME = 'pod_name',
}
...
  getKfpPod(execution: Execution): string | number | undefined {
    return (
      getResourceProperty(execution, KfpExecutionProperties.POD_NAME, true) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME, true) ||
      undefined
    );
  },

image

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2022
@rimolive
Copy link
Member

Closing this issue as there is no activity since 2022.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue as there is no activity since 2022.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Closed
Development

No branches or pull requests

7 participants