[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

vaskozl · 2021-07-26T10:37:03Z

/kind bug

I upgraded Kubeflow from 1.4.0 to 1.7.0-rc1 with the platnform-agnostic manifests.

While I now see correct visualizations of statistics from runs that happened before upgrading to 1.7.0-rc1, new runs only display the markdown details.

The TFX pipelines I submit are exactly the same. On the new runs the ML Metadata tab of the components prints:

"Corresponding ML Metadata not found."

Furthermore I don't see any new executions on the executions page despite running many pipelines since upgrading.

I don't see anything special in the logs of the TFX pods except:

WARNING:absl:metadata_connection_config is not provided by IR.

But that was present before upgrading to 1.7.0-rc1.

The only errors I see in the metadata-grpc-deployment pod is:

name: "sp-lstm-rh6xt"
Internal: mysql_query failed: errno: 1062, error: Duplicate entry '48-sp-lstm-rh6xt' for key 'type_id'
	Cannot create node for type_id: 48 name: "sp-lstm-rh6xt"

Which I also think is normal?

Basically I don't think executions and artifacts are getting written to the DB for some reason in 1.7.0-rc1. Not sure how to debug this. This causes the visualizations to not show up as far as I can see.

Metadata in the TFX pipelines is configured via the get_default_kubeflow_metadata_config tfx.orchestration.kubeflow function.

Environment:

Kubeflow version: 1.4.0 -> 1.7.0-rc1
kfctl version: Not used. Using tfx.orchestration.kubeflow to submit pipelines.
Kubernetes platform: Upstream kubeadm: k8s v1.20.5
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release): Centos 8

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

vaskozl · 2021-07-26T10:51:59Z

I've also just tested the issues with 1.7.0-rc2.

I've tried kfp 1.4.0 and 1.6.6.

vaskozl · 2021-07-26T13:10:39Z

Downgrading the metadata images seems to have solved this for me:

images:
  - name: gcr.io/ml-pipeline/metadata-envoy
    newTag: 1.4.0
  - name: gcr.io/tfx-oss-public/ml_metadata_store_server
    newTag: 0.25.1
  - name: gcr.io/ml-pipeline/metadata-writer
    newTag: 1.4.0

I believe the metadata-writer container version is the most important one here, though I have no context.

vaskozl · 2021-07-28T08:36:59Z

Actually reverting doesn't solve the problem completely, I still find the Statistics and Evaluator artifacts don't display.

In the markdown they have zeroed out id's:

id: 0
span: None
type_id: 0

vaskozl · 2021-07-28T13:34:25Z

There seems to be a problem with ml-metadata 1.0.0 which is required by tfx 1.0.0 and matches the grpc server of 1.7.1.

Downgrading the the metadata-grpc-server doesn't work solve anything.

I've gone back down to tfx 0.30 which uses metadata 0.30.0

Visualisations work when I downgrade gcr.io/tfx-oss-public/ml_metadata_store_server to 0.25.1 as well.

zijianjoy · 2021-08-06T00:52:56Z

Hello @vaskozl , what is the TFX version you are using when you are using KFP 1.4.0?

Bobgy · 2021-08-06T00:53:52Z

/cc @jiyongjung0
do you know any context?

zijianjoy · 2021-08-06T00:57:09Z

Possible related: tensorflow/tensorflow#50045

vaskozl · 2021-08-06T09:35:44Z

When on 1.4.0 everything from TFX 0.27 to 1.0.0 wrote metadata. (visualizations didn't work but they do in the 1.7.0 release). I do recall having to bump the grpcio version in requirements up and down while hopping up to TFX 1.0.0 .

TFX 1.0.0 -> Kubeflow 1.7.0 (with metadata) No metadata written
TFX 1.0.0 -> Kubeflow 1.4.0 Partial metadata written it seemed
TFX 0.30.0 -> Kubeflow 1.4.0 All as expected

As I said I'm now using 1.7.0 with just the ml_metadata_store_server downgraded below 1.0.0.

jiyongjung0 · 2021-08-13T04:54:12Z

I found some possible cause. It is related to the changes in the way TFX stores their contexts since 1.0.0 (which is related to the changes in the execution stack using TFX IR).

In TFX 0.X, the context were

type:pipeline, value: "my-pipeline"
type:run, value: "my-pipeline.my-pipeline-xaew1" (some hash is appended in the second part.)
type:component_run, value: "my-pipeline.my-pipeline-xaew1.CsvExampleGen"
Related code

However in TFX 1.0, the context became

type:pipeline, value: "my-pipeline"
type:pipeline_run, value: "my-pipeline-xaew1"
type:node, value: "my-pipeline.CsvExampleGen"

Related code

So it seems like Kubeflow Pipelines cannot find context (and artifacts) properly.
I think that we should change mlmd access code like here.

CC. @zhitaoli , @1025KB , @Bobgy

zijianjoy · 2021-08-13T05:18:46Z

Thank you @jiyongjung0 for finding the difference between TFX 0.X and 1.0+!

For backward compatibility, what should KFP frontend do to detect if a TFX pipeline is TFX 0.X or TFX 1.0+?

jiyongjung0 · 2021-08-13T06:13:56Z

Unfortunately, it seems that there is no direct clue when finding executions. (Artifacts has tfx_version property, but there is no such information in Context / Execution.)

I think that we can try to find 1.0 context first, and fallback to 0.X context if not found.

Bobgy · 2021-08-13T06:24:45Z

Makes sense to me to find 1.0 context first and then fall back to 0.X

zijianjoy · 2021-08-13T06:27:59Z

Sounds good, agree with the fallback strategy here.

zijianjoy · 2021-08-13T20:56:42Z

Hello @jiyongjung0 ,

Currently we use this code logic to identify the execution for a specific node using node ID (which is the corresponding Pod ID). However, with the latest integration with TFX, we are unable to find this connection from Execution, see one of the following example for a statisticsGen execution:

How do I fix the TFX integration that I get the properties correctly like the following? (From an old deployment)

jiyongjung0 · 2021-08-17T01:06:01Z

Hi, this is a bug from TFX side introduced in tensorflow/tfx@24fc5d1. It seems like we don't record pod names in TFX 1.0.0. I'll try to fix this ASAP, and will update the release plan.

See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391192276

Kubeflow Dag Runner should record additional execution property to store pod names. See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391192276

Kubeflow Dag Runner should record additional execution property to store pod names. See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391220221

jiyongjung0 · 2021-08-17T07:37:35Z

I'm trying to include the above fix in the TFX 1.2.0 which is expected to be released tomorrow. I'll update you when the release is finalized.

…4157) * Fixes missing kfp_pod_name execution property in Kubeflow Pipelines. Kubeflow Dag Runner should record additional execution property to store pod names. See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391220221 Co-authored-by: jiyongjung <jiyongjung@google.com>

* Update RELEASE.md * Update version.py * Update version.py * Update version.py * Update dependencies.py * Update RELEASE.md * Fixes missing kfp_pod_name execution property in Kubeflow Pipelines. Kubeflow Dag Runner should record additional execution property to store pod names. See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391220221 * Update RELEASE.md Co-authored-by: jiyongjung <jiyongjung@google.com>

jiyongjung0 · 2021-08-18T01:19:53Z

TFX 1.2.0 was released today and this should be fixed. Thank you again for reporting this!

zijianjoy · 2021-08-18T16:37:54Z

Appreciate it a lot @jiyongjung0 for the quick fix and release of TFX 1.2.0!

The following is what I found:

I need to make changes to notebook example for TFX v1.2.0 for pip installation and image. I send a PR for this: feat(sample): Use TFX 1.2.0 for Taxi tips prediction sample. Partial #6138 #6381, appreciate for your review in advance!
I am able to see HTML visualization for staticsgen, schemagen, etc. (yay!), but I am not able to see the visualization of transform step, for the artifact like pre_transform_stats. Because KFP is trying to visit Split-eval and Split-train in the code files = tf.io.gfile.listdir('${specificUri}'): https://github.com/kubeflow/pipelines/blob/master/frontend/src/lib/OutputArtifactLoader.ts#L304, where I don't have those files, see the screenshot below:
The step evaluator fails with the following logs, how does TFX utilize the KFP visualization feature?

  File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 779, in process
    result.extend(self._batch_reducible_process(unbatched_element))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 928, in _batch_reducible_process
    input_specs = get_input_specs(model, signature_name, required) or {}
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 472, in get_input_specs
    signature_name, model.signatures))
ValueError: tft_layer not found in model signatures: _SignatureMap({'serving_default': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E62D0>, 'transform_features': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E3410>}) [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractTransformedFeatures/Predict']
time="2021-08-18T16:09:09.989Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
Error: exit status 1

jiyongjung0 · 2021-08-19T01:39:25Z

@zijianjoy

Thank you so much!!

For 2.(Transform output) It seems like an inconsistency in TFX implementation. The artifact from Transform was added recently and I never tried before. I'll talk with other TFX folks and get back to you.

For 3.(Evaluator), We need to change the preprocessing_function_names in the evaluator config in 1.2.0, because the example was changed in 1.2.0. Please see https://github.com/tensorflow/tfx/blob/34bdbc8c0f7c2d0da36559c9cb7afd603e44a5e3/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py#L126

…6138 (#6381) * fix(sample): Use TFX1.2.0 for Taxi tips prediction sample * also update python * Update parameterized_tfx_oss.py * Update taxi_pipeline_notebook.ipynb * Update parameterized_tfx_oss.py Co-authored-by: Yuan (Bob) Gong <4957653+Bobgy@users.noreply.github.com>

… longer support previous versions. Part of #6138 (#6388)

zijianjoy · 2021-08-19T18:13:08Z

Document the conversation:

Now TFX moves the information in Execution to Context with type node. see code: https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/metadata.py#L428-L435.

KFP will consider the possibility to pulling context for TFX.

Kubeflow Dag Runner should record additional execution property to store pod names. See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391220221

ConverJens · 2021-08-26T06:51:49Z

@zijianjoy Was the TFX > 1.0.0 fix included in the KFP 1.7.0 release?

vaskozl · 2021-08-26T11:19:20Z

@ConverJens I can confirm it has.

TFX 1.2.0 and Pipelines 1.7.0 work perfectly with no patches.

ConverJens · 2021-08-26T11:26:36Z

@vaskozl Great news, thank you for the information!

zijianjoy · 2021-08-26T16:28:30Z

Yes @ConverJens , it is as confirmed by @vaskozl .

BTW, @Bobgy is working on a compatibility matrix for KFP and TFX (and more) which shows the working version combinations in the future.

ConverJens · 2021-08-26T17:59:06Z

@zijianjoy Great! A compatibility matrix would be awsome!

zijianjoy · 2021-09-14T05:55:54Z

Hello @ConverJens , you can check out the compatibility matrix in https://www.kubeflow.org/docs/components/pipelines/installation/compatibility-matrix/ now.

stale · 2022-03-02T21:04:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wyljpn · 2022-09-30T03:08:49Z

Hi, there.
I am using kfp v.1.5.1 with tfx v0.28.0.
I faced the issue of loading ML metadata in the Runs page, although in the compatibility matrix it is "Fully Compatible".
That was because of the contextName inconsistency between what frontend wants to find and that stored in the MySQL.
frontend wanted to find something like "my_pipeline.my-pipeline-xaew1" (some hash is appended in the second part.), but what is stored in the MySQL is "my-pipeline.my-pipeline-xaew1", so error happened.
See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L66

After changing it from .join('_') to .join('-'), it works.

  const pipelineName = argoWorkflowName
    .split('-')
    .slice(0, -1)
    .join('-');

And it also had an issue with getKfpPod using KFP_POD_NAME, because kfpRun wasn't written into the DB.
See:
https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L146
Took a workaround, using POD_NAME, made it works.

export enum KfpExecutionProperties {
  KFP_POD_NAME = 'kfp_pod_name',
  POD_NAME = 'pod_name',
}
...
  getKfpPod(execution: Execution): string | number | undefined {
    return (
      getResourceProperty(execution, KfpExecutionProperties.POD_NAME, true) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME, true) ||
      undefined
    );
  },

rimolive · 2024-03-14T21:22:20Z

Closing this issue as there is no activity since 2022.

/close

google-oss-prow · 2024-03-14T21:22:25Z

@rimolive: Closing this issue.

In response to this:

Closing this issue as there is no activity since 2022.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vaskozl added area/backend kind/bug labels Jul 26, 2021

vaskozl mentioned this issue Jul 26, 2021

Metadata not written (and new visualizations not displaying) in 1.7.0-rc1 kubeflow/kubeflow#6063

Closed

vaskozl changed the title ~~[backend] <Bug Name>~~ [backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) Jul 26, 2021

zijianjoy self-assigned this Aug 9, 2021

Bobgy mentioned this issue Aug 16, 2021

[release] 1.7.0 tracker #5779

Closed

23 tasks

copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021

Fixes missing kfp_pod_name execution property in Kubeflow Pipelines.

084a5e5

See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391192276

copybara-service bot mentioned this issue Aug 17, 2021

Fixes missing kfp_pod_name execution property in Kubeflow Pipelines. tensorflow/tfx#4156

Merged

copybara-service bot pushed a commit to tensorflow/tfx that referenced this issue Aug 17, 2021

Fixes missing kfp_pod_name execution property in Kubeflow Pipelines.

14677f8

See also kubeflow/pipelines#6138. PiperOrigin-RevId: 391192276

zijianjoy mentioned this issue Aug 18, 2021

feat(sample): Use TFX 1.2.0 for Taxi tips prediction sample. Partial #6138 #6381

Merged

1 task

zijianjoy mentioned this issue Aug 19, 2021

fix(frontend): Update TFX context naming pattern for TFX 1.2.0+. #6387

Closed

1 task

Bobgy mentioned this issue Aug 19, 2021

feat(frontend): integrate with TFX 1.2.0 metadata & visualization, no longer support previous versions. Part of #6138 #6388

Merged

1 task

google-oss-robot pushed a commit that referenced this issue Aug 19, 2021

feat(frontend): integrate with TFX 1.2.0 metadata & visualization, no…

d077e6e

… longer support previous versions. Part of #6138 (#6388)

Bobgy mentioned this issue Aug 19, 2021

[bug] sample - tfx builtin taxi sample visualizations broken #6296

Closed

Bobgy mentioned this issue Aug 25, 2021

docs(kfp): compatibility matrix with TFX kubeflow/website#2893

Merged

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2022

google-oss-prow bot closed this as completed Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

vaskozl commented Jul 26, 2021

vaskozl commented Jul 26, 2021

vaskozl commented Jul 26, 2021

vaskozl commented Jul 28, 2021

vaskozl commented Jul 28, 2021

zijianjoy commented Aug 6, 2021

Bobgy commented Aug 6, 2021

zijianjoy commented Aug 6, 2021

vaskozl commented Aug 6, 2021

jiyongjung0 commented Aug 13, 2021

zijianjoy commented Aug 13, 2021

jiyongjung0 commented Aug 13, 2021

Bobgy commented Aug 13, 2021 •

edited

Loading

zijianjoy commented Aug 13, 2021

zijianjoy commented Aug 13, 2021

jiyongjung0 commented Aug 17, 2021

jiyongjung0 commented Aug 17, 2021

jiyongjung0 commented Aug 18, 2021

zijianjoy commented Aug 18, 2021 •

edited

Loading

jiyongjung0 commented Aug 19, 2021 •

edited

Loading

zijianjoy commented Aug 19, 2021 •

edited

Loading

ConverJens commented Aug 26, 2021

vaskozl commented Aug 26, 2021

ConverJens commented Aug 26, 2021

zijianjoy commented Aug 26, 2021

ConverJens commented Aug 26, 2021

zijianjoy commented Sep 14, 2021

stale bot commented Mar 2, 2022

wyljpn commented Sep 30, 2022 •

edited

Loading

rimolive commented Mar 14, 2024

google-oss-prow bot commented Mar 14, 2024

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

Comments

vaskozl commented Jul 26, 2021

vaskozl commented Jul 26, 2021

vaskozl commented Jul 26, 2021

vaskozl commented Jul 28, 2021

vaskozl commented Jul 28, 2021

zijianjoy commented Aug 6, 2021

Bobgy commented Aug 6, 2021

zijianjoy commented Aug 6, 2021

vaskozl commented Aug 6, 2021

jiyongjung0 commented Aug 13, 2021

zijianjoy commented Aug 13, 2021

jiyongjung0 commented Aug 13, 2021

Bobgy commented Aug 13, 2021 • edited Loading

zijianjoy commented Aug 13, 2021

zijianjoy commented Aug 13, 2021

jiyongjung0 commented Aug 17, 2021

jiyongjung0 commented Aug 17, 2021

jiyongjung0 commented Aug 18, 2021

zijianjoy commented Aug 18, 2021 • edited Loading

jiyongjung0 commented Aug 19, 2021 • edited Loading

zijianjoy commented Aug 19, 2021 • edited Loading

ConverJens commented Aug 26, 2021

vaskozl commented Aug 26, 2021

ConverJens commented Aug 26, 2021

zijianjoy commented Aug 26, 2021

ConverJens commented Aug 26, 2021

zijianjoy commented Sep 14, 2021

stale bot commented Mar 2, 2022

wyljpn commented Sep 30, 2022 • edited Loading

rimolive commented Mar 14, 2024

google-oss-prow bot commented Mar 14, 2024

Bobgy commented Aug 13, 2021 •

edited

Loading

zijianjoy commented Aug 18, 2021 •

edited

Loading

jiyongjung0 commented Aug 19, 2021 •

edited

Loading

zijianjoy commented Aug 19, 2021 •

edited

Loading

wyljpn commented Sep 30, 2022 •

edited

Loading