Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: upload_graph option does not work with Dataflow Java Runner v2 #32159

Closed
2 of 17 tasks
damccorm opened this issue Aug 12, 2024 · 12 comments · Fixed by #32165
Closed
2 of 17 tasks

[Bug]: upload_graph option does not work with Dataflow Java Runner v2 #32159

damccorm opened this issue Aug 12, 2024 · 12 comments · Fixed by #32165

Comments

@damccorm
Copy link
Contributor

What happened?

Today, if you try to run a runner v2 job with the upload_graph option, the job fails with the following exception:

java.lang.NullPointerException: Cannot invoke "java.util.List.clear()" because the return value of "com.google.api.services.dataflow.model.Job.getSteps()" is null
	at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:1404)
	at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:203)
	at org.apache.beam.sdk.Pipeline.run(Pipeline.java:325)
	at org.apache.beam.sdk.Pipeline.run(Pipeline.java:310)
	...

I reproed this against our test suites:

Change I made - 70650bd
Result - https://github.com/apache/beam/actions/runs/10358654870

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@damccorm
Copy link
Contributor Author

I think #30604 likely caused this.

https://github.com/kennknowles/beam/blob/0a0186de5552dddd21a41fc851c805bf6b3b6714/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslator.java#L431 is the only runner v1/v2 logic on this path AFAIK, it only reproes on v2, and I confirmed it does work on 2.55.1 (before this change)

@damccorm
Copy link
Contributor Author

Also reproduced on 2.56.0 (example failures) - another datapoint towards #30604 being the thing that exposed this

FYI @kennknowles - looks like we missed an assumption. I think the easy fix may be to just do a null check on

before calling clear(), but I need to test it.

@damccorm damccorm self-assigned this Aug 12, 2024
@kennknowles
Copy link
Member

kennknowles commented Aug 13, 2024

Yea, the most internally-consistent thing to do is to ignore or disable upload_graph if V2 is used. That'll be more targeted than a null check and won't mask other errors.

Just to state it for posterity, though you probably already know:

  • The intent of upload_graph is that we upload the V1 post-translation Dataflow graph in case it is too big to be inlined in the CreateJob request due to various service-side limitations.
  • With V2 we always upload the portable proto and it is translated on the service side.

When I say they may be mutually exclusive, I think there is a chance that some piece of code somewhere looks in the same variable for the uploaded file, since the mechanism looks the same if you aren't type-conscious.

@liferoad
Copy link
Collaborator

Agree with @kennknowles. We can just error it out when upload_graph is used with V2.

@damccorm
Copy link
Contributor Author

I don't think we should error out, but we can warn. It doesn't make sense to me to fail the job.

We also need to not automatically opt users in for runner v2

@liferoad
Copy link
Collaborator

You can warn but that option perhaps should be removed in case some problematic code paths can be triggered when the option is there.

@damccorm
Copy link
Contributor Author

Mostly, I don't want to fail the job because it harms our ability to auto-upgrade users to runner v2. Removing the option seems reasonable.

@damccorm damccorm mentioned this issue Aug 13, 2024
3 tasks
@kennknowles
Copy link
Member

kennknowles commented Aug 13, 2024

The automatic opt in happens on the service side so I don't think it will trigger this.

But I also agree that making it do nothing is probably better than failing it, because if people have command lines assembled (either manually or automatically) and they add user_runner_v2 we would like it to work if possible. So if they have some old v1 command line that has upload_graph on it, I think it is benign to ignore it, since there is no change in the semantics of the job.

@kennknowles
Copy link
Member

Didn't someone add automatic upload_graph logic when the graph is too big, so we don't need to ever set it explicitly any more? It was always an extraneous knob that could/should be either automated or made the v1 default.

@github-actions github-actions bot added this to the 2.59.0 Release milestone Aug 13, 2024
@damccorm
Copy link
Contributor Author

The automatic opt in happens on the service side so I don't think it will trigger this.

Actually, this is on the SDK side (

) - fixed by my change to only do it for v1.

Didn't someone add automatic upload_graph logic when the graph is too big, so we don't need to ever set it explicitly any more? It was always an extraneous knob that could/should be either automated or made the v1 default.

Yes, but I assume folks still have it set manually

@kennknowles
Copy link
Member

kennknowles commented Aug 13, 2024

The automatic opt in happens on the service side so I don't think it will trigger this.

Actually, this is on the SDK side (

) - fixed by my change to only do it for v1.

I mean that automatic opt in to v2 is on the service side. For jobs that do not specify v1 or v2, we submit the necessary information for both and the service decides which is which. Jobs like that won't trigger the problem. This bug doesn't impact the involuntary migration to v2. It only impact jobs that explicitly choose both v2 and upload_graph prior to submission.

@damccorm
Copy link
Contributor Author

Oh got it, that's true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants