Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow.getVersion() could cause NonDeterminsticError with multithreading + timer #1430

Closed
longquanzheng opened this issue Sep 14, 2022 · 4 comments · Fixed by #1807
Closed

Comments

@longquanzheng
Copy link

longquanzheng commented Sep 14, 2022

Hey Temporal team,
We found that Workflow.getVersion() is letting workflow thread to yield, which causes NonDeterminsticError because the thread execution ordering is changed — this has made it very hard for us to use this getVersion().
Imagine we have two threads:

ThreadA:

{
    executeActivity(A)
}

ThreadB

{
   Workflow.sleep()
}

And thread A is invoked first.
And we got a JSON history for replay test. The history looks like this:

WF start
WF task scheduled
WF task started
WF task completed
activity task scheduled
timer scheduled

Then we are changing thread A to :

{
    if( Workflow.getVersion() >= X ){
        executeActivity(B)
    }
    executeActivity(A)
}

Then the replay test gets broken.
Because the thread A will yield at Workflow.getVersion() and thread B will then try to schedule before the activity A.
In Java SDK, it’s enforcing the eventId to be exactly matched with the history. So the replay test would run into NDE error.

Some thoughts to improve:
I think GoSDK is only forcing within a workflow task completed batch commands. This may be more reasonable because I don’t think the order really matter within a workflow task results.
Or if we could do special case in Workflow.getVersion() to let it not yield the current thread…

To reproduce

public void anyWorklfowMethod(){
         Async.procedure( ()->{ 
             Workflow.sleep(1s)
         }   
         )
         activityStub.hello()      
}

Got a history for replay
Then change the code to:

public void anyWorklfowMethod(){
         Async.procedure( ()->{ 
                 Workflow.sleep(1s)
             }   
         )
         Workflow.getVersion("changeId, Workflow.DefaultVersion, 1) 
         activityStub.hello()      
}

And this should break the replay test

@longquanzheng longquanzheng changed the title Workflow.getVersion() should yield workflow thread Workflow.getVersion() should NOT yield workflow thread Sep 14, 2022
@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Sep 14, 2022

Which version of java-sdk are we talking about?
Can you provide original histories and related pieces of code or a reproduction?

@longquanzheng longquanzheng changed the title Workflow.getVersion() should NOT yield workflow thread Workflow.getVersion() could cause NonDeterminsticError Sep 14, 2022
@longquanzheng
Copy link
Author

Which version of java-sdk are we talking about? Can you provide original histories and related pieces of code or a reproduction?

These are the versions we are using:

io.temporal:temporal-opentracing=1.13.0
io.temporal:temporal-sdk=1.13.0
io.temporal:temporal-serviceclient=1.13.0
io.temporal:temporal-test-server=1.13.0
io.temporal:temporal-testing=1.13.0

@Spikhalskiy
Copy link
Contributor

Confirm. State machine bug.
It's happening here:

It needs a very careful and probably very ugly fix to keep it backwards compatible.
Only replay shouldn't yield and ONLY if the actual version from history is default, which means that the original execution didn't have this call.

If we just fix it "right" and make getVersion not yielding, we will break existing histories generated by yielding version.

@longquanzheng
Copy link
Author

longquanzheng commented Sep 15, 2022

Confirm. State machine bug. It's happening here:

It needs a very careful and probably very ugly fix to keep it backwards compatible. Only replay shouldn't yield and ONLY if the actual version from history is default, which means that the original execution didn't have this call.

If we just fix it "right" and make getVersion not yielding, we will break existing histories generated by yielding version.

Just an idea. We could add a new field in workflow task started/completed to store some metadata of the sdk for the first workflow task of the workflow. Use the meta data (eg the sdk version) to decide the new behavior .

This new field can be generic to solve many future problems as well

@longquanzheng longquanzheng changed the title Workflow.getVersion() could cause NonDeterminsticError Workflow.getVersion() could cause NonDeterminsticError with multithreading + timer Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants