Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure the narayana-jta extension fully shuts down the recovery manager #35174

Merged
merged 1 commit into from
Aug 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
import com.arjuna.ats.arjuna.common.arjPropertyManager;
import com.arjuna.ats.arjuna.coordinator.TransactionReaper;
import com.arjuna.ats.arjuna.coordinator.TxControl;
import com.arjuna.ats.arjuna.recovery.RecoveryManager;
import com.arjuna.ats.internal.arjuna.objectstore.jdbc.JDBCStore;
import com.arjuna.ats.jta.common.JTAEnvironmentBean;
import com.arjuna.ats.jta.common.jtaPropertyManager;
Expand Down Expand Up @@ -158,7 +157,14 @@ public void startRecoveryService(final TransactionManagerConfiguration transacti
public void handleShutdown(ShutdownContext context, TransactionManagerConfiguration transactions) {
context.addLastShutdownTask(() -> {
if (transactions.enableRecovery) {
RecoveryManager.manager().terminate(true);
try {
QuarkusRecoveryService.getInstance().stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this change for the reproducer described on the issue #34576 and this solves the issue. Stopping the QuarkusRecoveryService is definitely what we need here instead of the termination of the recovery manager only. +1
We probably haven't spotted it before because when starting the application anew we don't notice this problem, this is definitely something we should take into consideration for next tests.

} catch (Exception e) {
// the recovery manager throws IllegalStateException if it has already been shutdown
log.warn("The recovery manager has already been shutdown", e);
Comment on lines +162 to +164
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not catch exception if we expect something more specific.

Suggested change
} catch (Exception e) {
// the recovery manager throws IllegalStateException if it has already been shutdown
log.warn("The recovery manager has already been shutdown", e);
} catch (IllegalStateException e) {
// the recovery manager throws IllegalStateException if it has already been shut down
log.warn("The recovery manager has already been shut down", e);
} catch (Exception e) {
throw new IllegalStateException("Unexpected exception shutting down the recovery manager", e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yrodiere The signature of the method that throws the IllegalStateException is public void stop() throws Exception [1] so that's what I need to catch .

Note that QuarkusRecoveryService extends RecoveryManagerService and RecoveryManagerService is a narayana integration class. In my response to Guillaume I was just pointing out what the actual exception thrown is, I gave links to the code in my response to him.

[1] https://github.com/jbosstm/narayana/blob/6.0.1.Final/ArjunaJTS/integration/src/main/java/com/arjuna/ats/jbossatx/jta/RecoveryManagerService.java#L69

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get what you mean.

But you only know that IllegalStateException is about the recovery manager already being shut down.

Any other exception would be unexpected, so even if you have to catch them, you probably shouldn't log a warning saying the recovery manager has already been shut down... ?

} finally {
QuarkusRecoveryService.getInstance().destroy();
}
}
TransactionReaper.terminate(false);
});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,11 @@ public void create() {
}
xaResources.clear();
}

@Override
public void destroy() {
super.destroy();
isCreated = false;
recoveryManagerService = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good for the purposes of this specific issue, but shouldn't this QuarkusRecoveryService be mindful about concurrency?
I see neither the initialization/destruction being safeguarded against racy invocations, nor the overriden instance methods - are these methods protected externally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhfeng will you respond to Sanne's point about lack of protection against concurrent access, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think the real working codes are in RecoveryManager.java. Both of initialize/terminate methods are protected by synchronized. So it should be safe for concurrency access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, RecoveryManagerService calls RecoveryManager. So it's possible in theory to have destroy being executed concurrently to e.g. create, which could result in the recovery manager being in a "stopped" status while the service is in a "running" status (isCreated == true). Adding synchronized to lifecycle methods would prevent that.

That being said, the superclass RecoveryManagerService seems to have the same problem, so I suppose if nobody ever reported a problem in other projects... it may be safe in practice.

}
Comment on lines +55 to +60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a non-regression test for this?

You can find an example of a hot reload test in Hibernate ORM tests:

In your case you'd probably just have a resource that runs an transaction and check that the resource still works after changing it (and automatically triggering a hot reload)

Copy link
Contributor Author

@mmusgrov mmusgrov Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that I will have time today, but do note that the clock is ticking on this fix. If we can merge it now then I or Amos can come back to it later.

I have a similar response to you next review comment as well.

These extra comments are good but they were present in the original code, I have just supplied the urgent bug fix to the existing code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the comment about concurrency in particular is not urgent at all.

As to whether we should add non-regression tests when fixing bugs, well, I'll let whoever wants the fix urgently decide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thoughts, regression tests ought to be non-negotiable so I will prioritise that ahead of other stuff today, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example you gave. what is it that triggers the hot reload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do I need to set quarkus.devservices.enabled to true in the application.properties that I use with the test?

Copy link
Member

@yrodiere yrodiere Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, since you don't want to test devservices here.

You'll probably want to set the JDBC URL though. But it looks like you use a test profile for that in other tests, not application.properties... for some reason (I don't know why):

public class JdbcObjectStoreTestProfile implements QuarkusTestProfile {
@Override
public Map<String, String> getConfigOverrides() {
HashMap<String, String> props = new HashMap<>();
props.put("quarkus.transaction-manager.object-store.type", "jdbc");
props.put("quarkus.transaction-manager.object-store.create-table", "true");
props.put("quarkus.transaction-manager.object-store.datasource", "test");
props.put("quarkus.transaction-manager.enable-recovery", "true");
props.put("quarkus.datasource.test.jdbc.url", "jdbc:h2:mem:default");
return props;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've figured it out, I'll let you know if I hit any issues with it ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yrodiere @Sanne @gsmet apart from not having a regression test, I believe that I have resolved the main review comments. To make early progress, how about the following proposal:

This bug fix gets merged today in its current state, I've fixed most of the comments. Then I will follow up with a new issue(s) to:

  • harden the QuarkusRecoveryService along the lines Sanne suggested;
  • add the reload test.

The follow up work will be done by either Amos, Marco or myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmusgrov The PR build is failing, so I think we'll need to fix that before we merge, at least.
The failure seems unrelated though. I'll try to rebase the PR...

@gsmet I'll let you approve the PR if you're fine with @mmusgrov 's proposal.

}
Loading