-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure the narayana-jta extension fully shuts down the recovery manager #35174
Conversation
} catch (Exception e) { | ||
throw new RuntimeException(e); | ||
} | ||
QuarkusRecoveryService.getInstance().destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be part of a finally clause?
I'm a bit worried of the state we will end up in if something wrong happens when we stop the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on both counts.
The Narayana RecoveryManager#terminate call [1] checks the state [2] and throws IllegalStateException if it has previously been shutdown so it's really a sanity check telling the caller not to call it twice.
[1] https://github.com/jbosstm/narayana/blob/6.0.1.Final/ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryManager.java#L206
[2] https://github.com/jbosstm/narayana/blob/6.0.1.Final/ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryManager.java#L489
public void destroy() { | ||
super.destroy(); | ||
isCreated = false; | ||
recoveryManagerService = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good for the purposes of this specific issue, but shouldn't this QuarkusRecoveryService
be mindful about concurrency?
I see neither the initialization/destruction being safeguarded against racy invocations, nor the overriden instance methods - are these methods protected externally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhfeng will you respond to Sanne's point about lack of protection against concurrent access, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think the real working codes are in RecoveryManager.java. Both of initialize/terminate methods are protected by synchronized
. So it should be safe for concurrency access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, RecoveryManagerService
calls RecoveryManager
. So it's possible in theory to have destroy
being executed concurrently to e.g. create
, which could result in the recovery manager being in a "stopped" status while the service is in a "running" status (isCreated == true
). Adding synchronized
to lifecycle methods would prevent that.
That being said, the superclass RecoveryManagerService
seems to have the same problem, so I suppose if nobody ever reported a problem in other projects... it may be safe in practice.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mmusgrov - LGTM!
} catch (Exception e) { | ||
// the recovery manager throws IllegalStateException if it has already been shutdown | ||
log.warn("The recovery manager has already been shutdown", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not catch exception if we expect something more specific.
} catch (Exception e) { | |
// the recovery manager throws IllegalStateException if it has already been shutdown | |
log.warn("The recovery manager has already been shutdown", e); | |
} catch (IllegalStateException e) { | |
// the recovery manager throws IllegalStateException if it has already been shut down | |
log.warn("The recovery manager has already been shut down", e); | |
} catch (Exception e) { | |
throw new IllegalStateException("Unexpected exception shutting down the recovery manager", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yrodiere The signature of the method that throws the IllegalStateException is public void stop() throws Exception
[1] so that's what I need to catch .
Note that QuarkusRecoveryService extends RecoveryManagerService
and RecoveryManagerService is a narayana integration class. In my response to Guillaume I was just pointing out what the actual exception thrown is, I gave links to the code in my response to him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get what you mean.
But you only know that IllegalStateException
is about the recovery manager already being shut down.
Any other exception would be unexpected, so even if you have to catch them, you probably shouldn't log a warning saying the recovery manager has already been shut down... ?
@Override | ||
public void destroy() { | ||
super.destroy(); | ||
isCreated = false; | ||
recoveryManagerService = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a non-regression test for this?
You can find an example of a hot reload test in Hibernate ORM tests:
- Resource:
Line 13 in ebee0a4
public class OrmXmlHotReloadTestResource { - Test:
Line 13 in 01bee69
public class OrmXmlHotReloadExplicitFileTestCase {
In your case you'd probably just have a resource that runs an transaction and check that the resource still works after changing it (and automatically triggering a hot reload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that I will have time today, but do note that the clock is ticking on this fix. If we can merge it now then I or Amos can come back to it later.
I have a similar response to you next review comment as well.
These extra comments are good but they were present in the original code, I have just supplied the urgent bug fix to the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the comment about concurrency in particular is not urgent at all.
As to whether we should add non-regression tests when fixing bugs, well, I'll let whoever wants the fix urgently decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second thoughts, regression tests ought to be non-negotiable so I will prioritise that ahead of other stuff today, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example you gave. what is it that triggers the hot reload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And do I need to set quarkus.devservices.enabled
to true in the application.properties that I use with the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, since you don't want to test devservices here.
You'll probably want to set the JDBC URL though. But it looks like you use a test profile for that in other tests, not application.properties
... for some reason (I don't know why):
Lines 8 to 21 in fd02a43
public class JdbcObjectStoreTestProfile implements QuarkusTestProfile { | |
@Override | |
public Map<String, String> getConfigOverrides() { | |
HashMap<String, String> props = new HashMap<>(); | |
props.put("quarkus.transaction-manager.object-store.type", "jdbc"); | |
props.put("quarkus.transaction-manager.object-store.create-table", "true"); | |
props.put("quarkus.transaction-manager.object-store.datasource", "test"); | |
props.put("quarkus.transaction-manager.enable-recovery", "true"); | |
props.put("quarkus.datasource.test.jdbc.url", "jdbc:h2:mem:default"); | |
return props; | |
} | |
} |
Line 11 in 6fc8350
@TestProfile(JdbcObjectStoreTestProfile.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've figured it out, I'll let you know if I hit any issues with it ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yrodiere @Sanne @gsmet apart from not having a regression test, I believe that I have resolved the main review comments. To make early progress, how about the following proposal:
This bug fix gets merged today in its current state, I've fixed most of the comments. Then I will follow up with a new issue(s) to:
- harden the
QuarkusRecoveryService
along the lines Sanne suggested; - add the reload test.
The follow up work will be done by either Amos, Marco or myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void destroy() { | ||
super.destroy(); | ||
isCreated = false; | ||
recoveryManagerService = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, RecoveryManagerService
calls RecoveryManager
. So it's possible in theory to have destroy
being executed concurrently to e.g. create
, which could result in the recovery manager being in a "stopped" status while the service is in a "running" status (isCreated == true
). Adding synchronized
to lifecycle methods would prevent that.
That being said, the superclass RecoveryManagerService
seems to have the same problem, so I suppose if nobody ever reported a problem in other projects... it may be safe in practice.
This comment has been minimized.
This comment has been minimized.
@@ -158,7 +157,14 @@ public void startRecoveryService(final TransactionManagerConfiguration transacti | |||
public void handleShutdown(ShutdownContext context, TransactionManagerConfiguration transactions) { | |||
context.addLastShutdownTask(() -> { | |||
if (transactions.enableRecovery) { | |||
RecoveryManager.manager().terminate(true); | |||
try { | |||
QuarkusRecoveryService.getInstance().stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this change for the reproducer described on the issue #34576 and this solves the issue. Stopping the QuarkusRecoveryService is definitely what we need here instead of the termination of the recovery manager only. +1
We probably haven't spotted it before because when starting the application anew we don't notice this problem, this is definitely something we should take into consideration for next tests.
This comment has been minimized.
This comment has been minimized.
The CI failure is not related, I'll restart CI once the issue is fixed in main. |
Rebased to get a CI fix. |
Failing Jobs - Building d99f0b5
Full information is available in the Build summary check run. Failures⚙️ Gradle Tests - JDK 11 Windows #- Failing: integration-tests/gradle
📦 integration-tests/gradle✖
⚙️ JVM Tests - JDK 20 #- Failing: integration-tests/virtual-threads/kafka-virtual-threads
📦 integration-tests/virtual-threads/kafka-virtual-threads✖
|
@yrodiere I tried unsuccessfully most of yesterday to create a hot reload test. I've started a zulipt topic to get some more eyes on my problem. |
I merged it. Please continue the work on the test in a separate PR. |
@gsmet I think it needs to backport to |
@zhfeng yes, that's what the |
This MR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [flow-bin](https://github.com/flowtype/flow-bin) ([changelog](https://github.com/facebook/flow/blob/master/Changelog.md)) | devDependencies | minor | [`^0.214.0` -> `^0.215.0`](https://renovatebot.com/diffs/npm/flow-bin/0.214.0/0.215.1) | | [org.liquibase:liquibase-maven-plugin](http://www.liquibase.org/liquibase-maven-plugin) ([source](https://github.com/liquibase/liquibase)) | build | patch | `4.23.0` -> `4.23.1` | | [io.quarkus:quarkus-maven-plugin](https://github.com/quarkusio/quarkus) | build | minor | `3.2.3.Final` -> `3.3.0` | | [io.quarkus:quarkus-universe-bom](https://github.com/quarkusio/quarkus-platform) | import | minor | `3.2.3.Final` -> `3.3.0` | | [org.apache.maven.plugins:maven-enforcer-plugin](https://maven.apache.org/enforcer/) | build | minor | `3.3.0` -> `3.4.0` | --- ### Release Notes <details> <summary>flowtype/flow-bin</summary> ### [`v0.215.1`](flow/flow-bin@a92ce80...cbb038f) [Compare Source](flow/flow-bin@a92ce80...cbb038f) ### [`v0.215.0`](flow/flow-bin@ca11e28...a92ce80) [Compare Source](flow/flow-bin@ca11e28...a92ce80) </details> <details> <summary>liquibase/liquibase</summary> ### [`v4.23.1`](https://github.com/liquibase/liquibase/blob/HEAD/changelog.txt#Liquibase-4231-is-a-patch-release) [Compare Source](liquibase/liquibase@v4.23.0...v4.23.1) </details> <details> <summary>quarkusio/quarkus</summary> ### [`v3.3.0`](https://github.com/quarkusio/quarkus/releases/tag/3.3.0) [Compare Source](quarkusio/quarkus@3.2.4.Final...3.3.0) ##### Complete changelog - [#​35350](quarkusio/quarkus#35350) - Fix package type system property clearing - [#​35348](quarkusio/quarkus#35348) - quarkus-maven-plugin runs native building even if the profile is commented out - [#​35343](quarkusio/quarkus#35343) - ArC: fix StackOverflowError in AutoAddScopeBuildItem - [#​35319](quarkusio/quarkus#35319) - Register arrays of Hibernate ORM's JDBC basic types for reflection - [#​35315](quarkusio/quarkus#35315) - Fix Datasource timing issues with Liquibase / Flyway and OpenTelemetry - [#​35314](quarkusio/quarkus#35314) - Regression in 3.3.0.CR1: Synthetic bean instance for io.opentelemetry.api.OpenTelemetry not initialized yet - [#​35312](quarkusio/quarkus#35312) - Updates Infinispan to 14.0.13.Final - [#​35308](quarkusio/quarkus#35308) - Lock jib execution to avoid OverlappingFileLockException in parallel builds - [#​35305](quarkusio/quarkus#35305) - Fix the titles of the tables in RESTEasy Reactive doc - [#​35302](quarkusio/quarkus#35302) - Docs: Mention wilcard support in resteasy reactive XML serialisation exclude classes configuration - [#​35301](quarkusio/quarkus#35301) - Fix potential NPE in quarkus-csrf-reactive when no MediaType is found - [#​35299](quarkusio/quarkus#35299) - Output build graph using `quarkus.builder.graph-output` property - [#​35285](quarkusio/quarkus#35285) - NullPointerException during http post request when quarkus-csrf-reactive extension is added to a project - [#​35283](quarkusio/quarkus#35283) - Upgrade proto-google-common-protos to 2.23.0 - [#​35282](quarkusio/quarkus#35282) - Avoid keeping references to BytecodeRecorderImpl - [#​35276](quarkusio/quarkus#35276) - Reinstate DevModeTestUtil to avoid breaking other projects that depend on it - [#​35273](quarkusio/quarkus#35273) - Fix small typo in comment - [#​35263](quarkusio/quarkus#35263) - Stop the recovery service while ArC is still around - [#​35245](quarkusio/quarkus#35245) - Add missing info to init Jobs - [#​35244](quarkusio/quarkus#35244) - Init Jobs are missing ServiceAccount and Image Pull Secrets - [#​35240](quarkusio/quarkus#35240) - Update SmallRye Health to 4.0.4 - [#​34071](quarkusio/quarkus#34071) - 3.1.1 Final - java.lang.IllegalArgumentException: Class java.util.UUID\[] is instantiated reflectively but was never registered - [#​32800](quarkusio/quarkus#32800) - Duplicated checks in health check response - [#​11903](quarkusio/quarkus#11903) - Gradle multimodule project + quarkus-container-image-jib: OverlappingFileLockException ### [`v3.2.4.Final`](https://github.com/quarkusio/quarkus/releases/tag/3.2.4.Final) [Compare Source](quarkusio/quarkus@3.2.3.Final...3.2.4.Final) ##### Complete changelog - [#​35300](quarkusio/quarkus#35300) - Fix `jandex-gradle-plugin-version` in CDI Reference - [#​35296](quarkusio/quarkus#35296) - Upgrade H2 to 2.2.220 - [#​35258](quarkusio/quarkus#35258) - CDI Reference 1.1 has incomplete information for gradle - [#​35255](quarkusio/quarkus#35255) - Quartz: QuarkusMSSQLDelegate should extends MSSQLDelegate - [#​35211](quarkusio/quarkus#35211) - Document Maven config options that may be relevant when running tests - [#​35203](quarkusio/quarkus#35203) - Pass Maven user settings when initializing artifact resolver - [#​35193](quarkusio/quarkus#35193) - OpenTelemetry service name should have higher priority than app name when no resource attributes are set - [#​35189](quarkusio/quarkus#35189) - Quarkus CLI fixes - [#​35188](quarkusio/quarkus#35188) - SmallRyeGraphQLOverWebSocketHandler: use order value > Integer.MIN_VALUE - [#​35181](quarkusio/quarkus#35181) - REST Data with Panache should not produce links when hal is disabled - [#​35174](quarkusio/quarkus#35174) - Ensure the narayana-jta extension fully shuts down the recovery manager - [#​35172](quarkusio/quarkus#35172) - Kafka Streams: restore the feature name at Quarkus startup - [#​35171](quarkusio/quarkus#35171) - kafka-streams: feature not listed on startup - [#​35165](quarkusio/quarkus#35165) - Propagate all user methods in REST Data with Panache - [#​35160](quarkusio/quarkus#35160) - Properly use internal links to point to other guides - [#​35140](quarkusio/quarkus#35140) - ArC: fix deadlock when calling guarded methods on the same thread - [#​35136](quarkusio/quarkus#35136) - Deadlock while calling write-locked method from read-locked method - [#​34908](quarkusio/quarkus#34908) - `@RouteFilter` stopped working with WebSocket requests Quarkus 3.2.0.Final - [#​34875](quarkusio/quarkus#34875) - Quarkus build does not work since 3.2.0 with teamcity/plexus launcher - [#​34713](quarkusio/quarkus#34713) - Option to track build configuration for changes between builds - [#​34576](quarkusio/quarkus#34576) - Live reload stopped working on 3.2 when using XA transactions - [#​34514](quarkusio/quarkus#34514) - Support `@WithUnnamedKey` in documentation - [#​34065](quarkusio/quarkus#34065) - Add support for project Java version update based on extensions - [#​33317](quarkusio/quarkus#33317) - OpenTelemetry SDK autoconfiguration ignores OTEL service name in favor of Quarkus app name - [#​15461](quarkusio/quarkus#15461) - Quarkus tests fails mTLS authentication against internal Maven repository </details> <details> <summary>quarkusio/quarkus-platform</summary> ### [`v3.3.0`](quarkusio/quarkus-platform@3.2.4.Final...3.3.0) [Compare Source](quarkusio/quarkus-platform@3.2.4.Final...3.3.0) ### [`v3.2.4.Final`](quarkusio/quarkus-platform@3.2.3.Final...3.2.4.Final) [Compare Source](quarkusio/quarkus-platform@3.2.3.Final...3.2.4.Final) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 👻 **Immortal**: This MR will be recreated if closed unmerged. Get [config help](https://github.com/renovatebot/renovate/discussions) if that's undesired. --- - [x] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC4yNC4wIiwidXBkYXRlZEluVmVyIjoiMzQuMjQuMCJ9-->
…s-googlecloud-jsonlogging!17) This MR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [io.quarkus:quarkus-extension-processor](https://github.com/quarkusio/quarkus) | | minor | `3.2.3.Final` -> `3.3.0` | | [io.quarkus:quarkus-extension-maven-plugin](https://github.com/quarkusio/quarkus) | build | minor | `3.2.3.Final` -> `3.3.0` | | [io.quarkus:quarkus-maven-plugin](https://github.com/quarkusio/quarkus) | build | minor | `3.2.3.Final` -> `3.3.0` | | [io.quarkus:quarkus-bom](https://github.com/quarkusio/quarkus) | import | minor | `3.2.3.Final` -> `3.3.0` | --- ### Release Notes <details> <summary>quarkusio/quarkus</summary> ### [`v3.3.0`](quarkusio/quarkus@3.2.4.Final...3.3.0) [Compare Source](quarkusio/quarkus@3.2.4.Final...3.3.0) ### [`v3.2.4.Final`](https://github.com/quarkusio/quarkus/releases/tag/3.2.4.Final) [Compare Source](quarkusio/quarkus@3.2.3.Final...3.2.4.Final) ##### Complete changelog - [#​35300](quarkusio/quarkus#35300) - Fix `jandex-gradle-plugin-version` in CDI Reference - [#​35296](quarkusio/quarkus#35296) - Upgrade H2 to 2.2.220 - [#​35258](quarkusio/quarkus#35258) - CDI Reference 1.1 has incomplete information for gradle - [#​35255](quarkusio/quarkus#35255) - Quartz: QuarkusMSSQLDelegate should extends MSSQLDelegate - [#​35211](quarkusio/quarkus#35211) - Document Maven config options that may be relevant when running tests - [#​35203](quarkusio/quarkus#35203) - Pass Maven user settings when initializing artifact resolver - [#​35193](quarkusio/quarkus#35193) - OpenTelemetry service name should have higher priority than app name when no resource attributes are set - [#​35189](quarkusio/quarkus#35189) - Quarkus CLI fixes - [#​35188](quarkusio/quarkus#35188) - SmallRyeGraphQLOverWebSocketHandler: use order value > Integer.MIN_VALUE - [#​35181](quarkusio/quarkus#35181) - REST Data with Panache should not produce links when hal is disabled - [#​35174](quarkusio/quarkus#35174) - Ensure the narayana-jta extension fully shuts down the recovery manager - [#​35172](quarkusio/quarkus#35172) - Kafka Streams: restore the feature name at Quarkus startup - [#​35171](quarkusio/quarkus#35171) - kafka-streams: feature not listed on startup - [#​35165](quarkusio/quarkus#35165) - Propagate all user methods in REST Data with Panache - [#​35160](quarkusio/quarkus#35160) - Properly use internal links to point to other guides - [#​35140](quarkusio/quarkus#35140) - ArC: fix deadlock when calling guarded methods on the same thread - [#​35136](quarkusio/quarkus#35136) - Deadlock while calling write-locked method from read-locked method - [#​34908](quarkusio/quarkus#34908) - `@RouteFilter` stopped working with WebSocket requests Quarkus 3.2.0.Final - [#​34875](quarkusio/quarkus#34875) - Quarkus build does not work since 3.2.0 with teamcity/plexus launcher - [#​34713](quarkusio/quarkus#34713) - Option to track build configuration for changes between builds - [#​34576](quarkusio/quarkus#34576) - Live reload stopped working on 3.2 when using XA transactions - [#​34514](quarkusio/quarkus#34514) - Support `@WithUnnamedKey` in documentation - [#​34065](quarkusio/quarkus#34065) - Add support for project Java version update based on extensions - [#​33317](quarkusio/quarkus#33317) - OpenTelemetry SDK autoconfiguration ignores OTEL service name in favor of Quarkus app name - [#​15461](quarkusio/quarkus#15461) - Quarkus tests fails mTLS authentication against internal Maven repository </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC4yNC4wIiwidXBkYXRlZEluVmVyIjoiMzQuMjQuMCJ9-->
Fixes #34576