Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove async go-routine for dispatchAction as unnecessary (and missing returns) #573

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

peterbroadhurst
Copy link
Contributor

Fixed #572

This code evolved over time (and multiple codebases) from some code that needed to wait for an async thread to process a piece of work, which might timeout.

This lead to dispatching the work to a background go-routine, and having a foreground context timeout. Using a go channel between the two. However, that code had missing cases that means it had potential for failing to notify the channel - particularly I spotted this line where if pending == nil we'd fail to put anything on the channel:

This PR proposes just simplifying it all to a synchronous function, as I cannot find any path on which there is actually an asynchronous processing function any more.

… edge cases)

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst
Copy link
Contributor Author

Related issue found in the test, we're holding a lock through a long retry loop:

2025-02-19T22:05:39.3191785Z goroutine 10047 [select]:
2025-02-19T22:05:39.3192381Z github.com/kaleido-io/paladin/toolkit/pkg/retry.(*Retry).WaitDelay(0x14ff410?, {0x14ff410, 0xc005b1a9c0}, 0x9)
2025-02-19T22:05:39.3192806Z 	/home/runner/work/paladin/paladin/toolkit/go/pkg/retry/retry.go:90 +0x1c6
2025-02-19T22:05:39.3193410Z github.com/kaleido-io/paladin/toolkit/pkg/retry.(*Retry).Do(0xc00152c080, {0x14ff410, 0xc005b1a9c0}, 0xc008cefd40)
2025-02-19T22:05:39.3193830Z 	/home/runner/work/paladin/paladin/toolkit/go/pkg/retry/retry.go:73 +0x14b
2025-02-19T22:05:39.3194496Z github.com/kaleido-io/paladin/core/internal/publictxmgr.(*pubTxManager).poll(0xc0010a2280, {0x14ff410, 0xc005b1a9c0})
2025-02-19T22:05:39.3195166Z 	/home/runner/work/paladin/paladin/core/go/internal/publictxmgr/transaction_manager_loop.go:130 +0x1126
2025-02-19T22:05:39.3195712Z github.com/kaleido-io/paladin/core/internal/publictxmgr.(*pubTxManager).engineLoop(0xc0010a2280)
2025-02-19T22:05:39.3196334Z 	/home/runner/work/paladin/paladin/core/go/internal/publictxmgr/transaction_manager_loop.go:62 +0x19b
2025-02-19T22:05:39.3196898Z created by github.com/kaleido-io/paladin/core/internal/publictxmgr.(*pubTxManager).Start in goroutine 10042
2025-02-19T22:05:39.3197539Z 	/home/runner/work/paladin/paladin/core/go/internal/publictxmgr/transaction_manager.go:193 +0x1c5

err := ble.retry.Do(ctx, func(attempt int) (retry bool, err error) {
// (raw SQL as couldn't convince gORM to build this)
const dbQueryBase = `SELECT DISTINCT t."from" FROM "public_txns" AS t ` +
`LEFT JOIN "public_completions" AS c ON t."pub_txn_id" = c."pub_txn_id" ` +
`WHERE c."pub_txn_id" IS NULL AND "suspended" IS FALSE`
const dbQueryNothingInFlight = dbQueryBase + ` LIMIT ?`
if len(inFlightSigningAddresses) == 0 {
return true, ble.p.DB().Raw(dbQueryNothingInFlight, spaces).Scan(&additionalNonInFlightSigners).Error
}
const dbQueryInFlight = dbQueryBase + ` AND t."from" NOT IN (?) LIMIT ?`
return true, ble.p.DB().Raw(dbQueryInFlight, inFlightSigningAddresses, spaces).Scan(&additionalNonInFlightSigners).Error
})

Which is causing a race condition with a test:

2025-02-19T22:05:39.2986850Z goroutine 10042 [sync.Mutex.Lock]:
2025-02-19T22:05:39.2987401Z sync.runtime_SemacquireMutex(0xc004c04120?, 0x30?, 0x1586ac2?)
2025-02-19T22:05:39.2988144Z 	/opt/hostedtoolcache/go/1.22.12/x64/src/runtime/sema.go:77 +0x25
2025-02-19T22:05:39.2988741Z sync.(*Mutex).lockSlow(0xc0010a2330)
2025-02-19T22:05:39.2989087Z 	/opt/hostedtoolcache/go/1.22.12/x64/src/sync/mutex.go:171 +0x15d
2025-02-19T22:05:39.2989229Z sync.(*Mutex).Lock(...)
2025-02-19T22:05:39.2989508Z 	/opt/hostedtoolcache/go/1.22.12/x64/src/sync/mutex.go:90
2025-02-19T22:05:39.2990722Z github.com/kaleido-io/paladin/core/internal/publictxmgr.(*pubTxManager).dispatchAction(0xc0010a2280, {0x14ff448, 0xc003f3ed70}, {0x36, 0x46, 0x78, 0xb6, 0x57, 0x69, 0x4f, ...}, ...)
2025-02-19T22:05:39.2991401Z 	/home/runner/work/paladin/paladin/core/go/internal/publictxmgr/dispatch_action.go:45 +0xbc
2025-02-19T22:05:39.2992189Z github.com/kaleido-io/paladin/core/internal/publictxmgr.TestDispatchCompletedActionForNonInflightIgnored(0xc0004421a0)
2025-02-19T22:05:39.2992801Z 	/home/runner/work/paladin/paladin/core/go/internal/publictxmgr/dispatch_action_test.go:37 +0x9f
2025-02-19T22:05:39.2992980Z testing.tRunner(0xc0004421a0, 0x13c4d98)
2025-02-19T22:05:39.2993355Z 	/opt/hostedtoolcache/go/1.22.12/x64/src/testing/testing.go:1689 +0xfb
2025-02-19T22:05:39.2993538Z created by testing.(*T).Run in goroutine 1
2025-02-19T22:05:39.2993928Z 	/opt/hostedtoolcache/go/1.22.12/x64/src/testing/testing.go:1742 +0x390

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst
Copy link
Contributor Author

Addressed the above by splitting the poll loop into two locked sections, with an unlocked section in the middle while we call the DB.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Copy link
Contributor

@dwertent dwertent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -31,7 +31,7 @@ ext {
include "mocks/**/*.go"
}

targetCoverage = 94.0
targetCoverage = 93.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update coverage (no drop in PR)

it's not clear why this coverage drop is needed if there is no coverage drop caused by this PR

Copy link
Contributor

@Chengxuan Chengxuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the coverage drop needed?

I initially thought it's because the branch is not synced with main, but it seems main still has 94.0 as the coverage threshold

@Chengxuan
Copy link
Contributor

the changes themselves look good to me, thanks for simplifying the logic and avoid locking when it's unnecessary.

@peterbroadhurst
Copy link
Contributor Author

is the coverage drop needed?

I don't think I have a coverage drop in the code. I took a file from <100% coverage, to 100% coverage in this change. I only changed one file.

However, I did find that I was receiving 93.9 coverage when running the build.

@Chengxuan
Copy link
Contributor

So reducing the test coverage threshold to 93.5 was an intentional change.

However, I did find that I was receiving 93.9 coverage when running the build.

I think that's probably caused by you've reduced the lines of code in this PR 💪 , can we update the threshold to 93.9% other than 0.4% lower?

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@Chengxuan Chengxuan enabled auto-merge February 20, 2025 17:25
@Chengxuan Chengxuan merged commit 7fac037 into main Feb 20, 2025
5 of 6 checks passed
@Chengxuan Chengxuan deleted the fix-572 branch February 20, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Intermittent issues in TestDemoNotarizedCoinSelection on my branch indicating bug in publictxmgr
3 participants