Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition found in Horizon Ingestion filtering Integration tests #4987

Closed
urvisavla opened this issue Jul 27, 2023 · 3 comments · Fixed by #4994
Closed

Race condition found in Horizon Ingestion filtering Integration tests #4987

urvisavla opened this issue Jul 27, 2023 · 3 comments · Fixed by #4994
Assignees

Comments

@urvisavla
Copy link
Contributor

What did you see?

Horizon integrations tests are failing due to race condition found in Ingestion filtering tests

Here is the error logged by Horizon CI while running TestIngestionFilteringAlwaysDefaultingToTrue/ingestion_filtering_flag_set_to_false.

2023-07-27T22:20:41.1340892Z ==================
2023-07-27T22:20:41.1341444Z WARNING: DATA RACE
2023-07-27T22:20:41.1342076Z Write at 0x00c0000b09a8 by goroutine 32960:
2023-07-27T22:20:41.1342497Z   runtime.racewrite()
2023-07-27T22:20:41.1342861Z       <autogenerated>:1 +0x24
2023-07-27T22:20:41.1343475Z   github.com/stellar/go/services/horizon/internal/ingest.(*system).Shutdown()
2023-07-27T22:20:41.1344240Z       /home/runner/work/go/go/services/horizon/internal/ingest/main.go:847 +0x150
2023-07-27T22:20:41.1344823Z   github.com/stellar/go/services/horizon/internal.(*App).waitForDone()
2023-07-27T22:20:41.1345522Z       /home/runner/work/go/go/services/horizon/internal/app.go:164 +0x164
2023-07-27T22:20:41.1346107Z   github.com/stellar/go/services/horizon/internal.(*App).Serve.func4()
2023-07-27T22:20:41.1346797Z       /home/runner/work/go/go/services/horizon/internal/app.go:133 +0x39
2023-07-27T22:20:41.1347108Z 
2023-07-27T22:20:41.1347435Z Previous read at 0x00c0000b09a8 by goroutine 32951:
2023-07-27T22:20:41.1348469Z   runtime.raceread()
2023-07-27T22:20:41.1348983Z       <autogenerated>:1 +0x24
2023-07-27T22:20:41.1350065Z   github.com/stellar/go/services/horizon/internal/ingest.(*system).runStateMachine()
2023-07-27T22:20:41.1350864Z       /home/runner/work/go/go/services/horizon/internal/ingest/main.go:641 +0x74
2023-07-27T22:20:41.1351457Z   github.com/stellar/go/services/horizon/internal/ingest.(*system).Run()
2023-07-27T22:20:41.1352184Z       /home/runner/work/go/go/services/horizon/internal/ingest/main.go:546 +0x49
2023-07-27T22:20:41.1352783Z   github.com/stellar/go/services/horizon/internal.(*App).Serve.func1()
2023-07-27T22:20:41.1353480Z       /home/runner/work/go/go/services/horizon/internal/app.go:105 +0x5e
2023-07-27T22:20:41.1353814Z 
2023-07-27T22:20:41.1354145Z Goroutine 32960 (running) created at:
2023-07-27T22:20:41.1355011Z   github.com/stellar/go/services/horizon/internal.(*App).Serve()
2023-07-27T22:20:41.1355646Z       /home/runner/work/go/go/services/horizon/internal/app.go:132 +0x81d
2023-07-27T22:20:41.1356363Z   github.com/stellar/go/services/horizon/internal/test/integration.(*Test).StartHorizon.func4()
2023-07-27T22:20:41.1359297Z       /home/runner/work/go/go/services/horizon/internal/test/integration/integration.go:420 +0x55
2023-07-27T22:20:41.1361030Z 
2023-07-27T22:20:41.1361414Z Goroutine 32951 (running) created at:
2023-07-27T22:20:41.1361868Z   github.com/stellar/go/services/horizon/internal.(*App).Serve()
2023-07-27T22:20:41.1362595Z       /home/runner/work/go/go/services/horizon/internal/app.go:104 +0x4d1
2023-07-27T22:20:41.1363640Z   github.com/stellar/go/services/horizon/internal/test/integration.(*Test).StartHorizon.func4()
2023-07-27T22:20:41.1364532Z       /home/runner/work/go/go/services/horizon/internal/test/integration/integration.go:420 +0x55
2023-07-27T22:20:41.1365015Z ==================
@urvisavla
Copy link
Contributor Author

Here is what I believe is happening:

The race condition occurs because the wg sync.WaitGroup in ingestion system is accessed concurrently by two different goroutines.

In the runStateMachine function s.wg.Add(1) is used to add a task to the WaitGroup.
In the Shutdown function s.wg.Wait() is used to wait for all tasks in the WaitGroup to complete.

While the Wait() and Done() methods of the WaitGroup are thread-safe, the Add() method is not. The reason this issue hasn't been seen earlier is that wg.Add() executed only once before entering the state machine loop. Once the state machine starts, the only access to the WaitGroup is through Done() and Wait(), both of which are thread-safe.

Because the write to the WaitGroup happens only once at the very start of the ingestion system and the Shutdown function is not called (in all practical scenarios) before that is executed.
However, in the case of the test we start Horizon and immediately shut it down resulting in concurrent access to the WaitGroup and causing the race condition.

@sreuland
Copy link
Contributor

@urvisavla , @aditya1702 , based on the investigation here, sounds like it boils down to removing the shutdown calls on that new test?

281390c

@urvisavla urvisavla moved this from Backlog to Current Sprint in Platform Scrum Jul 28, 2023
@aditya1702
Copy link
Contributor

aditya1702 commented Jul 28, 2023

@urvisavla Does adding test.WaitForHorizon() fix this? Since we will wait for it before shutting down the test?

Also I dont understand why this test is different than the other ones already present - TestMaxPathFindingRequests, TestMaxAssetsForPathRequests. They also have the same format of starting horizon, waiting for it and then asserting some variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants