Hour reaper fail, database size grow too large #3728

ghost · 2021-06-29T07:28:10Z

horizon config

stellar-horizon serve  --db-url postgres://horizon:password@localhost/horizon --captive-core-config-append-path /data2/xlm/stellar-captive-core-stub.toml --network-passphrase "Public Global Stellar Network ; September 2015" --ingest=true --per-hour-rate-limit 999999999 --history-retention-count 30000 --stellar-core-binary-path /usr/bin/stellar-core  --history-archive-urls https://history.stellar.org/prd/core-live/core_live_001

CATCHUP_RECENT=30000

time="2021-06-29T07:23:31.337+08:00" level=info msg="reaper: clearing" new_elder=36068890 pid=19878
time="2021-06-29T07:23:41.343+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T08:23:42.337+08:00" level=info msg="reaper: clearing" new_elder=36069553 pid=19878
time="2021-06-29T08:23:52.347+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T09:23:53.348+08:00" level=info msg="reaper: clearing" new_elder=36070216 pid=19878
time="2021-06-29T09:24:03.575+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T10:24:04.337+08:00" level=info msg="reaper: clearing" new_elder=36070878 pid=19878
time="2021-06-29T10:24:14.423+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T11:24:15.337+08:00" level=info msg="reaper: clearing" new_elder=36071543 pid=19878
time="2021-06-29T11:24:25.457+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T12:24:26.337+08:00" level=info msg="reaper: clearing" new_elder=36072206 pid=19878
time="2021-06-29T12:24:36.535+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T13:24:37.337+08:00" level=info msg="reaper: clearing" new_elder=36072870 pid=19878
time="2021-06-29T13:24:47.555+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878
time="2021-06-29T14:24:48.336+08:00" level=info msg="reaper: clearing" new_elder=36073532 pid=19878
time="2021-06-29T14:24:58.340+08:00" level=error msg="reaper failed: Error clearing history_operations: canceling statement due to user request" pid=19878

The text was updated successfully, but these errors were encountered:

ghost · 2021-06-29T07:52:21Z

I run stellar-horizon db reap is too slow

INFO[2021-06-29T15:09:44.356+08:00] reaper: clearing                              new_elder=36074024 pid=27711
INFO[2021-06-29T15:43:26.376+08:00] reaper succeeded                              new_elder=36074024 pid=27711

leevlad · 2021-06-30T14:30:49Z

I am also seeing this exact behavior on all of my Stellar Horizon deployments.

stellar-horizon[3330]: time="2021-06-30T06:49:06.535Z" level=info msg="reaper: clearing" new_elder=36041350 pid=3330
stellar-horizon[3330]: time="2021-06-30T06:49:16.537Z" level=error msg="reaper failed: Error clearing history_effects: canceling statement due to user request" pid=3330
stellar-horizon[3330]: time="2021-06-30T07:49:17.536Z" level=info msg="reaper: clearing" new_elder=36042008 pid=3330
stellar-horizon[3330]: time="2021-06-30T07:49:27.543Z" level=error msg="reaper failed: Error clearing history_effects: canceling statement due to user request" pid=3330
stellar-horizon[3330]: time="2021-06-30T08:49:27.547Z" level=info msg="reaper: clearing" new_elder=36042669 pid=3330
stellar-horizon[3330]: time="2021-06-30T08:49:37.536Z" level=error msg="reaper failed: Error clearing history_effects: canceling statement due to user request" pid=3330

If you look at my logs as well as logs from the user above, you see that the reaper times out after 10 seconds.

I looked around a bit, and I believe this bug was introduced in v2.3.0 here:
2348575

Hard-coding a timeout of 10 seconds made everyone's auto reapers timeout forever. Perhaps with the only exception of those who are running really powerful machines where the reaper can run within 10 seconds, which is still not ideal because missing a single reaper tick will make the next one less likely to succeed due to having a larger data set to reap, eventually cascading into a 100% reaper failure rate. This effectively disables HISTORY_RETENTION_COUNT configuration and will cause the Horizon DB size to grow indefinitely.

Perhaps it would be better to not use a shared context for all tickers here:

go/services/horizon/internal/app.go

Lines 384 to 402 in dc5baa1

    
           func (a *App) Tick(ctx context.Context) error { 
        
           	var wg sync.WaitGroup 
        
           	log.Debug("ticking app") 
        
           	// update ledger state, operation fee state, and stellar-core info in parallel 
        
           	wg.Add(3) 
        
           	go func() { a.UpdateLedgerState(ctx); wg.Done() }() 
        
           	go func() { a.UpdateFeeStatsState(ctx); wg.Done() }() 
        
           	go func() { a.UpdateStellarCoreInfo(ctx); wg.Done() }() 
        
           	wg.Wait() 
        
           	wg.Add(2) 
        
           	go func() { a.reaper.Tick(ctx); wg.Done() }() 
        
           	go func() { a.submitter.Tick(ctx); wg.Done() }() 
        
           	wg.Wait() 
        
           	log.Debug("finished ticking app") 
        
           	return ctx.Err() 
        
           }

And instead use a separate context for the reaper ticker, which should have a timeout higher than 10 seconds, which can also be configured via CLI/env var params?

ghost · 2021-07-01T01:15:53Z

sorry, I not find same issue. this issue need close

ghost added the bug label Jun 29, 2021

ghost changed the title ~~Hour reaper fail,database grow too large~~ Hour reaper fail, database size grow too large Jun 29, 2021

leevlad mentioned this issue Jun 30, 2021

Horizon doesn't enforce retention policy (HISTORY_RETENTION_COUNT environment variable) #3711

Closed

ghost closed this as completed Jul 1, 2021

leevlad mentioned this issue Jul 9, 2021

Stellar Horizon: select failed: sql: Scan error on column index 21, name \"ledger_close_time\": unsupported Scan #3751

Closed

bartekn mentioned this issue Jul 23, 2021

services/horizon: Move reap service outside global tick #3777

Merged

7 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hour reaper fail, database size grow too large #3728

Hour reaper fail, database size grow too large #3728

ghost commented Jun 29, 2021 •

edited by ghost

Loading

ghost commented Jun 29, 2021

leevlad commented Jun 30, 2021 •

edited

Loading

ghost commented Jul 1, 2021

Hour reaper fail, database size grow too large #3728

Hour reaper fail, database size grow too large #3728

Comments

ghost commented Jun 29, 2021 • edited by ghost Loading

ghost commented Jun 29, 2021

leevlad commented Jun 30, 2021 • edited Loading

ghost commented Jul 1, 2021

ghost commented Jun 29, 2021 •

edited by ghost

Loading

leevlad commented Jun 30, 2021 •

edited

Loading