Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Bigtable Upload Service Loops on Same Block Repeatedly #33831

Closed
McSim85 opened this issue Oct 24, 2023 · 3 comments · Fixed by #33861
Closed

Bigtable Upload Service Loops on Same Block Repeatedly #33831

McSim85 opened this issue Oct 24, 2023 · 3 comments · Fixed by #33861

Comments

@McSim85
Copy link

McSim85 commented Oct 24, 2023

We have the same issue on two warehouse nodes.
image
this happens from 1.14.17 (around 19 September) and still happens.
Currently, on 1.16.17, but still happens.

I will add more details shortly.

version was running

solana -V
solana-cli 1.16.17 (src:667bc163; feat:4033350765, client:SolanaLabs)

bounds of the ledger

solana@sol-warehouse-jfk-a01:~$ solana-ledger-tool -l  ./ledger/ bounds
[2023-10-24T11:27:20.845573653Z INFO  solana_ledger_tool] solana-ledger-tool 1.16.17 (src:667bc163; feat:4033350765, client:SolanaLabs)
[2023-10-24T11:27:20.845664206Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 1048576
[2023-10-24T11:27:20.845672926Z INFO  solana_ledger::blockstore] Opening database at "/home/solana/ledger/rocksdb"
[2023-10-24T11:27:20.848092279Z INFO  solana_ledger::blockstore_db] Opening Rocks with secondary (read only) access at: "/home/solana/ledger/rocksdb/solana-secondary"
[2023-10-24T11:27:20.848102869Z INFO  solana_ledger::blockstore_db] This secondary access could temporarily degrade other accesses, such as by solana-validator
[2023-10-24T11:28:19.255389546Z INFO  solana_ledger::blockstore_db] Rocks's automatic compactions are disabled due to Secondary access
[2023-10-24T11:28:19.255506870Z INFO  solana_ledger::blockstore] "/home/solana/ledger/rocksdb" open took 58.4s
Ledger has data for 1240156 slots 224411444 to 225683930
  with 1219130 rooted slots from 224411444 to 225683898
  and 32 slots past the last root

[2023-10-24T11:28:21.276443965Z INFO  solana_ledger_tool] ledger tool took 60.4s

bounds of bigtable

all data from the genesys

when you saw this

Mostly happens after uprade\restart.

relevant issue

          Yeah, I think that could happen, depending on node settings.

I'm going to close this for now, but please re-open if you see it again.
I'll also ponder how if bigtable-upload should support fragmented ledgers.

Originally posted by @CriesofCarrots in #27732 (comment)

@McSim85
Copy link
Author

McSim85 commented Oct 24, 2023

Right now, the server is in the loop, so it's an excellent change to troubleshoot.

@CriesofCarrots
Copy link
Contributor

Sure, if you can tell me what the missing ranges in your ledger are, I can take a look at the uploader logic. Probably it should emit an error and halt rather than continually looping.

@steviez
Copy link
Contributor

steviez commented Oct 25, 2023

@CriesofCarrots - In regards to looping on the same slot, I think we can attribute that to this comment

/// Uploads a range of blocks from a Blockstore to bigtable LedgerStorage
/// Returns the Slot of the last block checked. If no blocks in the range `[staring_slot,
/// ending_slot]` are found in Blockstore, this value is equal to `ending_slot`.

and this line here
return Ok(last_blockstore_slot);

From the logs posted above in the issue description, bigtable_upload::upload_confirmed_blocks() is getting called with start_slot=225163166, end_slot=225163358.

let result = runtime.block_on(bigtable_upload::upload_confirmed_blocks(
blockstore.clone(),
bigtable_ledger_storage.clone(),
start_slot,
end_slot,
config.clone(),
exit.clone(),
));
match result {
Ok(last_slot_uploaded) => start_slot = last_slot_uploaded,

The logs then indicate that the blockstore only has the slot 225163166 available within [225163166, 225163358]:

Found 1 slots in the range (225163166, 225163166)

And that the slot has already been uploaded:

No blocks between 225163166 and 225163358 need to be uploaded to bigtable

so there is no work to do and we hit this early return with last_blockstore_slot = 225163166

if blocks_to_upload.is_empty() {
info!(
"No blocks between {} and {} need to be uploaded to bigtable",
starting_slot, ending_slot
);
return Ok(last_blockstore_slot);

Looking at the snippet from BigTableUploadService above, start_slot will remain at 225163166:

Ok(last_slot_uploaded) => start_slot = last_slot_uploaded,

And thus we're stuck since this is a gap and our node will never repair/replay the slots immediately after 225163166. We can seemingly avoid getting stuck by returning ending_slot like we do in the other early return case:

if blockstore_slots.is_empty() {
warn!("Ledger has no slots from {starting_slot} to {ending_slot:?}");
return Ok(ending_slot);
}

PS: @McSim85 - When posting logs in the future, please post the text in between triple ` quotes instead of pasting an image; it makes it easier for us to copy/paste/search/etc the text.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants