Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(sync): increase the height of blocks for some full sync jobs #5391

Merged
merged 5 commits into from
Oct 21, 2022

Conversation

arya2
Copy link
Contributor

@arya2 arya2 commented Oct 12, 2022

Motivation

This is the standard fix for when a full sync job runs too long (logs-checkpoint needs to sync ~45k blocks as of the latest checkpoint update).

Solution

  • Replace full sync job for 1790k blocks with one for 1800k blocks
  • Add full sync job for 1820k blocks

See fixing-ci-sync-timeouts

Review

Anyone can review.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
  • How do you know it works? Does it have tests?

@arya2 arya2 added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles I-integration-fail Continuous integration fails, including build and test failures C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG labels Oct 12, 2022
@arya2 arya2 requested a review from a team as a code owner October 12, 2022 16:03
@arya2 arya2 requested review from teor2345 and removed request for a team October 12, 2022 16:03
@github-actions github-actions bot added the C-feature Category: New features label Oct 12, 2022
@arya2
Copy link
Contributor Author

arya2 commented Oct 12, 2022

The checkpoint step took 4-5 hours to get to 1810k blocks (timed out after ~1817k). Adding just one extra step this time.

@arya2 arya2 force-pushed the add-step-to-ci-docker branch from 33aa305 to 3623582 Compare October 12, 2022 16:28
@gustavovalverde gustavovalverde changed the title fix(ci): fix full sync failure by adding a job for 1810k blocks ci(sync): fix full sync failure by adding a job for 1810k blocks Oct 12, 2022
@arya2 arya2 requested a review from a team October 12, 2022 16:49
@arya2
Copy link
Contributor Author

arya2 commented Oct 12, 2022

ssh failure in logs-checkpoint:

Timeout, server 34.123.232.71 not responding.

..

ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

https://github.com/ZcashFoundation/zebra/actions/runs/3236318547/jobs/5302131447#step:6:777

@teor2345
Copy link
Contributor

ssh failure in logs-checkpoint: https://github.com/ZcashFoundation/zebra/actions/runs/3236318547/jobs/5302131447#step:6:777

There are lots of different ways ssh can fail, can you please quote the actual failure log?

Timeout, server 34.123.232.71 not responding.

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also saw these sync warnings in the logs:

2022-10-11T03:59:58.369043Z WARN {net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 30 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=31.016% current_height=Height(572486) network_upgrade=Sapling time_since_last_state_block=30m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

https://github.com/ZcashFoundation/zebra/actions/runs/3223410431/jobs/5295854595#step:6:2844

Every time they happen, the sync takes half an hour longer waiting for a single block. So maybe we could drop the timeout back down to 10 minutes first?

(We might still need this PR.)

@teor2345
Copy link
Contributor

Every time they happen, the sync takes half an hour longer waiting for a single block. So maybe we could drop the timeout back down to 10 minutes first?

Let's see how #5397 goes, then try this one?

There is also a timeout in the Rust test we might need to fix.

@arya2
Copy link
Contributor Author

arya2 commented Oct 13, 2022

I also saw these sync warnings in the logs:

2022-10-11T03:59:58.369043Z WARN {net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 30 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=31.016% current_height=Height(572486) network_upgrade=Sapling time_since_last_state_block=30m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

https://github.com/ZcashFoundation/zebra/actions/runs/3223410431/jobs/5295854595#step:6:2844

Every time they happen, the sync takes half an hour longer waiting for a single block. So maybe we could drop the timeout back down to 10 minutes first?

We should definitely fix that first.

(We might still need this PR.)

It's doubtful it'll make it from 1790k to ~1834k, though we might be able to replace logs-1790k (~2h 40m) with logs-1800k (seems to be another ~2h 20m) instead of adding a new step.

@gustavovalverde
Copy link
Member

@arya2 do we still need this PR?

@gustavovalverde
Copy link
Member

Seems like we are still having timeouts in the main branch, but I'll left @teor2345 the last call here even though is a standard change, as I'm not sure if there are other works in progress to fix it another way

@arya2
Copy link
Contributor Author

arya2 commented Oct 18, 2022

do we still need this PR?

I believe we do, but I'll just quickly check the logs for the timeouts and perhaps replace logs-1790k with logs-1800k.

@teor2345
Copy link
Contributor

Seems like we are still having timeouts in the main branch, but I'll left @teor2345 the last call here even though is a standard change, as I'm not sure if there are other works in progress to fix it another way

We're done with the fixes due to PR #5257, it should all be standard fixes from now on.

@arya2 arya2 requested a review from teor2345 October 19, 2022 16:58
@arya2 arya2 changed the title ci(sync): fix full sync failure by adding a job for 1810k blocks ci(sync): fix full sync failure by replacing a job for 1790k blocks with one for 1800k blocks and adding a job for 1820k blocks Oct 19, 2022
Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gustavovalverde gustavovalverde changed the title ci(sync): fix full sync failure by replacing a job for 1790k blocks with one for 1800k blocks and adding a job for 1820k blocks ci(sync): increase the height of blocks for the full sync test Oct 19, 2022
@gustavovalverde gustavovalverde changed the title ci(sync): increase the height of blocks for the full sync test ci(sync): increase the height of blocks for some full sync jobs Oct 19, 2022
@gustavovalverde gustavovalverde removed the C-feature Category: New features label Oct 19, 2022
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regexes might be wrong here?

Co-authored-by: teor <teor@riseup.net>
@github-actions github-actions bot added the C-feature Category: New features label Oct 19, 2022
Co-authored-by: teor <teor@riseup.net>
@arya2 arya2 requested a review from teor2345 October 19, 2022 21:32
@mpguerra
Copy link
Contributor

mpguerra commented Oct 20, 2022

is there a github issue that this is related to that I can link in?

@arya2
Copy link
Contributor Author

arya2 commented Oct 20, 2022

is there a github issue that this is related to that I can link in?

No, this was ad-hoc and trivial.

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@teor2345
Copy link
Contributor

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Oct 21, 2022

update

✅ Branch has been successfully updated

mergify bot added a commit that referenced this pull request Oct 21, 2022
@mergify mergify bot merged commit 2cadb73 into main Oct 21, 2022
@mergify mergify bot deleted the add-step-to-ci-docker branch October 21, 2022 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug C-feature Category: New features C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants