Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There should be no way of creating branch without specifying LSN, otherwise there is a race #2063

Closed
yeputons opened this issue Jul 8, 2022 · 5 comments · May be fixed by #3140
Closed

There should be no way of creating branch without specifying LSN, otherwise there is a race #2063

yeputons opened this issue Jul 8, 2022 · 5 comments · May be fixed by #3140
Labels
a/reliability Area: relates to reliability of the service a/test/flaky Area: related to flaky tests c/storage Component: storage t/bug Issue Type: Bug

Comments

@yeputons
Copy link
Contributor

yeputons commented Jul 8, 2022

Follow-up of the last failures in #1667 (comment), see also #2062.

Looking at our neon_local timeline branch command, it works entirely with pageserver:

neon/neon_local/src/main.rs

Lines 690 to 710 in 39d86ed

let start_lsn = branch_match
.value_of("ancestor-start-lsn")
.map(Lsn::from_str)
.transpose()
.context("Failed to parse ancestor start Lsn from the request")?;
let timeline = pageserver
.timeline_create(tenant_id, None, start_lsn, Some(ancestor_timeline_id))?
.ok_or_else(|| anyhow!("Failed to create new timeline for tenant {}", tenant_id))?;
let new_timeline_id = timeline.timeline_id;
let last_record_lsn = timeline
.local
.expect("no local timeline info")
.last_record_lsn;
env.register_branch_mapping(new_branch_name.to_string(), tenant_id, new_timeline_id)?;
println!(
"Created timeline '{}' at Lsn {} for tenant: {}. Ancestor timeline: '{}'",
timeline.timeline_id, last_record_lsn, tenant_id, ancestor_branch_name,
);

Moreover, the test_tenant_relocation test runs it with --ancestor-branch-name parameter only, no --ancestor-start-lsn. So the pageserver decides what LSN to create the branch at. Hence, if the pageserver is not fully caught up with safekeepers, the branch starts a little earlier than it should, truncating parts of the last transaction (effectively erasing it from the second timeline, I guess).

For example, it may result in test_tenant_relocation failure even before there are relocations: it creates two branches, and the second branch is sometimes created slightly before the first is fully on the pageserver, resulting in divergence: https://app.circleci.com/pipelines/github/neondatabase/neon/7828/workflows/bce7b9dc-e390-4f42-8beb-1a2e81bf2b9d/jobs/80367

assert (500500,) == (1001000,)
  At index 0 diff: 500500 != 1001000
  Full diff:
  - (1001000,)
  + (500500,)
test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
    timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
    assert cur.fetchone() == (expected_sum, )
E   assert (500500,) == (1001000,)
E     At index 0 diff: 500500 != 1001000
E     Full diff:
E     - (1001000,)
E     + (500500,)

If so, I argue it's not a flaky test; it's a design bug. We should have precise semantics of what "branching off a named branch" is (probably in terms of commit_lsn/flush_lsn/whatever on Safekeepers), and there should be no way to violate it by misusing the API. For example, there should be no way to call the pageserver branch creation API without specifying the exact parent timeline and the exact LSN in it.

I suspect there may be other similar bugs lurking around; the pageserver is probably the source of truth in multiple places (see, e.g., #809). I suspect it would be impossible for our end users to debug unless we somehow expose both safekeepers/pageserver's LSNs, and it would be very hard if we do.

@yeputons yeputons added t/bug Issue Type: Bug a/reliability Area: relates to reliability of the service c/storage Component: storage labels Jul 8, 2022
@yeputons yeputons changed the title There should be no way of creating There should be no way of creating branch without specifying LSN, otherwise there is a race Jul 8, 2022
@LizardWizzard
Copy link
Contributor

This problem was mentioned before, see #1371 proper solution requires rather big tweaks in console. E g when compute for the branch is running and you want to create a new branch you need to go to the compute, pick the current lsn here, and pass it to pageserver. I think this needs to be part of a console branching support epic

cc @ololobus

@ololobus
Copy link
Member

when compute for the branch is running and you want to create a new branch you need to go to the compute, pick the current lsn here

That is racy anyway if compute is under load. Not sure how the API and UI should look like to the end user. Maybe we should always ask for the timestamp to branch? After that we will re-map it to LSN using the pageserver API and finally create a branch

Thanks for the reminder

@hlinnaka
Copy link
Contributor

For example, it may result in test_tenant_relocation failure even before there are relocations: it creates two branches, and the second branch is sometimes created slightly before the first is fully on the pageserver, resulting in divergence: https://app.circleci.com/pipelines/github/neondatabase/neon/7828/workflows/bce7b9dc-e390-4f42-8beb-1a2e81bf2b9d/jobs/80367

This just happened again here: https://github.com/neondatabase/neon/runs/7413000700

hlinnaka added a commit that referenced this issue Jul 20, 2022
If the WAL arrives at the pageserver slowly, it's possible that the
branch is created before all the data on the parent branch have
arrived. That results in a failure:

    test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
        timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
    test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
        assert cur.fetchone() == (expected_sum, )
    E   assert (500500,) == (1001000,)
    E     At index 0 diff: 500500 != 1001000
    E     Full diff:
    E     - (1001000,)
    E     + (500500,)

To fix, specify the LSN to branch at, so that the pageserver will wait
for it arrive.

See #2063
hlinnaka added a commit that referenced this issue Jul 20, 2022
If the WAL arrives at the pageserver slowly, it's possible that the
branch is created before all the data on the parent branch have
arrived. That results in a failure:

    test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
        timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
    test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
        assert cur.fetchone() == (expected_sum, )
    E   assert (500500,) == (1001000,)
    E     At index 0 diff: 500500 != 1001000
    E     Full diff:
    E     - (1001000,)
    E     + (500500,)

To fix, specify the LSN to branch at, so that the pageserver will wait
for it arrive.

See #2063
@LizardWizzard LizardWizzard added the a/test/flaky Area: related to flaky tests label Aug 8, 2022
@yeputons
Copy link
Contributor Author

yeputons commented Dec 17, 2022

Current status:

I think only these things need addressing:

yeputons added a commit that referenced this issue Dec 17, 2022
This commit cleans up our internal APIs and leaves fewer places where one can forgot
to explicitly specify the LSN to branch from.

* `neon_local branch` still works without `--ancestor-start-lsn` and the user may bump
  into a race condition when branching right after SQL query is completed. We should
  query safekeepers for the most recent `commit_lsn` instead, just like in cloud.
* `TimelineCreateRequest` HTTP API for Pageserver now takes requires both `ancestor_timeline_id`
  and `ancestor_start_lsn` to be simultaneously present or absent. It was allowed to omit start LSN
  before.
yeputons added a commit that referenced this issue Dec 28, 2022
This commit cleans up our internal APIs and leaves fewer places where one can forgot
to explicitly specify the LSN to branch from.

* `neon_local branch` still works without `--ancestor-start-lsn` and the user may bump
  into a race condition when branching right after SQL query is completed. We should
  query safekeepers for the most recent `commit_lsn` instead, just like in cloud.
* `TimelineCreateRequest` HTTP API for Pageserver now takes requires both `ancestor_timeline_id`
  and `ancestor_start_lsn` to be simultaneously present or absent. It was allowed to omit start LSN
  before.
yeputons added a commit that referenced this issue Dec 28, 2022
See #2063 (comment)

Also improve control_plane::safekeeper::SafekeeperNode's HTTP API.

Previously it was possible for a user to do some SQL queries, branch,
and see old data in the branch because pageserver did not get the most
recent WAL from safekeepers yet.

Now we wait until Pageserver catches up with the most recent data
on Safekeepers. An alternative is to `SELECT pg_current_wal_insert_lsn()`,
but that requires running branch's compute node in addition to safekeepers.
Which in turn invites concurrent writes and worse semantics for branching.
@yeputons
Copy link
Contributor Author

yeputons commented Dec 30, 2022

All PRs from the checklist above are created. As this is more of a tech debt issue, closing it now, it requires no further tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/reliability Area: relates to reliability of the service a/test/flaky Area: related to flaky tests c/storage Component: storage t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants