Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Flaky test YbAdminSnapshotScheduleTest.PgsqlAddColumnCompactWithPackedRow #24310

Open
1 task done
Huqicheng opened this issue Oct 7, 2024 · 0 comments
Open
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue

Comments

@Huqicheng
Copy link
Contributor

Huqicheng commented Oct 7, 2024

Jira Link: DB-13199

Description

YbAdminSnapshotScheduleTest.PgsqlAddColumnCompactWithPackedRow is flaky because it expects snapshot creation is finished before doing Alter table add column but there's race condition between them.

Issue Type

kind/failing-test

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Huqicheng Huqicheng added the area/docdb YugabyteDB core features label Oct 7, 2024
@Huqicheng Huqicheng self-assigned this Oct 7, 2024
@yugabyte-ci yugabyte-ci added kind/failing-test Tests and testing infra priority/medium Medium priority issue kind/enhancement This is an enhancement of an existing feature labels Oct 7, 2024
@rthallamko3 rthallamko3 assigned yusong-yan and unassigned Huqicheng Dec 11, 2024
spolitov added a commit that referenced this issue Dec 26, 2024
…t schedule is enabled

Summary:
When snapshot schedule is active, we retain records from being compacted until there is snapshot that contains this record.
Due to race condition we could get into situation where tablet replica consider snapshot as taken while it does not have it.

TServer receives last schedule snapshot time via heartbeat mechanism.
There are 2 places that could lead to described bug.
1) Master sends last snapshot time for schedule that is not yet complete.
2) Snapshot is taken on leader and one follower, while another follower did not complete snapshot yet.

Fixed by switching to maintain last snapshot time locally and store this information in filesystem.

When snapshot is created for some schedule, we create file in snapshots folder named: last_snapshot.[SCHEDULE_ID].[SNAPSHOT_TIME].
And remove previous file for the same schedule.
During tablet bootstrap we list all files in snapshot folder, recovering latest snapshot times and removing outdated files.
Jira: DB-13199

Test Plan: ./yb_build.sh release -n 400 --gtest_filter YbAdminSnapshotScheduleTest.PgsqlAddColumnCompactWithPackedRow -- -p 8

Reviewers: timur

Reviewed By: timur

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D40893
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants