Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storagenode: unable to sync from trimmed source replica to a new replica #478

Closed
ijsong opened this issue Jun 13, 2023 · 0 comments · Fixed by #470 or #464
Closed

storagenode: unable to sync from trimmed source replica to a new replica #478

ijsong opened this issue Jun 13, 2023 · 0 comments · Fixed by #470 or #464
Assignees
Labels
bug Something isn't working

Comments

@ijsong
Copy link
Member

ijsong commented Jun 13, 2023

Current Behavior

Log stream replica whose log entries are trimmed cannot call SyncInit to the new replica joined into the log stream.

Expected Behavior

The new log stream replica should accept SyncInit to receive the commit context from the destination replica. It will make the log stream's state SEALED, which can transform writable.

Steps To Reproduce

  1. Add a new log stream.
  2. Append a few log entries.
  3. Trim all log entries.
  4. Change one replica with a new and clean replica.
  5. Tries to unseal the log stream, then you can see it fails.

The log stream could not unseal because it failed to synchronize from the source replica to the destination one joined into the log stream. Although there is no log entry to copy because of the Trim operation, the CommitContext must be copied. However, the destination replica rejects SyncInit since it makes a wrong decision due to no log entry to copy.

Environment

  • Varlog version: v0.13.0
@ijsong ijsong added the bug Something isn't working label Jun 13, 2023
@ijsong ijsong self-assigned this Jun 13, 2023
ijsong added a commit that referenced this issue Jun 13, 2023
…tination

Storage nodes can be trimmed and synchronized. However, there is some bug in that a new destination
replica joined into the log stream rejects SyncInit RPC sent from the trimmed source replica. Those
replicas are all empty and have no log entries; however, the source replica has a commit context
indicating the last committed LLSN. In this situation, the destination replica must accept SyncInit
to receive the commit context from the source replica, but it does not.

This PR fixes the above issue. To solve the problem, it changes the condition that the destination
replica decides whether they are already synchronized.

```go
    // Previous code: https://github.com/kakao/varlog/blob/5269481c0e80c2eebf8214116a2d1544a26cb443/internal/storagenode/logstream/sync.go#L297-L302
    //
    // NOTE: When the replica has all log entries, it returns its range of logs and non-error results.
    // In this case, this replica remains executorStateSealing.
    // Breaking change: previously it returns ErrExist when the replica has all log entries to replicate.
    if dstLastCommittedLLSN == srcRange.LastLLSN && !invalid {
        return snpb.SyncRange{}, status.Errorf(codes.AlreadyExists, "already synchronized")
    }
```

Since both replicas have no log entries, the condition `dstLastCommittedLLSN == srcRange.LastLLSN`
is not enough. This PR changed the condition to be `dstLastCommittedLLSN == srcLastCommittedLLSN &&
dstLastCommittedLLSN == srcRange.LastLLSN`. Since the `srcLastCommittedLLSN` is valid regardless of
log entries in the source replica, the destination replica will accept the SyncInit.

Resolve #478
ijsong added a commit that referenced this issue Jun 17, 2023
…tination

Storage nodes can be trimmed and synchronized. However, there is some bug in that a new destination
replica joined into the log stream rejects SyncInit RPC sent from the trimmed source replica. Those
replicas are all empty and have no log entries; however, the source replica has a commit context
indicating the last committed LLSN. In this situation, the destination replica must accept SyncInit
to receive the commit context from the source replica, but it does not.

This PR fixes the above issue. To solve the problem, it changes the condition that the destination
replica decides whether they are already synchronized.

```go
    // Previous code: https://github.com/kakao/varlog/blob/5269481c0e80c2eebf8214116a2d1544a26cb443/internal/storagenode/logstream/sync.go#L297-L302
    //
    // NOTE: When the replica has all log entries, it returns its range of logs and non-error results.
    // In this case, this replica remains executorStateSealing.
    // Breaking change: previously it returns ErrExist when the replica has all log entries to replicate.
    if dstLastCommittedLLSN == srcRange.LastLLSN && !invalid {
        return snpb.SyncRange{}, status.Errorf(codes.AlreadyExists, "already synchronized")
    }
```

Since both replicas have no log entries, the condition `dstLastCommittedLLSN == srcRange.LastLLSN`
is not enough. This PR changed the condition to be `dstLastCommittedLLSN == srcLastCommittedLLSN &&
dstLastCommittedLLSN == srcRange.LastLLSN`. Since the `srcLastCommittedLLSN` is valid regardless of
log entries in the source replica, the destination replica will accept the SyncInit.

Resolve #478
ijsong added a commit that referenced this issue Jun 19, 2023
…tination

Storage nodes can be trimmed and synchronized. However, there is some bug in that a new destination
replica joined into the log stream rejects SyncInit RPC sent from the trimmed source replica. Those
replicas are all empty and have no log entries; however, the source replica has a commit context
indicating the last committed LLSN. In this situation, the destination replica must accept SyncInit
to receive the commit context from the source replica, but it does not.

This PR fixes the above issue. To solve the problem, it changes the condition that the destination
replica decides whether they are already synchronized.

```go
    // Previous code: https://github.com/kakao/varlog/blob/5269481c0e80c2eebf8214116a2d1544a26cb443/internal/storagenode/logstream/sync.go#L297-L302
    //
    // NOTE: When the replica has all log entries, it returns its range of logs and non-error results.
    // In this case, this replica remains executorStateSealing.
    // Breaking change: previously it returns ErrExist when the replica has all log entries to replicate.
    if dstLastCommittedLLSN == srcRange.LastLLSN && !invalid {
        return snpb.SyncRange{}, status.Errorf(codes.AlreadyExists, "already synchronized")
    }
```

Since both replicas have no log entries, the condition `dstLastCommittedLLSN == srcRange.LastLLSN`
is not enough. This PR changed the condition to be `dstLastCommittedLLSN == srcLastCommittedLLSN &&
dstLastCommittedLLSN == srcRange.LastLLSN`. Since the `srcLastCommittedLLSN` is valid regardless of
log entries in the source replica, the destination replica will accept the SyncInit.

Resolve #478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant