Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All replicas are removed #1572

Closed
acelyc111 opened this issue Jul 31, 2023 · 1 comment
Closed

All replicas are removed #1572

acelyc111 opened this issue Jul 31, 2023 · 1 comment
Labels
type/bug This issue reports a bug.

Comments

@acelyc111
Copy link
Member

acelyc111 commented Jul 31, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    The slogs are corrupted on more than 3 replcia servers in a standard Pegasus cluster.
    (If the slog is corrupted in a single node cluster, it would cause the same issue)

The reason of slog corruption is there is a previous crash of replica server, it encountered a large amount of EAGAIN error when io_submit(), that's another issue we would handle (io_submit() has been replaced by pwrite() since 2.2)

  1. What did you expect to see?
    The replicas are kept in the normal directories, we can recover the table manually.

  2. What did you see instead?
    Some partitions' all replicas are moved to "<app_id>.<partition_id>.pegasus..err", the cluster is not able to recover automatically, we have to find and move the replicas back one by one.

E2023-07-29 03:17:32.797 (1690571852797613997 1eb83) replica.default0.0000eb5600010001: mutation_log.cpp:2048:read_next_log_block(): invalid data header magic: 0x0
D2023-07-29 03:17:32.797 (1690571852797631915 1eb83) replica.default0.0000eb5600010001: mutation_log_replay.cpp:41:replay(): finish to replay mutation log (/data3/sa_cluster/skv_offline/replica/slog/log.139.18306729623) [err: ERR_INVALID_DATA: failed to read log block]
E2023-07-29 03:17:32.797 (1690571852797648461 1eb83) replica.default0.0000eb5600010001: mutation_log_replay.cpp:189:replay(): replay mutation log failed: ERR_INVALID_DATA
E2023-07-29 03:17:32.798 (1690571852798053578 1eb83) replica.default0.0000eb5600010001: replica_stub.cpp:505:initialize(): replay shared log failed, err = ERR_INVALID_DATA, time_used = 618 ms, clear all logs ...
E2023-07-29 03:17:32.803 (1690571852803234241 1eb83) replica.default0.0000eb5600010001: pegasus_server_impl.cpp:2777:flush_all_family_columns(): [11.203@x.x.x.x:8171] flush failed, error = Shutdown in progress:
D2023-07-29 03:17:32.804 (1690571852804425084 1eb83) replica.default0.0000eb5600010001: pegasus_server_impl.cpp:1638:stop(): 11.203@x.x.x.x:8171: close app succeed, clear_state = false
D2023-07-29 03:17:32.804 (1690571852804709002 1eb83) replica.default0.0000eb5600010001: replica.cpp:419:close(): 11.203@x.x.x.x:8171: replica closed, time_used = 6ms
W2023-07-29 03:17:32.804 (1690571852804765679 1eb83) replica.default0.0000eb5600010001: replica_stub.cpp:524:initialize(): init_replica: {replica_dir_op} succeed to move directory '/data1/sa_cluster/skv_offline/replica/reps/11.203.pegasus' to '/data1/sa_cluster/skv_offline/replica/reps/11.203.pegasus.1690571852804715.err'
  1. What version of Pegasus are you using?
    2.0
    (2.4 has the same issue)
@acelyc111 acelyc111 added the type/bug This issue reports a bug. label Jul 31, 2023
@acelyc111
Copy link
Member Author

slog will not be written since 2.4, but it will still be replayed in 2.4, if the slog is corrupted for some reason, the issue will be reproduced.

acelyc111 added a commit that referenced this issue Jul 31, 2023
#1572

If the logs are called in FATAL level, the process should exit.
acelyc111 added a commit that referenced this issue Jul 31, 2023
)

#1572

Add an option to make it possible to exit the process and leave the
corrupted slog and replicas to be handled by the administrator when
open slog failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant