All replicas are removed #1572

acelyc111 · 2023-07-31T03:40:11Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do?
The slogs are corrupted on more than 3 replcia servers in a standard Pegasus cluster.
(If the slog is corrupted in a single node cluster, it would cause the same issue)

The reason of slog corruption is there is a previous crash of replica server, it encountered a large amount of EAGAIN error when io_submit(), that's another issue we would handle (io_submit() has been replaced by pwrite() since 2.2)

What did you expect to see?
The replicas are kept in the normal directories, we can recover the table manually.
What did you see instead?
Some partitions' all replicas are moved to "<app_id>.<partition_id>.pegasus..err", the cluster is not able to recover automatically, we have to find and move the replicas back one by one.

E2023-07-29 03:17:32.797 (1690571852797613997 1eb83) replica.default0.0000eb5600010001: mutation_log.cpp:2048:read_next_log_block(): invalid data header magic: 0x0
D2023-07-29 03:17:32.797 (1690571852797631915 1eb83) replica.default0.0000eb5600010001: mutation_log_replay.cpp:41:replay(): finish to replay mutation log (/data3/sa_cluster/skv_offline/replica/slog/log.139.18306729623) [err: ERR_INVALID_DATA: failed to read log block]
E2023-07-29 03:17:32.797 (1690571852797648461 1eb83) replica.default0.0000eb5600010001: mutation_log_replay.cpp:189:replay(): replay mutation log failed: ERR_INVALID_DATA
E2023-07-29 03:17:32.798 (1690571852798053578 1eb83) replica.default0.0000eb5600010001: replica_stub.cpp:505:initialize(): replay shared log failed, err = ERR_INVALID_DATA, time_used = 618 ms, clear all logs ...
E2023-07-29 03:17:32.803 (1690571852803234241 1eb83) replica.default0.0000eb5600010001: pegasus_server_impl.cpp:2777:flush_all_family_columns(): [11.203@x.x.x.x:8171] flush failed, error = Shutdown in progress:
D2023-07-29 03:17:32.804 (1690571852804425084 1eb83) replica.default0.0000eb5600010001: pegasus_server_impl.cpp:1638:stop(): 11.203@x.x.x.x:8171: close app succeed, clear_state = false
D2023-07-29 03:17:32.804 (1690571852804709002 1eb83) replica.default0.0000eb5600010001: replica.cpp:419:close(): 11.203@x.x.x.x:8171: replica closed, time_used = 6ms
W2023-07-29 03:17:32.804 (1690571852804765679 1eb83) replica.default0.0000eb5600010001: replica_stub.cpp:524:initialize(): init_replica: {replica_dir_op} succeed to move directory '/data1/sa_cluster/skv_offline/replica/reps/11.203.pegasus' to '/data1/sa_cluster/skv_offline/replica/reps/11.203.pegasus.1690571852804715.err'

What version of Pegasus are you using?
2.0
(2.4 has the same issue)

The text was updated successfully, but these errors were encountered:

acelyc111 · 2023-07-31T04:11:56Z

slog will not be written since 2.4, but it will still be replayed in 2.4, if the slog is corrupted for some reason, the issue will be reproduced.

#1572 If the logs are called in FATAL level, the process should exit.

) #1572 Add an option to make it possible to exit the process and leave the corrupted slog and replicas to be handled by the administrator when open slog failed.

acelyc111 added the type/bug This issue reports a bug. label Jul 31, 2023

acelyc111 mentioned this issue Jul 31, 2023

fix(logs): fix logs in FATAL level not take effect #1573

Merged

acelyc111 added a commit that referenced this issue Jul 31, 2023

fix(logs): fix logs in FATAL level not take effect (#1573)

117f604

#1572 If the logs are called in FATAL level, the process should exit.

acelyc111 mentioned this issue Jul 31, 2023

fix(slog): add an option to exit the process when find slog error #1574

Merged

acelyc111 closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All replicas are removed #1572

All replicas are removed #1572

acelyc111 commented Jul 31, 2023 •

edited

Loading

acelyc111 commented Jul 31, 2023

All replicas are removed #1572

All replicas are removed #1572

Comments

acelyc111 commented Jul 31, 2023 • edited Loading

Bug Report

acelyc111 commented Jul 31, 2023

acelyc111 commented Jul 31, 2023 •

edited

Loading