Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.IOException in snapshot operation #857

Closed
mareden opened this issue Jun 27, 2022 · 10 comments
Closed

java.io.IOException in snapshot operation #857

mareden opened this issue Jun 27, 2022 · 10 comments

Comments

@mareden
Copy link

mareden commented Jun 27, 2022

Describe the bug

We're using SOFAJRaft in one of our project. Nowadays, sometimes it's found that IOException is raised when snapshot is generated/saved:

2022-05-31 12:12:14,891 JRaft-Closure-Executor-14 WARN  [Utils:341] Unable to move atomically, falling back to non-atomic move, error: data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.tmp -> data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.
2022-05-31 12:12:14,891 JRaft-Closure-Executor-14 INFO  [Utils:345] The target file data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta was already existing.
2022-05-31 12:12:14,892 JRaft-Closure-Executor-14 WARN  [Utils:352] Unable to move data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.tmp to data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta. Attempting to delete data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.tmp and abandoning.
2022-05-31 12:12:14,892 JRaft-Closure-Executor-14 ERROR [LocalSnapshotStorage:212] Fail to sync writer data/raft/data_snapshot/snapshot_0530/temp.
2022-05-31 12:12:14,892 JRaft-Closure-Executor-14 INFO  [LocalSnapshotStorage:167] Deleting snapshot data/raft/data_snapshot/snapshot_0530/temp.
2022-05-31 12:12:14,896 JRaft-Closure-Executor-14 ERROR [SnapshotExecutorImpl:398] Fail to close writer
java.io.IOException
        at com.alipay.sofa.jraft.storage.snapshot.local.LocalSnapshotStorage.close(LocalSnapshotStorage.java:251)
        at com.alipay.sofa.jraft.storage.snapshot.local.LocalSnapshotWriter.close(LocalSnapshotWriter.java:98)
        at com.alipay.sofa.jraft.storage.snapshot.local.LocalSnapshotWriter.close(LocalSnapshotWriter.java:93)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:396)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:145)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:141)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-05-31 12:12:14,898 JRaft-FSMCaller-Disruptor-0 ERROR [StateMachineAdapter:72] Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on StateMachine com.server.handlers.StateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save snapshot.]]
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:691)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:414)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:145)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:141)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-05-31 12:12:14,899 JRaft-FSMCaller-Disruptor-0 WARN  [NodeImpl:2471] Node <G_snapshot_0530_P_1/192.168.109.52:13001> got error: {}.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save snapshot.]]
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:691)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:414)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:145)
        at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:141)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Previously, a 'question' was submitted in gibhub(see: #820), but till now it seems there is no further following-up. So I'm submitting this bug to track this issue.

Expected behavior

It's expected no exception is raised.

Actual behavior

Exception as mentioned before is seen.

Steps to reproduce

  1. introduced code for snapshot saving/loading inside state machine class
  2. run SOFAJRaft in it's way
  3. most of the cases, the code runs well, but in a few cases the exception mentioned above is seen.
    Note that:
  4. the exception seems from SOFAJRaft internal code, not from onSnapshotSave/onSnapshotLoad, so it seems sofajraft snapshot logic encounters some issue. Previously, I've tried to upgrade SOFAJRaft to its latest version, which introduced more logging messages (as suggtested How to know the possible reason of snapshot shot saving file operation failure? #820 (comment)), but the same issue happened again.
  5. recently @killme2008 mentioned that he expects more details log messages, but unfortunately the pasted log messages are all what I can get (How to know the possible reason of snapshot shot saving file operation failure? #820 (comment))

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

  • SOFAJRaft version:1.3.10.bugfix_2
  • JVM version (e.g. java -version):1.8.0_181-b13
  • OS version (e.g. uname -a):Linux 173604 3.10.0-957.el7.x86_64
  • Maven version:
  • IDE version:

Reference: related discussion thread: #820

@killme2008
Copy link
Contributor

killme2008 commented Jun 28, 2022

Are there any other processes will operates (move, rename etc) on jraft snapshot files/folders ?

It looks like temp folder /data/raft/data_snapshot/snapshot_0530/temp/ for snapshot is deleted unexpectedly.

It looks like the snapshot meta file is already exists when try to moving the temp snapshot meta file to the target:

The target file data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta was already existing.

Then try to replace the file, but it fails too

Unable to move data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.tmp to data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta. Attempting to delete data/raft/data_snapshot/snapshot_0530/temp/__raft_snapshot_meta.tmp and abandoning.

I note that your kernel version is so old, what's the file system are u using?

@killme2008
Copy link
Contributor

Please provide your onSnapshotSave implementation.

@killme2008
Copy link
Contributor

I added more logs in #858, you can use this branch if you can.

@mareden
Copy link
Author

mareden commented Jun 30, 2022

@killme2008 thank you a lot for digging into the issue.

Are there any other processes will operates (move, rename etc) on jraft snapshot files/folders ?
No, there's no other processes manipulating the files in parallel

It looks like the snapshot meta file is already exists when try to moving the temp snapshot meta file to the target
Thanks for this information. This sounds reasonable. I've checked our code, and it's found that the snapshot interval is set to 60s. A longer snapshot interval will be set and we'll see if the same issue exist or not.

......what's the file system are u using?
It's ext4

@mareden
Copy link
Author

mareden commented Jun 30, 2022

Please provide your onSnapshotSave implementation.
I don't know if I'm allowed to paste project code here. I tried to paste part of code of 'onSnapshotSave' here:

try {
    ……
    Path dataFilePath = dataFile.toPath().toAbsolutePath();
    Path dataSnapshotPath = snapshotPath.resolve(dataFileName);
    try {
        Files.createSymbolicLink(dataSnapshotPath, dataFilePath);
        writer.addFile(dataFileName);
    } catch (IOException e) {
        LOGGER.error(e.getLocalizedMessage(), e);
    }
    ……
    done.run(Status.OK());
} catch (Throwable t) {
    LOGGER.error(t.getLocalizedMessage(), t);
}

The idea is: create symbolic link of original data file as 'snapshot' data, thus the snapshot generation step is fast enough, and snapshot file copying will work well still as SOFAJRaft will visite original file through the symbolic link file.

@mareden
Copy link
Author

mareden commented Jun 30, 2022

I added more logs in #858, you can use this branch if you can.
I don't know how to get jar files from your new branch. Is there any instruction on jar file packaging? Thanks.
@killme2008

@killme2008
Copy link
Contributor

I added more logs in #858, you can use this branch if you can.
I don't know how to get jar files from your new branch. Is there any instruction on jar file packaging? Thanks.
@killme2008

You can package the jar by mvn -Dmaven.test.skip=true package

@yuyang0423
Copy link
Contributor

"in a few cases the exception mentioned above is seen"
===>> Can you describe the detailed steps of those failed cases ??

@mareden
Copy link
Author

mareden commented Jul 18, 2022

"in a few cases the exception mentioned above is seen"
===>> Can you describe the detailed steps of those failed cases ??

@yuyang0423, this issue is a sporadic issue, and there is no determined steps to reproduce the issue. I'm expecting I could've been able to provide detailed step to reproduce the issue, but unfortunately there are no steps to ensure the issue is always reproduced as expected. Till now, what we've monitored are:

  1. the issue was monitored for few times when a short snapshot interval is set
  2. the issue was monitored for few times when there're multiple files putting into a snapshot

@yuyang0423
Copy link
Contributor

"in a few cases the exception mentioned above is seen"
===>> Can you describe the detailed steps of those failed cases ??

@yuyang0423, this issue is a sporadic issue, and there is no determined steps to reproduce the issue. I'm expecting I could've been able to provide detailed step to reproduce the issue, but unfortunately there are no steps to ensure the issue is always reproduced as expected. Till now, what we've monitored are:

  1. the issue was monitored for few times when a short snapshot interval is set
  2. the issue was monitored for few times when there're multiple files putting into a snapshot

If you have your own onSnapshotSave code, highly recommend to make sure tmp directory and meta is sync to disk.
probably same to this pr #745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants