Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

Closed
romseygeek opened this issue May 14, 2018 · 10 comments · Fixed by #30778
Closed

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

romseygeek opened this issue May 14, 2018 · 10 comments · Fixed by #30778
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@romseygeek
Copy link
Contributor

romseygeek commented May 14, 2018

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles/2435/console

After a file is corrupted, snapshot should throw an error, but it seems to succeed instead. Doesn't reproduce locally though.

consoleText.txt

@romseygeek romseygeek added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0 labels May 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member

tlrx commented May 17, 2018

This test creates an index with 0 replica, merges disabled and a very high value for the flush translog size setting. Then it corrupts a file in a random primary shard (in this failure, it is _2.si), registers a repository and creates a snapshot.

When creating the snapshot, the test expects that the shard snapshot process loads the store metadata and then fails because one of the Lucene files is corrupted, failing the shard snapshot and marking the snapshot as PARTIAL.

In this test failure the snapshot completed as SUCCESS. I looked closely at it and I can't figure why it didn't fail when loading the store metadata. Of course it does not reproduce locally. The test correclty corrupted the file:

[INFO ][test                     ] Corrupting file --  flipping at position 68 from 0 to 1 file: _2.si
[INFO ][test                     ] Checksum before: [2483589937] after: [1016601414] checksum value after corruption: 2483589937] file: _2.si length: 387

I suspect a test bug, or maybe that the file was not part of the snapshot but it should have been. Merges are disabled and flushes are manually executed before corrupting the file.

I pushed 7915b5f to add more debug information, hopefully this error will appear again and we'll be able to grab the shard files and the snapshotted files.

@tlrx tlrx self-assigned this May 17, 2018
tlrx added a commit that referenced this issue May 17, 2018
This test failed but the cause is not obvious. This commit adds more
debug logging traces so that if it reproduces we could gather more
information.

Related #30577
@ywelsch
Copy link
Contributor

ywelsch commented May 17, 2018

@dnhatn With the changes that were made to CombinedDeletionPolicy, do you perhaps know if we can expect all unreferenced Lucene files to be cleaned up after a flush?

@dnhatn
Copy link
Member

dnhatn commented May 18, 2018

@ywelsch I have looked at the test.
If the latest value of the global checkpoint has not fsynced before a flush, we will have two commits; then if we corrupt some files which are referenced only by the previous commit, the snapshot of the latest commit will be SUCCESS instead of PARTIAL.

I will follow-up this test with @tlrx.

@ywelsch
Copy link
Contributor

ywelsch commented May 18, 2018

awesome!

@tlrx
Copy link
Member

tlrx commented May 18, 2018

Good catch @ywelsch and @dnhatn. But what could make a file referenced by a commit point and not by the next one, when merges are disabled and the shard manually flushed?

@dnhatn
Copy link
Member

dnhatn commented May 18, 2018

@tlrx You're right. My assumption is not correct in this case.

dnhatn added a commit that referenced this issue May 19, 2018
dnhatn added a commit that referenced this issue May 19, 2018
@dnhatn
Copy link
Member

dnhatn commented May 19, 2018

@tlrx I will be taking care of this.

@dnhatn dnhatn self-assigned this May 19, 2018
@dnhatn
Copy link
Member

dnhatn commented May 20, 2018

@tlrx and @ywelsch I've reproduced and have an explanation for this. This is possibly caused by LUCENE-8253

  1. indexRandom(true, builders) not only add docs but also index/delete bogus docs.
[test][4] Index [fEPBemMBIyPePwGlvrH9]
 [test][4] Index [bogus_doc_ीनठफड़जऊa44]
[test][4] Index [oUPBemMBIyPePwGlvrH-]
[test][4] Delete [bogus_doc_ीनठफड़जऊa44]
  1. With a special seed (A9D224BA3576B58B:E8899F47DB14C3CA), indexRandom might create a fully deleted segment in which all documents are deleted.
Segment _2(7.4.0):c1, files [_2.si, _2.cfe, _2.cfs], maxDocs=[1], numDocs=[0], delDocs=[1], seq#=[6] 

Previously, a fully deleted segment is kept around until the next commit, however since LUCENE-8253 it is dropped immediately. The problem is that its files, which are not referenced by any commit point, are not released but retained in IndexFileDeleter's lastFiles.

IFD's last files:  [_0.si, _3.si, _1.cfe, _1.si, _3_1.liv, _2.si, 
_0.cfs, _2.cfs, _3.cfe, _1.cfs, _0.cfe, _2.cfe, _3.cfs, _2_1.liv]
  1. We then corrupt _2.cfe which belongs to the dropped segment.
Corrupting file --  flipping at position 197 from 0 to 1 file: _2.cfe
  1. Taking snapshot should be ok because the last commit does not includes the corrupted file.
Loading store metadata using SnapshotIndexCommit{segments_3},
files: [_1.cfs, _0.cfe, _0.si, _3.si, _1.cfe, _1.si, _3_1.liv, _0.cfs, _3.cfs, _3.cfe, segments_3]

I opened LUCENE-8324. If @s1monw agrees to fix this in Lucene, we are good; otherwise we need to update the test.

@tlrx
Copy link
Member

tlrx commented May 22, 2018

Thanks @dnhatn. I suspected something like that but I did not look deeply enough.

dnhatn added a commit that referenced this issue May 22, 2018
The new snapshot includes LUCENE-8324 which fixes missing checkpoint
after a fully deletes segment is dropped on flush. This snapshot should
resolves failed tests in the CorruptedFileIT suite.

Closes #30741
Closes #30577
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this issue May 23, 2018
This test failed but the cause is not obvious. This commit adds more
debug logging traces so that if it reproduces we could gather more
information.

Related elastic#30577
dnhatn added a commit that referenced this issue May 29, 2018
The new snapshot includes LUCENE-8324 which fixes missing checkpoint
after a fully deletes segment is dropped on flush. This snapshot should
resolves failed tests in the CorruptedFileIT suite.

Closes #30741
Closes #30577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants