Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor beats lockfile to use timeout, retry #34194

Merged
merged 15 commits into from
Jan 31, 2023

Conversation

fearful-symmetry
Copy link
Contributor

What does this PR do?

This PR substantially refactors the lockfile to remove the "PID check" system and instead retry the underlying lock operation. This is mainly to deal with an (apparently somewhat common) edge case beat will shutdown improperly, leaving the old lockfile around, and the container environment will restart with new PID namespace, allowing for a collision between the PID written in the lockfile, and another running process.

This change puts the lockfile logic back in the hands of the OS; we try to obtain a lock, and if we can't, we retry a set number of times. In a case where a beat has shutdown improperly and the lockfile remains, instead of looking up a PID, we rely on the OS to release the underlying lock for the dead process, which most OSes will generally do, after a set amount of time.

I still need to test this by hand on Windows and Darwin.

Why is it important?

This is meant to deal with a few edge cases in how PID handling works.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added bug Team:Elastic-Agent Label for the Agent team labels Jan 5, 2023
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner January 5, 2023 23:25
@fearful-symmetry fearful-symmetry self-assigned this Jan 5, 2023
@fearful-symmetry fearful-symmetry requested review from faec and leehinman and removed request for a team January 5, 2023 23:25
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 5, 2023
@mergify
Copy link
Contributor

mergify bot commented Jan 5, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 6, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-01-30T21:26:44.826+0000

  • Duration: 65 min 7 sec

Test stats 🧪

Test Results
Failed 0
Passed 25311
Skipped 1962
Total 27273

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

if openErr != nil {
err = lock.handleFailedCreate()
for i := 0; i < lock.retryCount; i++ {
gotLock, err := lock.fileLock.TryLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think TryLock uses the os.O_EXCL flag. That means the file could exist already, and I think that would lead to a race condition in the Unlock function with Unlock & Remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? I assume you mean another beat swooping in while one beat is trying to lock or remove the file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the unlock code, it first unlocks, then does a remove. In between those lines of code another beat could create a new lock, but the file would be removed. This results in the new beat having an error if it goes to unlock because the lock file doesn't exist.

panic: unable to unlock data path file testing.lock: remove testing.lock: no such file or directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's an interesting edge case. Gonna see if I can think of a non-awkward way to protect against that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, made a change to remove the file first before we remove the lock. Going to see how the Windows CI reacts to that, but I imagine we'll want some manual testing, since I don't understand the Windows lockfile logic too well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put some or all of the detail from the PR description directly in the description of the Lock function? For example adding this would help the next developer to understand how this works.

In a case where a beat has shutdown improperly and the lockfile remains, instead of looking up a PID, we rely on the OS to release the underlying lock for the dead process, which most OSes will generally do, after a set amount of time.

It may also be worth noting that putting the PID into the lock file failed. To some degree we have had several iterations on this code because the original code did not explain itself at all, so let's try to avoid creating that problem again.

@cmacknz cmacknz added the backport-v8.6.0 Automated backport with mergify label Jan 9, 2023
@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2023

Needs a changelog entry 📓

@anmironov
Copy link

Hi Team, could you please clarify to me, if it will be in new release or backported to 8.6.0? Because, at the moment the issue (Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid....) persists in 8.6.0

@cmacknz
Copy link
Member

cmacknz commented Jan 11, 2023

Hi Team, could you please clarify to me, if it will be in new release or backported to 8.6.0?

Yes the plan is to put this in 8.6.1 as well as 8.7.0.

@fearful-symmetry
Copy link
Contributor Author

Alright, tested manually on Linux, seems fine.

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately on Windows we get this panic on unlock

panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

@mergify
Copy link
Contributor

mergify bot commented Jan 23, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lockfile-with-timeout upstream/lockfile-with-timeout
git merge upstream/main
git push upstream lockfile-with-timeout

@fearful-symmetry
Copy link
Contributor Author

Made a few changed based on a discussion with @leehinman , tested on Linux/darwin

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried it on Windows, doesn't work. Lock followed by Unlock gives: panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

@leehinman
Copy link
Contributor

tried it on Windows, doesn't work. Lock followed by Unlock gives: panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

nevermind, it worked on Windows. I screwed up the build and got the version of the PR.

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
lock.logger.Debugf("%s shut down without removing previous lockfile and is currently in a zombie state, continuing", lock.beatName)
return lock.recoverLockfile()
// now unlock on windows.
if runtime.GOOS == "windows" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we split this into 3 files? lock.go, unlock_posix.go, unlock_windows.go. Then we can have the Unlock function in the OS specific file, and use build tags to only compile the "right" one. The big benefit is that the doc strings can be specific to the OS since we are switching behavior based on that and it will be easier to understand later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea.

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
}

// now unlock on windows.
if runtime.GOOS == "windows" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of the runtime check? The build tags should mean this is the only "Unlock" implementation under Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Forgot to delete that...

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fearful-symmetry fearful-symmetry merged commit 21b6128 into elastic:main Jan 31, 2023
mergify bot pushed a commit that referenced this pull request Jan 31, 2023
* move lockfile logic to a retries

* clean up

* add changelog, update docs

* change unlock operation, remove file first

* fix tests

* fix lock on windows

* remove debug line

* add docs

* split out files

* remove old OS checks

* fix error

* format

(cherry picked from commit 21b6128)

# Conflicts:
#	libbeat/cmd/instance/locks/lock.go
@FranAguiar
Copy link

Hi!!
When will be this fix released? Thanks

@cmacknz
Copy link
Member

cmacknz commented Feb 6, 2023

This will be in 8.6.2 which is coming soon.

cmacknz added a commit that referenced this pull request Feb 7, 2023
…34435)

* Refactor beats lockfile to use timeout, retry (#34194)

* move lockfile logic to a retries

* clean up

* add changelog, update docs

* change unlock operation, remove file first

* fix tests

* fix lock on windows

* remove debug line

* add docs

* split out files

* remove old OS checks

* fix error

* format

(cherry picked from commit 21b6128)

# Conflicts:
#	libbeat/cmd/instance/locks/lock.go

* fix cherry pick

---------

Co-authored-by: Alex K <8418476+fearful-symmetry@users.noreply.github.com>
Co-authored-by: Alex Kristiansen <alex.kristiansen@elastic.co>
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@ariahi18
Copy link

Hello!!
any updates on the new release 8.6.2 please?
Thank you,

@cmacknz
Copy link
Member

cmacknz commented Feb 25, 2023

8.6.2 has been released. This may not have completely fixed the problem in some circumstances unfortunately.

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2023

The report of this still happening on 8.6.2 was a false alarm related to misconfiguration.

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* move lockfile logic to a retries

* clean up

* add changelog, update docs

* change unlock operation, remove file first

* fix tests

* fix lock on windows

* remove debug line

* add docs

* split out files

* remove old OS checks

* fix error

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants