Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timing hole releasing safe point VM access #978

Merged
merged 1 commit into from
Jan 18, 2018

Conversation

gacholio
Copy link
Contributor

When releasing safe point VM access, the vmThreadListMutex must be
released after resetting the VM safePointState in order to prevent
newly-created threads from incorrectly setting the halted at safe point
bit.

The vmThreadListMutex is held for the entirety of the safe point, and it
is also required while creating a new thread, so new threads will not be
created during the safe point. Releasing the vmThreadListMutex early
allows the thread creation code to proceed, where it examines the
safePointState and incorrectly set the halt bit which will never be
cleared, causing VM access acquisition to block forever.

Signed-off-by: Graham Chapman graham_chapman@ca.ibm.com

When releasing safe point VM access, the vmThreadListMutex must be
released after resetting the VM safePointState in order to prevent
newly-created threads from incorrectly setting the halted at safe point
bit.

The vmThreadListMutex is held for the entirety of the safe point, and it
is also required while creating a new thread, so new threads will not be
created during the safe point.  Releasing the vmThreadListMutex early
allows the thread creation code to proceed, where it examines the
safePointState and incorrectly set the halt bit which will never be
cleared, causing VM access acquisition to block forever.

Signed-off-by: Graham Chapman <graham_chapman@ca.ibm.com>
Copy link
Member

@DanHeidinga DanHeidinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Lock order is correct. As GAC pointed out on Slack, the comment at the top of the file describes lock order:

/* Note the required ordering of monitors related to VM access:
 *
 * vm->threadListMutex
 *	before
 * Any number of vmThread->publicFlagsMutex (there is one per thread)
 *	before
 * vm->exclusiveAccessMutex
 *
 * It is not necessary to always acquire all of the monitors, but whenever more than
 * one monitor is acquired, the above ordering must be maintained.
 */

@DanHeidinga
Copy link
Member

Jenkins test sanity plinux

@DanHeidinga
Copy link
Member

DanHeidinga commented Jan 18, 2018

It looks like the PR build passed:

'Build finished. 1013 tests run, 351 skipped, 0 failed.'

But I'll give it one more chance to actually become a checkmark...

Jenkins test sanity plinux

Edit: Looks like this is the #968 issue cropping up again.

@pshipton
Copy link
Member

pshipton commented Jan 18, 2018

It looks like the PR build passed

The thing to note is that there should be ~21152 tests run in the PR build. The previous test run aborted early and didn't complete the test run. Its unfortunate that message shows up, I've opened #981.

@DanHeidinga DanHeidinga merged commit 3c8288e into eclipse-openj9:master Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants