-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: process start time mismatch #1393
Comments
Selinux is set to Enforcing but not seeing any EPERM or anything in the audit logs related |
I doubt it is an SELinux issue. |
@rhatdan ya, me too. Was just adding it as a data point |
It looks like the CentOS kernel is missing this kernel patch: torvalds/linux@266b7a0. Basically when a multithreaded process does an |
I'll submit my changes to make the error messages return from runc more explicit |
My worry with this patch is that it converts our silent pid reuse detection into an error condition, which means that if you have a lot of process churn and containers being started and stopped then you'll start getting errors when runC would be able to take care of it. Maybe we should just log to stderr rather than actually bail in this case? |
@cyphar its always been an error condition hence why it was reported. How can we take care of it? The only alternative is to remove the check and just perform the actions. |
@crosbymichael But it's not an error condition (unless things are broken as we've seen in this issue). We check the start time to avoid PID recycling attacks, to make it so that if a container's init dies then the next time we run a runC command then the state will be correctly updated if a container's init died and a new process started with the same PID (which requires us making sure that the start time is the same). But if we make a different start time an error in To be honest, I'm not really sure how we can deal with kernels where the start time is wrong, but since it's a kernel bug that was fixed in 2013. Maybe we could come up with another way of disambiguating processes but the kernel actually makes a guarantee that |
i agree with @cyphar that this is not an error condition when we detect that the PID has been recycled. We will just mark the container state as stopped and continue in that case. The error here comes from the fact that the kernel is buggy. Either detect that the system is running on a buggy kernel so we can bail early, or find another way to disambiguate processes would be nice, but so far i don't have any nice clue how to achieve either of them in runc. |
@dqminh the problem is that when we mark the container as stopped you cannot do certain operations and that is what fails. you cannot exec a stopped container, etc. |
People should upgrade their kernels IMO -- this is a buggy kernel issue that was fixed several years ago.
My proposal in #1224 is another possible way of doing it, but it has the downside of meaning you have to futz with the host's mount table. And it will break rootless. IMO the "right way" is the way we're doing it now -- unfortunately there was a kernel bug in the past with this technique but it was fixed a while ago. |
What version of the Centos kernel were you testing? It appears that this kernel fix referred to in #1393 (comment) is in the 3.10.0-514.16.1 version released as part of Centos 7.3.1611 |
@evantorrie At the time when I wrote that comment I was checking the latest sources from CentOS. I talked to some RedHat folks and they patched their kernel so it should work now (AFAIK). Closing. |
Report from a docker issue and was able to have another user reproduce this error.
moby/moby#29794
I added some code to test the exact place that is causing this to confirm that on a centos system, the start time that runc records vs additional stat calls are not matching up.
The error that we are seeing with my patch is
exec failed: process start time and recorded start time do not match
How is this possible? Why centos? Why!
uname -a Linux ip-10-0-0-86 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
I've been trying to debug this but no clue what is going on. Does anyone else have any ideas how this mismatch can be possible? It looks like this was the same type of error that was fixed in #1136 but i have confirmed we are running with a build that contains that commit.
The text was updated successfully, but these errors were encountered: