-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IllegalStateException in SingularityMesosScheduler.statusUpdate #599
Comments
The patch I wrote fixed the issue, and my Singularity isn't stuck in an endless start-fail-abort loop now. |
Thanks for the PR, @chriskite! Would you mind giving a little more detail about the steps that lead up to Singularity getting into its death spiral? (i.e. what were you intending to run on Singularity, were you manually adding/removing data from ZK, were you running multiple frameworks in the Mesos cluster, etc...) Just trying to make sure there isn't more task data cleanup we should do in your PR. |
Sure, my environment is basically like this: Mesos with Marathon, Singularity, and my own framework Mescal. Singularity was running flawlessly for about a week. Then this morning it died with the error I posted. Once it aborts, systemd restarts it, but it dies again in the same way. I was also wondering what other task cleanup might be appropriate, but I'm not familiar enough with the internals to know how a task could be "lost" like this. Could it be an issue with ZK synchronization? |
Singularity should have logged the task that it received an update for right before it hit that exception: LOG.debug("Task {} is now {} ({}) at {} ", taskId, status.getState(), status.getMessage(), timestamp); Could you verify that the task ID it's blowing up on is affiliated with Singularity? (this is a long shot, but just wanted to rule it out early on) |
Yep, I looked in the log and it is a Singularity task. It is the same task every time, and the state is TASK_RUNNING. |
My next worry is that the history for that task is being purged prematurely from ZK. How far back do your Singularity logs go? Could you grep for a line that looks like |
Looks like I've only got logs from after this problem started happening. There's no instance of "Purged" in the log. |
I had commented on the PR but missed this conversation, so bringing it here. This is very odd, considering we have not run into this case with millions of tasks over years if it is indeed a scheduler bug. Would you be willing to share whatever logs you do have so we can look and see if there's anything interesting there? I think we should go ahead and handle this state, but we will probably need to think a bit more on how exactly we want to handle it (continue with processing as you did in your patch vs. simply exiting the update) and add tests to verify. |
Looked through the code more today. This looks impossible without direct data manipulation: 1- Active task + historical task (what was missing) are added to ZK in a single transaction There is no way I can see or imagine in which a historical task can be missing for which an active task is present. |
I am sure there was no human manipulation of data in ZK. I'm the only one with access to it, and I didn't touch it. The only software we use which has access to ZK is Mesos, Marathon, Singularity, Exhibitor, and ZK itself. It is possible that there is a bug somewhere in the stack which corrupted Singularity's state, if not in SIngularity itself. Either way, since this issue involves an unchecked Option type, I do think it makes sense to handle this edge case. Maybe the best thing to do would be to log the unexpected state corruption, and remove all traces of the task in question from Singularity and ZK. |
Seems like a strong argument -- either the parameter is optional, or it is not. If it is, it should handle the absent case. If it is not, it should not be wrapped in an |
This has been fixed in https://github.com/HubSpot/Singularity/releases/tag/Singularity-0.4.3 |
:mindblown: |
The text was updated successfully, but these errors were encountered: