Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

Closed
chriskite opened this issue Jul 8, 2015 · 15 comments
Closed

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

chriskite opened this issue Jul 8, 2015 · 15 comments
Assignees

Comments

@chriskite
Copy link

Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! java.lang.IllegalStateException: Optional.get() cannot be called on an absent value
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.google.common.base.Absent.get(Absent.java:47) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.hubspot.singularity.mesos.SingularityMesosScheduler.statusUpdate(SingularityMesosScheduler.java:297) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator.statusUpdate(SingularityMesosSchedulerDelegator.java:218) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ERROR [2015-07-08 16:43:16,621] com.hubspot.singularity.SingularityAbort: Singularity on ip-10-42-10-87.ec2.internal is aborting due to UNRECOVERABLE_ERROR
@chriskite
Copy link
Author

The patch I wrote fixed the issue, and my Singularity isn't stuck in an endless start-fail-abort loop now.

@tpetr
Copy link
Contributor

tpetr commented Jul 8, 2015

Thanks for the PR, @chriskite! Would you mind giving a little more detail about the steps that lead up to Singularity getting into its death spiral? (i.e. what were you intending to run on Singularity, were you manually adding/removing data from ZK, were you running multiple frameworks in the Mesos cluster, etc...) Just trying to make sure there isn't more task data cleanup we should do in your PR.

@chriskite
Copy link
Author

Sure, my environment is basically like this:

Mesos with Marathon, Singularity, and my own framework Mescal.
3 node ZK cluster
Not running MySQL for Singularity
Singularity running via Docker on CoreOS with systemd
About 15 scheduled tasks in Singularity
No data manually touched in ZK

Singularity was running flawlessly for about a week. Then this morning it died with the error I posted. Once it aborts, systemd restarts it, but it dies again in the same way.

I was also wondering what other task cleanup might be appropriate, but I'm not familiar enough with the internals to know how a task could be "lost" like this. Could it be an issue with ZK synchronization?

@tpetr
Copy link
Contributor

tpetr commented Jul 8, 2015

Singularity should have logged the task that it received an update for right before it hit that exception:

LOG.debug("Task {} is now {} ({}) at {} ", taskId, status.getState(), status.getMessage(), timestamp);

Could you verify that the task ID it's blowing up on is affiliated with Singularity? (this is a long shot, but just wanted to rule it out early on)

@chriskite
Copy link
Author

Yep, I looked in the log and it is a Singularity task. It is the same task every time, and the state is TASK_RUNNING.

@tpetr
Copy link
Contributor

tpetr commented Jul 8, 2015

My next worry is that the history for that task is being purged prematurely from ZK. How far back do your Singularity logs go? Could you grep for a line that looks like (task id) Purged?

@chriskite
Copy link
Author

Looks like I've only got logs from after this problem started happening. There's no instance of "Purged" in the log.

@wsorenson
Copy link
Contributor

I had commented on the PR but missed this conversation, so bringing it here.

This is very odd, considering we have not run into this case with millions of tasks over years if it is indeed a scheduler bug.

Would you be willing to share whatever logs you do have so we can look and see if there's anything interesting there?

I think we should go ahead and handle this state, but we will probably need to think a bit more on how exactly we want to handle it (continue with processing as you did in your patch vs. simply exiting the update) and add tests to verify.

@wsorenson
Copy link
Contributor

Looked through the code more today.

This looks impossible without direct data manipulation:

1- Active task + historical task (what was missing) are added to ZK in a single transaction
2- If an active task exists, the historical task is not deleted.
3- In the above case, an active task must exist to hit the code block which threw the exception.

There is no way I can see or imagine in which a historical task can be missing for which an active task is present.

@chriskite
Copy link
Author

I am sure there was no human manipulation of data in ZK. I'm the only one with access to it, and I didn't touch it. The only software we use which has access to ZK is Mesos, Marathon, Singularity, Exhibitor, and ZK itself. It is possible that there is a bug somewhere in the stack which corrupted Singularity's state, if not in SIngularity itself.

Either way, since this issue involves an unchecked Option type, I do think it makes sense to handle this edge case. Maybe the best thing to do would be to log the unexpected state corruption, and remove all traces of the task in question from Singularity and ZK.

@stevenschlansker
Copy link
Contributor

Seems like a strong argument -- either the parameter is optional, or it is not. If it is, it should handle the absent case. If it is not, it should not be wrapped in an Optional<T> :)

@wsorenson
Copy link
Contributor

#631

@tpetr
Copy link
Contributor

tpetr commented Aug 18, 2015

@wsorenson
Copy link
Contributor

So, we addressed this in #631, however, it finally happened in HubSpot and I had enough of the task logs to figure out the root cause. I have addressed it in #869.

Ultimately, it's a workaround for what we believe is a bug in ZooKeeper:

https://issues.apache.org/jira/browse/ZOOKEEPER-2362

@stevenschlansker
Copy link
Contributor

:mindblown:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants