IllegalStateException in SingularityMesosScheduler.statusUpdate #599

chriskite · 2015-07-08T16:56:16Z

Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! java.lang.IllegalStateException: Optional.get() cannot be called on an absent value
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.google.common.base.Absent.get(Absent.java:47) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.hubspot.singularity.mesos.SingularityMesosScheduler.statusUpdate(SingularityMesosScheduler.java:297) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ! at com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator.statusUpdate(SingularityMesosSchedulerDelegator.java:218) ~[singularity.jar:0.4.2-SNAPSHOT]
Jul 08 11:43:16 ip-10-42-10-87.ec2.internal bash[24028]: ERROR [2015-07-08 16:43:16,621] com.hubspot.singularity.SingularityAbort: Singularity on ip-10-42-10-87.ec2.internal is aborting due to UNRECOVERABLE_ERROR

The text was updated successfully, but these errors were encountered:

chriskite · 2015-07-08T18:56:37Z

The patch I wrote fixed the issue, and my Singularity isn't stuck in an endless start-fail-abort loop now.

tpetr · 2015-07-08T19:22:08Z

Thanks for the PR, @chriskite! Would you mind giving a little more detail about the steps that lead up to Singularity getting into its death spiral? (i.e. what were you intending to run on Singularity, were you manually adding/removing data from ZK, were you running multiple frameworks in the Mesos cluster, etc...) Just trying to make sure there isn't more task data cleanup we should do in your PR.

chriskite · 2015-07-08T19:27:26Z

Sure, my environment is basically like this:

Mesos with Marathon, Singularity, and my own framework Mescal.
3 node ZK cluster
Not running MySQL for Singularity
Singularity running via Docker on CoreOS with systemd
About 15 scheduled tasks in Singularity
No data manually touched in ZK

Singularity was running flawlessly for about a week. Then this morning it died with the error I posted. Once it aborts, systemd restarts it, but it dies again in the same way.

I was also wondering what other task cleanup might be appropriate, but I'm not familiar enough with the internals to know how a task could be "lost" like this. Could it be an issue with ZK synchronization?

tpetr · 2015-07-08T19:46:27Z

Singularity should have logged the task that it received an update for right before it hit that exception:

LOG.debug("Task {} is now {} ({}) at {} ", taskId, status.getState(), status.getMessage(), timestamp);

Could you verify that the task ID it's blowing up on is affiliated with Singularity? (this is a long shot, but just wanted to rule it out early on)

chriskite · 2015-07-08T19:51:30Z

Yep, I looked in the log and it is a Singularity task. It is the same task every time, and the state is TASK_RUNNING.

tpetr · 2015-07-08T20:07:08Z

My next worry is that the history for that task is being purged prematurely from ZK. How far back do your Singularity logs go? Could you grep for a line that looks like (task id) Purged?

chriskite · 2015-07-08T20:08:31Z

Looks like I've only got logs from after this problem started happening. There's no instance of "Purged" in the log.

wsorenson · 2015-07-09T12:12:08Z

I had commented on the PR but missed this conversation, so bringing it here.

This is very odd, considering we have not run into this case with millions of tasks over years if it is indeed a scheduler bug.

Would you be willing to share whatever logs you do have so we can look and see if there's anything interesting there?

I think we should go ahead and handle this state, but we will probably need to think a bit more on how exactly we want to handle it (continue with processing as you did in your patch vs. simply exiting the update) and add tests to verify.

wsorenson · 2015-07-14T18:07:18Z

Looked through the code more today.

This looks impossible without direct data manipulation:

1- Active task + historical task (what was missing) are added to ZK in a single transaction
2- If an active task exists, the historical task is not deleted.
3- In the above case, an active task must exist to hit the code block which threw the exception.

There is no way I can see or imagine in which a historical task can be missing for which an active task is present.

chriskite · 2015-07-14T19:09:58Z

I am sure there was no human manipulation of data in ZK. I'm the only one with access to it, and I didn't touch it. The only software we use which has access to ZK is Mesos, Marathon, Singularity, Exhibitor, and ZK itself. It is possible that there is a bug somewhere in the stack which corrupted Singularity's state, if not in SIngularity itself.

Either way, since this issue involves an unchecked Option type, I do think it makes sense to handle this edge case. Maybe the best thing to do would be to log the unexpected state corruption, and remove all traces of the task in question from Singularity and ZK.

stevenschlansker · 2015-07-14T20:09:05Z

Seems like a strong argument -- either the parameter is optional, or it is not. If it is, it should handle the absent case. If it is not, it should not be wrapped in an Optional<T> :)

wsorenson · 2015-07-23T19:13:36Z

#631

tpetr · 2015-08-18T14:17:53Z

This has been fixed in https://github.com/HubSpot/Singularity/releases/tag/Singularity-0.4.3

wsorenson · 2016-02-05T21:37:22Z

So, we addressed this in #631, however, it finally happened in HubSpot and I had enough of the task logs to figure out the root cause. I have addressed it in #869.

Ultimately, it's a workaround for what we believe is a bug in ZooKeeper:

https://issues.apache.org/jira/browse/ZOOKEEPER-2362

stevenschlansker · 2016-02-05T22:51:41Z

:mindblown:

chriskite mentioned this issue Jul 8, 2015

check presence of task object in SingularityMesosScheduler.statusUpdate #600

Closed

tpetr closed this as completed Aug 18, 2015

wsorenson mentioned this issue Feb 1, 2016

Add a buffer to task history persistence #869

Merged

tpetr assigned wsorenson Feb 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

chriskite commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

wsorenson commented Jul 9, 2015

wsorenson commented Jul 14, 2015

chriskite commented Jul 14, 2015

stevenschlansker commented Jul 14, 2015

wsorenson commented Jul 23, 2015

tpetr commented Aug 18, 2015

wsorenson commented Feb 5, 2016

stevenschlansker commented Feb 5, 2016

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

IllegalStateException in SingularityMesosScheduler.statusUpdate #599

Comments

chriskite commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

tpetr commented Jul 8, 2015

chriskite commented Jul 8, 2015

wsorenson commented Jul 9, 2015

wsorenson commented Jul 14, 2015

chriskite commented Jul 14, 2015

stevenschlansker commented Jul 14, 2015

wsorenson commented Jul 23, 2015

tpetr commented Aug 18, 2015

wsorenson commented Feb 5, 2016

stevenschlansker commented Feb 5, 2016