Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup hosts marked as missing on startup #2240

Merged
merged 13 commits into from
Nov 18, 2021
Merged

Conversation

ashdza
Copy link
Contributor

@ashdza ashdza commented Nov 9, 2021

This PR modifies the SingularityAgentReconciliationPoller to delete hosts that are marked as missing on startup. Previously, only hosts marked as dead were deleted by the poller, causing inactive hosts marked as missing on startup to still appear in Singularity.

Hosts that are dead or missing on startup are deleted from both the AgentManager and InactiveAgentManager lists to cleanup inactive hosts in the Singularity UI.

https://git.hubteam.com/HubSpot/PaaS-Run/issues/1525

@ssalinas
Copy link
Member

ssalinas commented Nov 9, 2021

PR looks like it will fix those. There is one addition we can make to the cleanup as well. A little context first:

  • AgentManager -> DEAD/MISSING... contains the active listing of mesos agents (i.e. those used and considered in scheduling/offers/resources/etc and that show up in the top sections of the SingularityUI agents page)
  • InactiveAgentManager - maintains a list of agent hostnames/IPs that are marked as 'should not come back'. These are different from the list above. While AgentsManager is initially populated by data from the mesos master as we get new offers (or on startup), InactiveAgentsManager's list is populated by the user (or our automation in this case). It is targetd at catching cases where we have a 'flappy' host. For example, a known bad host may be marked as DEAD by mesos, only to come back -> ACTIVE -> die again -> repeat, causing failed tasks and churn. If we add that hostname to the inactive agents list, it is automatically marked as DECOMMISSIONED when we see it again (rather than ACTIVE)

So, I think we ca, in addition, clean up anything in InactiveAgentManager's list that is 1) being deleted by the checks you already have above or 2) older than some amount of time (likely longer than the AgentManager threshold, more like a few days). This will serve to also clean up the large Inactive Agents list you see as the very bottom table in the Singularity UI agents page

checkDeadSlaves();
checkInactiveSlaves();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally missed this in earlier PRs. Can you do a slaves -> agents rename as you go through this PR please. Really thought I had caught most of those already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May do more of this in a separate PR; upon renaming tests, the order that the tests run changes and causes some more test failures likely due to interference between tests

Copy link
Contributor

@WH77 WH77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, assuming a host that's missing on startup and cleaned up can be safely reintroduced later on.

Comment on lines 59 to 61
inactiveAgentManager.cleanInactiveAgentsList(
System.currentTimeMillis() -
TimeUnit.HOURS.toMillis(configuration.getCleanInactiveHostListEveryHours())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up anything in InactiveAgentManager's list that is 1) being deleted by the checks you already have above or 2) older than some amount of time (likely longer than the AgentManager threshold, more like a few days)

@ssalinas should this be addressing point 2 or are they different mechanisms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this looks like it should be doing the 'anything old than X hours' part already

@ashdza ashdza marked this pull request as ready for review November 16, 2021 16:29
@pschoenfelder
Copy link
Contributor

🚢

@ashdza ashdza merged commit ab893e0 into master Nov 18, 2021
@ashdza ashdza deleted the cleanup-inactive-hosts branch November 18, 2021 20:45
@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants