Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The zoo is under new management (zk cleanup) #1922

Merged
merged 41 commits into from
Apr 25, 2019
Merged

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Apr 22, 2019

Be much nicer to zookeeper by avoiding possible large payloads, specifically:

  • Namespace pending tasks by request id so that the list children call for this does not get to large when there are thousands of pending tasks
    • Avoid the list + filter call, which required fetching all pending task ids to find the ones for a particular request
  • Remove usage of the /tasks/active node entirely since we have this data elsewhere
  • Namespaces last active task statuses nodes by request id and use this data to power the active tasks list instead
  • clean up nodes from old features that are no longer used

Still TODO on this PR:

  • automatic cleanup of request groups when a request is deleted
  • revisit caching to determine if any additional data can be leader cached
  • more aggressive pruning of inactive slaves data

@ssalinas
Copy link
Member Author

We weren't using the correct code paths to hit the leader cache for slave + rack data, causing lots of unneeded zk calls for those. Fixed in this PR now

@ssalinas ssalinas changed the title The zoo is under new management The zoo is under new management (zk cleanup) Apr 22, 2019
@ssalinas
Copy link
Member Author

/fixes #1823

@ssalinas ssalinas added the hs_qa label Apr 23, 2019
if (leaderCache.active()) {
return leaderCache.getRack(rackName);
return leaderCache.getRack(rackId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using both rackName (in SingularityLeaderCache) and rackId. We should pick one for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consolidating on rackId 👍

List<String> currentActive = curatorFramework.getChildren().forPath(ACTIVE_STATUSES_ROOT);
for (String taskIdString : currentActive) {
SingularityTaskId taskId = SingularityTaskId.valueOf(taskIdString);
String oldPath = ZKPaths.makePath(ACTIVE_STATUSES_ROOT, taskIdString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're running this code based on whether ACTIVE_TASKS_ROOT exists in ZK, but we're never actually reading from that path during this migration. Just double-checking, is that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove it as the last step of the migration which is why it's my initial check. There isn't actually any data in those nodes, they are only an index we call getChildren on. The /tasks/statuses is where the actual data is and what I am namespacing. However, realizing that if this died half way though, some of the children on the next run would be request ids instead. Going to add a fix so this will be able to run on a half-migrated dataset as well

@@ -153,6 +153,9 @@ public void subscribed(Subscribed subscribed) {
heartbeatIntervalSeconds = Optional.of(advertisedHeartbeatIntervalSeconds);
}

// Should be called before activation of leader cache or cache could be left empty
startup.checkMigrations();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect that this wouldn't have run at startup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only runs when we take leadership, not as a Managed class. Moved it here since we need to migrate before filling leader cache, since leader cache is using the newer taskManager at this point

}

saveResult(Optional.of(response.getStatusCode()), responseBody, Optional.<String> absent(), Optional.<Throwable>absent());
public void setHealthcehckUri(String healthcheckUri) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fixed

@@ -35,7 +35,7 @@
<basepom.release.profiles>oss-release</basepom.release.profiles>
<dep.classmate.version>1.3.1</dep.classmate.version>
<dep.commons-lang3.version>3.7</dep.commons-lang3.version>
<dep.curator.version>2.12.0</dep.curator.version>
<dep.curator.version>4.2.0</dep.curator.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet glory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -203,20 +222,33 @@ private String getHistoryUpdatePath(SingularityMachineStateHistoryUpdate history
return ZKPaths.makePath(getHistoryPath(historyUpdate.getObjectId()), historyChildPath);
}

public void clearOldHistory(String machineId) {
List<SingularityMachineStateHistoryUpdate> histories = getHistory(machineId);
histories.sort(Comparator.comparingLong(SingularityMachineStateHistoryUpdate::getTimestamp));
Copy link
Contributor

@baconmania baconmania Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to reverse this sort, or the skip() below will keep the oldest n entries instead of the newest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversing the sort. Though I'm realizing we don't even really expose this data anywhere... Maybe don't even need it

@baconmania
Copy link
Contributor

🚢

@ssalinas ssalinas merged commit aeab026 into master Apr 25, 2019
@ssalinas ssalinas deleted the general_zk_cleanup branch April 25, 2019 20:19
@ssalinas ssalinas added this to the 0.23.0 milestone Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants