Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): cache cluster state and reserved scheduled resources #386

Merged
merged 6 commits into from
Feb 14, 2025

Conversation

Ladicle
Copy link
Contributor

@Ladicle Ladicle commented Feb 12, 2025

No description provided.

@Ladicle Ladicle requested a review from kkaneda February 12, 2025 08:40
@github-actions github-actions bot added the enhancement New feature or request label Feb 12, 2025
Copy link
Contributor

@kkaneda kkaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -133,14 +134,16 @@ func run(ctx context.Context, c *config.Config) error {
usageSetter = sender.NoopUsageSetter{}
}

sched := scheduler.New(st, logger.WithName("scheduler"))
cache := cache.NewStore(st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we populate the cache from DB at the startup time? If so, how about adding a TODO comment?

@@ -107,6 +107,12 @@ func (s *S) CreateJob(
if err != nil {
return nil, status.Errorf(codes.Internal, "schedule: %s", err)
}
if err := s.cache.AddAssumedPod(userInfo.TenantID, sresult.ClusterID, &v1.GpuPod{
AllocatedCount: 1,
NamespacedName: fmt.Sprintf("%s/%s", sresult.Namespace, jobID),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're adding a job name, not pod name here? Given that we don't know a pod name in advance, we need some more complex conversion..

@kkaneda
Copy link
Contributor

kkaneda commented Feb 12, 2025

Also we don't need to worry too much now, but in-memory caching would be difficult to handle when there is more than one replica of job-manager-server..

@Ladicle
Copy link
Contributor Author

Ladicle commented Feb 13, 2025

resolve conflict & add changes at 11547a4

Copy link
Contributor

@kkaneda kkaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ProvisionableResources []*v1.ProvisionableResource

GPUPods []*v1.GpuPod
// AssumedGPUPodsByKey is a map from key to the assumed GPU pods on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what's the key here? Pod name prefix?

if ok {
defer c.mu.RUnlock()
cl, ok := cls[clusterID]
return cl.Clone(), ok, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do we really need Clone()? I thought Cluster is immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not immutable as Cluster. AssumedGPUPodsByKey is updated when adding a scheduled pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it!

@Ladicle Ladicle merged commit 07b845b into main Feb 14, 2025
2 checks passed
@Ladicle Ladicle deleted the cache-cls branch February 14, 2025 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants