-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): cache cluster state and reserved scheduled resources #386
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
server/cmd/run.go
Outdated
@@ -133,14 +134,16 @@ func run(ctx context.Context, c *config.Config) error { | |||
usageSetter = sender.NoopUsageSetter{} | |||
} | |||
|
|||
sched := scheduler.New(st, logger.WithName("scheduler")) | |||
cache := cache.NewStore(st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we populate the cache from DB at the startup time? If so, how about adding a TODO comment?
@@ -107,6 +107,12 @@ func (s *S) CreateJob( | |||
if err != nil { | |||
return nil, status.Errorf(codes.Internal, "schedule: %s", err) | |||
} | |||
if err := s.cache.AddAssumedPod(userInfo.TenantID, sresult.ClusterID, &v1.GpuPod{ | |||
AllocatedCount: 1, | |||
NamespacedName: fmt.Sprintf("%s/%s", sresult.Namespace, jobID), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're adding a job name, not pod name here? Given that we don't know a pod name in advance, we need some more complex conversion..
Also we don't need to worry too much now, but in-memory caching would be difficult to handle when there is more than one replica of |
Co-authored-by: Kenji Kaneda <kenji.kaneda@gmail.com>
resolve conflict & add changes at 11547a4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
ProvisionableResources []*v1.ProvisionableResource | ||
|
||
GPUPods []*v1.GpuPod | ||
// AssumedGPUPodsByKey is a map from key to the assumed GPU pods on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what's the key here? Pod name prefix?
if ok { | ||
defer c.mu.RUnlock() | ||
cl, ok := cls[clusterID] | ||
return cl.Clone(), ok, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, do we really need Clone()
? I thought Cluster
is immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not immutable as Cluster. AssumedGPUPodsByKey
is updated when adding a scheduled pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it!
No description provided.