feat(server): cache cluster state and reserved scheduled resources #386

Ladicle · 2025-02-12T08:40:25Z

No description provided.

kkaneda

nice!

server/internal/cache/cluster.go

kkaneda · 2025-02-12T17:08:49Z

server/cmd/run.go

@@ -133,14 +134,16 @@ func run(ctx context.Context, c *config.Config) error {
 		usageSetter = sender.NoopUsageSetter{}
 	}

-	sched := scheduler.New(st, logger.WithName("scheduler"))
+	cache := cache.NewStore(st)


Should we populate the cache from DB at the startup time? If so, how about adding a TODO comment?

kkaneda · 2025-02-12T17:10:04Z

server/internal/server/fine_tuning.go

@@ -107,6 +107,12 @@ func (s *S) CreateJob(
 	if err != nil {
 		return nil, status.Errorf(codes.Internal, "schedule: %s", err)
 	}
+	if err := s.cache.AddAssumedPod(userInfo.TenantID, sresult.ClusterID, &v1.GpuPod{
+		AllocatedCount: 1,
+		NamespacedName: fmt.Sprintf("%s/%s", sresult.Namespace, jobID),


I think we're adding a job name, not pod name here? Given that we don't know a pod name in advance, we need some more complex conversion..

kkaneda · 2025-02-12T21:20:45Z

Also we don't need to worry too much now, but in-memory caching would be difficult to handle when there is more than one replica of job-manager-server..

server/cmd/run.go

Co-authored-by: Kenji Kaneda <kenji.kaneda@gmail.com>

Ladicle · 2025-02-13T12:00:58Z

resolve conflict & add changes at 11547a4

kkaneda

LGTM!

kkaneda · 2025-02-13T17:20:24Z

server/internal/cache/cluster.go

+	ProvisionableResources []*v1.ProvisionableResource
+
+	GPUPods []*v1.GpuPod
+	// AssumedGPUPodsByKey is a map from key to the assumed GPU pods on the node.


Can you clarify what's the key here? Pod name prefix?

kkaneda · 2025-02-13T17:25:09Z

server/internal/cache/cluster.go

+	if ok {
+		defer c.mu.RUnlock()
+		cl, ok := cls[clusterID]
+		return cl.Clone(), ok, nil


Just curious, do we really need Clone()? I thought Cluster is immutable.

This is not immutable as Cluster. AssumedGPUPodsByKey is updated when adding a scheduled pod.

Ladicle requested a review from kkaneda February 12, 2025 08:40

github-actions bot added the enhancement New feature or request label Feb 12, 2025

kkaneda reviewed Feb 12, 2025

View reviewed changes

Ladicle commented Feb 13, 2025

View reviewed changes

server/cmd/run.go Outdated Show resolved Hide resolved

Ladicle and others added 4 commits February 13, 2025 11:27

feat(server): cache cluster state and reserved scheduled resources

1a34c4a

Update server/internal/cache/cluster.go

95ba339

Co-authored-by: Kenji Kaneda <kenji.kaneda@gmail.com>

Update server/cmd/run.go

ad3b8c2

fixup! Update server/cmd/run.go

11547a4

Ladicle force-pushed the cache-cls branch from f23dbca to 11547a4 Compare February 13, 2025 12:00

fixup! fixup! Update server/cmd/run.go

3c71943

kkaneda approved these changes Feb 13, 2025

View reviewed changes

fixup! fixup! fixup! Update server/cmd/run.go

0f4f623

Ladicle merged commit 07b845b into main Feb 14, 2025
2 checks passed

Ladicle deleted the cache-cls branch February 14, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): cache cluster state and reserved scheduled resources #386

feat(server): cache cluster state and reserved scheduled resources #386

Ladicle commented Feb 12, 2025

kkaneda left a comment

kkaneda Feb 12, 2025

kkaneda Feb 12, 2025

kkaneda commented Feb 12, 2025

Ladicle commented Feb 13, 2025

kkaneda left a comment

kkaneda Feb 13, 2025

kkaneda Feb 13, 2025

Ladicle Feb 13, 2025

kkaneda Feb 14, 2025

feat(server): cache cluster state and reserved scheduled resources #386

feat(server): cache cluster state and reserved scheduled resources #386

Conversation

Ladicle commented Feb 12, 2025

kkaneda left a comment

Choose a reason for hiding this comment

kkaneda Feb 12, 2025

Choose a reason for hiding this comment

kkaneda Feb 12, 2025

Choose a reason for hiding this comment

kkaneda commented Feb 12, 2025

Ladicle commented Feb 13, 2025

kkaneda left a comment

Choose a reason for hiding this comment

kkaneda Feb 13, 2025

Choose a reason for hiding this comment

kkaneda Feb 13, 2025

Choose a reason for hiding this comment

Ladicle Feb 13, 2025

Choose a reason for hiding this comment

kkaneda Feb 14, 2025

Choose a reason for hiding this comment