-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PX Resource Gateway #1621
base: release-24.2.0
Are you sure you want to change the base?
PX Resource Gateway #1621
Conversation
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reviewing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add unit tests to all of these files. You also need e2e tests: create small grpc servers and have a client send requests to tests the full paths.
proto/pxresourcegateway.proto
Outdated
}; | ||
} | ||
|
||
// KeepAlive sends a heartbeat to the semaphore service to keep the lock alive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a timeout. Keep alive for "how long"? for example 5 secs? etc.
return &emptypb.Empty{}, nil | ||
} | ||
|
||
func (s *server) KeepAlive(ctx context.Context, req *pb.KeepAliveRequest) (*emptypb.Empty, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is confusing on the side effects and expectations. I added a comment to the .proto
file on this rpc. Maybe add more comments on how this function works and expectations as a comment on the .proto file
pkg/px-resource-gateway/semaphore.go
Outdated
// check if the node is at the front of the queue | ||
nextResouceId, priority := s.priorityQ.Front() | ||
if nextResouceId == "" { | ||
panic("Queue is empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something else other than panic here? I would suggest handling this situation
pkg/px-resource-gateway/semaphore.go
Outdated
@@ -0,0 +1,310 @@ | |||
package server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain the algorithm as comment header in this file to help maintainers (and me as the reviewer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skipping this file until the algorithm is included.
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add UTs to the resource gateway installation and the actual server logic too before you merge the PR.
@@ -109,6 +109,8 @@ const ( | |||
AnnotationPVCControllerSecurePort = pxAnnotationPrefix + "/pvc-controller-secure-port" | |||
// AnnotationAutopilotCPU annotation for overriding the default CPU for Autopilot | |||
AnnotationAutopilotCPU = pxAnnotationPrefix + "/autopilot-cpu" | |||
// AnnotationPxResourceGatewayCPU annotation for overriding the default CPU for PxResourceGateway | |||
AnnotationPxResourceGatewayCPU = pxAnnotationPrefix + "/px-resource-gateway-cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this if we have resources
section in the cluster spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this annotation? I see a couple of valid comments which are resolved w/o a comment. If you had an offline conversation can you mention them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolved the comments because I had addressed them locally. I have pushed the changes now.
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
This PR is stale because it has been in review for 3 days with no activity. |
// Setup a signal handler | ||
signal_handler := util.NewSigIntManager(func() { | ||
pxResourceGatewayServer.Stop() | ||
os.Remove(PxResourceGatewayServerSocket) | ||
os.Exit(0) | ||
}) | ||
err = signal_handler.Start() | ||
if err != nil { | ||
fmt.Printf("Unable to start signal handler: %v", err) | ||
os.Exit(1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also setup memory and cpu profiling, and then you can add a signal handler for SIGUSR1 to dump the heap and stack profile -
operator/cmd/operator/operator.go
Line 148 in 303cc37
log.Infof("pprof profiling is enabled, creating profiling server at port %v", defaultPprofPort) |
@@ -109,6 +109,8 @@ const ( | |||
AnnotationPVCControllerSecurePort = pxAnnotationPrefix + "/pvc-controller-secure-port" | |||
// AnnotationAutopilotCPU annotation for overriding the default CPU for Autopilot | |||
AnnotationAutopilotCPU = pxAnnotationPrefix + "/autopilot-cpu" | |||
// AnnotationPxResourceGatewayCPU annotation for overriding the default CPU for PxResourceGateway | |||
AnnotationPxResourceGatewayCPU = pxAnnotationPrefix + "/px-resource-gateway-cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this annotation? I see a couple of valid comments which are resolved w/o a comment. If you had an offline conversation can you mention them here.
pkg/px-resource-gateway/semaphore.go
Outdated
|
||
var ( | ||
NLocks = 10 | ||
ConfigMapUpdateInterval = time.Second * 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO 1 sec is still too aggressive for a single client to invoke a k8s api. In comparison, our default operator reconcilor will wake up every 30s and then make a bunch of k8s apis if required.
If the goal is to just persist the locks so that if this pod crashes we can recover then isn't 30s good enough?
pkg/px-resource-gateway/semaphore.go
Outdated
data[clientId] = fmt.Sprintf("%d", lockId) | ||
} | ||
|
||
if !isUpdateRequired(data, s.configMap.Data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal to re-initialize the s.locks from s.configMap.Data when this process starts?
Right now, I don't see this config map's contents being used anywhere
pkg/px-resource-gateway/semaphore.go
Outdated
if _, ok := removedNode[clientId]; ok { // already removed | ||
continue | ||
} | ||
s.priorityQ.Remove(clientId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing something here, why not remove a dead node unconditionally from the priorityQueue if it was part of s.locks ?
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
Signed-off-by: Divyam Goel <dgoel@purestorage.com>
What this PR does / why we need it:
Which issue(s) this PR fixes (optional)
Closes #
Special notes for your reviewer: