Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target Allocator implementation (Part 1 - OTEL Operator Enhancements) #351

Merged
merged 33 commits into from
Aug 13, 2021

Conversation

rsvarma95
Copy link
Member

@rsvarma95 rsvarma95 commented Jul 21, 2021

Description

This PR addresses issue open-telemetry/prometheus-interoperability-spec#60. This is Part 1 of a 3 part update in which enhancements are presented to the OTel Operator to support the deployment and reconciliation support for a target allocator resource. This update serves as an addition to StatefulSet support in which the target allocator resource can utilize multiple collector instances to split the Prometheus targets among them evenly.
Part 2 will contain the internal target allocator logic which will live as an image and used by the OTel operator.
Part 3 will have the logic to modify the ConfigMap to allow the collectors to find the scrape targets using http_sd_config.
The Design document for Part 1 can be found here: Target Allocator Design Document

Type of change

New feature & Enhancement

Testing

Implemented Unit tests and KUTTL end to end testing for assuring that deployment, reconciliation & target allocator internal logic are working as expected.
A more detailed implementation/testing document for Part 1 of the update is atatched below: Implementation Details

@alexperez52

@rsvarma95 rsvarma95 requested review from a team and tigrannajaryan July 21, 2021 13:54
@jpkrohling jpkrohling self-requested a review July 21, 2021 14:30
@alolita
Copy link
Member

alolita commented Jul 21, 2021

Pl review @dashpole @Aneurysm9 ty!

@alolita
Copy link
Member

alolita commented Jul 26, 2021

@jpkrohling @Aneurysm9 - please review. Look forward to any comments you may have. ty!

@jpkrohling
Copy link
Member

I'm planning on reviewing this by the end of this week.

@jpkrohling
Copy link
Member

I left a few comments in the design doc. Let me know if you still want me to review this, or whether we should clarify the points there first.

@rsvarma95
Copy link
Member Author

Hi @jpkrohling, we have tried to address most of your concerns in the document. Please let us know if you feel more clarification is required.

@alolita
Copy link
Member

alolita commented Jul 29, 2021

Hi @jpkrohling thanks for your detailed review and comments on the design doc. The v3 design is what we have implemented currently. If you're good with the design, please review the code else we're happy to walk you through any grey areas / questions you may have.

@jpkrohling
Copy link
Member

I'm mostly happy with the design, there's just one point to clarify or adapt, related to how changes are handled by the allocator. I think the configmap might not be necessary, neither a restart of the allocator process.

@@ -70,15 +74,22 @@ func New(opts ...Option) Config {
o.collectorImage = fmt.Sprintf("otel/opentelemetry-collector:%s", o.version.OpenTelemetryCollector)
}

if len(o.targetAllocatorImage) == 0 {
// will be replcaed with target allocator image once approved and available
o.targetAllocatorImage = "raul9595/otel-loadbalancer:0.0.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to an appropriate image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to fetch to the default TA image version from versions.txt file similar to how we fetch the latest version of otel-collector image. This helps us to easily manage version upgrades of TA.

- apiGroups:
- rbac.authorization.k8s.io
resources:
- rolebindings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to create role bindings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to create a rolebinding is to allow the target allocator pod to view the collector pods in the same namespace. I am basically automating the below command. If I dont create a appropriate rolebinding, it will not be able to query for the collector pods. The logic of this part is in Part2.

kubectl create clusterrolebinding default-view --clusterrole=view --serviceaccount={namespace}:default

Reference link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator should definitely not be able to create cluster roles/bindings by default, as it would require the operator to have cluster-admin permissions. There should be a separate role/binding YAML file that users can opt in to use, giving the operator the extra permissions. At runtime, as part of the "auto detect" package, the operator can then detect if it has been given this extra permission and self-adjust, enabling/disabling this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this makes sense. I would remove that part then and add a check instead just to ensure the appropriate permissions are granted. This would remove the role/rolebinding files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having it in the auto detect during which we do not have access to the request namespace while initializing the auto detect, would it be fine to create a helper method in targetallocator/reconcile which would use the k8s client to check the service account for appropriate permissions and return an error if permissions are not available?

Another option is when the rest.inClusterConfig() is called inside the logic (Part of PR2), we can just return an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. The operator then should indeed provision the role/service account/binding. Are you sure it requires cluster roles, though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually role and role bindings are enough but for the operator to create a role with the appropriate permissions, the operator needs to have these permissions thmeselves. Thus, the operator needs to have a cluster role defined since the collector namespaces are dynamic.

One solution can be just to let the user create the role and rolebindings as defined in the kuttl tests (latest update).

config/rbac/role.yaml Outdated Show resolved Hide resolved
controllers/targetallocator_controller.go Outdated Show resolved Hide resolved
controllers/targetallocator_controller.go Outdated Show resolved Hide resolved
controllers/targetallocator_controller.go Outdated Show resolved Hide resolved
pkg/targetallocator/reconcile/role.go Outdated Show resolved Hide resolved
pkg/targetallocator/reconcile/role_test.go Outdated Show resolved Hide resolved
pkg/targetallocator/reconcile/rolebinding.go Outdated Show resolved Hide resolved
volumes:
- configMap:
items:
- key: targetallocator.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an assertion with the contents of the configmap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason I have not included the configmap contents is because it creates some of the the content dynamically (from OTEL collector namespace). This is still not supported in KUTTL and I am not sure how to proceed with this.

Copy link
Member Author

@rsvarma95 rsvarma95 Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the contents of the configmap are being tested in the Go unit tests. Would it be sufficient to just check if it is created (as being checked in the latest update)?

@@ -0,0 +1,8 @@
apiVersion: apps/v1
kind: Deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be a statefulset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenTelemetryCollector target allocation pod would be spawned when the CR mode is StatefulSet but the target allocation kind would be a Deployment. This KUTTL test is for testing the target allocation deployment alone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add extra assert files, outlining what's expected in terms of deployments/statefulsets/configmaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added more assert cases. Please let me know if these look fine.

@rsvarma95 rsvarma95 requested a review from jpkrohling August 9, 2021 11:27
jpkrohling
jpkrohling previously approved these changes Aug 9, 2021
@jpkrohling
Copy link
Member

The e2e tests failed probably due to intermittent issues with the cert-manager, and have been restarted.

VineethReddy02
VineethReddy02 previously approved these changes Aug 10, 2021
Copy link
Contributor

@VineethReddy02 VineethReddy02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Raul9595 before merging this PR make sure to update the default TA load balancer image. :)

@@ -70,15 +74,22 @@ func New(opts ...Option) Config {
o.collectorImage = fmt.Sprintf("otel/opentelemetry-collector:%s", o.version.OpenTelemetryCollector)
}

if len(o.targetAllocatorImage) == 0 {
// will be replcaed with target allocator image once approved and available
o.targetAllocatorImage = "raul9595/otel-loadbalancer:0.0.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to fetch to the default TA image version from versions.txt file similar to how we fetch the latest version of otel-collector image. This helps us to easily manage version upgrades of TA.

pkg/collector/reconcile/configmap.go Outdated Show resolved Hide resolved
pkg/collector/reconcile/service.go Outdated Show resolved Hide resolved
pkg/collector/reconcile/service.go Outdated Show resolved Hide resolved
pkg/collector/reconcile/suite_test.go Outdated Show resolved Hide resolved
@mergify mergify bot dismissed stale reviews from jpkrohling and VineethReddy02 August 11, 2021 03:22

Pull request has been modified.

internal/config/main.go Outdated Show resolved Hide resolved
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Aneurysm9
Aneurysm9 previously approved these changes Aug 11, 2021
@mergify mergify bot dismissed Aneurysm9’s stale review August 11, 2021 18:54

Pull request has been modified.

jpkrohling
jpkrohling previously approved these changes Aug 12, 2021
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are a couple of minor changes that would be needed, but they can be addressed in a follow-up PR. If you prefer to do that, please create an issue tracking those changes and I'll merge this.

I'll also give a few hours for other folks to review (@VineethReddy02?), in case they are interested. If I don't hear any objections, I'll merge this today.

api/v1alpha1/opentelemetrycollector_types.go Outdated Show resolved Hide resolved
pkg/collector/reconcile/configmap.go Outdated Show resolved Hide resolved
@@ -55,6 +116,54 @@ func TestExpectedDeployments(t *testing.T) {
assert.Equal(t, int32(2), *actual.Spec.Replicas)
})

t.Run("should not update target allocator deployment when the config is updated", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also change, say, the number of replicas? This is to demonstrate that you thought about the case where users are changing the deployment manually and you are OK with that.

@mergify mergify bot dismissed jpkrohling’s stale review August 12, 2021 12:36

Pull request has been modified.

@rsvarma95 rsvarma95 requested a review from jpkrohling August 12, 2021 12:45
@jpkrohling jpkrohling enabled auto-merge (squash) August 13, 2021 15:08
@jpkrohling
Copy link
Member

I'm really excited about this change! Thank you for your persistence and patience, @Raul9595!

This PR is set to auto-merge on green. I restarted the failing tests and will check it again later today.

@jpkrohling jpkrohling merged commit f7b7b93 into open-telemetry:main Aug 13, 2021
shree007 pushed a commit to shree007/opentelemetry-operator that referenced this pull request Dec 12, 2021
…open-telemetry#351)

Co-authored-by: Alexis Perez <perzmen@amazon.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…open-telemetry#351)

Co-authored-by: Alexis Perez <perzmen@amazon.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants