Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extension/cgroupruntime]: Initial implementation #35472

Merged
merged 25 commits into from
Nov 26, 2024

Conversation

rogercoll
Copy link
Contributor

@rogercoll rogercoll commented Sep 27, 2024

Description:

This PR adds the initial implementation of a new component to dynamically set the values of GOMEMLIMIT and GOMAXPROCS used by the Go runtime. Those values are normally manually aligned with the cgroup resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these environment variables in K8s deployments (e.g Helm templates) in addition to have fine-grained values (e.g. 90% of the resource memory limits).

Link to tracking Issue: #30289

Testing: Unit testing for the component has been added (config and extension start/stop). But ideally, an integration test that actually asserts the runtime modifications should be added as well. The extension relies on "github.com/KimMachineGun/automemlimit/memlimit" and "go.uber.org/automaxprocs/maxprocs" packages for the runtime modifications, but they don't provide a way to mock the "cgroups" file system which is the one they read to get the resource quota limits.

Any suggestion on how to perform this integration tests in the contrib repository? One possibility is to use the https://github.com/containerd/cgroups package to set the quota, but this requires privileged permissions (also in the GHA)

Documentation:

@rogercoll
Copy link
Contributor Author

Manually tested with a modified version of the OpenTelemetry collector that exposes the internal runtime metrics: https://pkg.go.dev/runtime/metrics

Deployed with very limited cgroup resources:

        resources:
          requests:
            cpu: "50m"      # Minimal CPU request
            memory: "64Mi"  # Minimal memory request
          limits:
            cpu: "100m"     # Minimal CPU limit
            memory: "128Mi" # Minimal memory limit

Resulting runtime metrics:

# HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes
# TYPE go_gc_gomemlimit_bytes gauge
go_gc_gomemlimit_bytes 1.20795955e+08

...
# HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads
# TYPE go_sched_gomaxprocs_threads gauge
go_sched_gomaxprocs_threads 1

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 12, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Oct 27, 2024
@mx-psi mx-psi reopened this Oct 30, 2024
@mx-psi mx-psi removed the Stale label Oct 30, 2024
@rogercoll rogercoll changed the title [PoC] Cgroup runtime extension [extension/cgroupruntime]: Initial implementation Nov 5, 2024
@rogercoll
Copy link
Contributor Author

@mx-psi would you still be able to sponsor this new component?

@mx-psi
Copy link
Member

mx-psi commented Nov 6, 2024

@rogercoll Yes! Happy to review this whenever you let me know it is ready

@rogercoll rogercoll marked this pull request as ready for review November 6, 2024 14:57
@rogercoll rogercoll requested a review from a team as a code owner November 6, 2024 14:57
@rogercoll rogercoll requested a review from songy23 November 6, 2024 14:57
@rogercoll
Copy link
Contributor Author

@mx-psi Great! Moved the PR to ready for review, either for this PR or for a follow-up I think it would be great to have a way to create and run the corresponding integration tests. Thanks!

@songy23 songy23 requested a review from mx-psi November 6, 2024 15:14
@songy23 songy23 removed their request for review November 11, 2024 15:39
@rogercoll
Copy link
Contributor Author

Extension "info" logs:

2024-11-20T09:59:53.206Z        info    extensions/extensions.go:39     Starting extensions...
2024-11-20T09:59:53.206Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "cgroupruntime"}
2024-11-20T09:59:53.257Z        debug   cgroupruntimeextension@v0.112.0/factory.go:46   maxprocs: Updating GOMAXPROCS=[1]: using minimum allowed GOMAXPROCS     {"kind": "extension", "name": "cgroupruntime"}
2024-11-20T09:59:53.257Z        info    cgroupruntimeextension@v0.112.0/extension.go:50 GOMAXPROCS has been set {"kind": "extension", "name": "cgroupruntime", "GOMAXPROCS": 1}
2024-11-20T09:59:53.258Z        info    cgroupruntimeextension@v0.112.0/extension.go:61 GOMEMLIMIT has been set {"kind": "extension", "name": "cgroupruntime", "GOMEMLIMIT": 120795955}
2024-11-20T09:59:53.258Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "cgroupruntime"}


...
2024-11-20T10:00:27.425Z        info    extensions/extensions.go:66     Stopping extensions...
2024-11-20T10:00:27.425Z        debug   cgroupruntimeextension@v0.112.0/factory.go:46   maxprocs: Resetting GOMAXPROCS to [20] {"kind": "extension", "name": "cgroupruntime"}
2024-11-20T10:00:27.425Z        info    service@v0.112.0/service.go:280 Shutdown complete.

@rogercoll rogercoll force-pushed the add_cgroupruntime_extension branch from 8496b92 to 1f0c26f Compare November 22, 2024 09:21
extension/cgroupruntimeextension/README.md Outdated Show resolved Hide resolved
extension/cgroupruntimeextension/README.md Outdated Show resolved Hide resolved
extension/cgroupruntimeextension/config.go Outdated Show resolved Hide resolved
@mx-psi
Copy link
Member

mx-psi commented Nov 26, 2024

I think this can be merged without the integration tests, but some sort of tests should be a requirement for inclusion in the contrib distro. @rogercoll Could you file a separate issue for this so we can track it?

@rogercoll
Copy link
Contributor Author

rogercoll commented Nov 26, 2024

I think this can be merged without the integration tests, but some sort of tests should be a requirement for inclusion in the contrib distro. @rogercoll Could you file a separate issue for this so we can track it?

Makes sense to me, tracking issue #36545

@mx-psi mx-psi merged commit 28f23aa into open-telemetry:main Nov 26, 2024
158 checks passed
@github-actions github-actions bot added this to the next release milestone Nov 26, 2024
@rogercoll rogercoll deleted the add_cgroupruntime_extension branch November 26, 2024 11:45
shivanthzen pushed a commit to shivanthzen/opentelemetry-collector-contrib that referenced this pull request Dec 5, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
open-telemetry#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this pull request Dec 6, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
open-telemetry#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
open-telemetry#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants