Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron observability #121

Merged
merged 30 commits into from
Mar 26, 2024
Merged

Neuron observability #121

merged 30 commits into from
Mar 26, 2024

Conversation

freschri
Copy link
Contributor

@freschri freschri commented Oct 10, 2023

Single New EKS Cluster Open Source Observability Accelerator for Neuron-based clusters
depends on aws-observability/aws-observability-accelerator#32
@elamaran11 please check if we want to keep/remove cdk.json from gitignore (we need to be able to add entries that are validated in compilation otherwise github compilation fails on push)

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freschri This is AWSome work. We need some strategic and tactical changes as per the comments below to align this well. Love the neuron pattern.

@elamaran11
Copy link
Contributor

@freschri Any updates to this PR?

@freschri
Copy link
Contributor Author

@freschri Any updates to this PR?

I addressed all points apart from the addons. One of them depends on another PR in Blueprints (Neuron Device Plugin Addon). My suggestion was on another channel to merge this PR as is, if no other concern, and raise 2 issues to port the addons to Blueprints. That can be done as an exercise by someone with limited experience on Blueprints/Patterns/Observability. Please let me know

@elamaran11
Copy link
Contributor

@freschri Any updates to this PR?

I addressed all points apart from the addons. One of them depends on another PR in Blueprints (Neuron Device Plugin Addon). My suggestion was on another channel to merge this PR as is, if no other concern, and raise 2 issues to port the addons to Blueprints. That can be done as an exercise by someone with limited experience on Blueprints/Patterns/Observability. Please let me know

Thats a good opportunity mentor, if you have some one in your region. If i have someone, i will let you know. But this is not a burning one, please take it slow.

@ariveroi ariveroi mentioned this pull request Feb 16, 2024
@ariveroi
Copy link

New pr against inf1 branch with new quickstart version: #153

If we merge #153 PR to inf1 branch we will be able to merge this PR

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freschri @ariveroi Some tactical feedback, overall approach is great.

lib/common/resources/otel-collector-config-new.yml Outdated Show resolved Hide resolved
lib/common/resources/neuron/pytorch-inference-resnet50.yml Outdated Show resolved Hide resolved
lib/common/resources/neuron/pytorch-inference-resnet50.yml Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freschri Couple of minor ones. Once you fix these, i can run e2e.

cdk.json Outdated Show resolved Hide resolved
@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-opensource-observability deploy

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-opensource-observability deploy

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the CDK version.

        "aws-cdk-lib": "2.132.0",
        "aws-cdk": "2.132.0"

package.json Outdated Show resolved Hide resolved
@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-inferentia-opensource-observability deploy

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-opensource-observability deploy

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more feedback

cdk.json Outdated Show resolved Hide resolved
freschri and others added 2 commits March 14, 2024 18:05
removed neuron reference in kustom as requested
# Conflicts:
#	lib/single-new-eks-fargate-opensource-observability-pattern/index.ts
#	package.json
@freschri
Copy link
Contributor Author

@elamaran11 updated blueprints version by merging main into here. please trigger e2e thanks

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-inferentia-opensource-observability deploy

@freschri
Copy link
Contributor Author

@elamaran11 e2e passes. shall we merge?

@elamaran11
Copy link
Contributor

@elamaran11 e2e passes. shall we merge?

Nope lets wait, I might to crash e2e and do few other checks.

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-inferentia-opensource-observability destroy

@elamaran11
Copy link
Contributor

The E2E Failure is related to subnet ENI which is very interim. It worked on retry,

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-opensource-observability deploy

@elamaran11
Copy link
Contributor

Running E2E on OSS Pattern.

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-opensource-observability destroy

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-inferentia-opensource-observability deploy

@elamaran11
Copy link
Contributor

/do-e2e-test single-new-eks-inferentia-opensource-observability destroy

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elamaran11 elamaran11 merged commit 0ba7afb into main Mar 26, 2024
3 checks passed
@elamaran11 elamaran11 deleted the inferentia-inf1-observability branch March 26, 2024 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single New EKS Cluster Open Source Observability Accelerator for Neuron-based clusters
3 participants