-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neuron Device Plugin Addon #777
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngjeong46 great work, I added a few tactical comments.
With respect to the usage of the manifests, can you check if there is any approach to test whether there is a new version of Neuron and how can we incorporate it for continuous maintenance, including patch releases, version upgrades, etc.
import { ClusterAddOn, ClusterInfo } from "../../spi"; | ||
import { loadYaml, readYamlDocument } from "../../utils/yaml-utils"; | ||
|
||
export class NeuronPluginAddOn implements ClusterAddOn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No options to deploy? Not even namespace? It is fine if ns should be kube-system, just want to check if anything is reasonable to expose for configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this one, it is straight forward. There may be some optional scheduler which I just saw, that I will do with options in a fast follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it in this PR actually, testing it right now.
@shapirov103 @youngjeong46 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngjeong46 Nice work, thankyou for doing this. I have some comments on the PR. Also want mkdocs
and Addon index update is missing for the new addon.
import { ClusterAddOn, ClusterInfo } from "../../spi"; | ||
import { loadYaml, readYamlDocument } from "../../utils/yaml-utils"; | ||
|
||
export class NeuronPluginAddOn implements ClusterAddOn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options
@freschri I'm linking it to the following two links (as they are raw files that can be imported straight as part of the addon). For now, unless the Neuron team announces otherwise, they would update the manifests through this location on GitHub. |
@youngjeong46 @freschri Are you folks planning to make progress to this PR? |
Neuron Device Plugin Addon Continuation
@shapirov103 It looks good to me, also just tested locally and can verify it works. |
/do-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end to end tests failed. A maintainer can provide more details.
Error:
@elamaran11 - the issue seems to be specific to adot, has nothing to do with neuron. |
@shapirov103 I ran the e2e even yesterday on my local, didnt see any issues. I think the PR might not have pulled the latest changes from main we had for OTEL config. I recommend to rerun the E2E after merging from main. Also recommend a local run. Plus we are using ADOT addon consistently in CDK accelerator with no issues. |
@elamaran11 This is with the main merged. |
@youngjeong46 I submitted a fix for it. The defect is very intermittent, race condition, but it is also non-determinstic. the same blueprint may fail one time, will succeed another time. See #931 |
/do-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end to end tests passed
Issue #, if available:
Description of changes: This PR adds Neuron Device Plugin to be able to leverage Inferentia and Trainium nodes on EKS Blueprints. This includes:
Other related changes include revisions to helper functions to load yaml files and additional unit tests.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.