Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start elastic PyTorchJob example #2050

Closed
tenzen-y opened this issue Apr 10, 2024 · 5 comments · Fixed by #2064
Closed

Unable to start elastic PyTorchJob example #2050

tenzen-y opened this issue Apr 10, 2024 · 5 comments · Fixed by #2064
Assignees

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Apr 10, 2024

Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too.

Originally posted by @mathias9395 in #2024 (comment)

@tenzen-y
Copy link
Member Author

@champon1020 Could you investigate the reason why the new image isn't working for elastic PyTorchJob?
As I quickly checked, both the latest and older images don't have a etcdctl.

@champon1020
Copy link
Contributor

OK, I'll investigate it.

@champon1020
Copy link
Contributor

/assign

@tenzen-y
Copy link
Member Author

/assign

Thank you!

@tenzen-y tenzen-y changed the title Unable to start slastic PyTorchJob example Unable to start elastic PyTorchJob example Apr 10, 2024
@champon1020
Copy link
Contributor

Sorry for the delay in updating progress.

I found that python etcd client library is not included in new image nvcr.io/nvidia/pytorch:24.01-py3 so the pytorch job fails during initializing distributed training.
I'll create a pull request to modify Dockerfile to install python-etcd. (cc: @tenzen-y)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants