Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] libtpu.so already in used by another process. #3214

Closed
MukundVarmaT opened this issue Nov 15, 2021 · 6 comments
Closed

[question] libtpu.so already in used by another process. #3214

MukundVarmaT opened this issue Nov 15, 2021 · 6 comments

Comments

@MukundVarmaT
Copy link

Hi,
I was recently trying to set up a training experiment with torch and tpus (V3-8). While the code works as expected, there were repeated warning messages libtpu.so already in used by another process. Not attempting to load libtpu.so in this process. which is really annoying as it appears multiple times after each epoch. I only noticed this when training on multiple cores and not on a single core.
It would be really helpful if you could suggest any method to suppress/solve these warning messages. Thanks

@JackCaoG
Copy link
Collaborator

Hi @MukundVarmaT . This problem should be solved with the latest tpu-vm-pt-1.10 image(with pt/xla 1.10 preinstalled)

@MukundVarmaT
Copy link
Author

MukundVarmaT commented Nov 15, 2021

Hi @JackCaoG, I am using the latest tpu-vm-pt-1.10 image but still get these warnings.

@JackCaoG
Copy link
Collaborator

Soory @MukundVarmaT I thought we fixed the issue with the new image, but seems like we didn't. We will work on the fix on the default image.

In the mean time, can you run

sudo pip3 install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/wheels/libtpu-nightly/libtpu_nightly-0.1.dev20211015-py3-none-any.whl

to bypass this error?

@MukundVarmaT
Copy link
Author

MukundVarmaT commented Nov 16, 2021

@JackCaoG Yep that solves the problem.
Hi just a follow-up (possibly unrelated question), after libtpu installation and when running on multiple cores, i get all xm.xla_device() to be "xla:0" except for one which is "xla:1". Is this as expected? Shouldn't it be from "xla:0" to "xla:7". PS: Before the libtpu installation, it used to print different device ids.

@JackCaoG
Copy link
Collaborator

It is expected. For one of the process xla:0 is a cpu device, so it uses xla:1.

xla:0(CPU), xla:1(TPU) process 1
xla:0, process 2
xla:0,  process 3
xla:0, process 4
xla:0, process 5
xla:0, process 6
xla:0, process 7
xla:0, process 8

@ronghanghu
Copy link
Collaborator

I once encountered a frequent GRPC error tpu-vm-pt-1.10. After upgrading to libtpu_nightly-0.1.dev20211015-py3-none-any.whl, the error seems gone for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants