Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing in dp mode uses only one of the GPUs #1213

Closed
Ir1d opened this issue Mar 23, 2020 · 13 comments · Fixed by #1260
Closed

Testing in dp mode uses only one of the GPUs #1213

Ir1d opened this issue Mar 23, 2020 · 13 comments · Fixed by #1260
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@Ir1d
Copy link
Contributor

Ir1d commented Mar 23, 2020

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Run a test without training

image

Code sample

Modified from the conference-seed repo

trainer = Trainer(
            gpus="-1",
            distributed_backend='dp',
        )
trainer.test(model)

Expected behavior

Environment

  • pl version: 0.6.0
  • PyTorch Version (e.g., 1.0): 1.2
  • OS (e.g., Linux): Ubuntu
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@Ir1d Ir1d added bug Something isn't working help wanted Open to be worked on labels Mar 23, 2020
@williamFalcon
Copy link
Contributor

ummm yeah, that's a bug. it should run via dp. @Ir1d want to submit a PR?
@PyTorchLightning/core-contributors any one else experience this?

@Ir1d
Copy link
Contributor Author

Ir1d commented Mar 23, 2020

@williamFalcon I tried wrapping the model in LightningDataParallel like the training loop did, but it tells me that LightningDataParallel object doen't have a test_step. How do I debug this?

@williamFalcon
Copy link
Contributor

you wouldn’t wrap it yourself ever haha.
the trainer does the wrapping for you.

the trainer needs to be modified to run the test on the correct method when done this way

@Ir1d
Copy link
Contributor Author

Ir1d commented Mar 23, 2020

I was trying to wrap it in evaluate in pytorch_lightning/trainer/evaluation_loop.py . Do you have any idea where to wrap this func?

@Ir1d
Copy link
Contributor Author

Ir1d commented Mar 23, 2020

Anyway, we've find one possible workround here:
After defining a torch model, and before sending it into PL model, wrap it with nn.dataparallel.

@williamFalcon
Copy link
Contributor

evaluate is private... you're not meant to call it directly.

call .test()

@williamFalcon
Copy link
Contributor

lightning does the wrapping by itself...

the fact that this doesn't work, is a bug.

model = MyLightningModule.load_from_checkpoint(...)
trainer = Trainer(
            gpus="-1",
            distributed_backend='dp',
        )
trainer.test(model)

The bug needs to be addressed correctly.

It's weird because we have tests for this... double check that this is really not working for you.

@Borda
Copy link
Member

Borda commented Mar 23, 2020

evaluate is private... you're not meant to call it directly.
call .test()

so let's rename it starting with _ to be clear that it is private from it name

@Ir1d
Copy link
Contributor Author

Ir1d commented Mar 23, 2020

I was calling .test and its not working

@Borda
Copy link
Member

Borda commented Mar 27, 2020

@neggert may you have look at this multi GPU issue?

@Borda Borda reopened this Mar 30, 2020
@Borda Borda added the priority: 0 High priority task label Apr 8, 2020
@Borda Borda added this to the 0.7.3 milestone Apr 8, 2020
@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020
@Borda Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 12, 2020
@Borda Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020
@edenlightning
Copy link
Contributor

@neggert ping :)

@Borda Borda modified the milestones: 0.8.0, 0.8.x Jun 17, 2020
@williamFalcon
Copy link
Contributor

looking at this with next sprint

@williamFalcon
Copy link
Contributor

williamFalcon commented Jul 10, 2020

fixed! (0.8.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants