Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix configs to properly use pytorch-lightning==1.6 with GPU #234

Merged
merged 11 commits into from
Apr 20, 2022

Conversation

samet-akcay
Copy link
Contributor

@samet-akcay samet-akcay commented Apr 14, 2022

Description

Changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

  • My code follows the pre-commit style and check guidelines of this project.
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes

@samet-akcay samet-akcay changed the title Fix configs Fix configs to properly use pytorch-lightning==1.6 Apr 14, 2022
@samet-akcay samet-akcay changed the title Fix configs to properly use pytorch-lightning==1.6 Fix configs to properly use pytorch-lightning==1.6 with GPU Apr 14, 2022
Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One minor comment but rest looks good

fast_dev_run: false
gpus: 1
gpus: null # Set automatically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that pytorch lightning will distribute job across all available GPUs? In which case, we might have to look at the performance.

Copy link
Contributor Author

@samet-akcay samet-akcay Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the best approach here. Setting it to 1 or automatically assigning it?
Any thoughts @ashwinvaidya17, @djdameln?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer using only a single gpu by default. I feel users should use distribute only if they know what they are doing. If we distribute the training then we will have to look at the learning rate and experiments might not be reproducible across different number of GPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if no GPUs are available. Will it break or just switch to CPU training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if no GPUs are available. Will it break or just switch to CPU training?

It switches to CPU.

I would prefer using only a single gpu by default. I feel users should use distribute only if they know what they are doing. If we distribute the training then we will have to look at the learning rate and experiments might not be reproducible across different number of GPUs.

As far as I understand, this doesn't set it to distributed training because strategy is still null. It automatically uses a single GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I am fine with how it is now.

@aj2563
Copy link

aj2563 commented Apr 19, 2022

Since this PR is not merged yet I'm adding a comment here rather than create an issue. I checked out the branch fix/sa/configs on a fresh install of anomalib and tried to run with following command

python tools/train.py --model cflow

And got the following error:

Traceback (most recent call last):
  File "tools/train.py", line 28, in <module>
    from anomalib.models import get_model
  File "/data/home/epi/AnomalibTest/lib/python3.8/site-packages/anomalib/models/__init__.py", line 24, in <module>
    from anomalib.models.components import AnomalyModule
  File "/data/home/epi/AnomalibTest/lib/python3.8/site-packages/anomalib/models/components/__init__.py", line 17, in <module>
    from .base import AnomalyModule, DynamicBufferModule
  File "/data/home/epi/AnomalibTest/lib/python3.8/site-packages/anomalib/models/components/base/__init__.py", line 17, in <module>
    from .anomaly_module import AnomalyModule
  File "/data/home/epi/AnomalibTest/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 24, in <module>
    from torchmetrics import F1, MetricCollection
ImportError: cannot import name 'F1' from 'torchmetrics' (/data/home/epi/AnomalibTest/lib/python3.8/site-packages/torchmetrics/__init__.py)

The error is coming from <path>/anomalib/anomalib/models/components/base/anomaly_module.py.
However if you use from from torchmetrics.classification.f_beta import F1Score it does not throw the error above (F1Score in other places in the file is replaced as well)

Package list (installed using pip install e . command):

pytorch-lightning       1.6.1 
torch                   1.11.0   
torchmetrics            0.8.0    
torchtext               0.12.0   
torchvision             0.12.0 

Out of the box it is failing for me, do you see the same issue on your side?

@ashwinvaidya17
Copy link
Collaborator

@aj2563 This is a known issue and is addressed in a different open PR. For now you can use pip install torchmetrics==0.6.0 to make it work. Here is a colab link which uses this branch for training. https://colab.research.google.com/drive/18NciqtQwlrUIiuwWn8ld_Tp_5K6kgRPH?usp=sharing The first 7 cells are relevant in your case.

@samet-akcay samet-akcay merged commit 01370ef into development Apr 20, 2022
@samet-akcay samet-akcay deleted the fix/sa/configs branch April 20, 2022 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU available but not used
4 participants