Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance profiling tool to enable programmatic/interactive/automatic PT-XLA profile capturing #3498

Merged
merged 5 commits into from
Apr 14, 2022

Conversation

thisisalbertliang
Copy link
Contributor

Resolves #3471

Pull Request Summary

This PR adds a performance profiling utility script (capture_profile.py) that enables interactive and/or automatic PT-XLA profile capturing from the command line. The motivation behind capture_profile.py is to provide PT-XLA users with an easy and simple way to re-use torch_xla.debug.profiler.trace for tracing their training jobs.

How to Run

Instructions on how to run capture_profile.py are already clearly documented in the script itself.

Here are some example run commands

python3 capture_profile.py --service_addr "localhost:9001" --logdir "gs://path/to/logdir" --duration_ms 20000 --interactive loop
python3 capture_profile.py --service_addr "10.0.0.2:9001" --logdir "gs://path/to/logdir" --duration_ms 30000 --automatic 100 60

For Googlers

See b/226973507 for more details.

@thisisalbertliang thisisalbertliang added the enhancement New feature or request label Apr 13, 2022
@thisisalbertliang thisisalbertliang self-assigned this Apr 13, 2022
@@ -338,6 +338,7 @@ With PyTorch/XLA we provide a set of performance profiling tooling and auto-metr
* [Official tutorial](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)
* [Colab notebook](https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/pytorch-xla-profiling-colab.ipynb)
* [Sample MNIST training script with profiling](https://github.com/pytorch/xla/blob/master/test/test_profile_mp_mnist.py)
* [Utility script for capturing performance profiles](https://github.com/pytorch/xla/blob/master/scripts/capture_profile.py)
Copy link
Collaborator

@miladm miladm Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference on how to load the profiled files on tensorboard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we port some of the comments in this README to PyTorch/XLA. I wonder if the README is the best place. Alternatively, we can consider another documentation guide on "Profiling" in PyTorch/XLA.

@JackCaoG wdyt?

Copy link
Contributor Author

@thisisalbertliang thisisalbertliang Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miladm

I am not sure where to add the tensorboard instructions. It seems like the Official tutorial and Colab notebook links in the README already teaches how to use tensorboard anyway?

For now, I have added instructions on how to run tensorboard in the module docstring for capture_profile.py. Let me know if you think there is a better location for these instructions.

"""A utility script for capturing PyTorch/XLA performance profiles interactively and/or automatically

Example run commands:
    $ python3 capture_profile.py --service_addr "localhost:9001" --logdir "gs://path/to/logdir" --duration_ms 20000 --interactive loop
    $ python3 capture_profile.py --service_addr "10.0.0.2:9001" --logdir "gs://path/to/logdir" --duration_ms 30000 --automatic 100 60

Once you have captured & saved the performance profiles, you can view them using Tensorboard.

Example commands to launch the Tensorboard server:
    $ (vm) tensorboard --logdir "gs://path/to/logdir --port 8001"
    $ tensorboard --logdir "/local/path/to/logdir --port 8001"

After that, visit http://localhost:8001/#profile on your machine to view the performance profile in Tensorboard.
"""

Copy link
Collaborator

@miladm miladm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thisisalbertliang

LGTM after tests pass. Need linter update.

@thisisalbertliang
Copy link
Contributor Author

thisisalbertliang commented Apr 13, 2022

@miladm

I just reformatted capture_profile.py using yapf. The linter test is passing now.

Let me know if we should merge this PR now.

@miladm
Copy link
Collaborator

miladm commented Apr 14, 2022

Great. Thanks @thisisalbertliang.
All tests need to pass before merging - even though your changes are in python.
I think rerunning the build or pushing an empty commit triggers the the process.

@JackCaoG
Copy link
Collaborator

@miladm found this pr while writing 1.12 release note. We should add this to the TroubleShotting doc

@thisisalbertliang
Copy link
Contributor Author

@miladm found this pr while writing 1.12 release note. We should add this to the TroubleShotting doc

thanks @JackCaoG . Just made a quick PR to add capture_profile.py to TROUBLESHOOTING.md. lmk if this what you wanted.

PR link: #3670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop a profiler tool that enables "programmable" and "interactive" profile capture
3 participants