Acquire access to GPU cluster for testing #93

matthewfeickert · 2018-03-01T00:30:16Z

Access to GPU clusters are needed for performing benchmarks with GPU acceleration. While access to Amir Farbin's personal GPU cluster is available it would also be good to have something with wider support. In the 2018-02-28 CERN IML meeting Maxime Reis advertised that CERN's TechLab has GPU clusters available with support. We can follow up on this and see if we can use it for testing.

matthewfeickert · 2018-03-01T01:00:43Z

We should be able to request specific GPU architectures for the benchmarking through the CERN TechLab TWiki. So once we have basic GPU functionality and tests then we can book a week and do testing.

matthewfeickert · 2018-03-01T10:35:14Z

Maxime Reis has followed up with me with regards to how much time we can get for benchmarking:

Most of the nodes with GPUs are shared, and for benchmarking this obviously won't do. Exclusive access can be arranged for short periods of time, and I'd say a day to a week should be manageable. More than that, we'll have to discuss, and it also depends on which GPU you'd like to benchmark.

So hopefully we can do some testing on other GPU machines and then do a full benchmarking run on the TechLab cluster.

kratsg · 2018-04-16T04:53:42Z

@ivukotic might be able to help give us access to some GPU clusters?

matthewfeickert · 2018-04-24T02:27:16Z

From the ATLAS Machine Learning Forum mailing list:

IBM has provided a small GPU cluster to CERN OpenLab for ML studies by the different experiments. They are planning to host a training workshop (one full day between May 28 and June 8, excluding June 7) to help people understand the cluster and how to use it. ATLAS is not the main customer here, but we can have a number of slots for ATLAS people.

One of the big benefits of IBM hardware is their NVLink, which provides much higher bandwidth between CPU/GPU and more critically GPU/GPU. Intel has recently improved CPU/GPU bandwidth, but not touched GPU/GPU. As such, IBM seems keen to demonstrate the potential of increased GPU/GPU bandwidth, which would require large-scale networks/etc which exploit multiple GPUs at once.

If you think you might have now, or will have soon an ML application with large enough network which will gain from efficient multi GPU training, then this training workshop is probably of interest to you.

I will write up an application and submit us.

kratsg · 2018-04-24T02:35:14Z

Talk to Ilija, the Pacific Research Cluster has a whole bunch of GPUs that we can use for continuous integration as well.

On Mon, Apr 23, 2018 at 21:27 Matthew Feickert ***@***.***> wrote: From the ATLAS Machine Learning Forum mailing list: IBM has provided a small GPU cluster to CERN OpenLab for ML studies by the different experiments. They are planning to host a training workshop (one full day between May 28 and June 8, excluding June 7) to help people understand the cluster and how to use it. ATLAS is not the main customer here, but we can have a number of slots for ATLAS people. One of the big benefits of IBM hardware is their NVLink, which provides much higher bandwidth between CPU/GPU and more critically GPU/GPU. Intel has recently improved CPU/GPU bandwidth, but not touched GPU/GPU. As such, IBM seems keen to demonstrate the potential of increased GPU/GPU bandwidth, which would require large-scale networks/etc which exploit multiple GPUs at once. If you think you might have now, or will have soon an ML application with large enough network which will gain from efficient multi GPU training, then this training workshop is probably of interest to you. I will write up an application and submit us. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/diana-hep/pyhf/issues/93#issuecomment-383782988>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAuei7MTbjT0F8V0UafgUcdPuwhpssoaks5tro2EgaJpZM4SXkyx> .

-- Giordon Stark

matthewfeickert · 2018-09-28T01:17:04Z

I have confirmed with the SMU HPC Admins that I can use M2's (SMU's Tier3) GPUs for testing and development. So we'll have access to up to 36 nodes with NVIDIA GPUs. 👍

matthewfeickert · 2018-10-30T00:05:54Z

At the moment the environment at SMU that the HPC admins were able to setup is only fully supporting an optimized TensorFlow GPU. So I'll start there and then move to PyTorch.

matthewfeickert · 2019-09-04T20:33:22Z

I'm getting access to NCSA's Hardware-Accelerated Learning (HAL) cluster, which should be a perfect environment to do hardware acceleration studies at scale (and probably make the BlueWaters team happier the having me mess around there). Thanks to @msneubauer for setting this in motion.

matthewfeickert · 2020-01-20T06:35:55Z

2020 update: There are two GPU enabled machines that I can use for testing at the moment:

My laptop (NVIDIA GeForce GTX 1650 Max-Q 4GB)
The Neubauer Group firmware and deep learning machine (@markusatkinson is the effective sys admin for this) (NVIDIA GeForce RTX 2080 Ti 11GB — memory can be expanded)

For dev work I will be using the GPUs on my laptop, but I will use our dedicated machine for all benchmarks.

kratsg · 2020-01-24T22:06:45Z

Can we talk with the UChicago folks (./cc @fizisist, @LincolnBryant, @robrwg, @ivukotic) as well for perhaps access to some machines for CI purposes? Or will the Neubauer group allow the DL machine to be used for that?

ivukotic · 2020-01-24T22:09:48Z

Hi Giordon, That’s easy: https://www.atlas-ml.org/ Loggin using your institution, I will approve your account. Once you get mail with approval, you can create a private JuputerLab instance with a GPU attached to it. Cheers, Ilija

matthewfeickert · 2020-01-24T23:00:48Z

Or will the Neubauer group allow the DL machine to be used for that?

I think that the DL machine we have is a great candidate for dedicated benchmarking studies, but I'm not sure if we can guarantee that the GPUs we have in there can be reserved for CI. The primary purpose of this machine is firmware development and testing with FPGAs and then deep learning studies with the GPUs, which gets first priority.

you can create a private JuputerLab instance with a GPU attached to it.

@ivukotic So do I understand you correctly that we can have that GPU indefinitely for hardware acceleration tests with our CI? If so, that's fantastic. I just wan't aware that this was an option.

ivukotic · 2020-01-24T23:07:30Z

Hi Matthew, You can’t get it indefinitely. But you can do reasonable scale studies. Cheers, Ilija

matthewfeickert · 2020-01-24T23:12:23Z

You can’t get it indefinitely. But you can do reasonable scale studies.

Right, okay this make more sense. :) @kratsg's question was about CI, but this still is good as it will give multiple sites to do hardware acceleration tests. Since it doesn't say on the public view of the ATLAS ML Platform but can you give us information on the GPUs that you have available so that we can include that in the studies?

ivukotic · 2020-01-24T23:14:26Z

There are 24 x 2080Ti , 4 x V100 and 2x k20c .

fizisist · 2020-01-25T00:47:48Z

Just don’t use all the resources ;-)

matthewfeickert · 2020-07-11T07:37:59Z

Closing as this has been solved given local machines that the pyhf dev team has access to (in addition to the ATLAS ML Platform).

matthewfeickert self-assigned this Mar 1, 2018

matthewfeickert added the research experimental stuff label May 13, 2018

matthewfeickert mentioned this issue Oct 30, 2018

Benchmark GPU backends #348

Open

2 tasks

matthewfeickert mentioned this issue Sep 10, 2019

pyhf 2019 into 2020 Roadmap #561

Open

41 tasks

matthewfeickert closed this as completed Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acquire access to GPU cluster for testing #93

Acquire access to GPU cluster for testing #93

matthewfeickert commented Mar 1, 2018 •

edited

Loading

matthewfeickert commented Mar 1, 2018

matthewfeickert commented Mar 1, 2018 •

edited

Loading

kratsg commented Apr 16, 2018

matthewfeickert commented Apr 24, 2018

kratsg commented Apr 24, 2018 via email

matthewfeickert commented Sep 28, 2018

matthewfeickert commented Oct 30, 2018

matthewfeickert commented Sep 4, 2019

matthewfeickert commented Jan 20, 2020 •

edited

Loading

kratsg commented Jan 24, 2020

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

matthewfeickert commented Jan 24, 2020 •

edited

Loading

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

matthewfeickert commented Jan 24, 2020 •

edited

Loading

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

fizisist commented Jan 25, 2020 via email •

edited by matthewfeickert

Loading

matthewfeickert commented Jul 11, 2020

Acquire access to GPU cluster for testing #93

Acquire access to GPU cluster for testing #93

Comments

matthewfeickert commented Mar 1, 2018 • edited Loading

matthewfeickert commented Mar 1, 2018

matthewfeickert commented Mar 1, 2018 • edited Loading

kratsg commented Apr 16, 2018

matthewfeickert commented Apr 24, 2018

kratsg commented Apr 24, 2018 via email

matthewfeickert commented Sep 28, 2018

matthewfeickert commented Oct 30, 2018

matthewfeickert commented Sep 4, 2019

matthewfeickert commented Jan 20, 2020 • edited Loading

kratsg commented Jan 24, 2020

ivukotic commented Jan 24, 2020 via email • edited by matthewfeickert Loading

matthewfeickert commented Jan 24, 2020 • edited Loading

ivukotic commented Jan 24, 2020 via email • edited by matthewfeickert Loading

matthewfeickert commented Jan 24, 2020 • edited Loading

ivukotic commented Jan 24, 2020 via email • edited by matthewfeickert Loading

fizisist commented Jan 25, 2020 via email • edited by matthewfeickert Loading

matthewfeickert commented Jul 11, 2020

matthewfeickert commented Mar 1, 2018 •

edited

Loading

matthewfeickert commented Mar 1, 2018 •

edited

Loading

matthewfeickert commented Jan 20, 2020 •

edited

Loading

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

matthewfeickert commented Jan 24, 2020 •

edited

Loading

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

matthewfeickert commented Jan 24, 2020 •

edited

Loading

ivukotic commented Jan 24, 2020 via email •

edited by matthewfeickert

Loading

fizisist commented Jan 25, 2020 via email •

edited by matthewfeickert

Loading