Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split splines across multiple GPUs on a given node #1101

Merged
merged 40 commits into from
Dec 7, 2018

Conversation

atillack
Copy link

@atillack atillack commented Oct 9, 2018

Here is my work-in-progress version of the split splines code for GPU with promising performance numbers:

Run GPUs Walkers/GPU Original Code New Code w/ Split Splines Split Spline Cost
NiO S32 on SummitDev 2 128 129.0 148.0 1.15
-- 4 128 129.0 236.5 1.83
NiO S32 on Summit 3 128 91.3 148.1 1.62
-- 3 250 145.1 223.2 1.54
-- 6 128 92.5 223.2 2.41
-- 6 250 147.3 403.7 2.74
NiO S64 on Summit 3 45 230.5 334.7 1.45
-- 6 45 230.5 463.8 2.01

Currently, only the spline type used for our NiO example is fully implemented. I will of course implement the missing bits once we finalized the code here.

In order to use the code and split the spline data memory across multiple GPUs the following needs to be done:

  • GPU MPS needs to be used,
  • the GPUs need to be visible in each MPI rank,
  • the options gpu="yes" and gpusharing="yes" need to be set in the <determinantset type="einspline"> section in the <wavefunction> definition block

I did test the original code based on an older development tree on both SummitDev and Summit with 2,4 and 3,6 GPUs, respectively (this is the data above). After updating to the current development tree I could only run on two GPUs on SummitDev to confirm performance levels are still the same - with four GPUs I get an MPI error (coll:ibm:module: datatype_prepare_recvv overflowed integer range triggered by comm->gatherv_in_place from gatherv in the new spline adaptor code). I get the exact same error with vanilla 3.5.0 on SummitDev. Since I can't run on Summit right now I can't verify if this is a regression in 3.5.0 (3.4.0 did not have these issues on Summit) or just the particular MPI version on SummitDev.

To finish:

  • Output error and exit when split splines and UVM are used together for now
  • PhaseFactors.cu is already finished but needs testing of the QMC_Complex code path
  • Implement double complex evaluation functions
  • Implement float and double wave function evaluation functions
  • Fill in the missing split splines memory allocations/break-ups in einspline/multi_bspline_create_cuda.cu
  • Implement the MPI host distribution to follow the split splines
  • Add documentation in manual and test case

@ghost ghost assigned atillack Oct 9, 2018
@ghost ghost added the in progress label Oct 9, 2018
@qmc-robot
Copy link

Can one of the maintainers verify this patch?

1 similar comment
@qmc-robot
Copy link

Can one of the maintainers verify this patch?

@prckent prckent changed the title Split splines across multiple GPUs on a given node [WIP] Split splines across multiple GPUs on a given node Oct 9, 2018
@prckent
Copy link
Contributor

prckent commented Oct 9, 2018

OK to test

@atillack
Copy link
Author

atillack commented Oct 9, 2018

Rhea (gpu) should compile and run fine now.

@ye-luo
Copy link
Contributor

ye-luo commented Oct 9, 2018

In the v3.5, I changed the way of the spline tables collection using in-place gather in order to require no extra scratch memory. MPI is notorious for using 32 bit integer range and you might have an inferior MPI implementation. We need to find a workaround hopefully this is not a bug.
Could you go to src/spline/einspline_util.hpp in the function gatherv could you printout
buffer->coefs_size and ncol on the crashed run?
There is an internal control switching between two algorithms. Try reduce (1<<20) and see if that works around the issue.

@ye-luo
Copy link
Contributor

ye-luo commented Oct 9, 2018

I'm trying to understand the algorithm. But directly reading the code may not be the best way.
So throw some questions:

  1. Since you are dispatching spline evaluation to remote devices. The result vector is allocated as unified memory and relys on CUDA to migrate the data back to the device for the determinant calculation. Am I right?
  2. When you list 6 GPUs, how many MPI ranks do you have on the node? 6 or 1?
  3. Assuming the previous answer is 6. The spline table on the host is not distributed among the MPI ranks within the group. Only the table on GPU is distributed when the data is copied to GPU. Am I right?
  4. How do you share the GPU memory address across different MPIs? or CUDA handles it for you?
  5. How much faster is this comparing to the UVM I implemented? One MPI with one GPU, the part of the spline which doesn't fit stays on the host.

@atillack
Copy link
Author

atillack commented Oct 9, 2018

@ye-luo:

To your earlier question:

On SummitDev with 4 MPI ranks: buffer->coefs_size=446,772,240 ; ncol = 816 (Working on reducing 1<<20 right now).

Algorithm questions:

  1. Yes.
  2. 6 MPI ranks. Each rank sees all six GPUs.
  3. Yes, you are correct.
  4. Cuda IPC memory handles (cudaIpcGetMemHandle and cudaIpcOpenMemHandle)
  5. Good question. Let me see if I can do a fair comparison by triggering your UVM code to put half the spline table on GPU and the other on the host.

@atillack
Copy link
Author

atillack commented Oct 9, 2018

@ye-luo Also, I left your UVM code in place and it should still work even with the split splines...

@ye-luo
Copy link
Contributor

ye-luo commented Oct 9, 2018

@atillack Great work.
I also need some effort on the spline builder to make the distribution on the host and it should connect to the GPU part seamlessly. I think eventually it will be we first split the table over MPI and then split between device and host using UVM if the table doesn't fit the device.

@atillack
Copy link
Author

atillack commented Oct 9, 2018

@ye-luo Agreed.

@atillack
Copy link
Author

atillack commented Oct 10, 2018

@ye-luo You are spot-on with your comment on the spline table MPI problem. I went to 1<<16 (to have a lower bound) and that runs through fine with 4 MPI ranks on SummitDev.

@atillack
Copy link
Author

@ye-luo 1<<19 of course works too... (446,772,240 / 816 > (1<<19))

@atillack
Copy link
Author

@ye-luo To your question of how much faster this implementation is compared to your UVM code: I set the available spline memory to 1738 MB (half of the spline table) so half ends up on the GPU and the other half on the CPU for the 128 atom NiO system.

When I do this with split splines using two MPI ranks (this implementation) on SummitDev the DMC block (5 block of 5 steps) takes 149 seconds (within typical fluctuation of the number above). Your UVM code also using two MPI ranks each using the same amount of memory for the spline table on the GPU (and the rest on the host) takes a total of 364 seconds. These runs were on the same node run after another.

…. 2 MPI ranks on the same GPU, rank 0 initializes half the memory, rank 1 the other half)
@atillack
Copy link
Author

Almost finished.

@atillack
Copy link
Author

@prckent I am not sure how to properly set up a test case. In theory, only gpusharing="yes" needs to be set in the section of a test case with enough orbitals (like NiO) but Cuda MPS also needs to be present which is not within the test framework's control (I think).

@ye-luo
Copy link
Contributor

ye-luo commented Nov 30, 2018

Is there a way in the code to test if MPS is enabled or not? If user turns on the flag but the MPS is not ready, the code can turn off this feature or abort the code.
You can grab an existing diamondC_2x1x1 test and enable sharing as a test.

@prckent
Copy link
Contributor

prckent commented Nov 30, 2018

Keep it simple. Lets worry about MPS settings when we have proven that they are needed. For oxygen, where we run the nightlies, I don't think we need be concerned. Later PRs can improve the situation as needed.

@atillack
Copy link
Author

atillack commented Dec 2, 2018

@prckent @ye-luo I think Ye's idea is a good one and it was simple enough to implement (see next commit). If Cuda MPS is off, a warning is generated and the split splines functionality disabled. This way we can add gpusharing="yes" to a test case and nothing breaks if there is no CUDA MPS.

Andreas Tillack added 8 commits December 2, 2018 16:35
… Summit). This allows a warning in case CUDA MPS is not available and the user tries to use split splines.
Conflicts:
	src/einspline/multi_bspline_cuda_d_impl.h
	src/einspline/multi_bspline_cuda_s_impl.h
	src/einspline/multi_bspline_cuda_z_impl.h
… as this is safer and is also necessary on Summit.
…ed once per node as contention may skew results.
@prckent
Copy link
Contributor

prckent commented Dec 6, 2018

Currently checking this on oxygen. If it works, I think we should merge. The tests and docs can be updated later. Enough tooth pulling via this PR.

@ghost ghost assigned prckent Dec 6, 2018
@prckent prckent changed the title [WIP] Split splines across multiple GPUs on a given node Split splines across multiple GPUs on a given node Dec 6, 2018
Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for comments on Friday before merging.

@prckent
Copy link
Contributor

prckent commented Dec 7, 2018

Note: Unable to get this to work with MPS on oxygen. Config instructions at https://docs.nvidia.com/deploy/mps/index.html generate a segv. Probably I don't have the correct/magical combination of cuda aware openmpi and environment settings. Without MPS running the code is appropriately benign.

Can someone confirm that this is working? I currently don't have access to a fully updated multiple GPU machine.

@atillack
Copy link
Author

atillack commented Dec 7, 2018

@prckent I can run on SummitDev and Summit. Documentation is updated. Can you send me the output which segfaults?

@prckent
Copy link
Contributor

prckent commented Dec 7, 2018

The LaTeX is slightly broken. I will push a fix and improvements, then merge.

MPS investigations on oxygen will not occur for a while. Having Summit work is the key thing due to upcoming INCITE usage. (Hence #1054 is an important problem.)

@prckent prckent merged commit a22d633 into QMCPACK:develop Dec 7, 2018
@ghost ghost removed the in progress label Dec 7, 2018
@atillack atillack deleted the split_splines_gpu branch December 12, 2018 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants