MPI/TCP/FS for NCCL-init #632

gordicaleksa · 2024-06-22T21:00:51Z

Instead of mixing NCCL & Open MPI during training: let's transition to using only NCCL. To the best of my knowledge there are no downsides here, they're equivalent and speedwise i couldn't observe any difference.

By doing this we scope down our MPI dependence to multi_gpu_config_init.

Context: why are we trying to reduce dependence on Open MPI?
slurm-wlm package - which is the easiest and thus likely the most frequent way folks setup their slurm clusters - dropped PMIx support which means we can't use slurm in a multi node setup together with Open MPI.

Regarding multi_gpu_config_init we have a few options as far as i can tell:

Keep using MPI (but as mentioned slurm users won't be able to run llm.c in multi-node setup)
Transition to logic where we synchronize on file system (similar to NCCL only multi-gpu multi-node training without MPI #426)
TCP sockets (nodes will have to contain IP address of each other to communicate eitherway so we can use that)

So:

slurm users can toggle a switch, remove OpenMPI code dependence, and run in multi-node setup
mpirun users can use llm.c as is and it will just work

gordicaleksa · 2024-06-22T21:03:47Z

I tried to use NCCL instead of MPI but realized there are some hard dependencies:

cudaSetDevice has to be called before ncclCommInitRank thus we can't use NCCL to reduce values inside multi_gpu_get_local_device_idx.
Similarly wanted to replace MPI_Bcast with NCCL, but since this needs to happen before ncclCommInitRank it can't be done.

So as mentioned above we're left with MPI, filesysystem sync, TCP. Will test the latter 2 tomorrow.

gordicaleksa · 2024-06-24T12:10:45Z

The PR is tested and ready @karpathy - comments / feedback welcome!

I personally find TCP setup the most useful since I don't have the shared filesystem.

The only variable that one needs to set is a single IP address, so fairly simple to run.

gordicaleksa · 2024-06-24T15:12:52Z

NOTE: Sockets are currently not cross-platform that's why Windows test is failing.

Linux & Mac are supported.

So we either have to tweak it a bit to make it x-platform OR we remove support on Windows OR we remove sockets altogether.

ngc92 · 2024-06-24T18:12:28Z

llmc/zero.cuh

+            printf("Failed to send nccl_id");
+            exit(EXIT_FAILURE);
+        }
+        close(client_sockets[i]);


closeCheck?

ngc92 · 2024-06-24T18:14:20Z

llmc/zero.cuh

+        // Step 6) accept connections from clients
+        printf("Waiting for clients to connect...\n");
+        while (num_clients < MAX_CLIENTS) {
+            if ((new_socket = accept(server_socket, (struct sockaddr *)&address, (socklen_t*)&addrlen)) < 0) {


why is addrlen of type int and not socklen_t?

Don't have a satisfying answer other than it's a pattern i've observed people use everywhere, see:

https://gist.github.com/SkrewEverything/2c535e83a3a7b8e5b7aa490009a87fbb

https://stackoverflow.com/questions/61106749/how-to-read-data-from-socket-correctly

your favorite LLM

it is casted to socklen_t before calling accept.

ngc92 · 2024-06-24T18:17:45Z

train_gpt2.cu

+    int gpus_per_node = 8;  // this should be set by the slurm environment
+    char nccl_init_method[256] = "mpi";  // "tcp" or "fs" or "mpi"
+    char server_ip[256] = "-1";  // used if init_method set to "tcp" -> set to your server ip address
+    char fs_path[256] = "/tmp";  // used if init_method set to "fs" -> set to a shared filesystem path


wouldn't /tmp usually be a path local to the current node?

Kind of agree, in a previous PR it was /dfs or so, worried that a person might get a wrong impression of this flag if they see its default as /tmp .

I'll change it to "" (empty) as it doesn't matter since the init method is set to "mpi" by default. This var is used only when init is set to "fs".

karpathy · 2024-06-24T18:37:17Z

llmc/zero.cuh

@@ -193,6 +212,126 @@ ncclUniqueId get_nccl_id_via_tcp(MultiGpuConfig* result, const char* server_ip)
    return nccl_id;
 }

+#ifdef _WIN32


any way to put this stuff into unistd.h, which is our windows-specific code? I think it would need a tiny refactor so that the needed imports are minimal, e.g. maybe not passing in a MultiGPUConfig but its members more directly.

i think we can do that in the follow-up PR if you're ok with that? e.g. @rosslwheeler has a windows setup ready and he said he can pick it up

chinthysl mentioned this pull request Jun 24, 2024

Socket server/client interface #633

Closed

gordicaleksa changed the title ~~Use only NCCL for training (WIP)~~ MPI/TCP/FS for NCCL-init Jun 24, 2024

gordicaleksa added 13 commits June 24, 2024 16:26

Use NCCL where possible

0addb50

Can not call set device after NCCL init

52db158

Use NCCL stream

ac72875

Mini refactor

1bd7375

Free up unified buffer mem

08c055c

fs working

5500b61

TCP working

1c6c5ff

Add all 3 init methods

1bc3a5e

Add launch scripts - tested

bfdea88

Add usage for new args

75a2c8b

Add gpus per node arg

3abdf9a

Fix CI errors

a259eea

Add MPI under multi gpu guard

fbf4f59

gordicaleksa force-pushed the multi_node_my branch from 7907176 to fbf4f59 Compare June 24, 2024 14:26

gordicaleksa added 2 commits June 24, 2024 14:38

Fix arg permutation

c420a52

Fix makefile

05ef1e7

ngc92 reviewed Jun 24, 2024

View reviewed changes

llmc/zero.cuh

printf("Failed to send nccl_id");

exit(EXIT_FAILURE);

}

close(client_sockets[i]);

Copy link

Contributor

ngc92 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closeCheck?

ngc92 reviewed Jun 24, 2024

View reviewed changes

Add socket logic for Windows

f9d0e8b

karpathy reviewed Jun 24, 2024

View reviewed changes

gordicaleksa added 3 commits June 24, 2024 18:50

Add a multi node readme section

60c95c8

Remove linux socket funcs when windows

ac1e37b

Fix windows include issue

2667aa1

gordicaleksa force-pushed the multi_node_my branch from 4e7980a to 2667aa1 Compare June 24, 2024 21:22

Remove /tmp as default init value for fs_path

4af6a6a

karpathy merged commit 69b50ad into karpathy:master Jun 24, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI/TCP/FS for NCCL-init #632

MPI/TCP/FS for NCCL-init #632

gordicaleksa commented Jun 22, 2024

gordicaleksa commented Jun 22, 2024

gordicaleksa commented Jun 24, 2024 •

edited

Loading

gordicaleksa commented Jun 24, 2024

ngc92 Jun 24, 2024

ngc92 Jun 24, 2024

gordicaleksa Jun 24, 2024 •

edited

Loading

ngc92 Jun 24, 2024

karpathy Jun 24, 2024

gordicaleksa Jun 24, 2024

karpathy Jun 24, 2024

gordicaleksa Jun 24, 2024

MPI/TCP/FS for NCCL-init #632

MPI/TCP/FS for NCCL-init #632

Conversation

gordicaleksa commented Jun 22, 2024

gordicaleksa commented Jun 22, 2024

gordicaleksa commented Jun 24, 2024 • edited Loading

gordicaleksa commented Jun 24, 2024

ngc92 Jun 24, 2024

Choose a reason for hiding this comment

ngc92 Jun 24, 2024

Choose a reason for hiding this comment

gordicaleksa Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

ngc92 Jun 24, 2024

Choose a reason for hiding this comment

karpathy Jun 24, 2024

Choose a reason for hiding this comment

gordicaleksa Jun 24, 2024

Choose a reason for hiding this comment

karpathy Jun 24, 2024

Choose a reason for hiding this comment

gordicaleksa Jun 24, 2024

Choose a reason for hiding this comment

gordicaleksa commented Jun 24, 2024 •

edited

Loading

gordicaleksa Jun 24, 2024 •

edited

Loading