Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI/TCP/FS for NCCL-init #632

Merged
merged 20 commits into from
Jun 24, 2024
Merged

Conversation

gordicaleksa
Copy link
Contributor

Instead of mixing NCCL & Open MPI during training: let's transition to using only NCCL. To the best of my knowledge there are no downsides here, they're equivalent and speedwise i couldn't observe any difference.

By doing this we scope down our MPI dependence to multi_gpu_config_init.

Context: why are we trying to reduce dependence on Open MPI?
slurm-wlm package - which is the easiest and thus likely the most frequent way folks setup their slurm clusters - dropped PMIx support which means we can't use slurm in a multi node setup together with Open MPI.

Regarding multi_gpu_config_init we have a few options as far as i can tell:

  • Keep using MPI (but as mentioned slurm users won't be able to run llm.c in multi-node setup)
  • Transition to logic where we synchronize on file system (similar to NCCL only multi-gpu multi-node training without MPI #426)
  • TCP sockets (nodes will have to contain IP address of each other to communicate eitherway so we can use that)

So:

  • slurm users can toggle a switch, remove OpenMPI code dependence, and run in multi-node setup
  • mpirun users can use llm.c as is and it will just work

@gordicaleksa
Copy link
Contributor Author

I tried to use NCCL instead of MPI but realized there are some hard dependencies:

  • cudaSetDevice has to be called before ncclCommInitRank thus we can't use NCCL to reduce values inside multi_gpu_get_local_device_idx.
  • Similarly wanted to replace MPI_Bcast with NCCL, but since this needs to happen before ncclCommInitRank it can't be done.

So as mentioned above we're left with MPI, filesysystem sync, TCP. Will test the latter 2 tomorrow.

@gordicaleksa gordicaleksa changed the title Use only NCCL for training (WIP) MPI/TCP/FS for NCCL-init Jun 24, 2024
@gordicaleksa
Copy link
Contributor Author

gordicaleksa commented Jun 24, 2024

The PR is tested and ready @karpathy - comments / feedback welcome!

I personally find TCP setup the most useful since I don't have the shared filesystem.

The only variable that one needs to set is a single IP address, so fairly simple to run.

@gordicaleksa
Copy link
Contributor Author

NOTE: Sockets are currently not cross-platform that's why Windows test is failing.

Linux & Mac are supported.

So we either have to tweak it a bit to make it x-platform OR we remove support on Windows OR we remove sockets altogether.

printf("Failed to send nccl_id");
exit(EXIT_FAILURE);
}
close(client_sockets[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closeCheck?

// Step 6) accept connections from clients
printf("Waiting for clients to connect...\n");
while (num_clients < MAX_CLIENTS) {
if ((new_socket = accept(server_socket, (struct sockaddr *)&address, (socklen_t*)&addrlen)) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is addrlen of type int and not socklen_t?

Copy link
Contributor Author

@gordicaleksa gordicaleksa Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a satisfying answer other than it's a pattern i've observed people use everywhere, see:

  1. https://gist.github.com/SkrewEverything/2c535e83a3a7b8e5b7aa490009a87fbb
  2. https://stackoverflow.com/questions/61106749/how-to-read-data-from-socket-correctly
  3. your favorite LLM

it is casted to socklen_t before calling accept.

train_gpt2.cu Outdated
int gpus_per_node = 8; // this should be set by the slurm environment
char nccl_init_method[256] = "mpi"; // "tcp" or "fs" or "mpi"
char server_ip[256] = "-1"; // used if init_method set to "tcp" -> set to your server ip address
char fs_path[256] = "/tmp"; // used if init_method set to "fs" -> set to a shared filesystem path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't /tmp usually be a path local to the current node?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of agree, in a previous PR it was /dfs or so, worried that a person might get a wrong impression of this flag if they see its default as /tmp .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to "" (empty) as it doesn't matter since the init method is set to "mpi" by default. This var is used only when init is set to "fs".

llmc/zero.cuh Outdated
@@ -193,6 +212,126 @@ ncclUniqueId get_nccl_id_via_tcp(MultiGpuConfig* result, const char* server_ip)
return nccl_id;
}

#ifdef _WIN32
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to put this stuff into unistd.h, which is our windows-specific code? I think it would need a tiny refactor so that the needed imports are minimal, e.g. maybe not passing in a MultiGPUConfig but its members more directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can do that in the follow-up PR if you're ok with that? e.g. @rosslwheeler has a windows setup ready and he said he can pick it up

@karpathy karpathy merged commit 69b50ad into karpathy:master Jun 24, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants