-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is javacpp pytorch distribute training aviliable? #1585
Comments
I also found ProcessGroup ProcessGroupGloo NCCLPreMulSumSupplement RecvWork SendWork ReduceOp CustomClassHolder AsyncWork GlooStore _SupplementBase Store |
I found the public enum BackendType { |
@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI and UCC ProcessGroupUCC, FileStore : public Store ,MPIStore NcclStore ,please add them @saudet https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64 and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup |
@HGuillemet @saudet please add them in javacpp-pytorch 2.6 release, on this version ,we need full support distribute pytorch training distribute,thanks |
@sbrunk I think storch need support distribute training pytorch ,would you do more develop for storch distribute code ? |
javacpp also have GradBucket ,Reducer ReduceOp and some @namespace("c10d") class, now just do more for mpi and nccl ucc,now javacpp pytorch will really could run training distribute , |
the javacpp Work.java we need to make it really work for ddp model, to solve this problem we need do more debug |
how about DistributedDataParallel this class ? do you need to implement in javacpp |
for javacpp ddp class ,I try to use them, but can not full code and can not run
|
HI,
I find now cpp libtorch is support distribute training ,maybe it just could use mpi , https://github.com/pytorch/examples/blob/main/cpp/distributed/dist-mnist.cpp,
https://github.com/pytorch/examples/blob/main/cpp/distributed/README.md
so I have check the javacpp just have some java file like package org.bytedeco.pytorch; DistributedBackend DistributedBackendOptional DistributedBackendOptions DistributedSampler and Work ,
so if our javacpp-pytorch is support distribute training ? could show me some distribute code demo for us ,because llm need distribute training
but Work class have content
but do you have try to use Work , I want to know the javacpp pytorch distribute logic. please thanks
The text was updated successfully, but these errors were encountered: