Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is javacpp pytorch distribute training aviliable? #1585

Open
mullerhai opened this issue Feb 27, 2025 · 9 comments
Open

is javacpp pytorch distribute training aviliable? #1585

mullerhai opened this issue Feb 27, 2025 · 9 comments

Comments

@mullerhai
Copy link

HI,
I find now cpp libtorch is support distribute training ,maybe it just could use mpi , https://github.com/pytorch/examples/blob/main/cpp/distributed/dist-mnist.cpp,
https://github.com/pytorch/examples/blob/main/cpp/distributed/README.md

so I have check the javacpp just have some java file like package org.bytedeco.pytorch; DistributedBackend DistributedBackendOptional DistributedBackendOptions DistributedSampler and Work ,
so if our javacpp-pytorch is support distribute training ? could show me some distribute code demo for us ,because llm need distribute training

but Work class have content

// Please do not use Work API, it is going away, to be
// replaced by ivalue::Future.
// Python binding for this class might change, please do not assume
// this will be bound using pybind.

but do you have try to use Work , I want to know the javacpp pytorch distribute logic. please thanks

@mullerhai
Copy link
Author

I also found ProcessGroup ProcessGroupGloo NCCLPreMulSumSupplement RecvWork SendWork ReduceOp CustomClassHolder AsyncWork GlooStore _SupplementBase Store

@mullerhai
Copy link
Author

I found the public enum BackendType {
UNDEFINED((byte)(0)),
GLOO((byte)(1)),
NCCL((byte)(2)),
UCC((byte)(3)),
MPI((byte)(4)),
CUSTOM((byte)(5));, could it really invoke any backend?

@mullerhai
Copy link
Author

mullerhai commented Feb 27, 2025

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI and UCC ProcessGroupUCC, FileStore : public Store ,MPIStore NcclStore ,please add them @saudet

https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64
https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp

and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup
not found AlgorithmEntry AlgorithmKey

@mullerhai
Copy link
Author

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI , FileStore : public Store ,MPIStore NcclStore ,please add them @saudet

https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64 https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp

and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup not found AlgorithmEntry AlgorithmKey

@HGuillemet @saudet please add them in javacpp-pytorch 2.6 release, on this version ,we need full support distribute pytorch training distribute,thanks

@mullerhai
Copy link
Author

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI , FileStore : public Store ,MPIStore NcclStore ,please add them @saudet
https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64 https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp
and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup not found AlgorithmEntry AlgorithmKey

@HGuillemet @saudet please add them in javacpp-pytorch 2.6 release, on this version ,we need full support distribute pytorch training distribute,thanks

@sbrunk I think storch need support distribute training pytorch ,would you do more develop for storch distribute code ?

@mullerhai
Copy link
Author

javacpp also have GradBucket ,Reducer ReduceOp and some @namespace("c10d") class, now just do more for mpi and nccl ucc,now javacpp pytorch will really could run training distribute ,
by the way ,could javacpp create discord group to chat @HGuillemet @saudet ,we need to chat will do better and fast

@mullerhai
Copy link
Author

the javacpp Work.java we need to make it really work for ddp model, to solve this problem we need do more debug

@mullerhai
Copy link
Author

how about DistributedDataParallel this class ? do you need to implement in javacpp

@mullerhai
Copy link
Author

for javacpp ddp class ,I try to use them, but can not full code and can not run

  val options = new DistributedBackendOptions()
  options.timeout(new SecondsFloat())
  val glooStore = new GlooStore()
  val options = new DistributedBackend.Options()
  options.timeout(new SecondsFloat())
  options.store(glooStore)

  val processGroup =new ProcessGroupGloo(glooStore,1,1,options)
  val backend = new DistributedBackend("gloo", options, processGroup)
  val distributedBackend = DistributedBackend.withBackend(backend)
  val rank = processGroup.rank()
  val worldSize = processGroup.size()
  val work =  processGroup.allreduce()
  work.wait()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants