-
Notifications
You must be signed in to change notification settings - Fork 524
Petals communication protocol
This page is meant as a brief summary of client-server interactions in Petals during inference.
From the model point of view (e.g., LLaMA in petals.models.llama), sending activations is done by means of the RemoteSequential class. This class sends inputs using RemoteSequentialAutogradFunction, which handles forward and backward passes through the servers.
In the context of the forward pass, the key method is sequential_forward
(https://github.com/bigscience-workshop/petals/blob/main/src/petals/client/sequential_autograd.py#L26), which sends model inputs (a torch.Tensor) through a sequence of layers.
These functions use remote forward/backward calls to the server with hivemind.compression.{serialize/deserialize}_torch_tensor
(https://github.com/learning-at-home/hivemind/blob/master/hivemind/compression/serialization.py#L30-L47).
All intermediate messages between the server and the client use ExpertRequest and ExpertResponse Protobuf classes for communication: their schemas can be found in https://github.com/learning-at-home/hivemind/blob/master/hivemind/proto/runtime.proto#L12-L21