一个Soft Mixture of Experts的轻量化实现。 不同于其他与ViT耦合的soft-moe实现,本代码没有与任何领域的特定方法耦合,也不依赖于PyTorch以外的任何框架,可以适用于任何向量化序列,移植和嵌入更加方便。
In Sparse MoE, the model processes input tokens in a sequential manner.
Let's assume that the input consists of a sequence of length
在Sparse MoE中,tokens以序列的模式进入模型,假设长度为
- The model does not consider the relationship between tokens, it is just a replacement for the FFN;
- The model faces the problem of load balancing, that is, it is easy to converge to a small number of experts handling the vast majority of inputs, while most experts are lazy;
- Token Dropping: For tokens that have not been seen, there may be a performance crash;
- 没有考虑token与token之间的关联,只是一个ffn的替代品;
- 面临负载均衡问题,即容易收敛到一小部分的experts处理了绝大部分输入,而大部分的experts在偷懒;
- Token Dropping:对于没有见过的token,可能面临着性能崩溃;
According to the design of Soft-MoE, a model has
依照 oft-MoE的设计, 一个SofMoE拥有
self.experts = nn.ModuleList([
Mlp(d_model, d_model, d_model,
hidden_activation=F.relu, output_activation=F.relu,
layer_norm=True, out_layer_norm=True, use_residual=False)
for _ in range(num_experts)
])
The original router is replaced by a parameter
原有的router被替换为了参数
self.phi = nn.Parameter(torch.randn(d_model, num_experts, num_slots))
For the input
对于输入
# compute weights, which are used both in dispatch and combine
weights = torch.einsum("b n d , d e s -> b n e s", x, self.phi)
The weight matrix has two dimensions: the sequence length
这个权重矩阵有两个维度, 序列长度
# dispatch tokens to experts
dispatch_weights = F.softmax(weights, dim=1)
experts_inputs = torch.einsum("b n e s, b n d -> b e s d", dispatch_weights, x)
We then dispatch
将
expert_outputs = torch.stack([self.experts[i](experts_inputs[:, i]) for i in range(self.num_experts)])
expert_outputs = einops.rearrange(expert_outputs, "e b s d -> b (e s) d")
Next, we aggregate
将
# combine expert outputs
combine_weights = einops.rearrange(weights, "b n e s -> b n (e s)")
combine_weights = F.softmax(combine_weights, dim=-1)
out = torch.einsum("b n z, b z d -> b n d", combine_weights, expert_outputs)