Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
transformers moe data-parallelism distributed-optimizers model-parallelism megatron mixture-of-experts pipeline-parallelism huggingface-transformers megatron-lm tensor-parallelism large-scale-language-modeling 3d-parallelism zero-1 sequence-parallelism
-
Updated
Dec 14, 2023 - Python