Skip to content

This work was published at IEEE ISIT 2022, where we proposed a novel algorithm named G-CADA aiming to improve the time and communication efficiency of distributed learning systems based on grouping and adaptive selection methods 😄

Notifications You must be signed in to change notification settings

fzhu0628/G-CADA---Adaptive-Worker-grouping-for-Communication-Efficient-and-Straggler-Tolerant-Distributed-SGD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Adaptive-Worker-Grouping-for-Communication-Efficient-and-Straggler-Tolerant-Distributed-SGD

Overview

  • This repository showcases my first published paper titled Adaptive Worker Grouping for Communication-Efficient and Straggler-Tolerant Distributed SGD.

  • The paper was presented at the IEEE International Symposium on Information Theory (ISIT) in 2022.

  • About the paper

    • We propose a novel algorithm, G-CADA, designed to enhance both time and communication efficiencies in distributed learning systems.
    • Building upon the widely recognized CADA algorithm, which aggregates gradients from workers lazily to improve communication efficiency, G-CADA introduces adaptive worker grouping. Specifically:
      • Workers are partitioned into groups dynamically at each iteration, and each group is assigned the same dataset shard.
      • By utilizing additional storage at the workers, the server interacts with only the fastest worker in each group, significantly improving robustness against stragglers.
    • Numerical simulations demonstrate substantial gains in both time and communication efficiency.

Key Concepts of G-CADA

  • Group formation:

    • A total of $M$ workers are divided into $G$ groups, with each group containing an equal number of workers.
    • The dataset is split into $G$ shards, and all workers within a group access the same shard.
  • Age parameter:

    • The server maintains a parameter, $τ_g^k$ for each group, representing the “age” (i.e., the number of rounds the group has not communicated with the server) at iteration $k$.
  • Group selection:

    • At each iteration $k$, groups are selected based on a predefined condition shown below. If the condition is violated, indicating that the group’s information has changed significantly, the group communicates with the server to update its parameters.

      image

  • Age update:

    • Groups that communicate with the server reset their age parameter.

Algorithm Illustration

An example visualization of the algorithm:

Pseudocode

The pseudocode below provides a step-by-step representation of the G-CADA algorithm:

image

Results

  • Numerical examples:
  • G-CADA demonstrates clear advantages over benchmark algorithms in terms of time efficiency, communication cost, and computational cost.

For more detailed information, please refer to our full paper: IEEE Xplore or arXiv.

Code Description

  • The implementation in linear_regression_ps.py simulates a linear regression model using the MNIST dataset and a quadratic error loss function.
  • Benchmark comparisons:
    • G-CADA is compared against state-of-the-art algorithms, including distributed SGD, CADA, and distributed Adam.
    • Results consistently show that G-CADA outperforms these methods in both time and communication efficiencies.

About

This work was published at IEEE ISIT 2022, where we proposed a novel algorithm named G-CADA aiming to improve the time and communication efficiency of distributed learning systems based on grouping and adaptive selection methods 😄

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages