Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Integrate Horovod training API as part of MXNet native distributed training API #17531

Merged
merged 25 commits into from
Apr 14, 2020

Conversation

apeforest
Copy link
Contributor

Description

Integrate Horovod training API as part of MXNet native distributed training API by making Horovod one special type of KVStore as proposed in #16795

Part of MXNet 2.0 project tracked by #17111

Checklist

Essentials

Still need to debug the integration tests core dump.

@eric-haibin-lin
Copy link
Member

@ChaokunChang FYI

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also include horovod dependency in CI?

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also update the initializer / broadcast function in trainer.py? That way user can directly use gluon.Trainer(params, kv='horovod')

@apeforest apeforest force-pushed the dev/unified_api_hvd branch 2 times, most recently from fa6ea1a to a0f5848 Compare March 15, 2020 05:34
@apeforest apeforest force-pushed the dev/unified_api_hvd branch 2 times, most recently from 8ac5a82 to fc22ff1 Compare March 17, 2020 18:35
@apeforest apeforest changed the title [WIP] Integrate Horovod training API as part of MXNet native distributed training API Integrate Horovod training API as part of MXNet native distributed training API Mar 18, 2020
@apeforest apeforest force-pushed the dev/unified_api_hvd branch 2 times, most recently from 040370c to 08d28a3 Compare March 19, 2020 08:12
@eric-haibin-lin eric-haibin-lin merged commit e796ae9 into apache:master Apr 14, 2020
@apeforest apeforest deleted the dev/unified_api_hvd branch April 14, 2020 21:07
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
…aining API (apache#17531)

* implement pushpull for horovod

* add local_rank function

* add tests

* Remove in-place broadcast API

* Add kvstore horovod example

* Fix the list to singlton conversion

* Add horood test to CI

* Remove test horovod from unit test

* Add docstring

* Add horovod in test

* sync with master

* Fix horovod dependency in CI

* Fix merge conflict with byteps

* Update __init__.py

* Resolve conflict

* Remove openib warning message

* Add log message in test

* Remove tmp file

* Fix lint

Co-authored-by: Haibin Lin <linhaibin.eric@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants