Added ElasticSampler and PyTorch Elastic ImageNet example #2297

tgaddair · 2020-09-19T17:23:41Z

Fixes #2252.

Signed-off-by: Travis Addair <taddair@uber.com>

github-actions · 2020-09-22T16:41:50Z

Unit Test Results

0 files 0 suites 0s ⏱️
0 tests 0 ✔️ 0 💤 0 ✖️

results for commit 672862b

Signed-off-by: Travis Addair <taddair@uber.com>

github-actions · 2020-09-22T16:46:25Z

Unit Test Results

0 files 0 suites 0s ⏱️
0 tests 0 ✔️ 0 💤 0 ✖️

results for commit 61582f6

Signed-off-by: Travis Addair <taddair@uber.com>

github-actions · 2020-09-22T16:55:04Z

Unit Test Results

0 files 0 suites 0s ⏱️
0 tests 0 ✔️ 0 💤 0 ✖️

results for commit 313d604

Signed-off-by: Travis Addair <taddair@uber.com>

github-actions · 2020-09-22T16:58:57Z

Unit Test Results

0 files 0 suites 0s ⏱️
0 tests 0 ✔️ 0 💤 0 ✖️

results for commit 4d82fda

abditag2 · 2020-09-22T18:51:18Z

examples/elastic/pytorch_imagenet_resnet50_elastic.py

+# Training settings
+parser = argparse.ArgumentParser(description='Elastic PyTorch ImageNet Example',
+                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+parser.add_argument('--train-dir', default=os.path.expanduser('~/imagenet/train'),


As a general improvement for future: I think it would be nice to have a script to download and prepare imagenet data for the examples. From personal experience of trying to run example on other repos, it is sometimes painful to get the data the way the example expects it.

Yes, I agree. At least instructions to do so. Finding and downloading ImageNet is a pain.

abditag2 · 2020-09-22T20:58:08Z

horovod/torch/elastic/sampler.py

+        self.rank = rank()
+
+        # Exclude any samples we have already processed this epoch
+        self.remaining_indices = [idx for idx in range(len(self.dataset))


NIT: It might be more efficient to do a Set of remaining indices - Set of processed indices instead of iterating over the entire list.

In this case we need to preserve the order of remaining_indices so that it is deterministic across order, which we could not guarantee using a set.

abditag2 · 2020-09-22T21:01:45Z

Generally, this looks good to me. So, ElasticSampler works only with datasets that are entirely on the disk and not with packages like Petastorm where we are streaming the data in. Is that correct?

tgaddair · 2020-09-22T21:13:02Z

Generally, this looks good to me. So, ElasticSampler works only with datasets that are entirely on the disk and not with packages like Petastorm where we are streaming the data in. Is that correct?

Any dataset that can be randomly indexed. Parquet is not particularly well-suited to this approach because of its row group storage format, but it could work in theory. Though in practice, we will need to do something else for Petastorm.

github-actions · 2020-09-22T22:00:19Z

Unit Test Results

  457 files +    7   457 suites +7 4h 29m 15s ⏱️ - 9m 16s
  618 tests +    1   573 ✔️ ±    0     44 💤 ±  0 1 ✖️ +1
8 803 runs -464 7 579 ✔️ -387 1 223 💤 -78 1 ✖️ +1

results for commit 318cc26 ± comparison against base commit 41b8152

This comment has been minimized.

Sign in to view

tgaddair added 16 commits September 22, 2020 09:27

Initial commit of elastic sampler

641ab82

Signed-off-by: Travis Addair <taddair@uber.com>

Added backwards compatibility

40edde4

Signed-off-by: Travis Addair <taddair@uber.com>

Added unit tests

faacd7d

Signed-off-by: Travis Addair <taddair@uber.com>

Added more unit tests

434d227

Signed-off-by: Travis Addair <taddair@uber.com>

Removed debug code

7e0aa03

Signed-off-by: Travis Addair <taddair@uber.com>

Initial commit of ImageNet example

055a25f

Signed-off-by: Travis Addair <taddair@uber.com>

Call end_epoch

e9b6b67

Signed-off-by: Travis Addair <taddair@uber.com>

Undo test code

764df40

Signed-off-by: Travis Addair <taddair@uber.com>

Renamed record_batch

b9f0d3c

Signed-off-by: Travis Addair <taddair@uber.com>

Fixed docs

9cd4235

Signed-off-by: Travis Addair <taddair@uber.com>

Added docs for ElasticSampler

e3721ef

Signed-off-by: Travis Addair <taddair@uber.com>

Expose elastic API

6e2ef45

Signed-off-by: Travis Addair <taddair@uber.com>

Restrict public API

305743d

Signed-off-by: Travis Addair <taddair@uber.com>

Commit at beginning

de05789

Signed-off-by: Travis Addair <taddair@uber.com>

Fixed circular imports

21f70f8

Signed-off-by: Travis Addair <taddair@uber.com>

Fixed resume_from_epoch

672862b

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair force-pushed the elastic-sampler branch from a032e10 to 672862b Compare September 22, 2020 16:28

Don't adjust learning rate

61582f6

Signed-off-by: Travis Addair <taddair@uber.com>

Added explicit host updates check

313d604

Signed-off-by: Travis Addair <taddair@uber.com>

Check division by zero

4d82fda

Signed-off-by: Travis Addair <taddair@uber.com>

Assign resume_from_epoch

318cc26

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair marked this pull request as ready for review September 22, 2020 17:01

tgaddair requested a review from abditag2 September 22, 2020 17:05

abditag2 reviewed Sep 22, 2020

View reviewed changes

abditag2 approved these changes Sep 22, 2020

View reviewed changes

tgaddair merged commit 32e5fdb into master Sep 22, 2020

tgaddair deleted the elastic-sampler branch September 22, 2020 22:01

tgaddair mentioned this pull request Sep 22, 2020

Add some demo of elastic horovod on real dataset. #2252

Closed

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ElasticSampler and PyTorch Elastic ImageNet example #2297

Added ElasticSampler and PyTorch Elastic ImageNet example #2297

tgaddair commented Sep 19, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 22, 2020

github-actions bot commented Sep 22, 2020

github-actions bot commented Sep 22, 2020

github-actions bot commented Sep 22, 2020

abditag2 Sep 22, 2020

tgaddair Sep 22, 2020

abditag2 Sep 22, 2020

tgaddair Sep 22, 2020

abditag2 commented Sep 22, 2020

tgaddair commented Sep 22, 2020

github-actions bot commented Sep 22, 2020

This comment has been minimized.

Added ElasticSampler and PyTorch Elastic ImageNet example #2297

Added ElasticSampler and PyTorch Elastic ImageNet example #2297

Conversation

tgaddair commented Sep 19, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 22, 2020

Unit Test Results

github-actions bot commented Sep 22, 2020

Unit Test Results

github-actions bot commented Sep 22, 2020

Unit Test Results

github-actions bot commented Sep 22, 2020

Unit Test Results

abditag2 Sep 22, 2020

Choose a reason for hiding this comment

tgaddair Sep 22, 2020

Choose a reason for hiding this comment

abditag2 Sep 22, 2020

Choose a reason for hiding this comment

tgaddair Sep 22, 2020

Choose a reason for hiding this comment

abditag2 commented Sep 22, 2020

tgaddair commented Sep 22, 2020

github-actions bot commented Sep 22, 2020

Unit Test Results

This comment has been minimized.