Add AMP + Update Benchmarking Script #1405

sxjscience · 2020-10-26T00:21:21Z

Description

Add AMP support + update benchmarking script.

Testing utility of fp16 backbone
Update tests and benchmarking script
Update SQuAD finetuning to use amp
Use boolean masking

Some issues:

The test for GPT-2 is turned off due to [Bug][AMP][2.0] AMP issue of the concatenate operator apache/mxnet#19463
The test of MobileBERT is turned off since it has numerical isssues in float16.
TVM test has been disabled because it has some problems when we turned on boolean masking: [TVM] TVM Integration Issue after changing to Boolean Mask. #1425 . We need to fix it on the TVM side.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

cc @dmlc/gluon-nlp-team

github-actions · 2020-10-26T00:41:34Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T00:42:46Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T01:51:35Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T01:51:52Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T02:45:15Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T05:37:35Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T06:13:04Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

codecov · 2020-10-26T06:17:13Z

Codecov Report

Merging #1405 into master will increase coverage by 0.42%.
The diff coverage is 87.83%.

@@            Coverage Diff             @@
##           master    #1405      +/-   ##
==========================================
+ Coverage   85.13%   85.55%   +0.42%     
==========================================
  Files          53       53              
  Lines        6928     6987      +59     
==========================================
+ Hits         5898     5978      +80     
+ Misses       1030     1009      -21

Impacted Files	Coverage Δ
src/gluonnlp/models/transformer.py	`98.94% <ø> (ø)`
src/gluonnlp/optimizer.py	`82.22% <ø> (-0.20%)`	⬇️
src/gluonnlp/models/transformer_xl.py	`80.80% <14.28%> (-1.65%)`	⬇️
src/gluonnlp/utils/testing.py	`94.16% <94.11%> (-0.04%)`	⬇️
src/gluonnlp/attention_cell.py	`80.39% <100.00%> (ø)`
src/gluonnlp/data/sampler.py	`96.57% <100.00%> (+0.02%)`	⬆️
src/gluonnlp/models/bart.py	`93.75% <100.00%> (+7.50%)`	⬆️
src/gluonnlp/models/gpt2.py	`98.26% <100.00%> (+0.01%)`	⬆️
src/gluonnlp/utils/misc.py	`58.58% <0.00%> (-0.62%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1726dd2...6b2e1ea. Read the comment docs.

github-actions · 2020-10-26T06:37:04Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T07:21:11Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-26T07:43:22Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-29T05:50:08Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-29T06:44:33Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-29T07:48:15Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

scripts/question_answering/run_squad.py

Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update testing.py Update attention_cell.py Update testing.py Update testing.py Update testing.py Update test_models_bert.py Update run_batch_squad.sh Update generate_commands.py Update run_batch_squad.sh Update run_batch_squad.sh Update run_batch_squad.sh Add region Update generate_commands.py Update run_squad.template Try to use clip 1.0 update Update README.md Update attention_cell.py Update benchmark_gluonnlp.py Update attention_cell.py Update testing.py Update run_squad.py Update attention_cell.py Update attention_cell.py Update attention_cell.py update Update attention_cell.py update Update numbers + log + weight update update Update testing.py

github-actions · 2020-10-29T18:13:29Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-10-29T18:13:31Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-11-05T21:05:21Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

github-actions · 2020-11-05T22:11:29Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

barry-jin

LGTM

scripts/machine_translation/README.md

szha · 2020-11-06T00:14:18Z

scripts/question_answering/README.md

+### Run with AWS Batch
+We can quickly run the squad finetuning via the [AWS Batch support](../../tools/batch).


is the batch support refactored so that it can be bootstrapped in any AWS account? including VPC setup, security group, etc?

if it doesn't work we should avoid advertising this as user facing feature.

I think it's for the purpose of running the experiments quickly if you have batch access. Otherwise, we will end up writing everything inside the batch folder.

it's also ok to just keep the script somewhere local or in a gist.

I've also considered this. The problem is that it will increase the communication time and it can add some values to the user if we will later have the cloudformation script.

if the feature doesn't work for a user as is, it will cause confusion and friction.

keeping it as part of infra code is also ok.

May be let's still keep it in the batch folder.

szha · 2020-11-06T00:15:51Z

scripts/question_answering/batch/sync_batch_result.sh

+#!/bin/bash
+
+set -ex
+
+LOG_PATH=$1
+SAVE_DIR_NAME=${2:-squad_2.0}
+
+while read -r job_name job_id; do
+    aws s3 sync s3://gluon-nlp-dev/batch/${job_id}/temp ${SAVE_DIR_NAME}/${job_name}
+done < ${LOG_PATH}


I don't think we should check in all the convenience scripts in the scripts folder, because this folder is user-facing.

For me, I feel that it's fine to help reproduce the experiments for our own purpose. And also later add to the scheduled check.

Should be resolved now.

szha

let's remove the non-user-facing scripts from the user facing scripts folder.

concerns addressed. great addition and a step closer to full prod support

github-actions · 2020-11-06T00:47:11Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1405/amp/index.html

sxjscience changed the title ~~[WIP]Add AMP + Use Boolean Mask + Update Benchmarking Script~~ [WIP]Add AMP + Update Benchmarking Script Oct 26, 2020

zheyuye reviewed Oct 29, 2020

View reviewed changes

scripts/question_answering/run_squad.py Outdated Show resolved Hide resolved

sxjscience force-pushed the amp branch from 672149e to 22992c3 Compare October 29, 2020 17:58

sxjscience added 2 commits October 29, 2020 11:02

Update run_squad.py

790a6c8

Merge remote-tracking branch 'upstream/master' into amp

4e8c2bb

sxjscience mentioned this pull request Oct 29, 2020

Post DeferredCompute Verification #1413

Open

7 tasks

sxjscience added 7 commits October 29, 2020 11:52

Update test_models_mobilebert.py

051e264

Update README.md

8713da5

Update test_models_bert.py

20993da

Update testing.py

5538c4f

Merge remote-tracking branch 'upstream/master' into amp

30261e0

Update test_models_mobilebert.py

3f7aec1

Update test_models_roberta.py

6e219fa

sxjscience added 8 commits November 5, 2020 00:14

update

847f4c7

Update testing.py

8ccc487

Update test_optimizer.py

4077929

Update benchmark_utils.py

8f5d5b7

update

d95b8c8

fix bug in inference

9bba04d

Update benchmark_gluonnlp.py

474bc57

Update run_batch_squad.sh

c14a340

sxjscience marked this pull request as ready for review November 5, 2020 16:21

sxjscience requested a review from a team as a code owner November 5, 2020 16:21

sxjscience changed the title ~~[WIP]Add AMP + Update Benchmarking Script~~ Add AMP + Update Benchmarking Script Nov 5, 2020

sxjscience added 6 commits November 5, 2020 08:30

Merge remote-tracking branch 'upstream/master' into amp

184ae0f

Update benchmark_utils.py

4e47f42

Update run_squad.py

4d6151f

Update run_squad.py

f5bcb56

Update run_squad.py

addff4a

Merge remote-tracking branch 'upstream/master' into amp

e797a62

Update run_squad.py

236f35e

barry-jin approved these changes Nov 5, 2020

View reviewed changes

szha reviewed Nov 6, 2020

View reviewed changes

scripts/machine_translation/README.md Show resolved Hide resolved

szha reviewed Nov 6, 2020

View reviewed changes

szha previously requested changes Nov 6, 2020

View reviewed changes

update

6b2e1ea

leezu approved these changes Nov 6, 2020

View reviewed changes

szha merged commit dd45270 into dmlc:master Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMP + Update Benchmarking Script #1405

Add AMP + Update Benchmarking Script #1405

sxjscience commented Oct 26, 2020 •

edited

Loading

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

codecov bot commented Oct 26, 2020 •

edited

Loading

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Nov 5, 2020

github-actions bot commented Nov 5, 2020

barry-jin left a comment

szha Nov 6, 2020

szha Nov 6, 2020

sxjscience Nov 6, 2020

szha Nov 6, 2020

sxjscience Nov 6, 2020

szha Nov 6, 2020

szha Nov 6, 2020

sxjscience Nov 6, 2020

szha Nov 6, 2020 •

edited

Loading

sxjscience Nov 6, 2020

sxjscience Nov 6, 2020

szha left a comment

github-actions bot commented Nov 6, 2020

		### Run with AWS Batch
		We can quickly run the squad finetuning via the [AWS Batch support](../../tools/batch).

Add AMP + Update Benchmarking Script #1405

Add AMP + Update Benchmarking Script #1405

Conversation

sxjscience commented Oct 26, 2020 • edited Loading

Description

Checklist

Essentials

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

codecov bot commented Oct 26, 2020 • edited Loading

Codecov Report

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 26, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Oct 29, 2020

github-actions bot commented Nov 5, 2020

github-actions bot commented Nov 5, 2020

barry-jin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 6, 2020

sxjscience commented Oct 26, 2020 •

edited

Loading

codecov bot commented Oct 26, 2020 •

edited

Loading

szha Nov 6, 2020 •

edited

Loading