[Numpy] Numpy version of GluonNLP #1225

sxjscience · 2020-05-07T19:31:22Z

Description

This is the new version of GluonNLP that uses the DeepNumpy interface of MXNet. Currently, we have the following functionality.

Finetune BERT, ALBERT, ELECTRA on SQuAD 1.1/2.0. We used a simplified preprocessing logic with the help of encode_with_offsets.

The new way to load the pretrained model:

from gluonnlp.models.albert import get_pretrained_albert, AlbertModel
cfg, tokenizer, model_path, _ = get_pretrained_albert()
model = AlbertModel.from_cfg(cfg)
model.load_parameters(model_path)

Train Transformer on WMT2014-en-de
nlp_data and nlp_preprocess utilities for downloading and preparing NLP data. There is no need to run perl scripts for preparing machine translation datasets. You can directly use

# For clean and tokenize
nl_preprocess clean_tok_para_corpus ...
# For learning subwords with different algorithms
nlp_preprocess learn_subword ...
nlp_preprocess apply_subword ...

New attention cell
We support the multihead attention with NKT, NTK, TNK layouts.
Also, we support multihead attention with relative positional encoding. The relative positional encoding method includes: TransformerXL, Shaw, and T5.
New tokenizer class
Includes whitespace, spacy, jieba, SentencePiece, YTTM, HuggingFaceBPE, HuggingFaceByteBPE, HuggingFaceWordPiece.

Supports the following basic functionalities

# Encode to list of string tokens
tokens = tokenizer.encode('Hello World', str)
# Encode to int tokens
tokens = tokenizer.encode('Hello World', int)
# Decode the encoded tokens
out = tokenizer.decode()
# Encode to tokens + offsets
tokens, offsets = tokenizer.encode_with_offsets('Hello World', int)

New vocab class
New configuration system based on yacs + New Registry class.
For example, the structure of the ALBERT-base model will look like this:

INITIALIZER:
  bias:
  - zeros
  embed:
  - truncnorm
  - 0
  - 0.02
  weight:
  - truncnorm
  - 0
  - 0.02
MODEL:
  activation: gelu(tanh)
  attention_dropout_prob: 0.0
  dtype: float32
  embed_size: 128
  hidden_dropout_prob: 0.0
  hidden_size: 3072
  layer_norm_eps: 1.0e-12
  max_length: 512
  num_groups: 1
  num_heads: 12
  num_layers: 12
  num_token_types: 2
  pos_embed_type: learned
  units: 768
  vocab_size: 30000
VERSION: 1

To try out some examples, go to scripts/machine_translation and scripts/question_answering.

We will soon add back-translation, stochastic beam search, pretraining electra, and other functionalities.

cc @dmlc/gluon-nlp-team

Credit also goes to @zheyuye , @hymzoque, @XieBinghui, and @gongel

szha · 2020-05-07T22:19:48Z

Let's divide the review work:

@szhengac: transformer/translation model
@eric-haibin-lin: BERT, conversion script
@leezu: infra, data pipeline and data preparation CLI, QA.
@szha: markdown

Everyone should review the API.

leezu · 2020-05-07T23:16:52Z

scripts/preprocess/apply_subword.py

@@ -0,0 +1,157 @@
+import argparse
+import textwrap
+from multiprocessing import Pool


Should set the serialization format based on our minimum supported version. See https://bugs.python.org/issue28053

I'll do it after this PR.

sxjscience · 2020-05-07T23:56:49Z

src/gluonnlp/attention_cell.py

+            context_vec = F.npx.batch_dot(attn_weights,
+                                          F.np.swapaxes(value, 1, 2)).transpose((0, 2, 1, 3))
+        context_vec = F.npx.reshape(context_vec, (-2, -2, -1))
+    elif layout == 'TNK':


@MoisesHer @ptrendx According to my understanding, the contrib.interleaved_matmul_selfatt_qk + contrib.interleaved_matmul_selfatt_valatt or contrib.interleaved_matmul_encdec_qk + interleaved_matmul_encdec_valatt could be implemented as einsum('ibnc,jbnc->bnij') and np.einsum('bnij,jbnc->ibnc')

scripts/question_answering/README.md

scripts/datasets/machine_translation/wmt2014_ende.sh

setup.py

tests/test_data_tokenizers.py

src/gluonnlp/models/bert.py

scripts/conversion_toolkits/README.md

szha

nice job! we will need to set up CI for this. I reviewed the markdowns in this pass.

scripts/datasets/README.md

scripts/datasets/general_benchmarks/README.md

scripts/datasets/machine_translation/README.md

scripts/datasets/music_generation/README.md

scripts/datasets/pretrain_corpus/README.md

scripts/datasets/url_checksums/book_corpus.txt

scripts/pretraining/README.md

szha

review still WIP

license headers for API, including src/gluonnlp/cli/__init__.py (I have no idea how to add review comment to empty file)

setup.py

szha · 2020-05-12T22:27:47Z

setup.py

+    package_dir={"": "src"},
+    zip_safe=True,
+    include_package_data=True,
+    install_requires=requirements,


consider adding tests_require

src/gluonnlp/cli/data/__main__.py

src/gluonnlp/cli/data/general_benchmarks/prepare_glue.py

szha · 2020-05-12T22:37:56Z

src/gluonnlp/__init__.py

+from . import initializer
+from . import initializer as init


stick to one for simplicity.

szha · 2020-05-12T22:52:11Z

src/gluonnlp/layers.py

+            return F.np.concatenate([sin_emb, cos_emb], axis=-1)
+        else:
+            return F.np.concatenate(
+                [sin_emb, cos_emb, F.np.expand_dims(F.np.zeros_like(positions).astype(self._dtype),


padding op?

Should be similar

szha · 2020-05-12T22:54:33Z

src/gluonnlp/layers.py

+
+
+@use_np
+class SinusoidalPositionalEmbedding(HybridBlock):


description

src/gluonnlp/layers.py

szha · 2020-05-12T22:58:51Z

src/gluonnlp/models/albert.py

+        return self.pooler(outputs)
+
+    @staticmethod
+    def get_cfg(key=None):


we should try to replace get_cfg and from_cfg with something automatic for blocks. @leezu suggestions?

szha · 2020-05-12T23:09:02Z

src/gluonnlp/utils/registry.py

+from json import JSONDecodeError
+
+
+class Registry:


how do we merge this with existing registries in mxnet?

This is a reusable registry class and there is no plan to merge this into MXNet.

why do we require yet another registry? what's missing in the existing ones?

Previously, we need to create the registry.py file and reuse MXNet registry:

gluon-nlp/src/gluonnlp/data/registry.py

Line 32 in 5dc6b9c

def register(class_=None, **kwargs):

. Now, we can just call MODEL_REGISTRY = Registry('model'), TOKENIZER_REGISTRY = Registry('tokenizer') to support multiple use-cases.

eric-haibin-lin · 2020-05-12T23:20:07Z

scripts/datasets/README.md

+    - [SuperGLUE](./general_benchmarks/README.md#superglue-benchmark)
+    - [SentEval](./general_benchmarks/README.md#senteval-benchmark)
+
+## Contribution Guide


we have move to CONTRIBUTING.md

I'll revise it later

scripts/datasets/language_modeling/prepare_lm.py

scripts/datasets/pretrain_corpus/README.md

src/gluonnlp/attention_cell.py

eric-haibin-lin · 2020-05-12T23:44:36Z

src/gluonnlp/data/tokenizers.py

+
+
+@TOKENIZER_REGISTRY.register('spm')
+class SentencepieceTokenizer(BaseTokenizerWithVocab):


Can we break it down into multiple files, each for 1 tokenizer?

Let me do it in a separate PR.

README.md

.flake8

scripts/datasets/update_download_stats.py

scripts/conversion_toolkits/convert_electra.sh

leezu · 2020-05-13T00:46:59Z

We can see the CI output at https://github.com/sxjscience/gluon-nlp/actions

src/gluonnlp/data/tokenizers.py

src/gluonnlp/models/transformer.py

szha · 2020-05-13T18:42:44Z

Thanks for the examples. Still, it begs the question why the registry in mxnet couldn't benefit from similar usability. If it could, I think it makes sense not to have multiple solutions for registry.

It doesn't have to happen as part of the PR but let's nail down a plan first.

szha · 2020-05-13T18:44:52Z

@leezu could we also have the code coverage information? I'd like to get a good understanding of the current test coverage.

sxjscience · 2020-05-13T19:05:05Z

@szha Having a separate Registry class helps us drop the unnecessary dependency of MXNet for tokenizers.py. Also, the new Registry class is documented and is designed to be reusable.

leezu · 2020-05-13T19:34:20Z

codecov is enabled via ac16f2d The main problem is that the tests are currently killed due to (?) OOM on Linux. This may be a bug in the memory management of mxnet as the memory usage appears to grow over time. We can use a separate python process for each test file as a workaround leezu@cc4e647

leezu

>>> nlp.models.albert
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'gluonnlp' has no attribute 'models'

modules are not exposed correctly

src/gluonnlp/models/albert.py

src/gluonnlp/cli/data/music_generation/prepare_music_midi.py

This reverts commit 6625af9. pytest-dev/pytest#1120

src/gluonnlp/attention_cell.py

szhengac · 2020-05-21T21:58:48Z

scripts/machine_translation/README.md

+
+| Seed = 100 | Seed = 1234 | Seed = 12345 |  Mean±std   |
+| ---------- | ----------- | ------------ |  ---------- |
+|   26.61    |   -    |   -          |  -          | 


it seems quite low....

scripts/machine_translation/train_transformer.py

szhengac · 2020-05-21T22:02:09Z

scripts/machine_translation/train_transformer.py

+        if v.grad_req != 'null':
+            v.grad_req = 'add'
+    model.collect_params().zero_grad()
+    model_averager = AverageSGDTracker(model.collect_params())


Are you using a-suffix averaged sgd, i.e., only averaging over last K iterations/epochs ?, where K is a user defined hyper-parameter.

szhengac · 2020-05-21T22:07:28Z

scripts/datasets/machine_translation/wmt2014_ende.sh

+sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/dev.raw.${SRC}
+sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/dev.raw.${TGT}
+sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/test.raw.${SRC}
+sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}


There are two versions of the tests set. One is filtered, which only has around 2700+ examples, while the other one has 3000+ examples. Have you checked which one does the above command get?

It's the 2700+ version.

I think the current baseline uses 3000+ version. But it is been a while, and I am not sure about it.

szhengac · 2020-05-21T22:12:40Z

scripts/datasets/pretrain_corpus/prepare_wikipedia.py

+        self.output_filename = output_filename
+
+    # This puts one article per line
+    def merge(self):


I recently just noted that this wiki cleaner does not remove some useless documents containing only html tags such as "colspan="2" style="background-color:".

And, these tags appear in all wiki texts for all languages.

I see. We should improve this.

@szhengac How about solve it in another PR?

Merge conversion toolkits update unittests by fixing the version update datasets add scripts Delete __init__.py add src update Update setup.py Update setup.py update all tests revise test cases Update unittests.yml Update initializer.py Create preprocessing.py Update __init__.py Update attention_cell.py Update prepare_wmt.py move ubuntu + windows to TODO

sxjscience · 2020-06-10T07:12:00Z

src/gluonnlp/data/tokenizers.py

+    """
+    def __init__(self, model_path: Optional[str] = None,
+                 vocab: Optional[Union[str, Vocab]] = None,
+                 nbest: int = 0, alpha: float = 0.0, do_lower=False,


@haven-jeon I fixed the default value here to cope with BPE-dropout.

src/gluonnlp/initializer.py

eric-haibin-lin

I have not looked into all conversion scripts in detail in the last few commits, but no concern so far about the API design if previous comments are addressed

numpy version

85c6855

sxjscience requested a review from a team May 7, 2020 19:36

This was referenced May 7, 2020

albert model requested! #1221

Closed

BeamSearchSampler failing with mx.numpy input #1220

Closed

ALBERT #955

Open

leezu reviewed May 7, 2020

View reviewed changes

sxjscience commented May 7, 2020

View reviewed changes

zheyuye reviewed May 8, 2020

View reviewed changes

scripts/question_answering/README.md Outdated Show resolved Hide resolved

sxjscience commented May 8, 2020

View reviewed changes

scripts/datasets/machine_translation/wmt2014_ende.sh Show resolved Hide resolved

sxjscience commented May 8, 2020

View reviewed changes

setup.py Show resolved Hide resolved

sxjscience commented May 8, 2020

View reviewed changes

tests/test_data_tokenizers.py Show resolved Hide resolved

zheyuye reviewed May 9, 2020

View reviewed changes

src/gluonnlp/models/bert.py Outdated Show resolved Hide resolved

eric-haibin-lin reviewed May 11, 2020

View reviewed changes

scripts/conversion_toolkits/README.md Show resolved Hide resolved

szha reviewed May 12, 2020

View reviewed changes

eric-haibin-lin reviewed May 12, 2020

View reviewed changes

leezu reviewed May 12, 2020

View reviewed changes

.flake8 Show resolved Hide resolved

scripts/datasets/update_download_stats.py Show resolved Hide resolved

scripts/conversion_toolkits/convert_electra.sh Show resolved Hide resolved

Enable Github Actions

7c23432

sxjscience added 5 commits May 13, 2020 01:35

Update unittests.yml

2d87cb0

Update unittests.yml

1508892

Update setup.py

a43765d

fix test

b278cf7

Update README.md

9623f62

sxjscience commented May 13, 2020

View reviewed changes

src/gluonnlp/data/tokenizers.py Outdated Show resolved Hide resolved

sxjscience commented May 13, 2020

View reviewed changes

src/gluonnlp/models/transformer.py Show resolved Hide resolved

fix a commit id

f2ea915

leezu reviewed May 14, 2020

View reviewed changes

src/gluonnlp/models/albert.py Outdated Show resolved Hide resolved

Separate codecov per platform

6f4d44f

sxjscience commented May 18, 2020

View reviewed changes

src/gluonnlp/cli/data/music_generation/prepare_music_midi.py Outdated Show resolved Hide resolved

Revert "Update tmpdir"

0d9fd0d

This reverts commit 6625af9. pytest-dev/pytest#1120

sxjscience mentioned this pull request May 20, 2020

Pin mxnet version in response to mx.metric reorg dmlc/gluon-cv#1310

Merged

sxjscience added 2 commits May 21, 2020 11:41

Remove files

6e48c02

add symlinks

ba3c131

szhengac reviewed May 21, 2020

View reviewed changes

sxjscience force-pushed the numpy branch from 2545c9f to 7755555 Compare June 10, 2020 00:01

Update unittests.yml

45a2775

sxjscience mentioned this pull request Jun 10, 2020

[Numpy Refactor] Dataset Enhancement TODO #1240

Open

6 tasks

sxjscience added the numpyrefactor label Jun 10, 2020

fix alpha in sentencepiece

ad9b000

sxjscience mentioned this pull request Jun 10, 2020

BPE's default alpha with sentencepiece #1239

Closed

sxjscience commented Jun 10, 2020

View reviewed changes

src/gluonnlp/initializer.py Show resolved Hide resolved

sxjscience added 4 commits June 10, 2020 01:02

fix bug

ca3e8f6

update

0ae39bc

fix README

8498d6d

Update unittests.yml

40f0057

eric-haibin-lin approved these changes Jun 10, 2020

View reviewed changes

szhengac approved these changes Jun 10, 2020

View reviewed changes

sxjscience added 2 commits June 10, 2020 11:05

Update README.md

f774187

update

e2be39c

leezu approved these changes Jun 10, 2020

View reviewed changes

szha merged commit 01122db into dmlc:numpy Jun 10, 2020



		@TOKENIZER_REGISTRY.register('spm')
		class SentencepieceTokenizer(BaseTokenizerWithVocab):

[Numpy] Numpy version of GluonNLP #1225

[Numpy] Numpy version of GluonNLP #1225

Conversation

sxjscience commented May 7, 2020 • edited Loading

Description

szha commented May 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leezu commented May 13, 2020

szha commented May 13, 2020 • edited Loading

szha commented May 13, 2020

sxjscience commented May 13, 2020

leezu commented May 13, 2020 • edited Loading

leezu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment • edited Loading

Choose a reason for hiding this comment

sxjscience commented May 7, 2020 •

edited

Loading

szha commented May 7, 2020 •

edited

Loading

szha commented May 13, 2020 •

edited

Loading

leezu commented May 13, 2020 •

edited

Loading

eric-haibin-lin left a comment •

edited

Loading