Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Numpy] Numpy version of GluonNLP #1225

Merged
merged 24 commits into from
Jun 10, 2020
Merged

[Numpy] Numpy version of GluonNLP #1225

merged 24 commits into from
Jun 10, 2020

Conversation

sxjscience
Copy link
Member

@sxjscience sxjscience commented May 7, 2020

Description

This is the new version of GluonNLP that uses the DeepNumpy interface of MXNet. Currently, we have the following functionality.

  • Finetune BERT, ALBERT, ELECTRA on SQuAD 1.1/2.0. We used a simplified preprocessing logic with the help of encode_with_offsets.

The new way to load the pretrained model:

from gluonnlp.models.albert import get_pretrained_albert, AlbertModel
cfg, tokenizer, model_path, _ = get_pretrained_albert()
model = AlbertModel.from_cfg(cfg)
model.load_parameters(model_path)
  • Train Transformer on WMT2014-en-de
  • nlp_data and nlp_preprocess utilities for downloading and preparing NLP data. There is no need to run perl scripts for preparing machine translation datasets. You can directly use
# For clean and tokenize
nl_preprocess clean_tok_para_corpus ...
# For learning subwords with different algorithms
nlp_preprocess learn_subword ...
nlp_preprocess apply_subword ...
  • New attention cell
    We support the multihead attention with NKT, NTK, TNK layouts.
    Also, we support multihead attention with relative positional encoding. The relative positional encoding method includes: TransformerXL, Shaw, and T5.
  • New tokenizer class
    Includes whitespace, spacy, jieba, SentencePiece, YTTM, HuggingFaceBPE, HuggingFaceByteBPE, HuggingFaceWordPiece.

Supports the following basic functionalities

# Encode to list of string tokens
tokens = tokenizer.encode('Hello World', str)
# Encode to int tokens
tokens = tokenizer.encode('Hello World', int)
# Decode the encoded tokens
out = tokenizer.decode()
# Encode to tokens + offsets
tokens, offsets = tokenizer.encode_with_offsets('Hello World', int)
  • New vocab class
  • New configuration system based on yacs + New Registry class.
    For example, the structure of the ALBERT-base model will look like this:
INITIALIZER:
  bias:
  - zeros
  embed:
  - truncnorm
  - 0
  - 0.02
  weight:
  - truncnorm
  - 0
  - 0.02
MODEL:
  activation: gelu(tanh)
  attention_dropout_prob: 0.0
  dtype: float32
  embed_size: 128
  hidden_dropout_prob: 0.0
  hidden_size: 3072
  layer_norm_eps: 1.0e-12
  max_length: 512
  num_groups: 1
  num_heads: 12
  num_layers: 12
  num_token_types: 2
  pos_embed_type: learned
  units: 768
  vocab_size: 30000
VERSION: 1

To try out some examples, go to scripts/machine_translation and scripts/question_answering.

We will soon add back-translation, stochastic beam search, pretraining electra, and other functionalities.

cc @dmlc/gluon-nlp-team

Credit also goes to @zheyuye , @hymzoque, @XieBinghui, and @gongel

@szha
Copy link
Member

szha commented May 7, 2020

Let's divide the review work:

Everyone should review the API.

@@ -0,0 +1,157 @@
import argparse
import textwrap
from multiprocessing import Pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should set the serialization format based on our minimum supported version. See https://bugs.python.org/issue28053

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it after this PR.

context_vec = F.npx.batch_dot(attn_weights,
F.np.swapaxes(value, 1, 2)).transpose((0, 2, 1, 3))
context_vec = F.npx.reshape(context_vec, (-2, -2, -1))
elif layout == 'TNK':
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MoisesHer @ptrendx According to my understanding, the contrib.interleaved_matmul_selfatt_qk + contrib.interleaved_matmul_selfatt_valatt or contrib.interleaved_matmul_encdec_qk + interleaved_matmul_encdec_valatt could be implemented as einsum('ibnc,jbnc->bnij') and np.einsum('bnij,jbnc->ibnc')

setup.py Show resolved Hide resolved
src/gluonnlp/models/bert.py Outdated Show resolved Hide resolved
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job! we will need to set up CI for this. I reviewed the markdowns in this pass.

scripts/datasets/README.md Outdated Show resolved Hide resolved
scripts/datasets/README.md Outdated Show resolved Hide resolved
scripts/datasets/general_benchmarks/README.md Outdated Show resolved Hide resolved
scripts/datasets/machine_translation/README.md Outdated Show resolved Hide resolved
scripts/datasets/music_generation/README.md Outdated Show resolved Hide resolved
scripts/datasets/pretrain_corpus/README.md Outdated Show resolved Hide resolved
scripts/pretraining/README.md Outdated Show resolved Hide resolved
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review still WIP

  • license headers for API, including src/gluonnlp/cli/__init__.py (I have no idea how to add review comment to empty file)

setup.py Outdated Show resolved Hide resolved
package_dir={"": "src"},
zip_safe=True,
include_package_data=True,
install_requires=requirements,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding tests_require

src/gluonnlp/cli/data/__main__.py Outdated Show resolved Hide resolved
src/gluonnlp/cli/data/general_benchmarks/prepare_glue.py Outdated Show resolved Hide resolved
Comment on lines 3 to 4
from . import initializer
from . import initializer as init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stick to one for simplicity.

return F.np.concatenate([sin_emb, cos_emb], axis=-1)
else:
return F.np.concatenate(
[sin_emb, cos_emb, F.np.expand_dims(F.np.zeros_like(positions).astype(self._dtype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding op?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be similar



@use_np
class SinusoidalPositionalEmbedding(HybridBlock):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description

src/gluonnlp/layers.py Outdated Show resolved Hide resolved
return self.pooler(outputs)

@staticmethod
def get_cfg(key=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to replace get_cfg and from_cfg with something automatic for blocks. @leezu suggestions?

from json import JSONDecodeError


class Registry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we merge this with existing registries in mxnet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reusable registry class and there is no plan to merge this into MXNet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we require yet another registry? what's missing in the existing ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we need to create the registry.py file and reuse MXNet registry:

def register(class_=None, **kwargs):
. Now, we can just call MODEL_REGISTRY = Registry('model'), TOKENIZER_REGISTRY = Registry('tokenizer') to support multiple use-cases.

- [SuperGLUE](./general_benchmarks/README.md#superglue-benchmark)
- [SentEval](./general_benchmarks/README.md#senteval-benchmark)

## Contribution Guide
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have move to CONTRIBUTING.md

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise it later

scripts/datasets/language_modeling/prepare_lm.py Outdated Show resolved Hide resolved
scripts/datasets/pretrain_corpus/README.md Outdated Show resolved Hide resolved
src/gluonnlp/attention_cell.py Outdated Show resolved Hide resolved


@TOKENIZER_REGISTRY.register('spm')
class SentencepieceTokenizer(BaseTokenizerWithVocab):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break it down into multiple files, each for 1 tokenizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do it in a separate PR.

README.md Outdated Show resolved Hide resolved
.flake8 Show resolved Hide resolved
scripts/datasets/update_download_stats.py Show resolved Hide resolved
@leezu
Copy link
Contributor

leezu commented May 13, 2020

We can see the CI output at https://github.com/sxjscience/gluon-nlp/actions

@szha
Copy link
Member

szha commented May 13, 2020

Thanks for the examples. Still, it begs the question why the registry in mxnet couldn't benefit from similar usability. If it could, I think it makes sense not to have multiple solutions for registry.

It doesn't have to happen as part of the PR but let's nail down a plan first.

@szha
Copy link
Member

szha commented May 13, 2020

@leezu could we also have the code coverage information? I'd like to get a good understanding of the current test coverage.

@sxjscience
Copy link
Member Author

@szha Having a separate Registry class helps us drop the unnecessary dependency of MXNet for tokenizers.py. Also, the new Registry class is documented and is designed to be reusable.

@leezu
Copy link
Contributor

leezu commented May 13, 2020

codecov is enabled via ac16f2d The main problem is that the tests are currently killed due to (?) OOM on Linux. This may be a bug in the memory management of mxnet as the memory usage appears to grow over time. We can use a separate python process for each test file as a workaround leezu@cc4e647

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> nlp.models.albert
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'gluonnlp' has no attribute 'models'

modules are not exposed correctly

src/gluonnlp/attention_cell.py Show resolved Hide resolved
src/gluonnlp/attention_cell.py Show resolved Hide resolved

| Seed = 100 | Seed = 1234 | Seed = 12345 | Mean±std |
| ---------- | ----------- | ------------ | ---------- |
| 26.61 | - | - | - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems quite low....

scripts/machine_translation/train_transformer.py Outdated Show resolved Hide resolved
if v.grad_req != 'null':
v.grad_req = 'add'
model.collect_params().zero_grad()
model_averager = AverageSGDTracker(model.collect_params())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using a-suffix averaged sgd, i.e., only averaging over last K iterations/epochs ?, where K is a user defined hyper-parameter.

sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/dev.raw.${SRC}
sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/dev.raw.${TGT}
sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/test.raw.${SRC}
sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two versions of the tests set. One is filtered, which only has around 2700+ examples, while the other one has 3000+ examples. Have you checked which one does the above command get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the 2700+ version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current baseline uses 3000+ version. But it is been a while, and I am not sure about it.

self.output_filename = output_filename

# This puts one article per line
def merge(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently just noted that this wiki cleaner does not remove some useless documents containing only html tags such as "colspan="2" style="background-color:".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, these tags appear in all wiki texts for all languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We should improve this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szhengac How about solve it in another PR?

Merge conversion toolkits

update unittests by fixing the version

update datasets

add scripts

Delete __init__.py

add src

update

Update setup.py

Update setup.py

update all tests

revise test cases

Update unittests.yml

Update initializer.py

Create preprocessing.py

Update __init__.py

Update attention_cell.py

Update prepare_wmt.py

move ubuntu + windows to TODO
"""
def __init__(self, model_path: Optional[str] = None,
vocab: Optional[Union[str, Vocab]] = None,
nbest: int = 0, alpha: float = 0.0, do_lower=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haven-jeon I fixed the default value here to cope with BPE-dropout.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked into all conversion scripts in detail in the last few commits, but no concern so far about the API design if previous comments are addressed

@szha szha merged commit 01122db into dmlc:numpy Jun 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants