-
Notifications
You must be signed in to change notification settings - Fork 538
Conversation
Let's divide the review work:
Everyone should review the API. |
@@ -0,0 +1,157 @@ | |||
import argparse | |||
import textwrap | |||
from multiprocessing import Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should set the serialization format based on our minimum supported version. See https://bugs.python.org/issue28053
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it after this PR.
context_vec = F.npx.batch_dot(attn_weights, | ||
F.np.swapaxes(value, 1, 2)).transpose((0, 2, 1, 3)) | ||
context_vec = F.npx.reshape(context_vec, (-2, -2, -1)) | ||
elif layout == 'TNK': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MoisesHer @ptrendx According to my understanding, the contrib.interleaved_matmul_selfatt_qk
+ contrib.interleaved_matmul_selfatt_valatt
or contrib.interleaved_matmul_encdec_qk
+ interleaved_matmul_encdec_valatt
could be implemented as einsum('ibnc,jbnc->bnij')
and np.einsum('bnij,jbnc->ibnc')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice job! we will need to set up CI for this. I reviewed the markdowns in this pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review still WIP
- license headers for API, including
src/gluonnlp/cli/__init__.py
(I have no idea how to add review comment to empty file)
package_dir={"": "src"}, | ||
zip_safe=True, | ||
include_package_data=True, | ||
install_requires=requirements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding tests_require
src/gluonnlp/__init__.py
Outdated
from . import initializer | ||
from . import initializer as init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stick to one for simplicity.
return F.np.concatenate([sin_emb, cos_emb], axis=-1) | ||
else: | ||
return F.np.concatenate( | ||
[sin_emb, cos_emb, F.np.expand_dims(F.np.zeros_like(positions).astype(self._dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
padding op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be similar
|
||
|
||
@use_np | ||
class SinusoidalPositionalEmbedding(HybridBlock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description
return self.pooler(outputs) | ||
|
||
@staticmethod | ||
def get_cfg(key=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to replace get_cfg
and from_cfg
with something automatic for blocks. @leezu suggestions?
from json import JSONDecodeError | ||
|
||
|
||
class Registry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we merge this with existing registries in mxnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reusable registry class and there is no plan to merge this into MXNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we require yet another registry? what's missing in the existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we need to create the registry.py
file and reuse MXNet registry:
gluon-nlp/src/gluonnlp/data/registry.py
Line 32 in 5dc6b9c
def register(class_=None, **kwargs): |
MODEL_REGISTRY = Registry('model')
, TOKENIZER_REGISTRY = Registry('tokenizer')
to support multiple use-cases.
- [SuperGLUE](./general_benchmarks/README.md#superglue-benchmark) | ||
- [SentEval](./general_benchmarks/README.md#senteval-benchmark) | ||
|
||
## Contribution Guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have move to CONTRIBUTING.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revise it later
|
||
|
||
@TOKENIZER_REGISTRY.register('spm') | ||
class SentencepieceTokenizer(BaseTokenizerWithVocab): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we break it down into multiple files, each for 1 tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do it in a separate PR.
We can see the CI output at https://github.com/sxjscience/gluon-nlp/actions |
Thanks for the examples. Still, it begs the question why the registry in mxnet couldn't benefit from similar usability. If it could, I think it makes sense not to have multiple solutions for registry. It doesn't have to happen as part of the PR but let's nail down a plan first. |
@leezu could we also have the code coverage information? I'd like to get a good understanding of the current test coverage. |
@szha Having a separate Registry class helps us drop the unnecessary dependency of MXNet for |
codecov is enabled via ac16f2d The main problem is that the tests are currently killed due to (?) OOM on Linux. This may be a bug in the memory management of mxnet as the memory usage appears to grow over time. We can use a separate python process for each test file as a workaround leezu@cc4e647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> nlp.models.albert
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'gluonnlp' has no attribute 'models'
modules are not exposed correctly
This reverts commit 6625af9. pytest-dev/pytest#1120
|
||
| Seed = 100 | Seed = 1234 | Seed = 12345 | Mean±std | | ||
| ---------- | ----------- | ------------ | ---------- | | ||
| 26.61 | - | - | - | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems quite low....
if v.grad_req != 'null': | ||
v.grad_req = 'add' | ||
model.collect_params().zero_grad() | ||
model_averager = AverageSGDTracker(model.collect_params()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using a-suffix averaged sgd, i.e., only averaging over last K iterations/epochs ?, where K is a user defined hyper-parameter.
sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/dev.raw.${SRC} | ||
sacrebleu -t wmt13 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/dev.raw.${TGT} | ||
sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo src > ${SAVE_PATH}/test.raw.${SRC} | ||
sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two versions of the tests set. One is filtered, which only has around 2700+ examples, while the other one has 3000+ examples. Have you checked which one does the above command get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the 2700+ version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current baseline uses 3000+ version. But it is been a while, and I am not sure about it.
self.output_filename = output_filename | ||
|
||
# This puts one article per line | ||
def merge(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently just noted that this wiki cleaner does not remove some useless documents containing only html tags such as "colspan="2" style="background-color:".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, these tags appear in all wiki texts for all languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We should improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szhengac How about solve it in another PR?
Merge conversion toolkits update unittests by fixing the version update datasets add scripts Delete __init__.py add src update Update setup.py Update setup.py update all tests revise test cases Update unittests.yml Update initializer.py Create preprocessing.py Update __init__.py Update attention_cell.py Update prepare_wmt.py move ubuntu + windows to TODO
""" | ||
def __init__(self, model_path: Optional[str] = None, | ||
vocab: Optional[Union[str, Vocab]] = None, | ||
nbest: int = 0, alpha: float = 0.0, do_lower=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haven-jeon I fixed the default value here to cope with BPE-dropout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked into all conversion scripts in detail in the last few commits, but no concern so far about the API design if previous comments are addressed
Description
This is the new version of GluonNLP that uses the DeepNumpy interface of MXNet. Currently, we have the following functionality.
encode_with_offsets
.The new way to load the pretrained model:
nlp_data
andnlp_preprocess
utilities for downloading and preparing NLP data. There is no need to run perl scripts for preparing machine translation datasets. You can directly useWe support the multihead attention with NKT, NTK, TNK layouts.
Also, we support multihead attention with relative positional encoding. The relative positional encoding method includes: TransformerXL, Shaw, and T5.
Includes whitespace, spacy, jieba, SentencePiece, YTTM, HuggingFaceBPE, HuggingFaceByteBPE, HuggingFaceWordPiece.
Supports the following basic functionalities
For example, the structure of the ALBERT-base model will look like this:
To try out some examples, go to
scripts/machine_translation
andscripts/question_answering
.We will soon add back-translation, stochastic beam search, pretraining electra, and other functionalities.
cc @dmlc/gluon-nlp-team
Credit also goes to @zheyuye , @hymzoque, @XieBinghui, and @gongel