-
Notifications
You must be signed in to change notification settings - Fork 530
Add embedding related methods in numpy version #1263
Conversation
Codecov Report
@@ Coverage Diff @@
## numpy #1263 +/- ##
==========================================
+ Coverage 82.44% 82.67% +0.23%
==========================================
Files 38 41 +3
Lines 5450 5702 +252
==========================================
+ Hits 4493 4714 +221
- Misses 957 988 +31
|
src/gluonnlp/embedding/evaluation.py
Outdated
x = x.reshape(-1, dim) | ||
y = y.reshape(-1, dim) | ||
x = mx.nd.L2Normalization(x, eps=eps).asnumpy() | ||
y = mx.nd.L2Normalization(y, eps=eps).asnumpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to move this function to op.py
and then reuse this function?
gluon-nlp/src/gluonnlp/attention_cell.py
Lines 303 to 318 in a646c34
def l2_normalize(F, data, axis=-1, eps=1E-6): | |
"""Normalize the data by L2 normalization. | |
Parameters | |
---------- | |
F : mx.sym or mx.nd | |
data : symbol or ndarray | |
axis : int, default -1 | |
eps : float, default 1E-6 | |
Returns | |
------- | |
ret : mx.sym or mx.nd | |
""" | |
ret = data / (F.np.linalg.norm(data, axis=axis, keepdims=True) + eps) | |
return ret |
|
||
return None | ||
|
||
def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='<unk>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to use separate classes for handling different embedding vectors. You may just implement some basic functions. For example, implement the model.load()
functionality to load the parameter. And model.similarity(a, b, method=None)
function to evaluate the similarity. Thus, it will be more structured and close to how we will evaluate it in the paper.
class KeyedVector:
def __init__(self):
def load(cls, path):
...
def similarity(self, a, b):
...
class FastText
def __init__(self):
...
def load(cls, path):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just load the embedding vectors from a small set of vocabulary, we can also add some flags to the load function, e.g., def load(cls, path, vocab=None, num_tokens=None):
, in which vocab
means a potential vocabulary object to help reduce the number of tokens to load. For example, the data augmentation algorithm in TinyBERT only loads the first 100000 words for fast lookup: https://github.com/huawei-noah/Pretrained-Language-Model/blob/e670706c041246b975a3646bc6a27c48786f6c15/TinyBERT/data_augmentation.py#L75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set a base class Class EmbeddingModel(HybridBlock)
to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings
for users to manually set_data, or have Class WordEmbedding(EmbeddingModel)
and move the functionality of load_embeddings
to this class. And complex embedding models like FastText
, character-level CNN can be further implemented based on EmbeddingModel
. This embedding models may be implemented in models/
Is it possible to get the embedding of words in raw text if it’s a HybridBlock? We may want to calculate the embedding from raw text or out-of-vocabulary words. which is the purpose of FastText.
Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: acphile <notifications@github.com>
Sent: Saturday, July 18, 2020 2:13:41 AM
To: dmlc/gluon-nlp <gluon-nlp@noreply.github.com>
Cc: Xingjian SHI <xshiab@connect.ust.hk>; Comment <comment@noreply.github.com>
Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)
@acphile commented on this pull request.
________________________________
In src/gluonnlp/embedding/embed_loader.py<#1263 (comment)>:
+ for cls_name, embedding_cls in text_embedding_reg.items():
+ if pretrained_name_or_dir in embedding_cls:
+ source = pretrained_name_or_dir
+ embedding_dir = os.path.join(root_path, cls_name)
+ file_name, file_hash = embedding_cls[source]
+ url = _get_file_url(cls_name, file_name)
+ file_path = os.path.join(embedding_dir, file_name)
+ if not os.path.exists(file_path) or not check_sha1(file_path, file_hash):
+ logging.info('Embedding file {} is not found. Downloading from Gluon Repository. '
+ 'This may take some time.'.format(file_name))
+ download(url, file_path, sha1_hash=file_hash)
+ return file_path
+
+ return None
+
+def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='<unk>',
I think we can set a base class Class EmbeddingModel(HybridBlock) to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings for users to manually set_data, or have Class WordEmbedding(EmbeddingModel) and move the functionality of load_embeddings to this class. And complex embedding models like FastText, character-level CNN can be further implemented based on EmbeddingModel. This embedding models may be implemented in models/
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#1263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3QCI7C27JYYSU5VEKLR4FRULANCNFSM4O2EOW6Q>.
|
For getting the embeddings for unknown words, there are the following situations: 1. words in the vocabulary but not in the embedding fileThe default method is to sample from normal distribution. And users can use
In this case, we get an embedding matrix for a given vocabulary.
2. words not in the vocabularyI think generally we just use the embedding of To further maintain the information from these words, in practice we may use some character-level NN. For example, we may use a character-level CNN to compute the embedding of a word and the parameters are learnable during training. That's why I think we should create a base class |
The advantage of fasttext is that there is no need to care about OOV words. Thus, you may need to still offer this functionality. |
@acphile This is the advantage of using subwords. Basically, there will be no/less number of OOV words if you are using a subword repsentation. For example, GPT-2/GPT-3 chose to use the byte-based BPE encoding because there will never be OOV words. Also, you may check Section 2.1 of https://arxiv.org/pdf/1911.03688.pdf to see how different models may adopt different strategies for dealing with the OOV problem. |
I understand that and in my context, |
@acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding. |
@acphile Is it possible to also refer to the implementation in gensim https://radimrehurek.com/gensim/models/fasttext.html#module-gensim.models.fasttext? |
For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672 |
I think it’s better to do it internally and the user may not need to care about it. Basically, we need a way to map raw text tokens to embedding vectors.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: phile <notifications@github.com>
Sent: Tuesday, July 21, 2020 4:06:26 AM
To: dmlc/gluon-nlp <gluon-nlp@noreply.github.com>
Cc: Xingjian SHI <xshiab@connect.ust.hk>; Comment <comment@noreply.github.com>
Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)
@acphile<https://github.com/acphile> The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.
For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672
And they use hash buckets for converting ngrams to indexes: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext.py#L1289
I think maybe we can make a supplement for hash lookup in vocab.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#1263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3RJXLSXID2G23UIHPDR4VZDFANCNFSM4O2EOW6Q>.
|
from .evaluation import CosineSimilarity, HyperbolicCosineSimilarity | ||
from ..data import Vocab | ||
|
||
class StaticEmbedding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the AutoML use-cases of data augmentation, we may need more efficient similarity implementations (eg https://github.com/facebookresearch/faiss). So associating a similarity
function to an embedding class may not be a good way forward. For this PR we can focus on adding the load_embeddings
and an equivalent for the fasttext
package and it may not be necessary to add the embed_container.py
src/gluonnlp/embedding/evaluation.py
Outdated
import numpy as np | ||
from ..op import l2_normalize | ||
|
||
__all__ = ['CosineSimilarity', 'ThreeCosMul', 'ThreeCosAdd', 'HyperbolicCosineSimilarity'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add these in a scripts/
folder as part of the evaluation scripts for the similarity and analogy datasets. We don't need to add it to the main API as part of this PR. (The scripts/
for evaluation are not required for automl and you don't need to add it if you're not interested in trying it)
Can you add some tests in https://github.com/dmlc/gluon-nlp/tree/numpy/tests? |
tests/test_embedding.py
Outdated
with pytest.raises(ValueError): | ||
get_fasttext_model('wiki.multi.ar') | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
raise ValueError('Cannot recognize {} for the bin file'.format(source)) | ||
file_name, file_hash = C.FAST_TEXT_BIN_SHA1[source] | ||
file_path = _get_file_path('fasttext', file_name, file_hash) | ||
return fasttext.load_model(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed one issue. Can we support multiprocessing with FastText embedding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a concrete example about multiprocessing with FastText embedding? I'm not very clear about it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, for example:
fasttext_model = load_fasttext(...)
with multiprocessing.Pool(4) as pool:
out = pool.map(..., fasttext_model.encode(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we need a function which support multiprocessing inside using fasttext.cc's APIs like your example or are you worried about if fasttext.cc's APIs could work in the multiprocessing setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline discussion with Leo, this looks good and let me merge this in.
commit 232e0b6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:05:17 2020 +0800 update commit 995e5d7 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:01:56 2020 +0800 fix commit 9623240 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 00:52:17 2020 +0800 fix commit d9c4140 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 23:07:10 2020 +0800 fix transformer commit e49fbe1 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:18:12 2020 +0800 update commit 1f75b26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:04:08 2020 +0800 test bart commit 5bab516 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:34:47 2020 +0800 fix cfg commit 6c62a29 Merge: 3366cf3 033214e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:33:10 2020 +0800 Merge remote-tracking branch 'upstream/numpy' into bart commit 033214e Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Wed Jul 29 00:36:57 2020 -0700 [Numpy] Fix SQuAD + Fix GLUE downloading (dmlc#1280) * Update run_squad.py * Update run_squad.py * Update prepare_glue.py commit 3c87457 Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Tue Jul 28 18:03:21 2020 -0700 Add layout + compute_layout support: TransformerNMT, BERT, ALBERT, ELECTRA, MobileBERT, RoBERTA, XLMR (dmlc#1258) * Add layout support * fix test * Update transformer.py * Update transformer.py * Update README.md * try to add set_layout * update test case * fix * update * update * update * Update bert.py * fix bug * update * Update test_models_bert.py * Update tokenizers.py * add compute layout * Update xlmr.py * Update test_models_bert.py * revise test cases * Update layers.py * move jieba to try import * fix * Update transformer.py * fix * Update bert.py * Update setup.py * Update test_models_bert.py * Update test_models_bert.py * fix * update * Revise * Update electra.py * Update electra.py * Update test_models_electra.py * fix * fix bug * Update test_models_albert.py * add more testcases * fix * Update albert.py * Update albert.py * fix bug * fix testcase * Update test_models_electra.py * Update bert.py * update * Update test_models_electra.py * Update mobilebert.py * Update mobilebert.py * update mobilebert * Update test_models_mobilebert.py * Update mobilebert.py * fix bug * Update roberta.py * fix roberta * update * update * fix import * fix bug * update * reduce test workloads * address comment * address comment commit 4d43f82 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 27 20:21:00 2020 -0700 add subversion/wget to docker, add readme (dmlc#1279) commit d76897b Author: phile <phile_999@126.com> Date: Tue Jul 28 10:10:13 2020 +0800 Add embedding related methods in numpy version (dmlc#1263) * A draft for embedding * fix embed_loader * add hyperbolic space and some updates * revise evaluation * fix * simple fixes * move l2norm to op.py * new features * fix * update * add tests, update * newline
commit 510d991 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 02:33:22 2020 +0800 test commit 1b5fa7b Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:48:01 2020 +0800 fix comment1 commit 6533601 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:27:44 2020 +0800 fix comment commit a8853f9 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:10:06 2020 +0800 Squashed commit of the following: commit 232e0b6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:05:17 2020 +0800 update commit 995e5d7 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:01:56 2020 +0800 fix commit 9623240 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 00:52:17 2020 +0800 fix commit d9c4140 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 23:07:10 2020 +0800 fix transformer commit e49fbe1 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:18:12 2020 +0800 update commit 1f75b26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:04:08 2020 +0800 test bart commit 5bab516 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:34:47 2020 +0800 fix cfg commit 6c62a29 Merge: 3366cf3 033214e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:33:10 2020 +0800 Merge remote-tracking branch 'upstream/numpy' into bart commit 033214e Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Wed Jul 29 00:36:57 2020 -0700 [Numpy] Fix SQuAD + Fix GLUE downloading (dmlc#1280) * Update run_squad.py * Update run_squad.py * Update prepare_glue.py commit 3c87457 Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Tue Jul 28 18:03:21 2020 -0700 Add layout + compute_layout support: TransformerNMT, BERT, ALBERT, ELECTRA, MobileBERT, RoBERTA, XLMR (dmlc#1258) * Add layout support * fix test * Update transformer.py * Update transformer.py * Update README.md * try to add set_layout * update test case * fix * update * update * update * Update bert.py * fix bug * update * Update test_models_bert.py * Update tokenizers.py * add compute layout * Update xlmr.py * Update test_models_bert.py * revise test cases * Update layers.py * move jieba to try import * fix * Update transformer.py * fix * Update bert.py * Update setup.py * Update test_models_bert.py * Update test_models_bert.py * fix * update * Revise * Update electra.py * Update electra.py * Update test_models_electra.py * fix * fix bug * Update test_models_albert.py * add more testcases * fix * Update albert.py * Update albert.py * fix bug * fix testcase * Update test_models_electra.py * Update bert.py * update * Update test_models_electra.py * Update mobilebert.py * Update mobilebert.py * update mobilebert * Update test_models_mobilebert.py * Update mobilebert.py * fix bug * Update roberta.py * fix roberta * update * update * fix import * fix bug * update * reduce test workloads * address comment * address comment commit 4d43f82 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 27 20:21:00 2020 -0700 add subversion/wget to docker, add readme (dmlc#1279) commit d76897b Author: phile <phile_999@126.com> Date: Tue Jul 28 10:10:13 2020 +0800 Add embedding related methods in numpy version (dmlc#1263) * A draft for embedding * fix embed_loader * add hyperbolic space and some updates * revise evaluation * fix * simple fixes * move l2norm to op.py * new features * fix * update * add tests, update * newline
commit 9e1ffde Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 11:42:01 2020 +0800 todo commit 9a7c343 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 10:53:15 2020 +0800 revert gelu commit 0425346 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 10:49:52 2020 +0800 re-upload bart commit 516ae84 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 03:32:35 2020 +0800 use_qkv_bias for transformer commit 9d60cda Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 03:17:28 2020 +0800 classifier_activation commit 510d991 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 02:33:22 2020 +0800 test commit 1b5fa7b Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:48:01 2020 +0800 fix comment1 commit 6533601 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:27:44 2020 +0800 fix comment commit a8853f9 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:10:06 2020 +0800 Squashed commit of the following: commit 232e0b6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:05:17 2020 +0800 update commit 995e5d7 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:01:56 2020 +0800 fix commit 9623240 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 00:52:17 2020 +0800 fix commit d9c4140 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 23:07:10 2020 +0800 fix transformer commit e49fbe1 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:18:12 2020 +0800 update commit 1f75b26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:04:08 2020 +0800 test bart commit 5bab516 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:34:47 2020 +0800 fix cfg commit 6c62a29 Merge: 3366cf3 033214e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:33:10 2020 +0800 Merge remote-tracking branch 'upstream/numpy' into bart commit 033214e Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Wed Jul 29 00:36:57 2020 -0700 [Numpy] Fix SQuAD + Fix GLUE downloading (dmlc#1280) * Update run_squad.py * Update run_squad.py * Update prepare_glue.py commit 3c87457 Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Tue Jul 28 18:03:21 2020 -0700 Add layout + compute_layout support: TransformerNMT, BERT, ALBERT, ELECTRA, MobileBERT, RoBERTA, XLMR (dmlc#1258) * Add layout support * fix test * Update transformer.py * Update transformer.py * Update README.md * try to add set_layout * update test case * fix * update * update * update * Update bert.py * fix bug * update * Update test_models_bert.py * Update tokenizers.py * add compute layout * Update xlmr.py * Update test_models_bert.py * revise test cases * Update layers.py * move jieba to try import * fix * Update transformer.py * fix * Update bert.py * Update setup.py * Update test_models_bert.py * Update test_models_bert.py * fix * update * Revise * Update electra.py * Update electra.py * Update test_models_electra.py * fix * fix bug * Update test_models_albert.py * add more testcases * fix * Update albert.py * Update albert.py * fix bug * fix testcase * Update test_models_electra.py * Update bert.py * update * Update test_models_electra.py * Update mobilebert.py * Update mobilebert.py * update mobilebert * Update test_models_mobilebert.py * Update mobilebert.py * fix bug * Update roberta.py * fix roberta * update * update * fix import * fix bug * update * reduce test workloads * address comment * address comment commit 4d43f82 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 27 20:21:00 2020 -0700 add subversion/wget to docker, add readme (dmlc#1279) commit d76897b Author: phile <phile_999@126.com> Date: Tue Jul 28 10:10:13 2020 +0800 Add embedding related methods in numpy version (dmlc#1263) * A draft for embedding * fix embed_loader * add hyperbolic space and some updates * revise evaluation * fix * simple fixes * move l2norm to op.py * new features * fix * update * add tests, update * newline
* init * fix convert roberta * rename TransformerNMTModel as TransformerModel * update bart * fix * fix * update init * add layernorm_embedding for transformer * convert script * encoder * fix * fix vocab * fix roberta * fix * fix electra * add conversion bash for roberta and xlmr * ELECTRA SETUP * convert bart decoder * fix * update * testing output * remove arange_like for embeddings * fix * update * use_pooler for bart * fix * upload params for bart * add test_models_bart * fix cfg * test bart * update * fix transformer * Squashed commit of the following: commit 510d991 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 02:33:22 2020 +0800 test commit 1b5fa7b Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:48:01 2020 +0800 fix comment1 commit 6533601 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:27:44 2020 +0800 fix comment commit a8853f9 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:10:06 2020 +0800 Squashed commit of the following: commit 232e0b6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:05:17 2020 +0800 update commit 995e5d7 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:01:56 2020 +0800 fix commit 9623240 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 00:52:17 2020 +0800 fix commit d9c4140 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 23:07:10 2020 +0800 fix transformer commit e49fbe1 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:18:12 2020 +0800 update commit 1f75b26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:04:08 2020 +0800 test bart commit 5bab516 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:34:47 2020 +0800 fix cfg commit 6c62a29 Merge: 3366cf3 033214e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:33:10 2020 +0800 Merge remote-tracking branch 'upstream/numpy' into bart commit 033214e Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Wed Jul 29 00:36:57 2020 -0700 [Numpy] Fix SQuAD + Fix GLUE downloading (#1280) * Update run_squad.py * Update run_squad.py * Update prepare_glue.py commit 3c87457 Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Tue Jul 28 18:03:21 2020 -0700 Add layout + compute_layout support: TransformerNMT, BERT, ALBERT, ELECTRA, MobileBERT, RoBERTA, XLMR (#1258) * Add layout support * fix test * Update transformer.py * Update transformer.py * Update README.md * try to add set_layout * update test case * fix * update * update * update * Update bert.py * fix bug * update * Update test_models_bert.py * Update tokenizers.py * add compute layout * Update xlmr.py * Update test_models_bert.py * revise test cases * Update layers.py * move jieba to try import * fix * Update transformer.py * fix * Update bert.py * Update setup.py * Update test_models_bert.py * Update test_models_bert.py * fix * update * Revise * Update electra.py * Update electra.py * Update test_models_electra.py * fix * fix bug * Update test_models_albert.py * add more testcases * fix * Update albert.py * Update albert.py * fix bug * fix testcase * Update test_models_electra.py * Update bert.py * update * Update test_models_electra.py * Update mobilebert.py * Update mobilebert.py * update mobilebert * Update test_models_mobilebert.py * Update mobilebert.py * fix bug * Update roberta.py * fix roberta * update * update * fix import * fix bug * update * reduce test workloads * address comment * address comment commit 4d43f82 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 27 20:21:00 2020 -0700 add subversion/wget to docker, add readme (#1279) commit d76897b Author: phile <phile_999@126.com> Date: Tue Jul 28 10:10:13 2020 +0800 Add embedding related methods in numpy version (#1263) * A draft for embedding * fix embed_loader * add hyperbolic space and some updates * revise evaluation * fix * simple fixes * move l2norm to op.py * new features * fix * update * add tests, update * newline * Squashed commit of the following: commit 9e1ffde Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 11:42:01 2020 +0800 todo commit 9a7c343 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 10:53:15 2020 +0800 revert gelu commit 0425346 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 10:49:52 2020 +0800 re-upload bart commit 516ae84 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 03:32:35 2020 +0800 use_qkv_bias for transformer commit 9d60cda Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 03:17:28 2020 +0800 classifier_activation commit 510d991 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 02:33:22 2020 +0800 test commit 1b5fa7b Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:48:01 2020 +0800 fix comment1 commit 6533601 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:27:44 2020 +0800 fix comment commit a8853f9 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:10:06 2020 +0800 Squashed commit of the following: commit 232e0b6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:05:17 2020 +0800 update commit 995e5d7 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 01:01:56 2020 +0800 fix commit 9623240 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 30 00:52:17 2020 +0800 fix commit d9c4140 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 23:07:10 2020 +0800 fix transformer commit e49fbe1 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:18:12 2020 +0800 update commit 1f75b26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 22:04:08 2020 +0800 test bart commit 5bab516 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:34:47 2020 +0800 fix cfg commit 6c62a29 Merge: 3366cf3 033214e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 29 21:33:10 2020 +0800 Merge remote-tracking branch 'upstream/numpy' into bart commit 033214e Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Wed Jul 29 00:36:57 2020 -0700 [Numpy] Fix SQuAD + Fix GLUE downloading (#1280) * Update run_squad.py * Update run_squad.py * Update prepare_glue.py commit 3c87457 Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Tue Jul 28 18:03:21 2020 -0700 Add layout + compute_layout support: TransformerNMT, BERT, ALBERT, ELECTRA, MobileBERT, RoBERTA, XLMR (#1258) * Add layout support * fix test * Update transformer.py * Update transformer.py * Update README.md * try to add set_layout * update test case * fix * update * update * update * Update bert.py * fix bug * update * Update test_models_bert.py * Update tokenizers.py * add compute layout * Update xlmr.py * Update test_models_bert.py * revise test cases * Update layers.py * move jieba to try import * fix * Update transformer.py * fix * Update bert.py * Update setup.py * Update test_models_bert.py * Update test_models_bert.py * fix * update * Revise * Update electra.py * Update electra.py * Update test_models_electra.py * fix * fix bug * Update test_models_albert.py * add more testcases * fix * Update albert.py * Update albert.py * fix bug * fix testcase * Update test_models_electra.py * Update bert.py * update * Update test_models_electra.py * Update mobilebert.py * Update mobilebert.py * update mobilebert * Update test_models_mobilebert.py * Update mobilebert.py * fix bug * Update roberta.py * fix roberta * update * update * fix import * fix bug * update * reduce test workloads * address comment * address comment commit 4d43f82 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 27 20:21:00 2020 -0700 add subversion/wget to docker, add readme (#1279) commit d76897b Author: phile <phile_999@126.com> Date: Tue Jul 28 10:10:13 2020 +0800 Add embedding related methods in numpy version (#1263) * A draft for embedding * fix embed_loader * add hyperbolic space and some updates * revise evaluation * fix * simple fixes * move l2norm to op.py * new features * fix * update * add tests, update * newline * fix comment * use xavier for embedding initializer
Description
Create embedding related methods in 'gluonnlp.embedding':
embed_loader.list_sources
: Get valid token embedding names and their pre-trained file names.embed_loader.load_embeddings
: Load pretrained embedding file to build an embedding matrix for a givenVocab
.evaluation.CosineSimilarity
: a function to compute the cosine similarity.evaluation.HyperbolicCosineSimilarity
: a function to compute the cosine similarity in the Hyperbolic space.evaluation.ThreeCosAdd
: a Class for 3CosAdd analogy.evaluation.ThreeCosMul
: a Class for 3CosMul analogy.About evaluation
Currently the implementations of
embedding.evaluation
are not very satisfactory. Suggestions are welcome.Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team