Skip to content

Latest commit

 

History

History
107 lines (93 loc) · 5.14 KB

README.md

File metadata and controls

107 lines (93 loc) · 5.14 KB

spacybert: Bert inference for spaCy

spaCy v2.0 extension and pipeline component for loading BERT sentence / document embedding meta data to Doc, Span and Token objects. The Bert backend itself is supported by the Hugging Face transformers library.

Installation

spacybert requires spacy v2.0.0 or higher.

Usage

Getting BERT embeddings for single language dataset

import spacy
from spacybert import BertInference
nlp = spacy.load('en')

Then either use BertInference as part of a pipeline,

bert = BertInference(
    from_pretrained='path/to/pretrained_bert_weights_dir',
    set_extension=False)
nlp.add_pipe(bert, last=True)

Or not...

bert = BertInference(
    from_pretrained='path/to/pretrained_bert_weights_dir',
    set_extension=True)

The difference is that when set_extension=True, bert_repr is set as a property extension for the Doc, Span and Token spacy objects. If set_extension=False, the bert_repr is set as an attribute extension with a default value (=None). The attribute computes the correct value when doc._.bert_repr is called.

Get the Bert representation / embedding.

doc = nlp("This is a test")
print(doc._.bert_repr)  # <-- torch.Tensor

Getting BERT embeddings for multiple languages dataset.

import spacy
from spacy_langdetect import LanguageDetector
from spacybert import MultiLangBertInference

nlp = spacy.load('en')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
bert = MultiLangBertInference(
    from_pretrained={
        'en': 'path/to/en_pretrained_bert_weights_dir',
        'nl': 'path/to/nl_pretrained_bert_weights_dir'
    },
    set_extension=False)
nlp.add_pipe(bert, after='language_detector')

texts = [
    "This is a test",  # English
    "Dit is een test"  # Dutch
]
for doc in nlp.pipe(texts):
    print(doc._.bert_repr)  # <-- torch.Tensor

When language_detector detects languages other than the ones for which pre-trained weights is specified, by default doc._.bert_repr = None.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute name on initializing the extension.

Doc._.bert_repr torch.Tensor Document BERT embedding
Span._.bert_repr torch.Tensor Span BERT embedding
Token._.bert_repr torch.Tensor Token BERT embedding

Settings

On initialization of BertInference, you can define the following:

name type default description
from_pretrained str None Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., bert-base-uncased
attr_name str 'bert_repr' Name of the BERT embedding attribute to set to the ._ property
max_seq_len int 512 Max sequence length for input to Bert
pooling_strategy str 'REDUCE_MEAN' Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available.
set_extension bool True If True, then 'bert_repr' is set as a property extension for the Doc, Span and Token spacy objects. If False, the 'bert_repr' is set as an attribute extension with a default value (None) which gets filled correctly when called in a pipeline. Set it to False if you want to use this extension in a spacy pipeline.
force_extension bool True A boolean value to create the same 'Extension Attribute' upon being executed again

On initialization of MultiLangBertInference, you can define the following:

name type default description
from_pretrained Dict[LANG_ISO_639_1, str] None Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights
attr_name str 'bert_repr' Same as in BertInference
max_seq_len int 512 Same as in BertInference
pooling_strategy str 'REDUCE_MEAN' Same as in BertInference
set_extension bool True Same as in BertInference
force_extension bool True Same as in BertInference

Pooling strategies

strategy description
REDUCE_MEAN Element-wise average the word embeddings
REDUCE_MAX Element-wise maximum of the word embeddings
REDUCE_MEAN_MAX Apply both 'REDUCE_MEAN' and 'REDUCE_MAX' and concatenate. So if the original word embedding is of dimensions (768,), then the output will have shape (1536,)
CLS_TOKEN, FIRST_TOKEN Take the embedding of only the first [CLS] token
SEP_TOKEN, LAST_TOKEN Take the embedding of only the last [SEP] token
None No reduction is applied and a matrix of embeddings per word in the sentence is returned

Roadmap

This extension is still experimental. Possible future updates include:

  • Getting document representation from other state-of-the-art NLP models other than Google's BERT.
  • Method for computing similarity between Doc, Span and Token objects using the bert_repr tensor.
  • Getting representation from multiple / other layers in the models.