This repository is a codebase for probing and visualizing multilingual language models, specifically Multilingual BERT, based on the ACL'20 paper Finding Universal Grammatical Relations in Multilingual BERT . It draws heavily from the structural-probes codebase of Hewitt and Manning (2019). All code is under the Apache license.
-
Clone the repository.
git clone https://github.com/ethanachi/multilingual-probing-visualization cd multilingual-probing-visualization
-
[Optional] Construct a virtual environment for this project. Only
python3
is supported.conda create --name probe-viz conda activate probe-viz
-
Install the required packages. This mainly means
pytorch
,scipy
,numpy
,sklearn
, etc. Look at pytorch.org for the PyTorch installation that suits you and install it; it won't be installed viarequirements.txt
. Everything in the repository will use a GPU if available, but if none is available, it will detect so and just use the CPU, so use the pytorch install of your choice.conda install --file requirements.txt pip install pytorch-pretrained-bert
A significant portion of our paper relies on t-SNE visualizations generated from head-dependency pairs.
We provide a demo script that can be used to easily produce such visualizations, using a pretrained set of
probe parameters trained on either English or a concatenation of 10 languages.
Visualizations for English→French transfer and a joint multilingual space are currently available here, although this may move in the near future.
-
Download data:
bash downloadExamples.sh
This downloads pretrained probe parameters into examples/data
, as well as example data for English and French into the examples/{en, fr}
folders, using the name convention described earlier.
To test on other languages, download the dev split conllu
files into similarly-named directories.
-
Process data:
bash scripts/process_demo.sh examples/
This will write raw .txt
files and BERT hidden state data into the examples/
folder.
-
Generate tSNE visualizations:
python3 run_demo.py examples/
This will write an output directory with visualizations to disk---check the output logs.
-
Run the server and navigate to
localhost:8000
:Visualize: cd examples/results/2020-5-9-19-0-29-622752/tsne-2020-5-9-19-1-14-581865 python3 -m http.server
Experiments run with this repository are specified via yaml
files that completely describe the experiment (except the random seed.)
In this section, we go over each top-level key of the experiment config.
observation_fieldnames
: the fields (columns) of the conll-formatted corpus files to be used. Must be in the same order as the columns of the corpus. The final two fields must belangs
andembeddings
. Each field will be accessable as an attribute of eachObservation
class (e.g.,observation.sentence
contains the sequence of tokens comprising the sentence.)corpus
: The location of the train, dev, and test conll-formatted multilingual corpora files. Each oftrain_path
,dev_path
,test_path
will be taken as relative to theroot
field.embeddings
: The location of the train, dev, and test pre-computed multilingual embedding files (ignored if not applicable. Each oftrain_path
,dev_path
,test_path
will be taken as relative to theroot
field. -type
is ignored.keys
: A list of languages to be used for each split (train, dev, and test).batch_size
: The number of observations to put into each batch for training the probe. 20 or so should be fine.
dataset:
observation_fieldnames:
- index
- sentence
- lemma_sentence
- upos_sentence
- xpos_sentence
- morph
- head_indices
- governance_relations
- secondary_relations
- extra_info
- langs
- embeddings
corpus:
root: /u/scr/ethanchi/langs
train_path: train.conllu
dev_path: dev.conllu
test_path: test.conllu
embeddings:
type: token
root: /u/scr/ethanchi/hdf5
train_path: train-multilingual.hdf5
dev_path: dev-multilingual.hdf5
test_path: test-multilingual.hdf5
keys:
train: ['fr']
dev: ['en']
test: ['en']
batch_size: 20
hidden_dim
: The dimensionality of the representations to be probed. The probe parameters constructed will be of shape (hidden_dim, maximum_rank)model_type
: One ofELMo-disk
,BERT-disk
,ELMo-decay
,ELMo-random-projection
as of now. Used to help determine whichDataset
class should be constructed, as well as which model will construct the representations for the probe. TheDecay0
andProj0
baselines in the paper are fromELMo-decay
andELMo-random-projection
, respectively. In the future, will be used to specify other PyTorch models.use_disk
: Set toTrue
to assume that pre-computed embeddings should be stored with eachObservation
; Set toFalse
to use the words in some downstream model (this is not supported yet...)model_layer
: The index of the hidden layer to be used by the probe. For example,ELMo
models can use layers0,1,2
; BERT-base models have layers0
through11
; BERT-large0
through23
.tokenizer
: If a model will be used to construct representations on the fly (as opposed to using embeddings saved to disk) then a tokenizer will be needed. Thetype
string will specify the kind of tokenizer used. Thevocab_path
is the absolute path to a vocabulary file to be used by the tokenizer.
model:
hidden_dim: 768 # BERT hidden dim
model_type: BERT-disk
use_disk: True
model_layer: 6 # BERT-multilingual: (0,...,11)
multilingual: True
task_signature
: Specifies the function signature of the task. Supportsword_pair
for parse distance tasks,word
for single-word tasks, andword_label
for classification tasks (e.g. semantic roles). Our paper uses only theword_pair
setting.task_name
: A unique name for each task supported by the repository. Right now, this includesparse-depth
,parse-distance
, andsemantic-roles
.maximum_rank
: Specifies the dimensionality of the space to be projected into, ifpsd_parameters=True
. The projection matrix is of shape (hidden_dim, maximum_rank). The rank of the subspace is upper-bounded by this value. Ifpsd_parameters=False
, then this is ignored.psd_parameters
: though not reported in the paper, theparse_distance
andparse_depth
tasks can be accomplished with a non-PSD matrix inside the quadratic form. All experiments for the paper were run withpsd_parameters=True
, but settingpsd_parameters=False
will simply construct a square parameter matrix. See the docstring ofprobe.TwoWordNonPSDProbe
andprobe.OneWordNonPSDProbe
for more info.diagonal
: Ignored.prams_path
: The path, relative toargs['reporting']['root']
, to which to save the probe parameters.epochs
: The maximum number of epochs to which to train the probe. (Regardless, early stopping is performed on the development loss.)loss
: A string to specify the loss class. Right now, onlyL1
is available. The class withinloss.py
will be specified by a combination of this and the task name, since for example distances and depths have different special requirements for their loss functions.
probe:
task_signature: word_pair # word, word_pair
task_name: parse-distance
maximum_rank: 32
psd_parameters: True
diagonal: False
params_path: predictor.params
probe_training:
epochs: 30
loss: L1
root
: The path to the directory in which a new subdirectory should be constructed for the results of this experiment.observation_paths
: The paths, relative toroot
, to which to write the observations formatted for quick reporting later on.prediction_paths
: The paths, relative toroot
, to which to write the predictions of the model.reporting_methods
: A list of strings specifying the methods to use to report and visualize results from the experiment. Forparse-distance
, the valid methods are:spearmanr
uuas
image_examples
write_data
(writes data to disk in an easier-to-read format)adj_acc
(reports UUAS for prenominal and postnominal adjectives)tsne
(generates a t-SNE visualization, see the next section)pca
(generates a PCA visualization)unproj_tsne
(generates a t-SNE visualization, but using PCA for dimensionality reduction rather than the structural probe)visualize_tsne
(copies supporting HTML files to disk for easy visualization) When reportinguuas
, sometikz-dependency
examples are written to disk as well. Note thatimage_examples
will be ignored for the test set.
reporting:
root: example/results
observation_paths:
train_path: train.observations
dev_path: dev.observations
test_path: test.observations
prediction_paths:
train_path: train.predictions
dev_path: dev.predictions
test_path: test.predictions
reporting_methods:
- spearmanr
#- image_examples
- uuas
Generally speaking, the following steps are necessary to run arbitrary experiments:
-
For each language that you'd like to investigate:
-
Have a
conllu
file for the train, dev, and test splits of your dataset. These should each go in a folder named with the language code (e.g.path_to_conllus/en/train.conllu
). -
Convert each
conllu
file to plain text by running:python3 scripts/convert_conll_to_raw.py path_to_conllus/en/train.conllu path_to_conllus/en/train.txt
Repeat this for each split (train, dev, test) as appropriate.
-
Write contextual word representations to disk for each of the train, dev, and test split in
hdf5
format. The key to eachhdf5
dataset object should be{lang}-{index}
, where{lang}
is the language code of the sentence's language, and{index}
is the index of the sentence in its specificconllu
file. That is, your dataset file should look a bit like{'en-0': <np.ndarray(size=(1,SEQLEN1,FEATURE_COUNT))>, 'en-1':<np.ndarray(size=(1,SEQLEN1,FEATURE_COUNT))>...}
, etc. Note here thatSEQLEN
for each sentence must be the number of tokens in the sentence as specified by theconllx
file. To do this for Multilingual BERT, run the following script:python3 scripts/convert_raw_to_bert.py path_to_conllus/en/train.txt path_to_hdf5/train_multilingual.hdf5 multilingual lang
where lang
is the language code (e.g. en
). Note that all splits for a specific language should share the same hdf5
embeddings file.
-
Edit a
config
file fromexample/config
to match the paths to your data, as well as the hidden dimension and labels for the columns in theconllx
file. For more information, please consult the experiment config section of this README. -
Run an experiment with
python3 probing/run_experiment.py
.
Here are the steps to replicate the results for our ACL'20 paper:
- Download the train/dev/test splits for the following datasets:
- UD_Arabic-PADT
- UD_Chinese-GSD
- UD_Czech-PDT
- UD_English-EWT
- UD_Finnish-TDT
- UD_French-GSD
- UD_German-GSD
- UD_Indonesian-GSD
- UD_Latvian-LVTB
- UD_Persian-Seraji
- UD_Spanish-AnCora
-
move the datasets to folders labeled with language codes in
DATAPATH
, i.e.DATAPATH/en/{train, dev, test}.conllu DATAPATH/fr/{train, dev, test}.conllu ...
-
remove any sentences from the train sets larger than 512 tokens (the maximum sentence length for Multilingual BERT), that is:
ar
: annahar.20021130.0085:p18u1fi
: j016.2fr
: fr-ud-train_06464
-
Convert the conllx files to sentence-per-line whitespace-tokenized files, using
scripts/convert_conll_to_raw.py
. -
Download the random baseline:
bash download_random_baseline.sh
. This will download a.tar
file with the parameters for mBertRandom, a baseline with randomly-initialized parameters. Change the path inscripts/convert_raw_to_bert.py
to match your download path. -
Use
scripts/convert_raw_to_bert.py
to take the sentence-per-line whitespace-tokenized files and write BERT vectors to disk in hdf5 format. -
Replace the data paths (and choose a results path) in the yaml configs in
acl2020/*/*
with the paths that point to your conllx and .hdf5 files as constructed in the above steps. These 270 experiment files specify the configuration of all the experiments that end up in the paper.
If you use this repository, please cite:
@inproceedings{chi2020finding,
title={Finding Universal Grammatical Relations in Multilingual BERT},
author={Chi, Ethan A and Hewitt, John and Manning, Christopher D},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}