Accessing higher dimensions for unsupervised translation

Code for the paper Accessing higher dimensions for unsupervised translation, a simple and direct way to achieve unsupervised word translation that could conceivably be done in the 90s.

Requirements

A short dependency list with comments in requirements.txt: tokenizers, numpy, cython, pandas for data processing and wandb for experiment management. fasttext is required for comparison experiments but not for the core method / demo usage. The exact requirements are in environment.yml

Demo usage

Run python setup.py build_ext --inplace in fast/ to build fast/cooc_count.pyx
Download and process the raw data using Europarl English to Hungarian as the example
- Download the compressed text data wget https://www.statmt.org/europarl/v7/hu-en.tgz
- Extract plain text data: tar xzf hu-en.tgz to get europarl-v7.hu-en.{hu, en}
- shuffle the parallel data, for example
```
shuf -o europarl-v7.hu-en.en < europarl-v7.hu-en.en
shuf -o europarl-v7.hu-en.hu < europarl-v7.hu-en.hu
```
Run with default parameters with python test_coocmap.py
- The default demo setting uses 20MB of data and achieve an accuracy of > 50% in under 5min on 8 core CPUs x 2.4GHz
- If not already, the evaluation data will be downloaded by data.py to ~/.cache/cooc/muse_dicts (requires wget) for language pairs covered by MUSE evaluation data
- The default data paths and evaluation dictionary corresponds to this Europarl example, modify the default parameters accordingly to use other data source or language pairs
In my testing, this demo got 54% accuracy. The induced dictionaries are stored in wandb/latest-run/files/dump/*.csv among other log and run information.

Code files

match.py contains the main methods vecmap and coocmap as well as methods for clip / drop and initialization. The main self-learn loop is coocmapt

test_coocmap.py runs the demo code, parameters are inline

embeddings.py is adopted from vecmap, with additions for comparison with other association methods

data.py contains most scripts to download and process data. Most data is saved in ~/.cache/cooc/ by default

fast/ Cython code to count co-occurences faster than python.

experiments/ contains wandb sweeps configs for experiments and their corresponding scripts

License

The majority of coocmap is licensed under CC-BY-NC, however portions of the project are available under separate license terms: https://github.com/artetxem/vecmap/blob/master/embeddings.py is licensed under the GNU General Public License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
experiments		experiments
fast		fast
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baselines.py		baselines.py
data.py		data.py
embeddings.py		embeddings.py
environment.yml		environment.yml
evaluation.py		evaluation.py
match.py		match.py
requirements.txt		requirements.txt
test_coocmap.py		test_coocmap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accessing higher dimensions for unsupervised translation

Requirements

Demo usage

Code files

License

About

Releases

Packages

Languages

License

facebookresearch/coocmap

Folders and files

Latest commit

History

Repository files navigation

Accessing higher dimensions for unsupervised translation

Requirements

Demo usage

Code files

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages