Skip to content

State-of-the-art count-based word embeddings for low-resource languages with a special focus on historical languages.

Notifications You must be signed in to change notification settings

asahala/pmi-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pmi-embeddings

State-of-the-art count-based word vectors for low-resource languages with a special focus on historical languages, especially Akkadian and Sumerian.

  1. src/make_embeddings.py creates PMI-based word vectors from the source corpus.
  2. src/explore_embeddings.py make simple searches from embeddings. Requires a vector file and a dictionary (included in corpora/akkadian.zip)
  3. src/hypertune.py tests hyperparameters (by using brute force) to find the best settings for the given data set (requires a gold standard).
  4. corpora/extract_corpus.py a script for extracting sense-disambiguated corpora from Korp-Oracc VRT files.
  5. corpora/akkadian.zip a zipped test corpora and a dictionary of Akkadian language (use these to generate new embeddings and to explore them).
  6. eval/gold.tsv an initial version of the Akkadian gold standard.

What are word embeddings and why are they useful?

Word embeddings represent words as real-valued vectors in a multi-dimensional vector space. As the vectors encode words' contextual similarity, they can be used to extract words that show higher interchangeability with each other. Thus, in addition to analogy and similarity tasks, word embeddings can be exploited in almost any NLP application, including sentiment analysis, spam detection and automatic chat moderation, document classification, machine translation etc. This repository contains basic tools for lexicographic analysis, namely exploring vocabularies of historical languages in their own terms.

Jupyter tutorials

For those who like to use Jupyter Notebooks, src/jupyter_embeddings.ipynb instructs how to build your own word embeddings with just a few lines of code. scr/jypyter_explore_embeddings.ipynb guides how to make queries from embeddings.

For setting up Jupyter environment, please read this guide by Niek Veldhuis. Note that you only need the packages listed below (of which most are likely preinstalled in Conda/Jypyter).

(NOTE! AFTER THE RECENT GENSIM UPDATE EXPLORE_EMBEDDINGS DOESN'T WORK! The word vectors are fine and can be used with any other script, though.)

Requirements

Python 3.6 or newer, numpy, scipy and sklearn. For evaluation scripts you will also need gensim.

Features

make_embeddings.py is a fast and efficient way to build PMI-based word embeddings from small (a few million words) text corpora. It combines findings from several recent research papers:

  • Dirichlet Smoothing (Turney & Pantel 2010; Jungmaier et al. 2020)
  • Context Similarity Weighting (Sahala & Linden 2020)
  • Shifted PMI (Levy et al. 2015) with different variants
  • Dynamic Context Windows (Sahlgren 2006; Mikolov et al. 2013; Pennington et al. 2014)
  • Subsampling (Mikolov et al. 2013)
  • Context Distribution Smoothing (Levy et al. 2015)
  • Eigenvalue Weighting (Caron 2001; Levy et al. 2015)
  • Dirty and clean stopwords (Mikolov et al. 2013; Levy et al. 2015)

Input format

Lemmatized UTF-8 encoded text one word per line. Use symbol ´#´ to set window span constraints (i.e. text or paragraph boundaries) and ´_´ to indicate lacunae (breakages in cuneiform text) and ´<stop>´ to indicate stop words.

Output format

Word2vec compatible raw text word vectors.

Parameters and usage

Run script from the commmand line python3 make_embeddings.py corpusfile vectorfile [parameters]. See this document for detailed information about the parameters and references.

Runtime performance

On Intel i5-2400 3.10GHz using a corpus of 1M words and a window size of 3, takes ca. 35 seconds to make basic embeddings and 50 seconds to make CSW-embeddings. On 2.1GHz Xeon Gold 6230 the runtimes are ca. 6 and 10 seconds respectively. Although this is quite fast, testing hundreds or thousands of combinations using hypertune.py may take a while.

Presentations and publications

This repository contains the up to date version of scripts used in Sahala 2019: PMI+SVD and Semantic Domains in Akkadian Texts (a poster presented in HELSLANG summer conference).

Version history

  • 2023-07-03 -- make_embeddings.py fixed shift type 1 formula.
  • 2021-01-26 -- make_embeddings.py added pmi-variants.
  • 2021-01-23 -- make_embeddings.py no longer saves null-vectors for words that occur in completely broken contexts (Gensim doesn't like them).
  • 2021-01-01 -- This is now the main tool for PMI-based word embeddings. Pmizer or Pmizer2 word vectors are no longer developed.

TODO:

  • Add parsing directly from Oracc using Niek's script
  • PMI-delta
  • TF-IDF filtering
  • Visualization

About

State-of-the-art count-based word embeddings for low-resource languages with a special focus on historical languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published