-
Notifications
You must be signed in to change notification settings - Fork 1
Sphere embedding (s-code) is a variation of Euclidean embedding of co-occurence data (code).
License
ai-ku/scode
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SCODE Copyright (c) 2012-2014 Deniz Yuret Algorithm: Let X and Y be two categorical variables with finite cardinalities |X| and |Y|. We observe a set of pairs {xi, yi} drawn IID from the joint distribution of X and Y. The basic idea behind CODE and related methods is to represent (embed) each value of X and each value of Y as points in a common low dimensional Euclidean space such that values that frequently co-occur lie close to each other. S-CODE further restricts these points to lie on the unit sphere. Multi variable CODE generalizes this idea to more than two variables. Please see the following papers for a description. * Amir Globerson, Gal Chechik, Fernando Pereira and Naftali Tishby. Euclidean Embedding of Co-Occurrence Data. Journal of Machine Learning Research 8 (Oct), p.2265-2295, 2007. See Eq. 2, the MM likelihood which scode maximizes, and Sec. 6.2 for the multivariable extension. * Yariv Maron, Michael Lamar, Elie Bienenstock. Sphere Embedding: An application to unsupervised POS tagging. (NIPS 2010) See Eq. 10, 11, 12 for the update rule implemented by scode. * Mehmet Ali Yatbaz, Enis Sert, and Deniz Yuret. 2012. Learning Syntactic Categories Using Paradigmatic Representations of Word Context. EMNLP 2012. Achieves state-of-the-art (80% accuracy) results in unsupervised part-of-speech induction for English. Usage: scode [OPTIONS] < file file should have columns of arbitrary tokens -r RESTART: number of restarts (default 1) -i NITER: number of iterations over data (default UINT32_MAX) -t THRESHOLD: quit if logL increase for iter <= this (default .001) -d NDIM: number of dimensions (default 25) -z Z: partition function approximation (default 0.166) -p PHI0: learning rate parameter (default 50.0) -u ETA0: learning rate parameter (default 0.2) -s SEED: random seed (default 0) -c calculate real Z (default false) -w The first line of the input is weights (default false) -v verbose messages (default false) The input format is: token1 <tab> token2 <tab> ... token_n <newline> The output format is: index:token <tab> count <tab> v1 <tab> v2 <tab> ... v_n <newline> where index is the column (0, 1, 2, ...) where the token has been observed, count is the number of times it was observed, and [v1,v2,...] is its vector embedding. To compile scode you need the gsl library. Otherwise everything is standard C, so just typing make should give you an executable. Other executables: scode-online [options] < file: A complete rewrite of scode to avoid memory problems with very large datasets. scode-online performs vector updates on the fly and does not store the training set in memory. The output format is similar to scode. The gsl dependencies in the code have been removed, everything is standard C. Interface differences with scode: Options like -r, -i, -t does not make sense and have been removed, scode-online stops and prints its output when the input runs out. To simulate -r, run scode-online multiple times with different seeds. To simulate -i, feed the training file to stdin multiple times (see the script "ncat" below). The counts will be different from the scode output in this case because scode-online does not know you are repeating the same data. Option -c to calculate the real Z have been removed, use scode-logl for that. Option -w to give different weights to different columns have been removed, all weights are assumed to be 1. A column in a training file (except the first) can be empty, in which case it is skipped (empty fields were denoted with /XX/ in original scode). To support empty columns the training file is split strictly at single tab characters (after the last character of the line, '\n' has been removed). Multiple tabs denote multiple empty fields, spaces act as regular data. ncat 10 file | scode-online > model: ncat writes the file to stdout n times. Useful when you want scode-online to go through the training set multiple times. scode-logl model < file: Computes the average log-likelihood of the data in file given the model output by scode or scode-online. Most of the computation is performed to calculate the real Z, which takes about 450 seconds for two columns each with 50K vocabulary.
About
Sphere embedding (s-code) is a variation of Euclidean embedding of co-occurence data (code).
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published