This package contains the scripts and various python tools for computing the semantic interpretability of topics via: (1) the word intrusion task; (2) PMI/NPMI/LCP-based observed coherence.
- 2016-10-31: updated ComputeObservedCoherence to compute mean coherence over multiple top-N words; e.g. using option "-t 5 10 15 20" means it will compute coherence for top-5/10/15/20 words and then take the mean over the 4 values. Our latest study found that using multiple top-N words improves performance (see "The Sensitivity of Topic Coherence Evaluation to Topic Cardinality" in Other Related Papers)
- ComputeObservedCoherence.py: computes the topic observed coherence (pairwise PMI/NPMI/LCP)
- ComputeWordCount.py: samples the word and word pair occurrences based on a reference corpus.
- ComputeWordIntrusion.py: computes the model precision of the word intrusion task.
- data: contains the input files (topics and intruder words).
- GenSVMInput.py: generates the feature file for SVM.
- ref_corpus: contains the reference corpus.
- results: contains the computed results for the topics.
- run-oc.sh: the main script for computing the observed coherence.
- run-wi.sh: the main script for running the word intrusion task.
- SplitSVM: splits the feature file generated by GenSVMInput.py to do 10-fold cross validation.
- svm_rank: contains the svm program and input feature files.
- wordcount: contains the word counts sampled by ComputeWordCount.py.
Pairwse PMI/NPMI/LCP observed coherence:
- Generate the topic file and put it in data/
- Set up the parameters in run-oc.sh
- Execute run-oc.sh
Word intrusion:
- Generate the topic file (with intruder words) and the intruder word file and put them in data/
- Set up the parameters in run-wi.sh
- Execute run-wi.sh
- Topic file: one line per topic (displaying top-N words).
- Topic file with intruder word: one line per topic including the intruder word.
- Intruder word file: one line per intruder word (each line corresponds to the topic of the same line number.
Examples are given in data/
Parallel processing for sampling the word counts can be achieved by splitting the reference corpus into multiple partitions. The format of the reference corpus is one line per document, and the words should be tokenised (separated by white space). Best results is achieved by lemmatising the reference corpus (and the document collection where the topic model is run on). An example reference corpus is given in the package.
- Debug OFF (in ComputeObservedCoherence.py/ComputeWordIntrusion.py): one score per line, each score corresponds to the topic of the same line
- Debug ON (in ComputeObservedCoherence.py/ComputeWordIntrusion.py): score, topics and intruder words (for the word intrusion task only) are displayed
The sampling of word counts work for multi-word topics (i.e. topics with phrases/collocations). Use the underscore symbol ("_") to concatenate the phrases/collocations. E.g. Topic 1: hello_world this_is_a_collocation apple orange banana durian
- MIT license - http://opensource.org/licenses/MIT.
- Jey Han Lau, David Newman and Timothy Baldwin (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), Gothenburg, Sweden, pp. 530—539.
- David Newman, Jey Han Lau, Karl Grieser and Timothy Baldwin (2010). Automatic Evaluation of Topic Coherence. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los Angeles, USA, pp. 100—108.
- Jey Han Lau, Timothy Baldwin and David Newman (2013). On Collocations and Topic Models. ACM Transactions on Speech and Language Processing 10(3), pp. 10:1—10:14.
- Jey Han Lau and Timothy Baldwin (2016). The Sensitivity of Topic Coherence Evaluation to Topic Cardinality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics — Human Language Technologies (NAACL HLT 2016), San Diego, USA, pp. 483—487.