Python 中文分词工具包
GitHub: https://github.com/samurais/chop
Pypi: https://pypi.python.org/pypi/chop
Python3
代码对 Python 3 兼容
-
全自动安装:
easy_install chop
或者pip install chop
/pip3 install chop
-
接口
from chop.hmm import Tokenizer as HMMTokenizer
from chop.mmseg import Tokenizer as MMSEGTokenizer
sentence = "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。"
def main():
HT = HMMTokenizer()
MT = MMSEGTokenizer()
print('HMM Tokenizer:', ' '.join(HT.cut(sentence)))
print('MMSEG Tokenizer:', ' '.join(MT.cut(sentence)))
- 代码通俗易懂,方便掌握算法
- chop.[mmseg|hmm].Tokenizer Object
t = chop.mmseg.Tokenizer([dict_path="自定义词典位置"])
- t#cut(sentence[, punctuation = True])
参数:
sentence 中文句子 punctuation=True 分词输出标点.
返回:
Token 使用yield返回的generator
./scripts/test-badcase.sh "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
- MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm http://technology.chtsai.org/mmseg/
Other references: http://blog.csdn.net/nciaebupt/article/details/8114460 http://www.codes51.com/itwd/1802849.html
- HMM & Viterbi:
Dict: https://github.com/Samurais/jieba/blob/master/jieba/dict.txt
virtualenv --no-site-packages -p /usr/local/bin/python3.6 ~/venv-py3
CHOP_LOG_LVL=DEBUG
./scripts/test.sh