Home

Getting Started

thulac4j 最新版本为1.3.0，支持两种分词模式：

SegOnly模式，只分词没有词性标注；
SegPos模式，分词兼有词性标注。

// SegOnly mode
String sentence = "滔滔的流水，向着波士顿湾无声逝去";
SegOnly seg = new SegOnly("models/cws_model.bin", "models/cws_dat.bin");
System.out.println(seg.segment(sentence));
// [滔滔, 的, 流水, ，, 向着, 波士顿湾, 无声, 逝去]

// SegPos mode
SegPos pos = new SegPos("models/model_c_model.bin", "models/model_c_dat.bin");
System.out.println(pos.segment(sentence));
// [滔滔/a, 的/u, 流水/n, ，/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]

SegOnly分词速度更快，但是准确率较SegPos模式要低；而SegPos具有更高的准确率，内存占用更多、分词速度较慢（请参看性能测试）。此外分词需要下载训练模型数据，下载地址见http://thulac.thunlp.org

此外，thulac4j还支持自定义词典：

seg.setUserWordsPath("<user-words-path>");

自定义词典中的词为行分隔，格式如下：

中国人
thulac4j
中文分词

支持繁体转简体：

Simplifier simplifier = new Simplifier();
String s = simplifier.t2s("世界商機大發現");

停用词过滤：

StopFilter stopFilter = new StopFilter();
stopFilter.filter(segmented);

入门

快速开始

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Getting Started

入门

测评

Clone this wiki locally