Home

Getting Started

thulac4j支持两种分词模式：

SegOnly模式，只分词没有词性标注；
SegPos模式，分词兼有词性标注。

// SegOnly mode
String sentence = "滔滔的流水，向着波士顿湾无声逝去";
SegOnly seg = new SegOnly("seg_only.bin");
System.out.println(seg.segment(sentence));
// [滔滔, 的, 流水, ，, 向着, 波士顿湾, 无声, 逝去]

// SegPos mode
SegPos pos = new SegPos("seg_pos.bin");
System.out.println(pos.segment(sentence));
//[滔滔/a, 的/u, 流水/n, ，/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]

SegOnly分词速度更快，但是准确率较SegPos模式要低；而SegPos具有更高的准确率，内存占用更多、分词速度较慢（请参看性能测试）。此外分词需要下载训练模型数据seg_only.bin与seg_pos.bin，可以在这里下载（下载后解压），或者是（clone源码后）导入THULAC的训练模型数据而生成：

ThulacModel thulac = new ThulacModel("cws_model.bin", "cws_dat.bin", "cws_label.txt");
thulac.serialize("seg_only.bin");

ThulacModel thulac = new ThulacModel("model_c_model.bin", "model_c_dat.bin", "model_c_label.txt");
thulac.serialize("seg_pos.bin");

此外，thulac4j还支持自定义词典：

seg.setUserWordsPath("<user-words-path>");

自定义词典中的词为行分隔，格式如下：

中国人
thulac4j
中文分词

支持繁体转简体：

Simplifier simplifier = new Simplifier();
String s = simplifier.t2s("世界商機大發現");

停用词过滤：

StopFilter stopFilter = new StopFilter();
stopFilter.filter(segmented);

入门

快速开始

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Getting Started

入门

测评

Clone this wiki locally