GitHub - seanghay/khmercut.cpp: A portable Khmer word boundary detection C++ library using CRFSuite.

khmercut.cpp

A portable Khmer word boundary detection library using CRFSuite.

Build

git clone --recursive https://github.com/seanghay/khmercut.cpp.git

mkdir build

cd build

cmake ..

make -j

./khmercut

Usage

#include "crfsuite.hpp"
#include "crfsuite_api.hpp"
#include "khmercut.h"

int main(int argc, const char *argv[])
{
	CRFSuite::Tagger tagger;
	tagger.open("../crf_ner_10000.crfsuite");

	std::string text = "ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់";
	CRFSuite::StringList tokens = khmercut::tokenize(tagger, text);

	for (const auto &token : tokens)
	{
		std::cout << "\"" << token << "\"" << std::endl;
	}
}

Result

"ឃាត់ខ្លួន"
"ជនសង្ស័យ"
"០៤"
"នាក់"
" "
"ករណី"
"លួច"
"ខ្សែភ្លើង"
" "
"នៅ"
"ស្រុក"
"ព្រៃនប់"

License

Apache-2.0

Reference

VietHoang1512/khmer-nltk
crf_ner_10000.crfsuite is extracted from here by using a script below:

import khmernltk.word_tokenize import model_path, load_model
import shutil

model = load_model(model_path)
shutil.copy(model.modelfile.name, "crf_ner_10000.crfsuite")

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
crfsuite @ a2a1547		crfsuite @ a2a1547
liblbfgs @ 5ad02fb		liblbfgs @ 5ad02fb
src		src
utfcpp @ 6be08bb		utfcpp @ 6be08bb
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
crf_ner_10000.crfsuite		crf_ner_10000.crfsuite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

khmercut.cpp

Build

Usage

Result

License

Reference

About

Releases

Languages

seanghay/khmercut.cpp

Folders and files

Latest commit

History

Repository files navigation

khmercut.cpp

Build

Usage

Result

License

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages