Skip to content

boudinfl/kea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kea

kea is a simple rule-based tokenizer for French. The tokenization process is decomposed in two steps:

  1. A rule-based tokenization approach is employed using the punctuation as an indication of token boundaries.

  2. A large-coverage lexicon is used to merge over-tokenized units (e.g. fixed contractions such as aujourd'hui are considered as one token)

A typical usage of this module is:

import kea
sentence = "Le Kea est le seul perroquet alpin au monde."
keatokenizer = kea.tokenizer()
tokens = keatokenizer.tokenize(sentence)

['Le', 'Kea', 'est', 'le', 'seul', 'perroquet', 'alpin', 'au', 'monde', '.']

About

A tokenizer for French

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published