-
-
Notifications
You must be signed in to change notification settings - Fork 7
uralicNLP.string_processing
Mika Hämäläinen edited this page Aug 11, 2020
·
7 revisions
The uralicNLP.string_processing module has the following methods:
Returns the English name for the language ISO code
from uralicNLP import string_processing
string_processing.iso_to_name("kpv")
>> Komi-Zyrian
Splits words into characters better than Python's own " ".join("") method. This tries to maintain diacritics with the character they belong to instead of separating them. Take a look at the following example:
from uralicNLP import string_processing
s = 'h̭ɛ̮ŋkkɐᴅ'
" ".join(s)
>> h ̭ ɛ ̮ ŋ k k ɐ ᴅ
string_processing.char_split(s)
>> ['h̭', 'ɛ̮', 'ŋ', 'k', 'k', 'ɐ', 'ᴅ']
In short, it takes a string and returns a list split in characters.
This return the parts of text that are written in Arabic. The parameters are
-
text The text to process
-
keep_vowels=True Whether diacritics should be removed
-
combine_by="" Joins the Arabic text fragments by this string, could be set to a space
from uralicNLP import string_processing a = "تحميل PDF" string_processing.filter_arabic(a) >> تحميل
UralicNLP is an open-source Python library by Mika Hämäläinen