Urdu Stemmer

It is a python based urdu stemmer. From a given list of words, it will try to find their stems using a limited list of affixes given in the program.

stemmer.py: This file contains the logic and implementation of the stemmer. It uses regular expressions to find prefixes at the start of a word and suffixes at the end of the word.

Following are the list of (currently present) affixes:

urduPrefixes = ['بے', 'بد', 'لا', 'ے', 'نا', 'با', 'کم', 'ان', 'اہل', 'کم']
urduSuffixes = ['دار', 'وں', 'یاں', 'یں', 'ات', 'گوار', 'ور', 'پسند']

To find a prefix it uses this regular expresseion:

checkPrefix = re.search(rf'\A{prefix}', urduWord)

To find a suffix it uses this regular expression:

checkSuffix = re.search(rf"{suffix}\Z", urduWord)

urdu-affixes.txt: This file contains the input words for the stemmer.py. It contains two colloums and are read from urdu way of reading files (right to left).

The words on the most right act as a input for the program. The stemmer reads them and finds their stems.
The words on the most left are the actual stem words of words on the right side. These are wriiten manuually to calcaulate the efficency/accuracy of the program i.e. How many stem words the program calculated right?

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
stemmer.py		stemmer.py
urdu-affixes.txt		urdu-affixes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Urdu Stemmer

About

Releases

Packages

Languages

burhanharoon/Urdu-Stemmer

Folders and files

Latest commit

History

Repository files navigation

Urdu Stemmer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages