This gem is intended to contain tools for Arabic Natural Language Processing. As of version 0.1, this toolkit gem allows you to:
-
Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
-
Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
Arabic Stemming without a root dictionary
This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
Add this line to your application's Gemfile:
gem 'nlp_arabic'
And then execute:
$ bundle
Or install it yourself as:
$ gem install nlp_arabic
Once installed, you can use it like this:
NlpArabic.clean(text) will return the text without the stop words.
NlpArabic.stem(word) will return the word stemmed.
NlpArabic.stem_text(text) will stem an entire text.
NlpArabic.clean_and_stem(text) will do both.
NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
After checking out the repo, run bin/console
for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run bin/setup
.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
to create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described here. The default encoding used is UTF-8.
- Fork it ( https://github.com/othmanela/nlp_arabic/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Write unit tests and make sure all of them (including the old ones) pass
- Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request