Skip to content

Latest commit

 

History

History
117 lines (86 loc) · 4.77 KB

README.en.md

File metadata and controls

117 lines (86 loc) · 4.77 KB

Go-Sastrawi

GoDoc Build Status

Go-Sastrawi is a Go package for doing stemming in Indonesian language. It is based from Sastrawi for PHP by Andy Librian.

Stemming

From Wikipedia, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. For example :

  • menahan => tahan
  • pewarna => warna

Usage Examples

The most basic usage is by using default dictionary that provided by Sastrawi :

import (
	"fmt"
	"github.com/RadhiFadlillah/go-sastrawi"
)

func main() {
	// Original sentence
	sentence := "Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya. Baca berita selengkapnya di http://www.kompas.com."

	// Reduce inflected words to its root form
	dictionary := sastrawi.DefaultDictionary()
	stemmer := sastrawi.NewStemmer(dictionary)
	for _, word := range sastrawi.Tokenize(sentence) {
		fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
	}
}

Beside using the default dictionary, you can also create your own root words dictionary :

import (
	"fmt"
	"github.com/RadhiFadlillah/go-sastrawi"
)

func main() {
	// Create new dictionary
	dictionary := sastrawi.NewDictionary("lapar")
	dictionary.Print("")

	// Add new words to dictionary
	dictionary.Add("ingin", "makan", "gizi", "enak", "lezat")
	dictionary.Print("")

	// Remove some words from dictionary
	dictionary.Remove("enak", "lezat")
	dictionary.Print("")

	// Use your new dictionary for stemming
	sentence := "Aku kelaparan dan menginginkan makanan yang bergizi."
	stemmer := sastrawi.NewStemmer(dictionary)
	for _, word := range sastrawi.Tokenize(sentence) {
		fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
	}
}

Sastrawi also provides list of stop words that can be used to remove common words in Indonesian language. This list of stop words is an ordinary Dictionary, therefore you can add or remove the stop words depending on your purpose :

package main

import (
	"fmt"
	"github.com/RadhiFadlillah/go-sastrawi"
)

func main() {
	stopwords := sastrawi.DefaultStopword()
	dictionary := sastrawi.DefaultDictionary()
	stemmer := sastrawi.NewStemmer(dictionary)
	sentence := "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan"

	for _, word := range sastrawi.Tokenize(sentence) {
		if stopwords.Contains(word) {
			continue
		}

		fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
	}
}

Resource

Algorithm

  1. Nazief and Adriani Algorith
  2. Asian J. 2007. Effective Techniques for Indonesian Text Retrieval. PhD thesis School of Computer Science and Information Technology RMIT University Australia. (PDF and Amazon)
  3. Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language, Proceeding of International Conference on Information & Communication Technology and Systems (ICTS). (PDF)
  4. A. D. Tahitoe, D. Purwitasari. 2010. Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia dengan Metode Corpus Based Stemming, Institut Teknologi Sepuluh Nopember (ITS) – Surabaya, 60111, Indonesia. (PDF)
  5. Additional stemming rules from Sastrawi's contributors.

Root Words Dictionary

Stemming process by this package is depends heavily on the root words dictionary. Sastrawi use root words dictionary from kateglo.com with some changes.

License

As Sastrawi for PHP, Go-Sastrawi is also distributed using MIT license. Root words dictionary is distributed by Kateglo using CC-BY-NC-SA 3.0 license.

Sastrawi in Other Language