An enhanced stemmer algorithm for geez text: a long match approach
In this study, development and enhancement of a stemmer algorithm for Geez texts were presented. The general approach used a longest-match principle and in this stemmer takes a corpus an input, then expand short words, remove punctual marks, special characters and numbers (normalization), remove stop words, identify the case when an affix is not a real affix (exceptions), handle irregular words, and finally removes its affixes and the corpus input is taking in Geez language and the resulting corpus in Geez language. In this enhanced stemmer, there is no need of transliteration; and stemmer is implemented with a user interface which make the stemmer easily understandable to none expert users and learners. The prototype was tested with three datasets with vary of 2000 words. To evaluate stemmer, manual error counting method was used. According to the evaluation of the experiments, the results showed that it achieved an average accuracy of 87.22% and the proposed method generated some errors over stemming and under stemming errors were 8.31% and 4.35%, respectively. In conclusion, an overall accuracy of the stemmer was encouraging which shows stemming can be performed with low error rates in morphologically rich languages such as Geez language. Finally, researchers found out that infixed words affect in geez stem words. They also found out that it is possible to use the stemmer for developing morphological analyzer, parser, and spell checker, thesaurus and word frequency counting and so on.
Keywords: Information, Retrieval (IR), Stemming, morphology, Natural Language, Processing, Suffixes, Prefixes, Algorithm