Main Article Content

A word-class tagset for Setswana


Bertus van Rooy
Rigardt Pretorius

Abstract

This paper aims to present a general tagset for use in an
automatic word-class tagger, functioning largely at the level of word-classes,
rather than pure morphological information. In view of the importance of
reusability, guidelines and standards for tagsets are identified, concentrating
on the standards proposed by the Expert Advisory Group on Language Engineering
Standards (EAGLES) within the framework of the European Union language
technology initiatives. Certain criteria for both tagsets and tag labels are
identified. Thereafter, problems and solutions for tokenisation in Setswana are
discussed, with emphasis on the challenge presented by the disjunctive
orthography and the agglutinative character of Bantu languages. The bulk of the
article is then devoted to the development of a tagset for the various
part-of-speech categories of Setswana, as a test for the extent to which the
EAGLES standards can be adopted and adjusted to make them suitable for an
agglutinating language. The conclusion is that this is indeed possible to a
large extent, with minor elaborations necessary, in particular as far as the
disjunctively written prefixes of verbs are concerned.

Southern African Linguistics and
Applied Language Studies 2003, 21(4): 203–222

Journal Identifiers


eISSN: 1727-9461
print ISSN: 1607-3614