Main Article Content

Topic modeling of phonetic Latin-spelled Arabic for the relative analysis of genre-dependent and dialect-dependent variation


Ali Sakr
Mark Hasegawa-Johnson

Abstract

We demonstrate a data collection and analysis system that can be used to analyze the relative contributions of dialect dependent variation in the lexical of speech-like Arabic text. We utilize Latent Dirichlet Allocation (LDA), a generative Probabilistic modeling method, to analyze a phonetic Latin Spelled Arabic online chat corpus. The corpus produces different word choices and word relations based on Dialect, which can therefore aid in producing written forms of Arabic Dialects despite the large difference between Standard Written Arabic and the many Arabic Dialects.

Keywords: Topic Modeling, phonetic Latin-Spelled Arabic, LDA, Arabic online chat corpus analysis


Journal Identifiers


eISSN: 1111-0015