The application of statistical methods in the development of Cyrillic-Latin converter for Tatar language

A. V. Danilov, L. L. Salekhova, N. Anyameluhor


The article describes the process of a software product development that allows you to convert a text written in Tatar to Latin using Cyrillic graphics. The aspects of Cyrillic
graphics to Latin graphics conversion are considered for Tatar language. The authors study the application of various statistical methods necessary for converter operation and analyze
the speed and the accuracy of the conversion algorithms. An algorithm was created and software modules were developed that made it possible to convert messages written in Tatar Cyrillic alphabet to Tatar Latin alphabet. Based on normative documents and scientific works on the use of Latin graphics in Tatar language, a verbal and an algorithmic model of conversion was constructed. In the process of development, it turned out that the process of a Tatar word conversion depends on its origin. If native Tatar words are converted according to the phonetic principle (кәлам - qäläm), the borrowed words are converted according to the rules of transliteration. The main problem of the study is the problem of a word origin determination. In order to solve this problem, the authors propose various algorithms. Software tools based on the statistical processing of linguistic data are considered and developed in the work: combined bigram analysis, naive Bayesian classification and a direct search. Each of these algorithms is used to determine the etymology of a word, on which depends the application of certain rules of conversion from Cyrillic to Latin. The result of the research is a developed software product that is capable to carry out the process of Cyrillic graphics conversion to Latin for Tatar. In the future, the authors plan to improve the software product and use it in educational activities.

