Standard Yorùbá Context Dependent Tone Identification Using Multi-Class Support Vector Machine (MSVM)

Most state-of-the-art large vocabulary continuous speech recognition systems employ context dependent (CD) phone units, however, the CD phone units are not efficient in capturing long-term spectral dependencies of tone in most tone languages. The Standard Yorùbá (SY) is a language composed of syllable with tones and requires different method for the acoustic modeling. In this paper, a context dependent tone acoustic model was developed. Tone unit is assumed as syllables, amplitude magnified difference function (AMDF) was used to derive the utterance wide F contour, followed by automatic syllabification and tri-syllable forced alignment with speech phonetization alignment and syllabification SPPAS tool. For classification of the context dependent (CD) tone, slope and intercept of F values were extracted from each segmented unit. Supervised clustering scheme was utilized to partition CD tri-tone based on category and normalized based on some statistics to derive the acoustic feature vectors. Multi-class support vector machine (MSVM) was used for tri-tone training. From the experimental results, it was observed that the word recognition accuracy obtained from the MSVM tri-tone system based on dynamic programming tone embedded features was comparable with phone features. A best parameter tuning was obtained for 10-fold cross validation and overall accuracy was 97.5678%. In term of word error rate (WER), the MSVM CD tri-tone system outperforms the hidden Markov model tri-phone system with WER of 44.47%. DOI: https://dx.doi.org/10.4314/jasem.v23i5.20 Copyright: Copyright © 2019 Sosimi et al. This is an open access article distributed under the Creative Commons Attribution License (CCL), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Dates: Received: 21 December 2019; Revised: 20 May 2019; Accepted 25 May 2019

In recent times Automatic Speech Recognition (ASR) has been of special interest to researchers; its application domain has also expanded from simplest system of digit recognition to portable cross-language spontaneous dialogue systems, such development is mainly due to the improvement in computational power and modeling approaches for representing speech signal. While significant progress have been accomplished in phone language ASR, there are still large number of issues that have not been solved, particularly for under-resource languages, where annotated speech resources are limited (Eme and Uba, 2016). Tone languages denote a large proportion of the spoken languages of the world and yet lexical tone is an understudied features. This is attributed to the unsettled questions on building of the vocabulary, what should constitute the sub-word units, how structures over these units are parameterized, modeled and trained. In languages such as SY, tone forms an integral element of the syllable and serves an essential function in distinguishing meaning of syllables with same phonological configuration. Tonal languages have distinctive tones and the number of tones differs across languages. For example, SY, Thai, Cantonese, and Hausa have three, five, nine and two lexical tones respectively. Hence, tone languages, such as Standard Yorùbá, differ from other tone languages, for instance, in some Asian languages, tones are identified by their shape (contour of the fundamental frequency) and pitch range (or register) while in some African languages, tones are distinguished by their relative pitch levels (Akinlabi and Liberman, 2001), as a result tones cannot be universally applied to speech pattern classification . Classical ASR systems are based on context dependent tri-phone acoustic modeling and commonly use phone features, such as Mel-filtered cepstrum coefficient (MFCCs) as input features. This model and representation work well for phone recognition, but do not carry information about tone. Another challenge, is the segmentation of sentences of tonal language into words. In the SY writing and speaking system, the basic unit is syllable and not word. Consequently, the design and implementation of Multi-class Support Vector Machine in the recognition of SY context dependent tone is presented in this paper to engender and provide arguments for the use of context dependent tone segment for SY ASR. In language such as SY, tones are associated with syllable (Yang and Zhang, 2018). SY has seven possible syllable structures, these include consonant-vowel , , digraph-vowel nasal , digraph-vowel , vowel , vowel nasal and syllabic nasal . SY has three lexical tones: high, low and mid. In recent times, several models have been proposed for tone language ASR. These techniques can be categorized into two main classes: (i) rule-based and (ii) data-based approach. The implementation of the rule-based SOSIMI, AA.; ADEGBOLA, T; FAKINLEDE, OA. scheme requires eliciting of rule-sets from knowledgeable experts. A drawback of this scheme, is the generation, organization and representation of the interdependency of the rule-set as well as unavailability of domain experts. These setbacks inspired the use of the data-driven techniques to ASR (Kumalalo et al., 2010). The most commonly used generative models for tonal language ASR are: (i) embedded (Chen et al., 2014) and (ii) explicit (Kristine, 2017;Li et al., 2016) approaches. In the embedded scheme, tone recognition is based on a multi-stream HMM decoding while in the explicit scheme, syllables within an utterance are identified first via force alignment of HMM and then tone recognition is then performed on each segmented syllable using Gaussian Mixture Model (GMM). Compared with embedded tone modeling, the explicit tone modeling approach is capable of exploiting the supra-segmental nature of the tones. There are two major approaches to explicit tone modeling: sequence based tone modeling and segment based tone modeling (Chen et al., 2014). Due the fact that articulation of human is sequential and output of pitch related feature extraction is frame based, modeling of tones using sequential model is logical. Examples of sequenced model includes the hidden Markov model (HMM) and hidden conditional random fields (HCRF), etc. A major weakness of sequenced model, is that is challenging for the sequence based models to use segment based information from contextual tones. Hence, considerable efforts are required to utilize pitch information of CD syllable. Discriminative training models such as Gaussian mixture model (GMM), support vector machines, neural network and deep network etc. are alternatives approach to sequence based model. Lately, MSVM have successfully been applied to many different speech recognition application, such as speaker verification, emotion and text classifications (N. Yang et al., 2017). Aida-zade et al., (2016) implemented a speech recognition system using SVM. In the work, SVM was used to make decisions at frame-level, and a Token Passing algorithm to obtain the chain of recognized words. TombaloĞlu and Erdem (2017) developed SVM based recognizer, MFCC features of Turkish speech were extracted and SVM based classifier alongside a new text comparison algorithm was explored. The text comparison algorithm uses phoneme sequence to measure words similarity. Frihia and Bahi (2017), presented a combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and label Arabic speech waveform into phoneme units. HMMs generate the sequence of phonemes and their boundaries; the SVM refines the boundaries and modifies the labels. The segmented and labelled units was used as the training sets. The system was evaluated based on word error rate (WER. The results shows that the speech recognizer built upon the HMM/SVM segmentation outperforms the one built upon the generalized learning segmentation in terms of WER by about 0.05%, on a noisy data. The MSVM approach to context dependent tone recognition is particularly suitable for the current study. First, the CD tone recognition problem involves the conversion of frame based pitch-related observation sequence into a fixed dimensional vector. Second, the number of CD tri-tone are limited thus, reducing model confusability when compared to CD tri-phone which requires a lot of hours of segmented and labelled speech unit. Third, the availability of free software and tools for modeling and implementing MSVM. Hence, the objective of this paper is to develop a tri-tone acoustic model and explore the use sub-segmental features for SY CD tone identification.

MATERIALS AND METHOD
The Standard Yorùbá context dependent tone identification problem is composed mainly of 2 steps: (1) MSVM model formulation (2) Implementation (training and testing).
The bi-objective formulation is presented in Eqn.1. Inequalities in Eqn.2, 3, 4 and 5 are constraints. In the model, % represents the hyperplane, is penalty parameter for error on the training sample, ) * is the slack variable, , 3, , are decision variables, is the class label, , is un-regularized bias term and % & is the decision function. Linear parameterization of B are the non-zero support vectors that determines the hyper-plane. To optimize Eqn. 6, requires the selection of suitable decision kernel. In this paper, radial basis function (RBF) kernel is used. The tunable parameters such as and kernel parameters were selected through a combination of resampling techniques and a separate validation set. Training and Testing: To train the MSVM, grapheme to phoneme conversion was done based on a description of the SY orthography followed by the implementation syllabification algorithm and automatic phonemic alignment of the audio using speech phonetization alignment and syllabification (SPPAS), treating each speaker's utterances independently. The tone introduces the requirements of generation of tone annotations from SY syllable tier. The algorithm for the production of equivalent tone transcriptions is described in this section. The algorithm was implemented based on a description of the SY orthography with each vowel bearing any of three (3) distinct tones: high (H), low (L) or mid (M) and eighteen consonants (0). The pseudo code of the algorithm is presented below. MATLAB R2013a software was used for computer implementation of the algorithm. This is followed by extracting and selecting the best feature subset for predicting the class label. dynamic programming (DP) (Chen and Jang, 2008) to derive the Least Square, Cubic Spline and Dynamic Programming Embedded tone features respectively. POLYFIT MATLAB function was used to obtain the slope and intercept of n over each segment.

Pseudo code for the Generation of Tone Transcript
For a wider diversity CD tri-tone and contextual influence of tones, the tri-syllable unit is shifted by a syllable and the refined features are normalized using the following normalization scheme: n o normalization by minn and maxn of each cluster (Norm_ n _Min_Max).
n normalization by n mean of each cluster (Norm_n _Mean).
In this study, a hybrid normalization scheme n so as presented in Eqn.13 is also explored (where k represents slope or intercept vector). Resulting to four dimensional feature vectors namely absolute slope, absolute intercept, normalized slope and normalized intercept, where % is the weight representing the contribution of each feature. n so = % n o + % $ n o $ + % z n o z (13) Subject to ∑ % = 1 z + The space SY CD tri-tone is determined using the expression below.

Dummy start + ᶇ • + Dummy end
Where ᶇ is the number of tones which in the case of SY is 3, while r is number of items to be chosen which for tri-tones is 3. For SY language has 27 distinct CD tri-tone clusters plus two (2) dummy cluster resulting in a total of 29. Having created the feature vectors, each Tri-tone context is clustered based on respective signature as presented in Figure 2. For an unbiased estimation of training algorithm, a combination of the hold-out sampling and whole data set is explored. Repeated -fold cross validation is performed on €. The is varied over -number of runs i. e. = 2, 4, 6, 8, 10, 15, 20, 50 and = 10, 20, 50, 100, 150, 200, 500, 1000) in order to determine the optimal values of μ ‚ , α $ & γ (where i.e. μ ‚ is the mean value taken over all possible k − fold cross validations over €, B $ is the conditional prediction error and † the mean accuracy of ‡ €ˆ on ‰, taken over all training set €ˆ of size Š − ‹ /Š|€| which generates the best accuracy. Having learnt the optimal model vis-a-vis k, n, μ •Ž• , α •Ž• & γ •Ž• that best describes €, where μ •Ž• , α •Ž• and γ •Ž• are the mean taken over all possible k, condition prediction error and mean accuracy segment respectively. • is used to estimate accuracy of the model using the mathematical formula presented in Eqn. 15.
Computer implementation of the Multiclass SVM Learning Algorithm was done using MATLAB.

RESULTS AND DISCUSSION
In order to capture information between the voiced and unvoiced speech segment, pitch contour refinement schemes are implemented, a sample of results of refining the broken pitch segment is illustrated in Figure 4.   Figure 4, the baseline system via the DP refinement has the highest accuracy of 37.32%. On the other hand, the Least Square Embedded Tone LSET baseline system recorded an accuracy of 34.62% and the CS resulted in an accuracy of 31.52%. The highest accuracy, over all the schemes was obtained at Insertion Log Probability (ILP) between -18 and -15. The percentage of correctly recognized word increases as ILP increases, while the CD tri-phone model recorded 44.57% best accuracy. For the Multiclass SVM DP baseline, best classification accuracy was recorded at experiment setup of 200 runs and a 10 folds cross-validation as shown in Table 1. At this experimental setup, 87.9252% accuracy was recorded as illustrated in Figures 4 and 5.
Having learnt the optimal model parameters (i.e. , , £ ¤¥s , B ¤¥s & † ¤¥s ) that best describes the training set, the model was evaluated with the test data. At optimal parameter settings a 97.5678% tri-tone accuracy was obtained, the confusion matrix and classification spectrum are presented in Figure 6. From the results, utilization of tone acoustic model and pitch features have shown to be effective in tone classification.  On comparing the performance of MSVM and HMM for CD tone classification, the MSVM yielded the best classification accuracies.

Conclusions: A Standard Yorùbá Context Dependent
Tone identification using Multi-class Support Vector Machine (M-SVM) have been presented in this paper.
The results led to three major conclusions: SY CD tone recognition problem can be implemented with MSVM and HMM. The accuracy rates achieved using the MSVM was found to be higher than that of the HMM on the validation data sets. However, the performance of MSVM in modelling time sequential nature of continuous utterance have not been reported. In addition, its ability to handle dialectic variations which are essential characteristics of SY language is yet to be determined.