Meta Learning Approach to Phone Duration Modeling

One of the essential prerequisites for achieving the naturalness of synthesized speech is the possibility of the automatic prediction of phone duration, due to the high importance of segmental duration in speech perception. In this paper we present a new phone duration prediction model for the Serbian language using meta learning approach. Based on the data obtained from the analysis of a large speech database, we used a feature set of 21 parameters describing phones and their contexts. These include attributes related to the segmental identity, manner of articulation (for consonants), attributes related to phonological context, such as segment types and voicing values of neighboring phones, presence or absence of lexical stress, morphological attributes, such as part-of-speech, and prosodic attributes, such as phonological word length, the position of the segment in the syllable, the position of the syllable in a word, the position of a word in a phrase, phrase break level, etc. Phone duration model obtained using meta learning algorithm outperformed the best individual model by approximately 2,0% and 1,7% in terms of the relative reduction of the root-meansquared error and the mean absolute error, respectively.


INTRODUCTION
Temporal segmental organization of spoken language is the result of many factors, which are mutually dependent and intricately interconnected [1]. These involve various physiological, phonological, morphologic, syntactic and prosodic factors, and understanding their impact on speech is essential both for understanding the process of speech production and for the development of high quality synthesized speech [2]. A text-to-speech system (TTS) therefore requires a specialized module for segmental duration modeling, which has to take in consideration all the relevant factors.
Two main types of models for predicting segmental duration have been used in TTS systems -rule-based models and corpus-based models.
The oldest, and still rather popular, rule-based model for predicting duration was the one developed by Dennis Klatt [1]. One of the main shortcomings of Klatt's model is that it may lead to over-generalization, especially due to the occurrence of exceptional cases. On the other hand, these models are convenient because they do not require large speech corpora. This was particularly important at the time when they were dominant in speech synthesis, since computational resources needed for generating and analyzing large speech corpora were not available as today.
With the advancement of computer technology, corpus-based statistical models have become more prevalent. These models require a large corpus of spoken language, because the modeling is done using a machine learning algorithm on large corpora. Various machine learning approaches have been applied for phone duration modeling such as artificial neural networks [3,4] decision trees [5][6][7][8][9][10], Bayesian models [11], and instance-based algorithms [12].
In this paper the authors present phone duration modeling in the Serbian language using the meta learning algorithm. The modeling was carried out using five different types of individual models and the proposed model. The performances of these models are evaluated by objective measures such as root-mean-squared error (RMSE), mean absolute error (MAE) and correlation coefficient (CC).
This paper is organized as follows: The Introduction section gives an overview of approaches to duration modeling, focusing on the significance of phone duration modeling in speech synthesis. Section 2 provides the description of the speech database used for extracting the set of relevant features and modeling phone duration. It also contains a detailed description of the feature set relevant for the Serbian language. Phone duration modeling process using meta learning algorithm is described in Section 3. The experimental results are presented and discussed in Section 4. Section 5 contains the concluding remarks and proposes further lines of research.

FEATURE SET FOR THE SERBIAN LANGUAGE
In order to predict the duration of a speech segment in a given context, a TTS system also requires a module that automatically generates the appropriate feature vector for each phoneme in the speech database. This module precedes the module for predicting the duration of speech segments in the process of speech synthesis.
Most of the factors that influence the duration of segments are universal, and they therefore affect the durational features of segments cross-linguistically. However, some factors may be more marked in some languages than others, so it is important to select the language specific factors when developing a model of phone duration in a speech synthesizer for a particular language. We therefore selected those factors which have been researched in the literature across different languages [1,[6][7][8][9], but also the ones found in previous studies concerning the effect of various factors on the duration of phonemes in the Serbian language [13,14]. These factors have been extracted from the database of spoken Serbian language, recorded for the needs of developing the speech synthesizer for Serbian [15]. This corpus contains approximately 2000 sentences (16.000 words) of read texts taken from daily press, typically used for such purposes. The texts were read by a female professional radio announcer and recorded in a sound proof studio at 88,2 kHz sampling rate. She is a native speaker of Serbian, who speaks the Ekavian standard dialect. The recorded material was annotated for phonetic and prosodic features. Temporal alignment was done using the AlfaNumASR speech recognition system [16] on the phonetic level, while the correction of phone labels was done manually by means of the AlfaNum TTSLabel software [16]. Prosodic annotation included labels for the four types of lexical stress (long-falling, long-rising, short-falling and short-rising) with additional marking of post-tonic long syllables (post-accentual length). It also involved marking focused elements and phrase break levels. Prosodic annotation was carried out manually using the AlfaNum TTSLabel software [16].
Each phoneme in the speech database is assigned the appropriate feature vector, which contains the information on the segment itself and the context in which it occurs.
The subsequent section of the paper contains the list of relevant factors and their potential values in the Serbian language, classified according to the domain of their impact. • Nature of the segment segment identity: Serbian has 30 phonemes, 5 vowels and 25 consonants. However, the labeling also had to include two different realizations of the semi-phone schwa /ə/, which is the vocalic element of the consonant /r/. The first type of /ə/ belongs to the phoneme /r/ at syllable margins, and the second when it is part of syllabic /r/ [17]. Stops and affricates are labeled as pairs of semi-phonemes, segmented into the sequences of the occlusion and burst and the occlusion and friction, respectively. The total number of different consonants is therefore 36, and the total of 43 different segment values are accounted for. The break levels were determined on the basis of different perceptually detected relevant discontinuations in the speech chain. In the case of longer intervals of silence (major break), distinction can be made between initial, medial and final position of a word in the prosodic unit.

PHONE DURATION MODELING USING STACKING ALGORITHM
The basic idea of using meta algorithms is to make more reliable decisions by combining the output of different models [19]. There is no generally accepted preferable way of combining multiple models, so many different variations can be applied. These algorithms often contribute to the improvement of the predictive performance over a single model taking into account the observation that different algorithms perform differently in different situations. In the case of phone duration modeling, different individual phone duration models will produce different errors and the meta learning algorithm, which combines the predictions of individual models in an appropriate way, could compensate for some of these errors. Therefore, meta learning technique will contribute to the increase of the overall phone duration prediction accuracy [19].
Stacked generalization or stacking is one of the meta learning algorithms invented by David Wolpert [19]. It presents a possible way of combing multiple models of different types. Stacking introduces the concept of a metalearner. The metalearner is a learning algorithm which tries to discover how best to combine the output of the base learner or level-0 learner.
General structure of the stacking algorithm is shown in Fig. 1. As one can notice in the figure, the input to the level-1 model consists of the predictions of the level-0 models. A level-1 instance has as many attributes as there are level-0 models, and the attribute values give the predictions of the base learners on the corresponding level-0 instance. The method of cross validation is usually applied for every level-0 learner, ensuring that the level-1 learner uses a full set of training data.
Numerous machine learning methods can be applied for training the level-1 model. Because most of the work is done by the level-0 models, the level-1 learner may be a simple algorithm such as a linear regression or model tree. Different machine learning techniques which are used as a level-0 learner or level-1 learner such as linear regression, model tree, CART (Classification and Regression Trees) and REP (Reduced Error Pruning) Tree will be described briefly in the following paragraphs. Linear regression [19] is one of the methods used for the prediction of numerical values. This very simple method is also the oldest method of regression analysis. It has been extensively studied for many years and it is extensively used in many practical applications [20]. The basic idea of this algorithm is that the dependent variable or predictive value is to be presented as a linear combination of attributes which will be taken into account when forming a predictive model and upon which the predictive value depends. Each of the attributes is weighted by the appropriate weighting factor. Therefore, the dependent variable can be written as: where: x is the predictive value or the phone duration in the case of duration modeling; a 1 , a 2 ,…, a k are the factors affecting the duration of phonemes; w 0 , w 1 ,…, w k are weighting factors. The weighting factors are determined based on the training data, which are in the speech database.
The predicted value of the duration for the first phoneme in the database can be written as: (1) (1) In determining the coefficients w j , there are k + 1 of them, the method of the least squares is applied, and it is necessary to minimize the sum of squares of the differences between the actual and the predicted values over all the training data.
If there are n phones in the database, wherein the i th phoneme denoted with superscript (i) then the sum of the squares of the differences can be represented as: where the difference in parenthesis is the difference between the actual and the predicted value of the duration of the i th phoneme in the database. By choosing appropriate coefficients w j the sum of squares is minimized.
CART method [21] is today probably the most frequently applied method for modeling duration of speech segments in synthesized speech. It involves the use of a regression tree for predicting the duration of a given speech segment, which is in the database represented by a corresponding feature vector. The formation of the tree takes place in several steps: first. the formation of the question set and selection of the best question for splitting in the given node; selection of stopping criterion in a node, or declaration of a given node as a terminal node (leaf) and the prediction of a value in a given node.
The criterion for splitting is usually the mean squared error. If Y is the actual duration for training data X, then the overall prediction error for a node t can be defined as: (4) where d(X) is the predictive value of Y. The next step is the selection of the best question which is equivalent to finding the best split for the instances of the node. We should find the question with the largest squared error reduction or the question q * that maximizes: where l and r are the leaves of the node t. We define the expected square error V(t) for a node t as the overall regression error divided by the total number of instances in the node: (6) One can notice that V(t) is actually the variance estimate of the duration if d(X) is made to be the average duration of instances in the node. With V(t), we can define the weighted squared error ( ) V t for a node t as follows: Finally, the splitting criterion can be rewritten as: Regression tree is formed by splitting each node until either of the following conditions is met for a node: 1. The greatest variance reduction of the best question falls below a pre-set threshold α: 2. The number of instances falling in the leaf node t is below a threshold β.
When a node cannot be split further, it is declared a terminal node. The tree building algorithm stops when all nodes are terminal.
Regression tree is a special case of model tree. The only difference between regression tree and model tree is that for model tree each node contains a linear regression model based on some of the attribute values instead of a constant value. Linear regression model predicts the value for the instances that reach the leaf.
In addition to regression and model trees in the process of duration modeling another algorithm based on decision trees could be used. This is the REP (Reduced Error Pruning) Trees algorithm [22] developed in order to obtain the optimal tree, which means finding a minimum tree size while achieving a minimum error. In this algorithm different sets of data are also used for forming and pruning the tree, wherein the mutual ratio of the amount of data of these two sets is one of the parameters of the algorithm.

EXPERIMENTAL RESULTS AND DISCUSSION
In this paper duration models have been developed with LR (linear regression), M5P (model tree), M5PR (regression tree) and REPTree with and without pruning, and STACK (stacking) algorithms of WEKA [23]. These algorithms were used for training the models on a large speech corpus containing 98.214 phones, including 59.671 consonants and 38.543 vowels. Fig. 2 shows the SAMPA (Speech Assessment Methods Phonetic Alphabet) symbols of phonemes and phonemic segments from the Serbian speech database analyzed and the number of their occurrences.
Prediction performance of each model was evaluated in the experiment including unseen (new) data, which were not used in the training phase. The procedure involved splitting the whole database into two subsets: the training set and the test set. The training set included 80% of the database, while the remaining 20% cases were used as the test set. The evaluation of the duration models was performed by means of objective measures, including root-mean-squared error (RMSE), correlation coefficient (CC) and mean absolute error (MAE) between the predicted and actual durations of phones.  The root-mean-squared error, mean absolute error and correlation coefficient of duration models developed using LR, M5P, M5PR, REPTree with and without pruning algorithms as well as meta learning STACK algorithm for the full phoneme set are given in Tab. 1. LR, M5P, M5PR, and REPTree with and without pruning were used as level-0 models in stacking algorithm and the M5P has been chosen to be metalearner.
Based on the results presented in Tab. 1 it can be seen that the performances of STACK M5P model are better than the performances of the individual models used as level-0 models in the training phase of STACK M5P model in terms of RMSE, MAE and CC. Among the individual models M5P model has the best prediction performances [24]. This is the reason why M5P was chosen to be used as level-1 model. One can also notice that by the application of linear regression the worst prediction performances model is obtained.
To further improve the achieved STACK M5P model performances, the outliers of the speech database were removed, resulting in a new range of phone durations, which contains 96,27% of the data of the full segment set. It was obtained with regards to the distribution of durations after removing the instances of phones with extremely small or extremely large durations near the boundary values of the full duration range, i.e. around 2 and 290 ms (Fig. 3).      After removing the outliers STACK M5P model outperforms the best individual model M5P by approximately 2,0% and 1,7% in terms of relative reduction of RMSE and MAE respectively.

CONCLUSION
In this paper we presented the results of a new phone duration model using meta learning algorithm for the full phoneme set of the Serbian language. The corpus used for the duration modeling procedure contained the total of 98214 phonemes. In order to improve model performance, outliers were removed from the analysis. The model obtained for the Serbian language was subjected to objective evaluation. It was found that the quantitative measures obtained in terms of RMSE, MAE and CC outperform individual models. Therefore, we can conclude that use of the meta learning algorithm contributes to the increase of phone duration prediction accuracy.
Future research should include subjective evaluation of our duration model once it is implemented into the speech synthesizer for the Serbian language [15]. The goal of such a study would be to evaluate the quality of synthesized speech and determine whether and to what extent the proposed model contributes to the naturalness, intelligibility and comprehensibility of synthesized speech.