Two databases have so far been tested with this technique, a male British English RP speaker consisting of 460 TIMIT phonetically balanced sentences (about 14,000 units) and a female American news reader from the Boston University FM Radio corpus [8] (about 37,000 units).
Testing the quality of speech synthesis is difficult. Initially we tried to score a model under some set of parameters by synthesizing a set of 50 sentences. The results were scored on a scale of 1-5 (excellent to incomprehensible). However the results were not consistent except when the quality widely differed. Therefore instead of using an absolute score we used a relative one, as it was found to be much easier and reliable to judge if an example was better, equal or worse than another than state its quality on some absolute scale.
In these tests we generated 20 sentences for a small set of models by varying some parameter (e.g. cluster size). The 20 sentences consisted of 10 ``natural target'' sentences (where the segments, duration and / were derived directly from naturally spoken examples), and 10 examples of text to speech. None of the sentences in the test set were in the databases used to build the cluster models. Each set of 20 was played against each other set (in random order) and a score of better, worse or equal was recorded. A sample set was said to ``win'' if it had more better examples than another. A league table was kept recording the number of ``wins'' for each sample set thus giving an ordering on the sets.
In the following tests we varied cluster size, and / weight in the acoustic cost, and the amount to prune final clusters. These full tests were only carried out on the male 460 sentence database.
For the cluster size we fixed the other parameters at what we thought were mid-values. The following table gives the number of ``wins'' of that sample set over the others.
Obviously we can see that when the cluster is too restrictive the quality decreases but at around 10 it is at its best and decreases as the cluster size gets bigger.
The importance of / in the acoustic measure was tested by varying its weighting relative to the other parameters in the acoustic vector.
This optimal value is lower than we expected but we believe this is because our listening test did not test against an original or actual desired /, thus no penalty was given to a ``wrong'' but acceptable / contour, in a synthesized example.
The final test was to find the effect of pruning the clusters. In this case clusters of size 15 and 10 were tested, and pruning involved discarding a number of units from the clusters. In both cases discarding 1 or 2 made no perceptible difference in quality (though results actually differed in 2 units). In the size 10 cluster case, further pruning began to degrade quality. In the size 15 cluster case, quality only degraded after discarding more than 3 units. Overall the best quality was for the size 10 cluster and pruning 2 allows the database size to be reduced without affecting quality. The pruning was also tested on the f2b database with its much larger inventory. Best overall results with that database were found with pruning 3 and 4 from a cluster size of 20.
In these experiments no signal modification was done after selection, even though we believe that such processing (e.g. PSOLA) is necessary. We do not expect all prosodic forms to exist in the database and it is better to introduce a small amount of modification to the signal in return for fixing obvious discontinuities. However it is important for the selection algorithm to be sensitive to the prosodic variation required by the targets so that the selected units require only minimal modification. Ideally the selection scoring should take into account the cost of signal modification, and we intend to run similar tests on selections modified by signal processing.