We took the two sets of sentences, (221 and 146) and recorded KAL delivering them. We built basic unit selection voices, using the standard FestVox build process, and hand corrected the phonetic labels of each. Three test voices were actually built, one with just the 221, one with just the 146, and one with the combined set. We spent some time tuning the voices to find good parameters, such as the right cluster size and features used for selection. However, for the best quality, we know we would need to do more correction of the labels.
A set of 20 sentences had been previous created for testing purposes; one of these sentences (the first sentence of Alice) was contained in the 34,796 utterances training set but not in the selected set, but the rest of the test set were independent.
Judging speech synthesis quality is not easy, even - perhaps especially - if you listen to a lot of it. In general, it is fairly easy to reliably determine what is much better, but in close cases where multiple factors may affect the quality of speech (e.g. joins, prosodic smoothness and segmental quality) such judgments become not so clear, and subjects, when questioned, may differ.
We synthesized the twenty sentences, and one of the authors listened to randomly ordered examples form each comparison. Opinions were collect in terms of A being better than B, B better than A or equal quality.
A B A=B better txt_221 vs txt_146 4 8 8 txt_146 txt_221 vs txt_367 6 10 4 txt_367 txt_146 vs txt_367 3 12 3 txt_367
It is clear that the largest set is best, but it is interesting that that it is not so clear which of the 221 and 146 sets are best, even when one is 50% larger than the other. The 146 is often smoother, but typically has less variation in prosody.
Relative quality helps in deciding directions, but this does not determine if the resulting voice is good enough for real applications. A second test was done on these voices with respect to further five different test sets.
A five point score was used, 5 being indistinguishable or nearly indistinguishable from recorded speech, 4 being errors but understandable, 3 being understandable with difficulty, 2 bad but some part discernible and 1 nearly incomprehensible or worse. We had 4 people listen to these examples.
testset | Listeners Mean Scores | Total | ||||
1 | 2 | 3 | 4 | mean | rank | |
alice | 4.4 | 4.15 | 3.3 | 3.95 | 3.95 | 1 |
timit | 3.75 | 3.75 | 2.95 | 2.85 | 3.32 | 4 |
comm | 3.7 | 3.9 | 2.4 | 3.25 | 3.31 | 5 |
festvox | 3.75 | 4.05 | 3.25 | 3.4 | 3.61 | 3 |
story | 4.0 | 4.3 | 3.05 | 3.95 | 3.82 | 2 |
It is clear to anyone listening to the voices that it is most
appropriate for reading stories, which is expected as that is the
domain the voices were selected for. The TIMIT sentences are, of
course, more complex, as they were deliberated chosen to have good
phonetic coverage. The communicator data is a different style, and
its quality is not as good, even though it is understandable; it does
not have the fluency and appropriateness the limited domain voice has,
which confirms our hypothesis that domain voices can always sound
better than general purpose voices, because they are more appropriate
to the task at hand.