The first important observation to make is that this system does not solve the general synthesis problem. We must make that clear as too often a single high quality example is played giving the impression anything can be synthesized at that high quality. However, what we do conclude here is that these techniques allow reliable high-quality synthetic voices to be developed quickly, if they are targeted towards a limited domain.
The advantage that these techniques bring, in that the synthesis implicitly models the quality in the recorded database, is in the long run, a disadvantage too. As more general synthesis is required, with varying prosody, varying emphasis and focus as well as larger vocabularies, the amount of data that needs to be recorded will become too large. At some point we need to properly model prosodic and spectral phenomena explicitly so that we can get the same quality of synthesis without having to record such large databases.
We see this technique as offering a more general solution to system currently using recorded prompts. This offers the quality of recorded prompts but also the generality of simple synthesis so phrases other than those in the recordings can be generated. We do not currently recommend this system for truly general synthesis, such as reading email or news stories, but there still are many speech applications which fall within the scope of this technique.
Full documentation with scripts, code and explicit walkthroughs of these techniques with examples are available at http://festvox.org