One of the important stages in the process of turning unmarked text into speech is the assignment of appropriate phrase break boundaries. Phrase break boundaries are important to later modules including accent assignment, duration control and pause insertion.
A number of different algorithms have been proposed for such a task, ranging from the simple to the complex. These different algorithms require different information such as part of speech tags, syntax and even semantic understanding of the text. Obviously these requirements come at differing costs and it is important to trade off difficulty in finding particular input features versus accuracy of the model.
The simplest models are deterministic rules. A model simply inserting phrase breaks after punctuation is rarely wrong in assignment, but massively under-predicts as it will allow overly long phrases when the text contains no punctuation. More complex rule-driven models such as [1] involve much more detailed rules and require the input text to be parsed. On the other hand statistically based models offer the advantages of automatic training which make movement to a new domain or language much easier. Simple direct CART models using features such as punctuation, part of speech, accent positions etc. can produce reasonable results [5]. Other more complex stochastic methods optimising assignment over whole utterances (e.g. [8]) have also been developed.
An important restriction that sometimes is ignored in these algorithms is that the inputs to the phrase break assignment algorithm have to be available at phrase break assignment time, and themselves be predictable from raw text. For example, some algorithms require accent assignment information but we believe accent assignment can only take place after prosodic boundaries are identified. A second example is the requirement of syntactic parsing of the input without providing a syntactic parser to achieve this. Thus we have ensured that both our phrase break assignment algorithm is properly placed within a full text to speech system and that the prediction of any required inputs is included in our tests.
A second requirement for our algorithm was introduced by our observation that many phrase break assignment algorithms attempt to estimate the probability of a break at some point based only on local information. However, what may locally appear as a reasonable position for a break may in fact be less suitable than the position after the next word. That is, assignment should not be locally optimised but globally optimised over the whole utterance. For example in the sentence I wanted to go for a drive in the country.
a potential good place for assignment may locally appear to be between ``drive'' and ``in'' based on part of speech information. However in the sentence I wanted to go to a drive in.
such a position is unsuitable. Another example is a uniform list of nouns. Breaks between nouns are unusual but given a long list of nouns (e.g. the numbers 1 to 10) it then becomes reasonable to insert a phrase break.
Thus we wish our model to have reasonable input requirements, use predicted values for the inputs as part of the test and consider global optimisation of phrase break assignment over the whole utterance.