Each subject was presented with 8 utterances, and was asked to write down what she heard. They heard 2 utterances (one "time" and one "words") from each of the 4 voice types. The order of the voice types was changed for each subject (24 orders in all).
The system ran on a Linux machine using an inexpensive LineJack telephony card. A custom-written program was used to run the experiment.
The natural utterances were spoken by a graduate student who has experience in delivering prompts for telephone systems, and therefore can consistently deliver different utterances in the same style. The synthetic utterances were produced by the Festival Speech Synthesis System [1], using the us1_mbrola voice, considered by many to be a good high-quality diphone voice. Although we are aware of other synthetic voices which may have better understandability and/or naturalness, at this stage we were not concerned with measuring the quality of different synthesized voices, so we selected a typical example of easily achievable synthesis quality which is also publicly available.
The experiment was set up so that when a call was initiated, a two-key index was requested to select in which order the voices would be presented. The experimenter normally did this for the subject so as to avoid lengthy explanations of the experimental setup. Once the order was selected a single key press allowed the user to hear the first and following utterances. In this test we did not allow the user to hear the utterance more than once. The possibility of being distracted from hearing an item by such sources as asking questions, abrupt noises, and other background noises was thus kept to a minimum. Users had no problem learning this method; we only eliminated one of 536 responses due to distraction. Even when the user had a telephone where the keypad was in the handset, the system response time after the user pushed a key was long enough that the subjects did not miss any part of the utterances.
The subjects were given a sheet on which they were to fill out their initials, age, whether they had hearing problems, and they signed a release statement. On the bottom half of the sheet were the 8 response lines, preceded by the instructions. The subjects were asked to read the instructions and the instructions were also read out loud to them by the experimenter (eight utterances, asking them to write something down each time, press a key other than pound sign to go on). The font on the sheet was at least 16 points in every part.