To find out a more realistic assessment of these models' treatment of unknown words we processed the first section of the WSJ Penn Treebank [9]. This consists of a total of 39923 words in news text style. Using our standard OALD lexicon we find that a total of 1775 words (4.6%) are not found in the lexicon, 943 of which are unique. Of those unknown words we find the following distribution
Occurs | % | |
names | 1360 | 76.6 |
unknown | 351 | 19.8 |
American spelling | 57 | 3.2 |
typos | 7 | 0.4 |
We listened to each of the 1775 words as pronounced by a number of the models discussed above. A yes/no decision was made about acceptability. Note that a number these words have multiple acceptable pronunciations. If any of those were predicted they were deemed acceptable. For example the pronunciations of ``Reagan'' as /r ey g ah n/ and as /r iy g ah n/ were both considered acceptable.
The best results, shown above for OALD, were obtained by building the deepest possible trees. But when those models were applied to these unknown words the results showed that although the models were not over-trained for the unseen test set extracted from the lexicon itself, they were for these unknown words. The following shows the results after varying the stop value for CART building.
Lexicon | Unknown | ||
Stop | Test set | Test set | size |
1 | 74.56% | 62.14% | 39500 |
4 | 65.17% | 67.66% | 17948 |
5 | 63.15% | 70.65% | 14968 |
6 | 61.65% | 67.49% | 12782 |
Looking at those words that are pronounced wrongly we find some mistakes are still recognizable (e.g. Chrysler as /k r ih s l ah er/) but many are unacceptable and unrecognizable showing there is still work to be done. Further analysis of these words shows
Occurs | % | |
names | 413 | 79 |
unknown | 94 | 18 |
American spelling | 7 | 0 |
typos | 2 | 0 |
Further analysis of the types of names that are still unpronounceable shows a larger proportion of non-anglo-saxon origin than in those that are correctly pronounced. As many of the languages these names originate from often have a more standardized pronunciation than English (e.g. Polish, Italian, Japanese (in its romanized form)), knowing the origin of an unknown word may allow more specific rules to be applied, but we have not yet investigated this area.