Two forms of unit selection will discussed here, not because we feelthey are the best but simply because they are the ones actuallyimplemented by us and hence can be distributed. These should still beconsidered research systems. Unless you are specifically interested orhave the expertise in developing new selection techniques it is notrecommended that you try these, if you need a working voice within amonth and can't afford to miss that deadline then the diphone option issafe, well tried and stable.

Although it is too early to make definitive statements about what sizeand type of data is best for unit selection we do have some roughguides. A Timit like database of 460 phonetically balanced sentences(around 14,000 phones) is not an unreasonable first choice. If thetext has not been specifically selected for phonetic coverage a largerdatabase is probably required, for example the Boston Univeristy RadioNews Corpus speaker has been usedrelatively successfully. Of course all this depends on what use youwish to make of the synthesizer, if its to be used in more restrictiveenvironments (as is often the case) tailoring the database for the taskis a very good idea. If you are going to be reading a lot of telephonenumbers, having a significant number of examples of read numbers willmake synthesis of numbers sound much better.

Like diphone databases the more cleanly and carefully the speech isrecorded the better the synthesized voice will be. As we are going tobe selecting units from different parts of the database the more similarthe recordings are, the less likely bad joins will occur. Howeverunlike diphones database prosodic variation is probably a good thing, asit is those variations that can make synthesis from unit selection soundmore natural. Good phonetic coverage is also useful, at least phonecoverage if not complete diphone coverage. Also synthesis using thesetechniques seem to retain aspects of the original database. If thedatabase is broadcast news stories, the synthesis from it will typicallysound like read news stories (or more importantly will sound best whenit is reading news stories).

In order to make access well defined you need to construct Festivalutterance structures for each of the utterances in your database. This(in is basic form) requires labels for: segments, syllables, words,phrases, F0 Targets, and intonation events. Ideally these should all becarefully hand labelled but in most cases that's impractical. There areways to automatically obtain most of these labels but you should beaware of the inherit errors in the labelling system you use (includinglabelling systems that involve human labellers). Note that when a unitselection method is to be used that fundamentally uses segmentboundaries its quality is going to be ultimately determined by thequality of the segmental labels in the databases.

Unlike diphone database which are carefully constructed to ensurespecific coverage one of the advantages of unit selection is thata much more general database is desired. However, although voicesmay be built from existing data not specifically gathered forsynthesis there are still factors about the data that will help makebetter synthesis.

This chapter discussed some of the options for building waveformsynthesizers using unit selection techniques in Festival. This is stillvery much an on-going research question and we are still adding newtechniques as well as improving existing ones often so the techniquesdescribed here are not as mature as the techniques as described inprevious diphone chapter.

This paper describes a new method for synthesizing speech by concatenating sub-word units from a database of labelled speech. A large unit inventory is created by automatically clustering units of the same phone class based on their phonetic and prosodic context. The appropriate cluster is then selected for a target unit offering a small set of candidate units. An optimal path is found through the candidate units based on their distance from the cluster center and an acoustically based join cost. Details of the method and justification are presented. The results of experiments using two different databases are given, optimising various parameters within the system. Also a comparison with other existing selection based synthesis techniques is given showing the advantages this method has over existing ones. The method is implemented within a full text-to-speech system offering efficient natural sounding speech synthesis. 1. BACKGROUND Speech synthesis by concatenation of sub-word units ...

As touched on above the choice of an inventory of units can be viewed asa line from a small inventory phones, to diphones, tripohones toarbitrary units. Though the direction you come from influences theselection of the units from the database. CHATR liesfirmly at the "arbitrary units" end of the spectrum. Although it canexclude bad units from its inventory it is very much view of the world. Microsoft's Whistler onthe other hand, starts off with a general database base but selectstypical units from it. Thus its inventory is substantially smaller thanthe full general database the units are extracted from. At the otherend of the spectrum we have the fixed pre-sepcified inventory likediphone synthesis as has bee described in the previous chapter.

