The first task faced by any TTS system is the conversion of input text into linguistic representation, usually called text-to-phonetic or grapheme-to-phoneme conversion. The difficulty of conversion is highly language depended and includes many problems. In some languages, such as Finnish, the conversion is quite simple because written text almost corresponds to its pronunciation. For English and most of the other languages the conversion is much more complicated. A very large set of different rules and their exceptions is needed to produce correct pronunciation and prosody for synthesized speech. Some languages have also special features which are discussed more closely at the end of this chapter. Conversion can be divided in three main phases, text preprocessing, creation of linguistic data for correct pronunciation, and the analysis of prosodic features for correct intonation, stress, and duration.
Synthetic speech can be compared and evaluated with respect to intelligibility, naturalness, and suitability for used application (Klatt 1987, Mariniak 1993). In some applications, for example reading machines for the blind, the speech intelligibility with high speech rate is usually more important feature than the naturalness. On the other hand, prosodic features and naturalness are essential when we are dealing with multimedia applications or electronic mail readers. The evaluation can also be made at several levels, such as phoneme, word or sentence level, depending what kind of information is needed.
SoftVoice, Inc. - Text-to-Speech Synthesis
With segmental evaluation methods only a single segment or phoneme intelligibility is tested. The very commonly used method to test the intelligibility of synthetic speech is the use of so called rhyme tests and nonsense words. The rhyme tests have several advantages (Jekosh 1993). The number of stimuli is reduced and the test procedure is not time consuming. Also naive listeners can participate without having to be trained and reliable results can be obtained with relatively small subject groups, which is usually from 10 to 20. The learning effects can also be discarded or measured. With these features the rhyme tests are easy and economic to perform. The obtained measure of intelligibility is simply the number of correctly identified words compared to all words and diagnostic information can be given by confusion matrices. Confusion matrices give information how different phonemes are misidentified and help to localize the problem points for development. However, rhyme tests have also some disadvantages. With monosyllabic words only single consonants are tested, the vocabulary is also fixed and public so the system designers may tune their systems for the test, and the listeners might remember the correct answers when participating in the test more than once. For avoiding these problems Jekosh (1992) has presented CLID-test described later in this chapter. Rhyme tests are available for many languages and they are designed for each language individually. The most famous segmental tests are the Diagnostic and Modified Rhyme Tests described below. Some developers or vendors, such as Bellcore and AT&T have also developed word lists for diagnostic evaluation of their own (Delogu et al 1995).
Talking Web Pages and the Speech Synthesis API - SitePoint
Speech quality is a multi-dimensional term and its evaluation contains several problems (Jekosh 1993, Mariniak 1993). The evaluation methods are usually designed to test speech quality in general, but most of them are suitable also for synthetic speech. It is very difficult, almost impossible, to say which test method provides the correct data. In a text-to-speech system not only the acoustic characteristics are important, but also text pre-processing and linguistic realization determine the final speech quality. Separate methods usually test different properties, so for good results more than one method should be used. And finally, how to assess the test methods themselves.
This article introduces the Speech Synthesis API, ..
SoftVoice even has extensive singing support!
Duringspeech, the programmer can elect to receive a variety of different messages backfrom the synthesizer.
Talking Web Pages and the Speech Synthesis API
The Cluster Identification Test was developed under the ESPRIT project SAM (Jekosh 1992, 1993). The test is based on statistical approach. The test vocabulary is not predefined and it is generated for each test sequence separately. The test procedure consists of three main phases: word generator, phoneme-to-grapheme converter and an automatic scoring module. Word generator generates the test material in phonetic representation. The user can determine the number of words to be generated, the syllable structure (e.g., CCVC, VC,...), and the frequency of occurrence of cluster, initial, medial, and final cluster separately. Syllable structures can also be generated in accordance of their statistical distribution. For example, the structure CCVC occurs more often than CCCVCCC. Used words are usually nonsense. Since most of the synthesizers do not accept phoneme strings, the string has to be converted into graphemic representation. Finally, the error rates are automatically fetched from computer. Initial, medial, and final clusters are scored individually. Also confusion matrices for investigating mix-ups between certain phonemes are easy to generate from the data. In CLID test the open response answering sheet is used and the listener can use either a phonemic or a graphemic transcription. Used sound pressure level (SPL) can be also chosen individually (Kraft et al. 1995).