Synthesis has traditionally been based on very labor-intensive optimization work. Until recently, the notion of analysis by synthesis had been explored mainly by manual comparisons between hand-tuned spectral slices and a reference spectrum. The work of Holmes and Pearce (1990) is a good example of how to speed up this process. With the help of a synthesis model, spectra are automatically matched against analyzed speech. Automatic techniques, such as this, will probably also play an important role in making speaker-dependent adjustments. One advantage of these methods is that the optimization is done in the same framework as that to be used in the production. The synthesizer constraints are thus already imposed in the initial state.

One of the major problems in concatenative synthesis is to make the best selection of units and describe how to combine them. Two major factors create problems: distortion because of spectral discontinuity at the connecting points and distortion because of the limited size of the unit set. Systems using elements of different lengths depending on the target phoneme and its function have been explored by several research groups. In a paper by Olive (1990), a new method for concatenating "acoustic inventory elements" of different sizes is described. The system, developed at ATR, is also based on nonuniform units (Sagisaka et al., 1992).

Today’s text to speech technology is much improved over that of even a few years ago. The older systems -- which produced the robotic-like sounds that people tend to associate with computer voices -- used the parametric or formant synthesis method to simulate the acoustic properties of speech.

The data for individual voices, including regional accents, are provided in separate files called "voices". The text to speech engine can work with any of the voices interchangeably.

Synthesis systems based on coding have as long a history as the vocoder. The underlying philosophy is that natural speech is analyzed and stored in such a way that it can be assembled into new utterances. Synthesizers such as the systems from AT&T Bell Labs (Olive, 1977, 1990; Olive and Liberman, 1985), Nippon Telephone & Telegraph (NTT) (Hakoda et al., 1990; Nakajima and Hamada, 1988) and ATR Interpreting Telephone Research Laboratories (ATR) (Sagisaka, 1988; Sagisaka et al., 1992) are based on the source-filter technique where the filter is represented in terms of linear predictive coding (LPC) or equivalent parameters. This filter is excited by a source model that can be of the same kind as the one used in terminal analog systems. The source must be able to handle all types of sounds: voiced and unvoiced vowels and consonants.

Text to speech is the automated synthesis of speech from text. The heart of the system is the text to speech engine – a sophisticated piece of software that: