A Mandarin Text-to-Speech System

In this paper, the implementation of a high-performance Mandarin TTS system is presented. The system is composed of four main parts: text analysis (TA), prosodic information generation (PIG), a waveform table of 411 base-syllables (WT), and PSOLA-based waveform synthesis (PSOLA). In TA, statistical model based method is first employed to automatically tag the input text to obtain the word sequence and the associated part-of-speech (POS) sequence. A lexicon containing about 80000 words is used in the tagging process. Then the corresponding base-syllable sequence is found and used in WT to form the basic waveform sequence of the base-syllables. Some linguistic features used in PIG are also extracted in TA. In PIG, a four-layer recurrent neural network (RNN) is employed to generate some prosodic information including the pitch contour, energy level, initial duration and final duration of syllables as well as the inter-syllable pause duration. Lastly, in PSOLA, the basic waveform sequence is modified using the prosodic information to generate output synthetic speech. The whole system has been implemented by software on a PC/AT 486 with a 16-bit Sound Blaster add-on card. Only memory spaces of 3.2 Mbyte and 5.5 Mbyte are, respectively, required for the two versions with sampling rates of 10 kHz and 20 kHz. It can synthesize speech in real-time for any input Chinese text. Informal listening tests by many native Chinese living in Taiwan have confirmed that the synthetic speech sounds very fluent and natural.