Make computer talk and record it

3/7/2023

The software notes features of interest about each phoneme, such as what phonemes preceded and followed it, or whether it is the first or last one in a sentence. Software then converts the written text from a series of words into one of phonemes. The result is a collection of several thousand voice files. The sentences are chosen for their diverse phonetic content, to ensure that we capture lots of examples of all the English phonemes in many different contexts. The speaker who lands the part sits in a sound booth and reads several thousand sentences, which take more than a week to record. We usually look for someone with an agreeable voice and who has good, clear pronunciation that is also free of any significant regional accent at times, however, we may need other characteristics for a specialized application, such as synthesizing English with a foreign inflection or for a robot voice in a movie. Speech synthesis starts with a human voice, so our team typically auditions dozens of speakers to find the right one for a given task. When it comes time to speak, the software grabs the appropriate samples needed to piece together new words. Supervoice contains a collection of recorded samples of each phoneme. For example, the word "please" is composed of four: P, L, EE and Z. English contains about 40 unique phonemes. While most of us think of language in terms of letters or words, the software treats it as a series of phonemes. Supervoices use this building-block model.

Therefore, a set of recordings of a speaker uttering all these building blocks can serve as a kind of typesetter's case for assembling speech. It is based on the premise that speech is composed of a finite number of linguistic building blocks called phonemes and that these can be arranged in new sequences to create any word. The advent of faster computers and inexpensive data storage in the late 1990s made today's most advanced synthetic speech possible. The result was intelligible, though somewhat robotic-sounding, speech. Makers of these systems attempted to model the entire speech production process directly, using a relatively small number of parameters. By the 1970s digital computing enabled the first generation of modern text-to-speech systems to reach fairly wide use. Scientists have attempted to simulate human speech since the late 1700s, when Wolfgang von Kempelen built a "Speaking Machine" that used an elaborate series of bellows, reeds, whistles and resonant chambers to produce rudimentary words. IBM released the latest generation of the technology for commercial use in late 2002. In the future Supervoices could enhance video and computer games, handheld devices and even motion-picture production. What are the immediate uses of this technology? They include delivery of up-to-the-minute news, reading machines for the handicapped, automotive voice controls and retrieving e-mail over the phone-or any system where the vocabulary is large, the content changes frequently or unpredictably, and a visual display isn't practical. But the difference is that they can utter anything at all-including natural-sounding words the original speaker never said. Like the current phrase-splicing systems, our newest ones, called Supervoices, are also based on recordings of a human speaker and they can respond in real time. (Hear a sample by clicking here.) For example, we've developed systems that can "read" a breaking news story or a bunch of e-mail messages aloud over the phone. Synthetic-speech researchers at IBM have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.

But since they can't stray from their prerecorded phrases, their capabilities are limited.

Though the cobbled-together speech sounds stilted, these systems are sufficient for handling limited transactions, where the subject matter is known in advance. By stringing together several canned phrases, such systems do an adequate job of bringing a banking or ticket-booking transaction to a successful conclusion. Dial up a bank or airline these days, and chances are your call will be answered by a prerecorded voice rather than a live human being.

0 Comments

Make computer talk and record it

Leave a Reply.

Author

Archives

Categories