apple tree logo
apple tree logo
Speech

Good Places to Start

Readings Online

Related Web Sites

Related Pages

More Readings
(see FAQ)

Recent News about THE TOPICS (annotated)



 

 

Did you say:

sketch of person talking

"How to recognize speech"
or,
"How to wreck a nice beach"
or
"How to Wreck a Nice Beach You Sing Calm Incense"

...Simple inquiries about bank balance, movie schedules, and phone call transfers can already be handled by telephone-speech recognizers.
Voice activated data entry is particulary useful in medical or darkroom applications, where hands and eyes are unavailable, or in hands-busy or eyes-busy command and control applications. Speech could be used to provide more accessibility for the handicapped (wheelchairs, robotic aids, etc.) and to create high-tech amenities (intelligent houses, cars, etc.)
- Alex Waibel and Kai-Fu Lee, from Readings in Speech Recognition

The 1990s saw the first commercialization of spoken language understanding systems. Computers can now understand and react to humans speaking in a natural manner in ordinary languages within a limited domain. Basic and applied research in signal processing, computational linguistics and artificial intelligence have been combined to open up new possibilities in human-computer interfaces.


Good Places to Start

Common sense boosts speech software. By Eric Smalley. Technology Research News (March 23 / 30, 2005). "Speech recognition software matches strings of phonemes -- the sounds that make up words -- to words in a vocabulary database. The software finds close matches and presents the best one. The software does not understand word meaning, however. This makes it difficult to distinguish among words that sound the same or similar. The Open Mind Common Sense Project database contains more than 700,000 facts that MIT Media Lab researchers have been collecting from the public since the fall of 2000. These are based on common sense like the knowledge that a dog is a type of pet rather than the knowledge that a dog is a type of mammal. The researchers used the phrase database to reorder the close matches returned by speech recognition software. ... 'One surprising thing about testing interfaces like this is that sometimes, even if they don't get the absolutely correct answer, users like them a lot better,' said [Henry] Lieberman. 'This is because they make plausible mistakes, for example 'tennis clay court' for 'tennis player', rather than completely arbitrary mistakes that a statistical recognizer might make, for example 'tennis slayer',' he said. "

  • Also noted in the article is the related technical paper: "How to Wreck a Nice Beach You Sing Calm Incense," Intelligent User Interfaces Conference (IUI 2005), San Diego, January 9-12, 2005.
    • If you'd like to learn more about "wreck a nice beach," the classic acoustic ambiguity, see this article below.

Spoken Language Systems Group, MIT Computer Science and Artificial Intelligence Laboratory.

  • About SLS: "User: Yes, I would like the weather forecast for London, England, please. JUPITER: In London in England Wednesday, partly cloudy skies with periods of sunshine. High 82 and low 63. Is there something else? ... SLS researchers make this kind of dialogue look easy by empowering the computer to perform five main functions in real time: speech recognition-- converting the user's speech to a text sentence of distinct words, language understanding -- breaking down the recognized sentence grammatically, and systematically representing its meaning, information retrieval -- obtaining targeted data, based on that meaning representation, from the appropriate online source, language generation -- building a text sentence that presents the retrieved data in the user's preferred language, and speech synthesis -- converting that text sentence into computer-generated speech. Throughout the conversation, the computer also remembers previous exchanges."
  • Core Technology Development: "To support its research on spoken language systems for human/computer interaction, the SLS group has developed its own suite of core speech technologies. These technologies include: * speech recognition (SUMMIT) * natural language understanding (TINA) * dialogue modeling * language generation (GENESIS) * speech synthesis (ENVOICE)."
    • Also see: Talking with Your Computer. By Victor Zue. Scientific American (August 1999). "When you ask Galaxy a question, a server called Summit matches your spoken words to a stored library of phonemes - the irreducible units of sound that make up words in all languages. Then Summit generates a ranked list of candidate sentences - the machine's guess at what you actually said."
  • Applications

Conversations control computers. By Eric Smalley. Technology Research News (January 12/19, 2005). "Because information from spoken conversations is fleeting, people tend to record schedules and assignments as they discuss them. Entering notes into a computer, however, can be tedious -- especially when the act interrupts a conversation. Researchers from the Georgia Institute of Technology are aiming to decrease day-to-day data entry and to augment users' memories with a method that allows handheld computers to harvest keywords from conversations and make use of relevant information without interrupting the personal interactions. ... The researchers' system protects privacy by only using speech from the user's side of the conversation, said [Kent] Lyons."

Making Computers Talk - Say good-bye to stilted electronic chatter: new synthetic-speech systems sound authentically human, and they can respond in real time. By Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003). "Scientists have attempted to simulate human speech since the late 1700s, when Wolfgang von Kempelen built a 'Speaking Machine' that used an elaborate series of bellows, reeds, whistles and resonant chambers to produce rudimentary words."

  • One of the many resources referenced in the article is IBM's Interactive U.S. English Demo: "This demonstration of our work in unconstrained text-to-speech research allows users to submit text to be synthesized into speech."

Ernestine, Meet Julie - Natural language speech recognition is markedly improving voice-activated self-service. By Karen Bannan. CFO Magazine (January 1, 2005). "A new technology, called natural language speech recognition, is markedly improving voice-activated self-service. Powered by artificial intelligence, these speech-recognition systems are altering consumer perceptions about phone self-service, as calls for help no longer elicit calls for help. That, in turn, is spurring renewed corporate interest in the concept of phone self-service. In 2004, sales of voice self-service systems topped $1.2 billion. 'We've seen voice systems move from emerging technology to applied technology over the last few years,' says Steve Cramoysan, principal analyst at Stamford, Connecticut-based research firm Gartner. 'It's still fairly immature. But it's proven and moving toward the mainstream.'"

The Futurist - The Intelligent Internet. The Promise of Smart Computers and E-Commerce. By William E. Halal. Government Computer News Daily News (June 23, 2004). "Scientific advances are making it possible for people to talk to smart computers, while more enterprises are exploiting the commercial potential of the Internet. ... [F]orecasts conducted under the TechCast Project at George Washington University indicate that 20 commercial aspects of Internet use should reach 30% 'take-off' adoption levels during the second half of this decade to rejuvenate the economy. Meanwhile, the project's technology scanning finds that advances in speech recognition, artificial intelligence, powerful computers, virtual environments, and flat wall monitors are producing a 'conversational' human-machine interface. These powerful trends will drive the next generation of information technology into the mainstream by about 2010. ... The following are a few of the advances in speech recognition, artificial intelligence, powerful chips, virtual environments, and flat-screen wall monitors that are likely to produce this intelligent interface. ... IBM has a Super Human Speech Recognition Program to greatly improve accuracy, and in the next decade Microsoft's program is expected to reduce the error rate of speech recognition, matching human capabilities. ... MIT is planning to demonstrate their Project Oxygen, which features a voice-machine interface. ... Amtrak, Wells Fargo, Land's End, and many other organizations are replacing keypad-menu call centers with speech-recognition systems because they improve customer service and recover investment in a year or two. ... General Motors OnStar driver assistance system relies primarily on voice commands, with live staff for backup; the number of subscribers has grown from 200,000 to 2 million and is expected to increase by 1 million per year. The Lexus DVD Navigation System responds to over 100 commands and guides the driver with voice and visual directions."

From Your Lips to Your Printer. By James Fallows. The Atlantic (December 2000). "First, the computer captures the sound waves the speaker generates, tries to filter them from coughs, hmmmms, and meaningless background noise, and looks for the best match with the phonemes available. (A phoneme is the basic unit of the spoken word.)"

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. By Daniel Jurafsky and James H. Martin. Prentice-Hall, 2000. Both the Preface and Chapter 1 are available online as are the resources for all of the chapters.

FAQs. Topics covered include: general information, signal processing, speech coding and compression, natural language processing, speech synthesis, and speech recognition.

  • Don't miss "the list of all the hyperlinks from the comp.speech FAQ. This is probably the biggest list of speech technology links available. The links are provided to WWW references, ftp sites, and newsgroups. Cross-references to the comp.speech WWW pages are also provided."
The online version of Hal's Legacy: 2001's Computer as Dream and Reality. Edited by David G. Stork.

Speech Recognition Using Neural Networks. By John-Paul Hosom, Ron Cole, and Mark Fanty at the Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology. "There are four basic steps to performing recognition. ... First, we digitize the speech that we want to recognize; for telephone speech the sampling rate is 8000 samples per second. Second, we compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). ... Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. Fourth, a Viterbi search is used to match the neural-network output scores to the target words (the words that are assumed to be in the input speech), in order to determine the word that was most likely uttered." This tutorial also includes several diagrams that clarify the many of the concepts.

  • Also see their response to the question "What is Automatic Speech Recognition?" in which you'll be introduced to the Hidden Markov Model ... and then scroll down the page to learn about their current projects.

Experts Use AI to Help GIs Learn Arabic. By Eric Mankin. USC News (June 21, 2004). " To teach soldiers basic Arabic quickly, USC computer scientists are developing a system that merges artificial intelligence with computer game techniques. The Rapid Tactical Language Training System, created by the USC Viterbi School of Engineering's Center for Research in Technology for Education (CARTE) and partners, tests soldier students with videogame missions in animated virtual environments where, to pass, the students must successfully phrase questions and understand answers in Arabic." Read the story and then watch the video!

Readings Online

Speech in Education. By Phillip Britt. Speech Technology Magazine (June / July 2005). "Speech-enabled applications and hardware are increasingly finding their way into the classroom and into the offices of educators at all levels of education, but educational applications still represent a small, though growing, segment of the speech technology market, according to industry analysts."

A Short Introduction to Text-to-Speech Synthesis. By Thierry Dutoit. The Circuit Theory and Signal Processing Lab of the Faculte Polytechnique de Mons. "I try to give here a short but comprehensive introduction to state-of-the-art Text-To-Speech (TTS) synthesis by highlighting its Digital Signal Processing (DSP) and Natural Language Processing (NLP) components. As a matter of fact, since very few people associate a good knowledge of DSP with a comprehensive insight into NLP, synthesis mostly remains unclear, even for people working in either research area."

Capitalize on Customer Conversations with Speech Analytics. By Donna Fluss. Speech Technology Magazine (September / October 2005). "For years, speech analytics have been used worldwide by security organizations to help government agencies identify potential risks and threats. In the past two years, contact centers have begun to use speech analytics applications to capture and structure customer communications. The applications analyze the structured data to identify customer trends and insights for the purpose of improving service quality, customer satisfaction, and generating new revenue. There are three major analysis techniques and outputs from speech analytics: Keyword or Key Phrase Identification ... Emotion Detection ... Talk Analysis."

Automatic Speech Recognition, Spring 2003. Staff Instructors: Dr. James Glass and Professor Victor Zue. Available from MIT OpenCourseWare. "6.345 is a course in the department's 'Bioelectrical Engineering' concentration. This course offers a full set of lecture slides with accompanying speech samples, as well as homework assignments and other materials used in the course. 6.345 introduces students to the rapidly developing field of automatic speech recognition. Its content is divided into three parts. Part I deals with background material in the acoustic theory of speech production, acoustic-phonetics, and signal representation. Part II describes algorithmic aspects of speech recognition systems including pattern classification, search algorithms, stochastic modelling, and language modelling techniques. Part III compares and contrasts the various approaches to speech recognition, and describes advanced techniques used for acoustic-phonetic modelling, robust speech recognition, speaker adaptation, processing paralinguistic information, speech understanding, and multimodal processing."

IBM gets smart about Artificial Intelligence. By Pamela Kramer. IBM Think Research (June 2001). "Computer vision is important to speech recognition, too. Visual cues help computers decipher speech sounds that are obscured by environmental noise. Chalapathy Neti, manager of IBM's audiovisual speech technologies (AVST) group at Watson, often cites HAL's lip-reading ability in 2001 in promoting the group's work."

  • Also see: You just don't understand! IBM's Superhuman Speech initiative clears conversational confusion. By Sam Howard-Spink. IBM Think Research. (September 2002). "It's a vision of the future that's been promised for decades — humans and machines interacting with each other by voice."

Men all ears as health technology gets hearing. The Northern Daily Leader & tamworth.yourguide (June 16, 2004). "A revolutionary hearing aid was just one of a number of new technological exhibits on show at the Men's Health Expo in Tamworth yesterday to coincide with Men's Health Week. The hearing aid allows the person wearing it to focus on a specific conversation more clearly while drowning out any other noises in the room. It has been designed to select the best speech over noise using parallel processing through a new concept called syncro. ... Spokesman James Battersby for Oticon, which manufactures the hearing aid, said ... 'It's design has been created by using artificial intelligence and allows the wearer to cancel out up to four different noises simultaneously.'"

The Power of Speech. By Lawrence Rabiner, Center for Advanced Information Processing, Rutgers University. Science (September 12, 2003; Volume 301, Number 5639: 1494-1495). "In the multimedia world of future communications, speech will play an increasingly important role. From speaker verification to automatic speech recognition and the understanding of key phrases by computers, the spoken word will replace keyboards and pointing devices like the mouse. In his Perspective, Rabiner discusses recent advances and remaining challenges in the processing of speech by communication devices. The key challenge is to make the user interface for 21st-century services and devices as easy to learn and use as a telephone is today for voice conversations."

Computers That Speak Your Language. By Wade Roush. Technology Review (June 2003). Be sure to see the illustration in the article: Inside a Conversational Computer.

Linguistic Knowledge and Empirical Methods in Speech Recognition. By Andreas Stolcke. (1997). AI Magazine 18 (4): 25-32.

Is There a Future for Speech in Vehicles? By Kenneth White, Harvey Ruback and Roberto Sicconi. Speech Technology Magazine (November / December 2004). "Today, speech recognition technology is becoming an important component in how people are using and interacting with their cars. ... Many people associate speech in cars with science fiction movies and television shows where the cars act like R2D2 robots on wheels. In today’s world the main reason for using speech is less Hollywood and more pragmatic. In fact, it usually boils down to safety. ... The car represents a very challenging environment for voice technologies. The challenges range from creating optimal operation in an unpredictable and noisy environment to dealing with very limited system resources, such as memory/CPU."

Related Web Sites

"The Institute for Signal and Information Processing (ISIP) [at Mississippi State University] has been established to launch a multidisciplinary program to develop next generation information processing techniques. Research at ISIP is centered on intelligent information processing, perhaps the most important technology of the next century. ISIP draws upon a wide range of research experience in areas such as signal processing, communications, natural language, database query, intelligent systems, and discrete controls. Its present vision is to develop systems capable of intelligent interactions with users by the integration of a multiplicity of interface technologies including speech, natural language, database query, and imaging."

The Centre for Speech Technology Research at the University of Edinburgh [CSTR]: "Founded in 1984, CSTR is concerned with research in all areas of speech technology including speech recognition, speech synthesis, speech signal processing, information access, multimodal interfaces and dialogue systems. We have many collaborations with the wider community of researchers in language, cognition and machine learning for which Edinburgh is renowned." Be sure to see their collection of current research projects .

HP SpeechBot - audio search using speech recognition. From Hewlett-Packard.

  • How Does SpeechBot Work? "After one of these radio programs goes to air, HP uses its speech recognition software to create a time-aligned 'transcript' of the program and build an index of the words spoken during the program. When you use SpeechBot, it searches through the shows we have indexed, trying to match your words with those in the index. SpeechBot then displays the matches for your search in order of likely relevance."
  • About SpeechBot FAQs
  • Technical Whitepaper

The Meeting Recorder Project at ICSI [The International Computer Science Institute]. "Despite recent advances in speech recognition technology, successful recognition is limited to co-operative speakers using close-talking microphones. There are, however, many other situations in which speech recognition would be useful - for instance to provide transcripts of meetings or other archive audio. Speech researchers at ICSI, UW, SRI, and IBM are very interested in new application domains of this kind, and we have begun to work with recorded meeting data." - from the Introduction

ModelTalker. From the Speech Research Laboratory, duPont Hospital for Children and University of Delaware. Not only can you pick a voice for the demo, but you can also pick an emotion!

Quantifying Room Acoustic Quality Using Artificial Neural Networks Project. Salford Acoustics Audio and Video at the University of Salford. "This project was concerned with spaces where good acoustics are required for speech. Such spaces include shopping malls and railway stations where announcements need to be intelligible, and theatres where the quality of sound plays a crucial role in the enjoyment of a performance. The project researched a novel measurement technique intended to increase understanding of acoustics by enabling in-use, non-invasive evaluation of room acoustics to be made. ... The measurement system proposed derives the acoustic quality from a speech signal as received by a microphone in a room. Neural networks learn how to extract the determining characteristics from the speech signals that lead to the objective parameters. In this way, the neural networks predict the reverberation time, early decay time, STI (Speech Transmission Index) and RASTI (RApid Speech Transmission Index). In addition to enabling occupied measurements, the development of the neural network sensing system is of academic interest, as it is forming an artificial intelligence system to mimic the behaviour of human perception."

Speech at CMU Web Page. An extensive collection of speech resources from Carnegie Mellon University with links to many exciting projects (both at CMU and around the world).

Talking Heads. "This website provides an overview of the rapidly growing international effort to create talking heads (physiological / computational / cognitive models of audio-visual speech), the historical antecedents of this effort, and related work. Links are provided (where possible) to the sites of many researchers and commercial entities working in this diverse and exciting area." "This site is also designed as a working outline for a book presently being written by Philip Rubin and Eric Vatikiotis-Bateson," who maintain the site. Here's a peek at what awaits you:

  • Simulacra - The Early History of Talking Machines, which begins with the statement: "The earliest speaking machines were perceived as the heretical works of magicians and thus as attempts to defy god."
  • Speech Synthesis: "A revolution occurred in speech technology when the digital computer permitted the simulation of electronic circuitry, the conversion of analog signals to digital form, and the creation of analog signals from digital information (in this case, sound in the form of speech). The advent of desktop computing in the 1980s and 1990s brought affordable speech synthesis and recognition within the reach of the average computer user."
  • other topics include: Vocal Tracts, Articulators, Speech Production, McGurk, Speechreading, Facial Animation and Avatars.

Dennis Klatt's History of Speech Synthesis. "Audio clips of synthetic speech illustrating the history of the art and technology of synthetically produced human speech."

Related Pages in AI Topics

More Readings

Aaron, Andy and Ellen Eide, John F. Pitrelli. June 2005: Conversational Computers. Scientific American (subscription req'd). "Call a large company these days, and you will probably start by having a conversation with a computer. Until recently, such automated telephone speech systems could string together only prerecorded phrases. ... Computer-generated speech has improved during the past decade, becoming significantly more intelligible and easier to listen to. But researchers now face a more formidable challenge: making synthesized speech closer to that of real humans--by giving it the ability to modulate tone and expression, for example--so that it can better communicate meaning. This elusive goal requires a deep understanding of the components of speech and of the subtle effects of a person's volume, pitch, timing and emphasis. That is the aim of our research group at IBM and those of other U.S. companies, such as AT&T;, Nuance, Cepstral and ScanSoft, as well as investigators at institutions including Carnegie Mellon University, the University of California at Los Angeles, the Massachusetts Institute of Technology and the Oregon Graduate Institute."

Erman, Lee D. and Frederick Hayes-Roth, Victor R. Lesser, D. Raj Reddy. 1980. The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty. ACM Computing Surveys 12(2): 213 - 253. "The Hearsay-II speech-understanding system ... recognizes connected speech in a 1000-word vocabulary with correct interpretations for 90 percent of test sentences. Its basic methodology involves the application of symbolic reasoning as an aid to signal processing. A marriage of general artificial intelligence techniques with special acoustic and linguistic knowledge was needed to accomplish satisfactory speech-understanding performance."

Nii, Penny H. 1986. Blackboard Systems: The Blackboard Model of Problem Solving and the Evolution of Blackboard Architectures. AI Magazine 7 (2): 38-64. "The first blackboard system was the HEARSAY-II speech understanding system (Erman et al.,1980) that evolved between 1971 and 1976. Subsequently, many systems have been built that have similar system organization and run-time behavior. The objectives of this article are (1) to define what is meant by 'blackboard systems' and (2) to show the richness and diversity of blackboard system designs."