ALDA Speech Recognition Panel – Part 1

ALDA Speech Recognition Panel – Part 1

Automatic Speech Recognition Systems as a Conversational Aid by People Who are Deaf or Hard of Hearing Presentation

October 1999 Association for Late-Deafened Adults Conference

Reported by Jim House, TDI Presenters: (2 sessions)

Dr. Carl Jensema – Institute on Disability
Dr. Ross Stuckless – NTID/RIT
Dr. Judy Harkins – Gallaudet Technology Assessment Program
Dr. Anita Haravon – Lexington School/Center for the Deaf

In 1876, Alexander Graham Bell announced his new invention – the telephone. Ironically, this is probably the earliest attempt to create visible speech as an aid for his wife and daughter, who both have hearing loss.

In 1950, Bell Labs created the first real speech recognition machine that matched spoken audio patterns with patterns stored in the system. It was also speaker dependent and needed extensive training for a vocabulary of just 10 words. The president of Bell Labs did not see much of a future in automatic speech recognition (ASR) and gave the project very little support.

In the ’50’s and ’60’s, researchers found that ASR was a much tougher goal than they originally expected. Speech recognition requires many calculations in a very short time and further development has depended on the availability of faster computers. There are many calculations based on mathematical, audiological and computer science formulas. Researchers decided to focus on developing a system that could recognize the discrete speech of one person who paused between words and developed a small vocabulary of 50 words or less. During this period, IBM and Carnegie Mellon University in Pittsburgh, PA did much of the basic ASR research.

During the early ’70’s, Threshold Technology, Inc. developed the first real ASR product called the VIP-100 System. It had little practical application, but nevertheless, it drew the interest of the Advanced Research Projects Agency (ARPA) from the US Department of Defense. ARPA began to fund Speech Understanding Research (SUR) projects from 1971 through 1976 to three contractors: Carnegie Mellon University; Bolt, Beranek & Newman; and System Development Corporation. These contractors were to build an ASR system that had the ability to recognize multiple speakers using continuous speech and 1,000-word vocabulary. Only one of those contractors met these specifications. Carnegie Mellon University’s “Harpy” recognized 1,011 words with 95% accuracy.

ARPA continued to support SUR projects during the ’80’s as personal computers became available. Carnegie Mellon University went on to develop what is now the Dragon Dictate speech recognition system. It was one of the first ASR system to use the hidden Markov modeling, a popular technique used by almost all ASR systems. IBM was also very active in ASR and did important work on statistical modeling techniques. ASR research began to focus on developing larger vocabulary systems and on telephone interactive voice menus that uses a small vocabulary while being speaker independent. The best systems were able to recognize discrete speech from one speaker after weeks of training, on known subject materials without background noise and attain 90% accuracy. Several more companies began to develop their own ASR products such as Dragon Systems, Inc.; IBM; ITT Defense Communications; Kurzweil AI; Mimic; Speech Systems, Inc.; Vocollect; Voice Connexion; Voice Control Systems; and Voice Processing Corporation.

In the last decade, personal computers reached the point where speech recognition could be accomplished quickly with the introduction of the 486 processor in 1989. From there, speech recognition power increased dramatically and prices plummeted on ASR systems. Large vocabularies became the norm while continuous speech recognition and artificial neural networks were introduced in commercially available systems. Standards for computer application programming interfaces (API) began to emerge and many applications of ASR appeared. More and more technology companies are entering the ASR field, especially those involved in the computer and telephone industries. Computing power has bloomed since the 486 processor. The Pentium Pro chip has a speed of 200mhz; Pentium II has 300mhz and the Pentium III chip in use today has more than 600mhz. Today’s chip speed is 12 times as fast as the 486 processor ten years ago. Within a year, we can expect to see computers on the market that run at 1,000mhz.

ASR systems on the market today are relatively inexpensive and easy to use. It helps if the speaker is reasonably computer literate and wears a headset. The speaker also has to prepare in advance by entering specialized words to vocabulary, and spend approximately 30 minutes training the system to recognize his or her voice patterns before initial use. As the speaker uses the ASR system, he or she can speak naturally and continuously, but watch out for false starts like “umm” “ahh”, monitor the output on the screen and fix errors as they occur.

ASR will become a common feature on personal computers as computing power continues to grow. Major speech recognition systems in use today include:

Dragon System’s Naturally Speaking

IBM’s Via Voice

Lernout & Hauspie’s (L&H) Voice Xpress

Philips Dictation Systems’ Free Speech

Here’s Part Two