Voice Recognition Captioning
Last year I wrote about my experience using voice recognition technology to teach a computer class to people with hearing loss. It was a surprisingly positive experience for all concerned, and I mentioned at the time that I foresaw numerous applications for the technology.
I have just finished captioning an ALDA meeting using the same technology. What’s different from the computer class is that in this case I was not the speaker. Rather than saying what I wanted to, my job was to repeat what the speakers said into the voice recognition software. This, in itself, is quite different from deciding what to say and just saying it.
An additional difference from speaking my own words is that I had to ensure that my repetition of the speaker’s words weren’t distracting to the speaker or to others in the audience.
The basic tools in either situation are a reasonably fast laptop computer equipped with voice recognition software and a microphone, an LCD projector, and a screen. I talk into the microphone, which feeds to the voice recognition software via the sound card. The voice recognition software converts the audio to text, which is output to the LCD projector. The projector puts the text up on the screen.
For the computer class I used a headset with microphone. This was appropriate equipment for that situation, because those students with enough hearing to benefit from hearing my voice were able to do so.
For the ALDA meeting, however, I wanted to muffle my voice as much as possible. To do so I used a stenomask from Talk Technologies (http://www.talk-tech.net/pages/sylencer.html). The stenomask fits over the mouth and seals against the fact to muffle the voice. It’s not 100% effective, so the audience can still hear something, but it’s far less distracting than normal voice volume speaking into a conventional microphone.
To train ViaVoice (the voice recognition software I’m using) for the computer class, I spent about an hour reading the stories that ViaVoice provides for software training. After that training, ViaVoice performed with about 95% accuracy, provided I was careful to speak clearly and distinctly. If I got lazy, the accuracy declined fast.
To use ViaVoice with the stenomask, I had to train a whole new model. Just as the software must be trained for each person who uses it, it must also be trained for each new hardware configuration. Changing a sound card or microphone, or even the background noise, can necessitate a new voice model.
So I trained ViaVoice with the stenomask for an hour using the provided stories, after which the accuracy was probably only 75%. I was disappointed at the performance, but not surprised, because the stenomask really requires an acclimation period. Ensuring a tight seal (to reduce escaping sound) requires that the mask be held firmly against the face. This makes it hard to move the lips in a natural and consistent manner, which certainly degrades the software performance.
The other problem is the restricted air movement that the mask causes. One result is that breathing is different, and that takes some getting used to. More closely related to the accuracy issue is the fact that the sealed stenomask prohibits normal exhalation as a person speaks. The pressure builds up and makes vocalization difficult, which affects how sounds are produced.
I continued training the stenomask voice model for another few hours, but was unable to significantly improve the accuracy.
Hmmmmm. . . . . what to do?
After awhile I realized what the problem was. When I first started using the stenomask, I was not at all used to it, and my speech was not at all natural. With additional training, I became more comfortable with the equipment, and I was able to speak more naturally. But that speech was very different from the speech with which the model had originally been trained! The problem was that the original training was not representative of my later speech, and no reasonable amount of additional training could overcome the original corrupted training.
So I started over with the provided training stories, and was able to get about 90% accuracy after the first hour. I attribute the slightly degraded performance (compared to using the microphone) to the fact that I’m still not entirely comfortable with the stenomask, so I don’t speak consistently.
So how did the meeting go?
Very well, actually! The system exceeded my expectations. I was pretty much able to keep up with the speakers, and the accuracy was high, as long as I was careful to speak clearly. But as before, the first hint of lazy speech was brutally punished.
Oh, by the way, the reason I’m doing this is because our ALDA group just lost the funding that paid for CART services. We’re looking for new funding, of course, but these are difficult times. It may be that I’ll be providing voice recognition captioning for quite some time.
And why am I telling you all this? It’s not just because I like to whine ;-} It’s because voice recognition is a very real option for organizations that can’t afford traditional captioning. If your organization can find a willing volunteer and can borrow an LCD projector, it’s very doable at very reasonable cost.
And the quality? I’d say it was as good as some traditional CART reporters I’ve seen. It’s nowhere near the quality of the best CART reporters – yet. But I saved the text and audio files from today’s meetings and I’ll use them to continue training the system. Between that and more practice time for me, I wouldn’t be surprised to be rivaling the best CART reporters in a matter of months.
I’ll be happy to do what I can to help anyone who wants to pursue this. Just email me – firstname.lastname@example.org