The main goal of Human-Computer Interaction (HCI) is to enable machines to better communicate with humans. In human society, the most natural form of communication is speech. Speech-enabled devices, especially smartphones, are commonly used in our daily life. Renowned organizations use automated voice messages to answer common queries. Customers are no more required to press the interface buttons. We can use speech recognition systems in our homes, offices, and for personal use. It got the hype when Apple released iOS5 with Siri for iPhone 4S in 2011.
A variety of software applications is available in the market for speech recognition. Such applications are now considered integral parts of the operating systems. Gone are the days when typing was considered a tedious job. It is all about voice commands now, just speak and it is done. Thanks to speech recognition technologies. The research advancement in this field has opened new horizons of its applications in other fields of life. Computing devices can now easily run speech recognition software. It does not only improve the interfaces but also helps visually-impaired people. Such people can now use these devices for limited operations.
Once, such people learn to use the system, they can send messages, make calls, or even send emails, etc. [1]. Voice commands help impressively in performing office activities including voice-dialing, reminder-settings, messages/email typing. This vividly improves office productivity [2]. There can be endless implementations, in our day-to-day life [3]. Some car manufacturers have introduced voice commands to do many tasks inside the car. Drivers can navigate, call, or send messages using voice instructions. An introductory level inspirational video is available at the link [4].
How it Converts the Voice to Text
Voice to text conversion is not a simple task. For this purpose, computing devices have to process some complex steps. When a human speaks, it creates vibration (waves of pressure) in the air. These are referred to as analog signals. These analog signals are converted to digital signals by Analog to Digital Converters (ADC). For conversion, sampling the sound is performed first, which is called digitizing. Then, it takes measurements at a frequent interval of time.
By using Fast Fourier Transform (FFT), the digital data is then converted into a spectrogram. It is then fragmented into overlapping chunks called the acoustic model. A list of known words and their sound features are stored in a dictionary called a phonetic dictionary. Simply, phonetic is one of the units of sound which differentiate one word from the others in a given language.
For the English language, the most agreed-upon number of phonemes is 40. Other features of phonemes include phonemic stress and phonemic tone. The features analysis entirely depends on the requirements of the recognition system. Now, as a simple speech recognition task, a chunk from the signal is compared to the list in the phonetic dictionary. At the basic level, speech is recognized in the same manner.
Main Issues During Speech Analysis and Recognition Process
It’s difficult to differentiate between noise and words. When someone speaks, the words and noise both propagate to the listener’s ears in terms of signals. It is the human brain which helps in separating the words and noise. For example, someone is talking to you during a dance party and you still understand the words. Humans have different voices and this difference is also difficult to handle. You may have noticed that some people talk slowly compare to others. This results in different signal samplings for the same word spoken by slow and quick speakers.
Moreover, there are issues of similarity in pronunciations of some words. For example words like “flower” and “flour”, “sea”, and “see” are homophones. Again, it’s our brain that can differentiate between two words by understanding the context of the word in a sentence. Other issues include the understanding of accents, dialects, grammar, and semantics by machines.
Statistical Analysis and Language Modeling for Speech Recognition
It is much difficult to analyze the speech because of its variations. Humans may use different sentence structures while talking about the same concept. As the number of speakers increases, the system has to recognize more voices. The variability grows and the possibility of mistakes also increases. For a simple voice detection program that listens to your voice and then types, a simple pattern recognition system can work well. However, we can use the rules of a language also. Which will further improve the performance; they are referred to as language modeling.
Two models, which give the best results, are the Hidden Markov Model (HMM) and neural networks. Both models have some complex mathematical functions. But, they can handle the information, known to the system for figuring out hidden information. For each voice speech segment, HMM provides a best guess. This is based on all the extracted features from the spectrum and pieces of phonemes. The same process continues for the next speech segment and referred to as the Markov chain. For details about HMM in Speech recognition, please follow the link at [5].
How Systems Exploit the Efficiency of Deep Learning in Speech Recognition
Dynamic programming can help in credit assignment and reduce problem depth in reinforcement learning. Its algorithms are essential for the systems. They associate the concepts of neural networks and graphical models like HMMs. The purpose of learning is about finding the weights that enable the NNs to perform as intended. These behaviors require chains of complex computational stages. Each stage transforms the cumulative activation of the network. The behaviors themselves depend on the problem and the way how neurons are connected in a network.
Deep learning accurately assigns scores across a number of such stages. Many algorithms learn structural hierarchies of more abstract data representations. Continuous learning is to add previously-learned concepts in the current stage. The cyclic recurrent neural networks (RNN) are the deepest of all neural networks. These can create process memories of arbitrary sequences of input patterns. Alex Graves [6], have evaluated the deep RNNs performance. They combined multiple levels of representation with the flexible use of long-range context. According to their results, Long-Short-Term-Memory (LSTM) achieved a score of 17.7%. The tests were performed on the TIMIT phoneme recognition benchmark.
What to Use: Speech Recognition Software Products
Apple’s Siri: [7] can help you get things done through voice. It provides a variety of command support. It is tuned in to the world, working with Wikipedia, Yelp, Rotten Tomatoes, Shazam, and other online services.
- SiriKit: [8] enables your iOS apps to work with Siri.
- Google’s Now: [9]. It is an intelligent personal assistant that provides voice search and commands. It also provides APIs for developers.
- SpeechRecognition 3.4.6: [10] provides a python library for implementing speech recognition. It supports several engines and APIs for developers.
- Dragon: [11] is speech recognition software.
- SpeechPad: [12] is a voice recognition application for converting speech to text. It can also convert any audio file to text.
- SpeechTexter: [13] let you type your voice.
- Speechnotes: [14] an online speech to text notepad.
- TalkTyper: [15] to use a web app that allows free speech to text dictation in a browser.
- Tazti: [16] let you command your PC and games. You can also create your own commands.
Key Takeaways
- If speech recognition is required for a single speaker, the system can learn easily. This can give accurate results comparatively.
- For multiple speakers/users, the system has to learn more because of the multiple speakers. This cannot give good results at the beginning.
- For multiple speakers/users, the voice features analysis play a vital role in recognition.
- Speech recognition systems can vividly improve Human-Computer Interaction performance.
- Advancements in speech recognition technologies can be utilized in many fields. For example health, education, commerce, finance, agriculture, etc.
Recommended Reading
- http://www.afb.org/info/programs-and-services/professional-development/technology/five-tips-for-teaching-speech-recognition-to-people-with-a-visual-or-physical-impairment/1235.
- http://whatsnext.nuance.com/office-productivity/using-speech-recognition-on-mobile-devices-for-office-productivity/
- http://www.slate.com/articles/technology/technology/2014/04/the_end_of_typing_speech_recognition_technology_is_getting_better_and_better.html
- https://www.youtube.com/watch?v=3vuWirlt7Rw
- http://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf
- Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645-6649. IEEE, 2013.
- http://www.apple.com/ios/siri/
- https://developer.apple.com/sirikit/
- https://www.google.com/search/about/learn-more/now/
- https://pypi.python.org/pypi/SpeechRecognition/
- http://www.nuance.com/dragon/index.htm
- https://speechpad.pw/
- https://www.speechtexter.com/
- https://speechnotes.co/
- https://talktyper.com/
- https://www.tazti.com/