Monday, January 26, 2015

Can you hear me?

For more than five decades, Automatic speech recognition has been the area of active research. This problem was thoroughly studied with the advent of digital computing and signal processing. Increased awareness of the advantages of conversational systems led to the development of this problem. Speech recognition has wide applications and includes voice-controlled appliances fully featured speech-to-text software, automation of operator-assisted services and voice recognition aids for the handicapped.

There are four main approaches in speech recognition:

The acoustic –phonetic approach, the pattern recognition approach, the artificial intelligence approach and Neural network approach. Hidden Markov Model (HMM) and Gaussian Mixture models are also adopted in ASR.
Speech recognition has a big potential in becoming an important factor of interaction between human and computer in the near future. A successful speech recognition system has to determine features not only present in the input pattern at one point in time but also features of the input pattern changing over time (Berthold, M.R, Benyettou). In the speech recognition domain, the first model used by weibel is based on multilayer perceptron using Time Delay Neural network. But this model was the hard time processing. For new applications, the adjustment of parameters became a laborious stain.
RBF networks don’t require a special adjustment and with regard to the Time delay neural network the training time becomes shorter. But RBF problem is the shift invariant in time [Berthold,M.R].
The NN approach for SR can be divided into two main categories: Conventional neural networks and recurrent neural networks. The main rival to the multilayer perceptron is RBF which is becoming an increasingly popular neural network with diverse applications. Traditional statistical pattern classification techniques became inspiration for RBF networks.
In RBF network, process is performed in the hidden layer which is its unique feature. The patterns in the input space form clusters. The distance from the cluster center can be measured if the centers of these clusters are known. Further this distance measure is made non-linear so that it gives a value close to 1 if a pattern is in an area that is close to a cluster center. Serious rivals to MLP are RBF networks which are statistical feed-forward networks. The learning mechanisms in these networks are not biologically plausible- they have not been taken by some researchers who insist on biological analogies.1
All of this is not new and perhaps started with Alexander Graham Bell (March 3, 1847 – August 2, 1922). He was an eminent Scottish-born scientist, inventor, engineer and innovator who is credited with inventing the first practical telephone in 1876.

With advances in the chip technology as well as software magic we are getting closer to more accurate voice recognition. It is mind blowing, however, how many variables go into the interpretation of the human input. Many science fiction movies portrait the interaction with a machine rather effortlessly. "The Machine" or "Transcendence" are some of the latest movies of interest. Also, with the ever evolving smartphone technology we almost take it for granted to talk to our little machine. However, if you are sitting behind the scenes and trying to write programs which allow a somewhat coherent interaction, it is a completely different story. 

For me it is fascinating to realize that what we use to interact with each other on a daily basis, without thinking about it, is such an intricate task. I was looking at some of the white papers, studies, and research by some of the universities states site and abroad, they were perplexing.
In any case, back to where I’m sitting, as far as reading text out loud (like the weather status) seems to be rather easy, so is the capturing of voice and translation into text, accuracy aside. Programmatically writing random sentences based on the grammatically correct style is not very hard either, do the sentences make sense – hardly ever. The hard part is to respond to human voices the right way in a conversation. 

Taking that into account the challenge is to have a corresponding correct answer to the question or request. Giving commands to invoke a task, such as playing a movie or reading a book aloud seems trivial, comparatively speaking of course. 

So I turned to the information provided by various companies who offer to learn the English language to find out what the most common phrases are in a conversation. The result was that there are hundreds of them from formal English to slang expressions, that was a good starting point. Next was to figure out how to search effectively for the corresponding phrase given the speech-to-text result, and above all, how to hold what was said in memory to chain a conversation using look-ahead expression validation.

Stay tuned – I’m working on it …





1 By R.L.K. Venkateswarlu, R. Raviteja and R. Rajeev 2012

No comments:

Post a Comment