Friday, January 26, 2007

Recognition via Synthesis and Replication

Many recognition systems work by applying transformations on input data to map a feature space to a more "friendly" feature space. The transformations are generally destructive, non-reversible filters (ie edge detection or color removal in video, frequency selective filtering in audio). This implies a destruction of information in the signal path: by filtering data we reducing the variance of our state-space to focus on "interesting" features. After filtering the data, the result is a more "familiar" feature space on which to perform reductive analysis.

In order to recognize speech we may eliminate all uninteresting frequencies by taking local maxima above a certain threshold. By looking at the relative amplitudes of interesting frequencies, we can identify a set of conditions that categorize each of the sounds into phonemes (record yourself saying ooh and ahh and look at an FFT and you'll get an idea of what a computer might want to look for). The conditions for categorizing an input stram is often trained against a known data set.

Today at the HKN Student Project expo, two 6.111 projects showed how an FPGA can be used to identify a person's hands (juggling simulation) and colored patches and gloves (full body DDR). These systems recognize where a color is by filtering color from video data and identifying the center of mass of each color. Such filter and reduce (alludes to MapReduce of Google) recognition systems will be an important part of artificially intelligent system. Real-world solutions while sharing this common programming model, still require a significant amount of energy to program and train resulting in highly domain specific implementations.

Perhaps it would be useful to produce a general set of filters and reductive analyses. The system, guided by a programmer would assemble the filters and reductions to produce an output from an input stream: where are the hands in this image stream, what is the word in this audio stream. I realized that training such a system, would be simpler, but there is still the problem of calibrating a paramater space for the filters: which frequencies are interesting, what size of red-blob in a filtered image qualifies as a hand? How do we handle arbitrary training sets?

This problem is closely related to speech synthesis: what parameters for tone generators and filters are necessary to model a voice? It struck me that perhaps the reason we recognize and understand speech is because we can mimic the words and the intentions expressed. We recognize and understand when someone says "I am happy," because we can both produce the expression and sympathize with the emotion. Recognition is probably the lower hanging fruit than understanding.

The K-lines activated in person A when a person B says "happy" is instrinsically connected to the K-lines person A uses to control his voice. A computer could perform recognition by modulating the controls of a synthesizer to minimize the difference between the heard tone and the synthesized tone. This could apply to a video recognition system as well: attempt to render a scene matching what it sees. To find a hand, it renders a parametric model of a hand until we realize where what parameters for the hand to minimize the error between the rendered image and the seen image to a point where we are confident we have located the hand properly.

Since I'm a synth junkie, I think of a computer controlling the knobs on a synth to mimic a sound. By examining the gestures used to produce the sound, the synthesizer would recognize the world hello by realizing it is saying "hello" when it attempts to mimic the input. By translating an input space to a control-space we can produce a set of meaningful symbols on which to perform analysis. Useful control parameters are orthogonal/independent: if we modulate them in isolation we could find a parametric minima that is a component of the global minima. It is also nice if our parameters themselves do not have local minima in order to simplify minimization.

We don't always get that kind of niceness... perhaps we can take advantage of the continuity of trajectories in the physical world to imply only incremental modulation is required to control and maintain accurate replication synthesis.