I haven't had a brain dump in a while, and I should probably write more since this blog gets a ton of visitors for some reason. I'm excited about a lot of technology that I've read up on lately and will touch on some before diving into object tracking.
I've read a lot of really interesting papers about transmitting information using orbital angular momentum (OAM) of light and a 2.4 Tbps wireless transmission that recently set the record. OAM is a separate dimension onto which we can encode information, providing a multiplicative effect in the amount of available bandwidth for transmitting information. This will be a massively disruptive technology and you should leave this site right now to learn about how to produce curl in your Poyting vectors using helical antennae.
I've also heard about using field-effects to produce gate-controllable junctions in arbitrary singly doped semiconductors. This can lead to new types of solar devices using cheaper materials like Copper Oxide. One could also imagine a photo detector array which uses the effect to sweep through color bands. This method is similar physically to the method of producing a band-gap in bilayer graphene. Controllable electron transport and volume manufacturing of graphene devices are both very active areas of research and any electrical engineer who wants to be ahead of the curve aught to study up on relativistic quantum electrodynamics in monolayers.
On the FPGA side of things, I'm very excited by the Xilinx Virtex-7 which went with 3-D integration to provide what they are calling "Stacked Silicon Interconnect" on a separate layer which shows that FPGAs continue to be a proving-ground for leading semiconductor process technology.
I expect that we will see optical sensors 3-D integrated on top of a processing layer so that we will have computer vision baked onto optical sensors. This will allow us to process visual data with much higher framerates, lower-latency and lower-power. I predict that image filters, motion detection, optical flow, edge detection, generation of pyramid representations and integral images will all be implemented on silicon that is 3-D integrated with a sensor. This sort of sensor will enable low-power object tracking and gesture control in basically any embedded application.
Object tracking and recognition are topics I have been following for a long time. I wrote about my ideas for synthesis-based recognition techniques six years ago when I was still an undergrad. PrimeSense's launch of OpenNI inspired me to pursue human computer interaction side of this work at Zigfu where we recently hit the milestone of one-hundred commercial customers in 29 countries using our Unity3D SDK to produce interactive experiences. Zigfu is primarily focused on human-computer interaction (HCI) and not computer vision (CV); even with perfectly accurate sensors and computer vision to track human movement, there is a wide open problem of designing a motion-controlled UI that enables a human to comfortably and accurately select one element out of a list of 1000 elements. I like to compare this to touch-screens, where the measurable qualities of capacitive touch screens are like sensors / computer-vision while the magical qualities of the iOS/Cocoa UI and Android UI are how people experience this technology.
Still, I've been keeping an on eye on visual object tracking for a long time and also want to do a brain dump of my thoughts. Visual object tracking such as in Zdenek Kalal's predator algorithm is very compelling:
Zdenek has also supported an open source community called OpenTLD (Tracking Learning Detection) which has produced C++ ports of his algorithm from the matlab original (see me playing with it in the video below).
Another great reference is this paper on skeleton tracking with surface analysis, with a really cool video to go along:
Some time back on April 15th, I wrote a proposal for ad-hoc skeleton analysis to the TLD group that went something like this.
Ever since watching the Predator video, I've been thinking about how to extend the algorithm to use 3-D voxel and point cloud structures and not just track, but determine orientation and perform ad-hoc discovery of skeleton joint structures. I call this algorithm "Terminator." Terminator tracks and recognizes objects by generating a rigged 3-D model of the object.
Instead of producing 2-D wavelets from the tracked object as in predator, terminator generates a 3-D model of the object complete with inferred skeleton structures. A voxel model can be created by composting the 2-D images generated by a tracker system (as in Predator). Voxel acquisition can also be assisted with a depth sensor. Recognition and determination of the orientation of rigid bodies can be performed using a sparse voxel representation. One way to accelerate recognition may be to use principle component analysis to align the input data with the models being recognized. Another way to perform recognition may be to create sets of 2-D wavelets by projecting the voxel representation to create sets of 2-D image recognizers Brute force 3-D convolution of sparse voxels may also work, but makes more sense for detecting features in non-sparse voxels like MRI data.
A skeleton model with joint weighs for each voxel can be inferred by using optical flow data to detect state transitions in the voxel model such as when a mouth opens or hand gesture changes. Historic data can be used to verify and improve the skeleton system and to infer range of motion constraints.
Just had to share this idea in case I never really get down to it.
Since the Kinect provides superior tracking on the 3-D depth data, we are able to train multiple predator-like algorithms on the RGB data with feedback from a depth-data tracker. We can also use the depth data for user segmentation and extract only the interesting RGB pixels from the registered depth data.
Anyway, this is long enough already, so I'll leave you with a video I made running the predator algorithm's C++ port in multiple processes: