Sunday, October 28, 2012

Fear and Loathing at Zigfu: My YCombinator Experience

Applications for YCombinator's winter 2013 cycle are due October 30. Let me tell you about my experience in YCombinator with Zigfu.

Zigfu is a platform for making motion controlled applications using the Kinect in Unity3D and HTML. We incorporated as Motion Arcade Inc. in May 2011 and went through the YCombinator program in Summer 2011. During my interview with YCombinator, I pitched a company that would develop an eco-system of motion controlled applications, and monetize by selling applications, dev-kits and a motion OS with an app store for smart TVs. I showed off one of my Kinect hacks and had PG dance the hokey-pokey. We got in.

Today, I actually make reasonable revenue from selling the Zigfu Dev Kit. We are a little better than ramen profitable with over one-hundred customers using our Unity3D software for interactive installations all around the world. Over 100,000 Kinects have been hacked with our developer package, and thousands of developers are using our platform actively in commercial and university projects. But, we are not one of YC's high-flying success stories. I'm the only person still employed full-time at Zigfu supporting developers and working on customer projects. The platform is still gaining new features every time I do another project, but this was not the business I intended to start. We never raised money after demo day, not for lack of trying. It is safe to say that the most awesome thing that I got out of demo day was a cigarette bummed off Ashton Kutcher. Sweet.

Here's our story.

Fear and Loathing at Zigfu:

I want to start by sharing some wisdom gained from this whole experience. If you're working in a rapidly evolving technology field, you might be too early to scale your solution and your solution might be wrong, and you'll always have some uncertainty and doubts about the potential for success and thus question the value of any particular schlep. We called this "Fear and Loathing" at Zigfu.

A lot of investors asked me "why this?" like why pursue this company or business over any other potential way to spend my time. This is a common interviewing technique akin to "why do you want this job?" or "what in your experience makes you qualified to do this job?" This means investors want to hear why you are working on the problem you are solving. A lot of startups can canonically answer it with a rehearsed anecdotal response like:

"We experienced problem X, and think that there's room in the market for the AirBNB/Dropbox of X" for a B2C company.

Or "While getting paid by company Y, we ran into problem Z, and think that there's room in the market to be the Salesforce of Z for all people in companies like Y." for a B2B company.

For Zigfu: "I was developing motion controlled apps for interactive engagements and needed a set of UI components for motion control, and I think there's room in the market to be the iOS/Android of Kinect." This often led to "but why not Apple or Google or Microsoft?"

If there's a flicker of doubt, investors can smell the lack of confidence. You are not a true believer and so why should they be? But this question brings up the basic existential crisis that faces most geniuses: how do you address the opportunity cost of focusing on one thing over any other you could be pursuing? It is important and difficult to remain in control of your doubt despite that nagging voice that tells you could be doing something totally better, like building an FPGA OS, or creating radios using light-wave vortices, or solving the Riemann Hypothesis.

Focus is hard, and requires discipline and feedback mechanisms. Establish good practices for setting and meeting milestones, and manage uncertainty and doubt early in your startup. When you have a large set of tasks to accomplish and fear and loathing dominates your actions, you end up choosing what not to do instead of what to do. Our friend Cody says "there's no room for post-modernism in startups." Get over your existential crisis and get to work.

The Problem:

Our startup was founded on the premise that human motion sensing is going to be a commodity. Today, to perform motion tracking you need to buy sensors for a hundred dollars or so, paired with a processor for another few hundred dollars. The future has high-pixel-count sensors integrated (probably vertically) with a computer vision processor providing tracking information with high accuracy, low power, and at the same cost as existing camera sensors. Zigfu aimed to solve the big challenge for bringing such integrated sensors and embedded computer vision systems to market: natural user interface (NUI) design and the creation of a motion controlled application eco-system.

If you are looking for an idea for a startup, you can append "...that doesn't suck" to a lot of existing product descriptions to define good problems and it helps to have a product description you can summarize in a few words. Zigfu is making a voice and gesture controlled user experience that doesn't suck. Even with arbitrarily accurate tracking information about the positions of every joint in the human body, there is still a major challenge of making a comfortable and intuitive gesture-controlled list, for example a list of 1000 videos, where you can quickly select exactly the one you want to watch, without false positives and without misses. The human factors engineering problem in natural user interface design is like building the Cocoa framework that powers the iOS UI, except instead of touch screens, we're working with Kinect and other hand tracking sensors, and instead of phones, we were thinking about 10-foot-experiences like for televisions. We imagined advertisements for our product would show the slow-motion destruction of remote controls using implements like a chainsaw, sledge-hammer, or nail gun, with pieces of circuit boards and shards of plastic flying everywhere and Ode to Joy playing in the background.

Good ideas are non-obvious and misunderstood, otherwise you might be too late or under-capitalized to win the market. The NUI problem is easily misunderstood because it might seem like part of the evolution of the sensors. We should not be deriving the desired user-experience from the limitations of current sensor technology or the availability of software algorithms. Instead, we should work on the user-experience to derive the necessary characteristics of the sensors and algorithms.  It is easy to differentiate motion control companies based on measurable quantities of a sensor or computer vision stack: more joints tracked, higher accuracy tracking, higher frame-rate, longer range, lower cost etc, but these systems are judged by the user experience they provide. Now that people have experienced Kinect, they might say: why not just make a cursor with hover-to-click? Clever engineers will suggest virtual touch-screens, virtual mouse cursor systems with push-to-click, and augmented reality controls (overlaying the UI on the camera image). We've tried all of them too, but slapping motion input onto existing user interfaces designed for different controllers is a recipe for a crappy user experience.

As our starting point, Zigfu helps to simplify gesture controlled application development with Kinect. We do this by making it easy for developers to install and use Kinect with HTML and Unity3D, and by providing high level motion-controlled user-interface components like buttons, menus and lists. Our stack is built on top of existing sensors and computer vision middle-wares and has been easy to port between multiple languages and sensors and tracking algorithms. Investors who understood this message, would say we are rate-limiting our growth by our dependency on the sensor market. Indeed the diversity of available sensor and computer vision products on the market has been growing slower than we expected.

Surely Apple should have a voice and motion-controlled TV set on the market by now.

Starting Up:

I started Motion Arcade Inc. because I was supporting Kinect hackers all over the world with a package for Kinect in Unity3D. I had just wound down my first startup "Tinker Heavy Industries" making iPad apps for kids (try our ragdoll elephant) and was finding contracting gigs to make money hacking the Kinect, and I wanted to start hiring a team to support the motion gaming eco-system with tools and an app-store. I had been working with Shlomo Zippel, who was an applications engineer at PrimeSense at the time, on the Unity3D package for Kinect hackers. For background, PrimeSense is an Israeli company that makes the Kinect's sensor and they are also responsible for fueling the hacker community with the release of OpenNI/NITE in December 2010. OpenNI provides a framework for natural interaction software development and NITE is a set of free commercial algorithms for skeleton tracking.

In order to do a startup, you will need co-founders. It's important to find complimentary co-founders if you're the lead instigator and solo-founder seeking co-conspirators. You will understand this better when your company is making consumer-facing products without a designer or trying to sell OEM licenses without someone with this sort of sales experience. I had spent about 6 months seeking co-founders for Motion Arcade and getting into YCombinator definitely helped me pull a team together. I had met my friend Ted Blackman through mutual friends from MIT, where we both studied, and we applied to YCombinator together. Ted and I are kindred spirits: we're both the kind who can rapidly become an expert in any science, technology, engineering or mathematical field. I spent only one day hacking with Ted, but it was clear that he was an exceptional genius. YCombinator was concerned that I was essentially a single-founder pulling in Ted without much experience working together: One indicator for YCombinator about the likelihood of success of a team is that the cofounders have known eachother and worked together for a while and would work together on anything: basically if the teams' loyalty is to the group and not the particular idea they are pursuing they will work together through multiple pivots. Of course, this isn't an absolute indicator and Dropbox is a notable exception of a solo founder recruiting a great team.

After getting accepted into YC we started out with the goal of making an Xbox game. We used our YCombinator-ness to get Microsoft to send us an Xbox Development Kit; this was no simple feat. Our experience working with Microsoft probably served as a prototypical example for how they would later interface with the startups in the Techstars/Microsoft Kinect accelerator. We figured Microsoft controlled the path to market for any motion controlled consumer product and that if we wanted to spawn our own Kinect app-store we would have to play inside the Xbox eco-system first. We needed quality content and we may as well eat our own dog-food if we're making a platform: make a game and sell it before trying to make our own distribution channel. Valve didn't just start with Steam, they had Half Life first. I hired some of the artists that I had worked with on Tinker Heavy Industries to produce 3-D content and we set out to make a game called Sushi Warrior.

A month after starting YCombinator with an extra $150,000 from SV Angel and Start Fund and $50K more from Romulus Capital, we recruited Shlomo Zippel out of PrimeSense to join us. Shlomo convinced his friend and coworker from PrimeSense, Roee Shenberg, to move here from Israel to join the founding team at Zigfu. Shlomo and Roee are both amazing hackers and work together extremely well. Israeli hackers are trained better than most MIT students, and it was a privilege to hack with them.

With this team, and their experience we shifted focus away from full-body games to the more difficult problems of making hand-gesture controlled UI and apps. We had a ton of users downloading our software for hacking Kinect, and our community group and subscriber list was growing rapidly, we wanted to make some applications that people would use regularly. In about 3 months we cranked out demos of a gesture-controlled YouTube and Facebook and an app loader/portal. This was what I was showing investors at the Microsoft VC summit in October 2011:

After Demo Day and Burning Man, in September 2011, I switched into pitch-mode telling investors how we had a growing community of Kinect hackers using our software and we were building the Zig TV motion OS, looking for funding to make an OEM-license-able product for smart TVs.


In retrospect, I wasted an incredible amount of valuable time showing these demos to investors trying to raise a seed round. I met with Paul Graham shortly after demo day in September 2011 to discuss the process of raising money and we talked about investor leads and interest. I told him my goal was to raise $2.5M in a month or two and he said pretty directly: "no way that's happening." I thought we had a strong demo day launch, with a lot of leads to follow, and many of the articles labeling us one of the startups to watch in the YC summer 2011 class. I guess it's easy to stand out in a YC class when what you are showing is not a web or mobile app.

Unfortunately, because our demo is compelling, and our technology is cool, I wasted an incredible amount of time piquing interest without closing. I started meeting with VC investors and was enthusiastic early on when Andreesen-Horowitz committed to join our seed round contingent on us finding a lead investor. Other investors suggested they would also like to join a round and wanted to track us for a month or so to see our traction. This herd-mentality is a common strategy among investors. The best investors will reject you quickly if they are not interested. The last investor I pursued was Brad Feld from Foundry Group early in 2012. Brad is a great guy, and Foundry is a great team: they won't say things like "we'll invest only if someone else will." Brad is kind to entrepreneurs, notably so for saying no in 60 seconds and not wasting your time. They also invest in natural-user-interface companies, notably Oblong, who is the team responsible for making the "Minority Report" UI. I show off Zigfu a lot and I demand a dollar every time someone says "Minority Report" to me.

Shlomo and I flew down to Vegas for CES mainly to meet Brad. Because he didn't say no in 60 seconds, we were pretty excited; it would take about a month before we got to a no. The TV OEM market we were aiming for wasn't attractive to Foundry, and ultimately Brad just got the feeling that while he was excited by the team and tech, he was trending negative on Zigfu's business prospects. Brad said something like "This sounds like something that will sell to Google or Microsoft for $20-50M and I'm just not interested in that." I'm really grateful for having had the opportunity to get his feedback.

So after several months of chasing down VCs getting weak commitments to join a round; only once someone else would lead, I'd had enough. I realized that I should have been working on getting influential private angel investors on board first before talking to venture capital firms, but mostly I was over the whole fundraising thing entirely. I met with Paul Graham sometime in the middle of this failed fundraising process and he said to me "well maybe you just suck, Amir" and encouraged me to "just make money." Shortly after that interaction he published this essay about patterns in the least successful companies he's funded. I'm sure he was writing this to me, at least partially.

Just Make Money:

Actually, I should not have spent any time seeking out funding at all. We still had 6 months of runway after demo-day when I set out to fund-raise and there was no shortage of Kinect hacking jobs available to make quick revenue while building our platform. We were turning away revenue because it was distracting us from building our platform and fundraising, but we could have better spent our time building products and getting a stronger revenue strategy together before raising capital. Revenue is just like funding but it costs you less to get it, and building products isn't a total waste of time even if they don't succeed at massive growth. Investors care about your vision for the company and product, but for crazy ideas that don't fit in the common startup bins, you need big potential revenue numbers and evidence of traction.

It wasn't until March 2012 that we started selling a Kinect software-development-kit product for Unity3D with a buy button up on the internet. In retrospect, we could have been selling this for nearly a year by the time we finally got around to it. We were basically out of funding, scraping by each month doing some amount of contract Kinect hacking. By April it was clear that the dev kit revenue and contracting revenue were not going to sustain us as a team, and we didn't have any runway left to build new products. Ted was the first to leave at the end of April. Then in July, Roee and Shlomo took on contracting gigs separate from Zigfu, and built TheFundersClub, which recently announced $6M in funding. Shlomo is developing TheFundersClub now, and Roee has moved back to Israel to pursue natural language processing.

I continue to operate Zigfu, updating the platform to support new features and supporting customers. Licensing revenues are reasonable: I cover all of my personal expenses from our mostly passive online sales revenue. I'm supplementing that income with contracting revenue doing Kinect hacking for interactive installations, kiosks, and digital signage engagements. These developments have led to OEM licensing agreements with a few large companies who use our browser plugin for Kinect controlled digital signage. I plan to release additional products from some of our half-completed projects and to support new sensors and tracking algorithms as they are released. Eventually, I plan to hire someone in a marketing and business development role as soon as I've saved up enough of our revenues; I am looking for the right person.

Gained Wisdom:

Shlomo and I recently reflected on things we would have done differently and think this is worth sharing as advice. First, don't try to raise money if you don't have revenue or at least a credible story about where your paycheck will come from after the funding runs out. OEM licensing is not a good story for a seed-stage startup without OEM sales experience, though I'm getting better at it. We developed a reasonable digital signage and interactive installation niche for Zigfu, and it's totally awesome supporting a platform used by tens-of-thousands of developers. But the promise of the gesture control market is still out-of-reach for many consumer applications. The bill-of-materials needs to get to $1 for a commodity CMOS array with skeleton-tracking logic integrated in the sensor.

We should have been focused on the developer tools market earlier since it was accessible to us, and then we could show some kind of traction to investors. Don't over-think it. As technologists, we wanted to build an important and innovative platform like a motion controlled TV operating system, but you can make money quicker and start growing revenue by building something simple and targeted, like a gesture controlled slideshow or gesture interactive digital signage.

We should have been more frugal with the funding we raised. I told PG and the YC partners that I thought that the $150K convertible debt that comes when you enter YC might have broken their scrappy startup model. Believing that the most leverage I could gain from that funding by demo day was to ramp up and hire people to make stuff, I spent it on getting more people on my team without proper focus on a product. But during the phase where you are dependent on funding and have no revenue, if you have more people on your payroll, you will run out of money sooner.

An early stage startup really shouldn't spend money unless it's clear how that investment will produce more money. It helps if you and your co-founders have saved up enough money before you started so that you don't need a salary, and we all agreed that we would have significantly less anxiety about funding if we had longer "personal runways." If you can sustain on savings, then you set timelines and milestones without fear of running out of money to pay rent, so you can use the funding you raise strategically instead of needing to spend your funding on personal expenses.

I really wish we had thrown more parties. The Kinect hacking community is building cool toys: one of our first customers turns DJs into Robots on massive displays and our friends at Ethno Tekh make awesome body-controlled musical instruments. We should invite our friends to play with these toys. It gives more social context to our products and it is very motivating to see people appreciating our work. Brad Feld told me that a startup isn't a waste of time if you learn something and make friends and I certainly learned a lot from doing Zigfu and made a lot of brilliant friends through YCombinator. If you are irrationally confident in your ability to achieve greatness, I encourage you to apply.

Friday, October 05, 2012

Zigfu HCI, and Terminator: Ad-hoc Skeleton Tracking

I haven't had a brain dump in a while, and I should probably write more since this blog gets a ton of visitors for some reason. I'm excited about a lot of technology that I've read up on lately and will touch on some before diving into object tracking.

I've read a lot of really interesting papers about transmitting information using orbital angular momentum (OAM) of light and a 2.4 Tbps wireless transmission that recently set the record. OAM is a separate dimension onto which we can encode information, providing a multiplicative effect in the amount of available bandwidth for transmitting information. This will be a massively disruptive technology and you should leave this site right now to learn about how to produce curl in your Poyting vectors using helical antennae.

I've also heard about using field-effects to produce gate-controllable junctions in arbitrary singly doped semiconductors. This can lead to new types of solar devices using cheaper materials like Copper Oxide. One could also imagine a photo detector array which uses the effect to sweep through color bands. This method is similar physically to the method of producing a band-gap in bilayer graphene. Controllable electron transport and volume manufacturing of graphene devices are both very active areas of research and any electrical engineer who wants to be ahead of the curve aught to study up on relativistic quantum electrodynamics in monolayers.

On the FPGA side of things, I'm very excited by the Xilinx Virtex-7 which went with 3-D integration to provide what they are calling "Stacked Silicon Interconnect" on a separate layer which shows that FPGAs continue to be a proving-ground for leading semiconductor process technology.

I expect that we will see optical sensors 3-D integrated on top of a processing layer so that we will have computer vision baked onto optical sensors. This will allow us to process visual data with much higher framerates, lower-latency and lower-power. I predict that image filters, motion detection, optical flow, edge detection, generation of pyramid representations and integral images will all be implemented on silicon that is 3-D integrated with a sensor. This sort of sensor will enable low-power object tracking and gesture control in basically any embedded application.

Object tracking and recognition are topics I have been following for a long time. I wrote about my ideas for synthesis-based recognition techniques six years ago when I was still an undergrad. PrimeSense's launch of OpenNI  inspired me to pursue human computer interaction side of this work at Zigfu where we recently hit the milestone of one-hundred commercial customers in 29 countries using our Unity3D SDK to produce interactive experiences. Zigfu is primarily focused on human-computer interaction (HCI) and not computer vision (CV); even with perfectly accurate sensors and computer vision to track human movement, there is a wide open problem of designing a motion-controlled UI that enables a human to comfortably and accurately select one element out of a list of 1000 elements. I like to compare this to touch-screens, where the measurable qualities of capacitive touch screens are like sensors / computer-vision while the magical qualities of the iOS/Cocoa UI and Android UI are how people experience this technology.

Still, I've been keeping an on eye on visual object tracking for a long time and also want to do a brain dump of my thoughts. Visual object tracking such as in Zdenek Kalal's predator algorithm is very compelling:
Zdenek has also supported an open source community called  OpenTLD (Tracking Learning Detection) which has produced C++ ports of his algorithm from the matlab original (see me playing with it in the video below).

Another great reference is this paper on skeleton tracking with surface analysis, with a really cool video to go along: Some time back on April 15th, I wrote a proposal for ad-hoc skeleton analysis to the TLD group that went something like this.

Ever since watching the Predator video, I've been thinking about how to extend the algorithm to use 3-D voxel and point cloud structures and not just track, but determine orientation and perform ad-hoc discovery of skeleton joint structures. I call this algorithm "Terminator." Terminator tracks and recognizes objects by generating a rigged 3-D model of the object. Instead of producing 2-D wavelets from the tracked object as in predator, terminator generates a 3-D model of the object complete with inferred skeleton structures. A voxel model can be created by composting the 2-D images generated by a tracker system (as in Predator). Voxel acquisition can also be assisted with a depth sensor. Recognition and determination of the orientation of rigid bodies can be performed using a sparse voxel representation. One way to accelerate recognition may be to use principle component analysis to align the input data with the models being recognized. Another way to perform recognition may be to create sets of 2-D wavelets by projecting the voxel representation to create sets of 2-D image recognizers  Brute force 3-D convolution of sparse voxels may also work, but makes more sense for detecting features in non-sparse voxels like MRI data. A skeleton model with joint weighs for each voxel can be inferred by using optical flow data to detect state transitions in the voxel model such as when a mouth opens or hand gesture changes. Historic data can be used to verify and improve the skeleton system and to infer range of motion constraints. Just had to share this idea in case I never really get down to it.

Since the Kinect provides superior tracking on the 3-D depth data, we are able to train multiple predator-like algorithms on the RGB data with feedback from a depth-data tracker. We can also use the depth data for user segmentation and extract only the interesting RGB pixels from the registered depth data.

Anyway, this is long enough already, so I'll leave you with a video I made running the predator algorithm's C++ port in multiple processes: