Large Scale Deep Learning at Google

[This blog has been dormant a long time, since I post most of this kind of content on Google+ these days. I’ll still cross-post the occasional longer piece here.]

There’s an important paper at ICML this week, showing results from a Google X project which scaled up deep learning to 16,000 cores. Just by throwing more computation at the problem, things moved substantially beyond the prior state of the art.

Building High-level Features Using Large Scale Unsupervised Learning
Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng

Learned feature detectors


I think this is a really important paper, so let me give some background. Since starting at Google, there have been five projects which I’ve seen internally that blew me away with their obvious potential to truly change the world. A few of those are now public: (1) Self-driving cars, (2) Project Glass, (3) Knowledge Graph. Number four is this paper. It might grab fewer headlines than the first three, but in the long term this is by far the most important.

What this paper demonstrates (or at least convinced me) is that raw computation is now the key limiting factor for machine learning. That is a huge. For the last twenty years or more, it was not really the case. The field was dominated by SVMs and Boosting. Progress didn’t really have much to do with Moore’s Law. If machines got a million times faster, it wasn’t really clear that we had any good way to use the extra computation. There certainly wasn’t a viable path to animal-level perceptual abilities. Now I would like to stick my neck out and say that I think that position has changed. I think we now have a research program that has a meaningful chance of arriving at learning abilities comparable to biological systems.
That doesn’t mean that if someone gifted us a datacenter from 2050 we could solve machine learning immediately. There is a lot of algorithmic progress still to be made [1]. Unlike SVMs, the training of these systems still owes a lot to black magic. There are saturation issues that I think nobody has really figured out yet, to name one of a hundred problems [2]. But, the way seems navigable. I’ve been optimistic about this research ever since I saw Geoff Hinton’s talk on RBMs back in 2007, but it was a cautious optimism back then. Now that Google has shown you can scale the methods up by orders of magnitude and get corresponding performance improvements, my level of confidence has gone up several notches.

Returning to the present, here are a few cool aspects of the current paper:

1) Without supervision, the model learns complex, generalizable features (see the human face and cat face detectors below). To say that again, there is no labelled training data. Nobody told the model to detect faces. The face feature simply emerges naturally as a compact way for the network to reconstruct its inputs. We’ve seen that before for low level features like edges and edge junctions, but to see it for high level concepts is a result.

2) “Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation.”

This is important too. It’s been known for a while that most current approaches used in computer vision don’t really learn any meaningful invariance to transformations which are not explicitly hand-designed into the features. e.g. See this paper from the DiCarlo lab: Comparing State-of-the-Art Visual Features on Invariant Object Recognition Tasks

3) “Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.”

It works! Well – it still fails 85% of the time (on a very very hard test set), but it’s big progress. These techniques apply to everything from speech recognition to language modeling. Exciting times.

——————————-
Notes:

[1]: I saw a talk by Geoff Hinton just yesterday which contained a big advance which he called “dropout”. No paper available yet, but check his website in the next few days. Lots happening right now.

[2] Or the embarrassing fact that models that achieve close to record performance on MNIST totally fail on 1 – MNIST (i.e. just invert the colours and the model fails to learn). Another example is the structural parameters (how many layers, how wide) which are still picked more or less arbitrarily. The brain is not an amorphous porridge, in some places the structure is important and the details will take years for us to figure out.

The Universal Robotic Gripper

I just saw a video of device that consists of nothing more than a rubber balloon, some coffee grounds and a pump. I’m pretty sure it’s going to change robotics forever. Have a look:

You need to a flashplayer enabled browser to view this YouTube video

It’s a wonderful design. It’s cheap to make. You don’t need to position it precisely. You need only minimal knowledge of the object you’re picking up. Robotic grasping has always been too hard to be really practical in the wild. Now a whole class of objects just got relatively easy.

Clearly, the design has it’s limitations. It’s not going to allow for turning the pages of a book, making a cheese sandwich, tying a dasiy chain, etc. But for relatively straightforward manipulation of rigid objects, it’s a beautiful solution. This one little idea could help start a whole industry.

The research was a collaboration between Chicago, Cornell and iRobot, with funding from DARPA. It made the cover of PNAS this month. The research page is here.

Fun with Robots

It’s no secret that I’m a huge fan of Willow Garage. So as they get ready to ship their first PR2 robots, here’s a gratuitous video of the pre-release testing:

You need to a flashplayer enabled browser to view this YouTube video

This second video is a nice overview of what Willow Garage and their open source robotics program is all about:

You need to a flashplayer enabled browser to view this YouTube video

Google Goggles Goes Live

To my surprise, Google Goggles actually launched last night, not 12 hours after I posted about it yesterday. I’ve just spent a while playing around  with it on my Android handset. Search times are, as expected, much more than one second, more in the anticipated 5-10 second range. Good to see that even Google can’t break the laws of physics. The app shows a pretty-but-pointless image analysis animation to make the wait seem shorter, almost exactly like my tongue-in-cheek suggestion from yesterday.

The engine covers all the easy verticals (books, DVDs, logos, landmarks, products, text, etc). The recognition quality is very good, though the landing pages are often a bit useless. It will take a bit of living with it to see how much use it is as a tool rather than a tech demo.
The major worry is that it may end up being too broad-but-shallow. For example, they do wine recognition, but the landing pages are generic. Perhaps visual wine recognition would be better built into Snoot or some other dedicated iPhone wine app. Or Google could take the route Bing recently took with recipes, and build rich landing pages for each vertical. Because of the nature of current visual search technology, Goggles is essentially a number of different vertical searches glued together, so this is more feasible than it would be for web search.

Certainly an interesting week for visual search!

Google Visual Search

The tubes are a-buzz with some news that Google are working on a mobile visual search system. No surprise there really, but it is interesting to see the current state of their prototype. The info comes from a CNBC documentary called Inside Google. A clip is up on YouTube, and the relevant section starts three minutes in.

The main thing that surprised me is that the interviewer mentioned response times of less than a second. I find that somewhat unlikely, and I think it’s probably a case of a journalist misunderstanding some fine (but crucial) distinctions.
In terms of the actual visual search engine, there’s no problem. Our own visual search engine can serve searches in a couple of hundred milliseconds, even for millions of images. The issue is transmitting the user image to the server over the mobile phone network. Cell networks are still pretty slow, and have high latency even to transmit 1 byte. Even after aggressive compression of the image and after playing some other fancy tricks, we see typical user waiting times between 4 and 5 seconds over a typical 3G connection. On a 2G connection, the system is basically unusably slow . I strongly suspect the Google app will perform similarly. In many cases it’s still much faster than pecking away on an on-screen keyboard, though because the user is waiting, it can feel longer. It would almost be worth giving the user a pointless maths puzzle to solve, tell them it’s powering the recognition, and they would probably be happier!

In any case, the Google demo is an interesting, if not unexpected, development in the visual search space. While we’re waiting to see the final version of Google Visual Search, Android users can try out our PlinkArt app, which does visual search for famous paintings. It’s live right now!

You need to a flashplayer enabled browser to view this YouTube video

Autonomous Helicopters!

I’ve let this blog go very quiet while I was working on finishing my thesis (done now!). However, today my brother got a helicopter pilot’s license, so I though I would mark the occasion by posting some videos showing how his fancy skill might soon be redundant :). Here are some cool results from Nick Roy’s group at MIT:

It’s a pretty cool system. Robots that do the full autonomous shebang, from SLAM to path planning to obstacle avoidance, are still quite rare. To do it all on a helicopter is just showing off.

A Thousand Kilometers of Appearance-Only SLAM

I’m off to RSS 2009 in Seattle next week to present a new paper on FAB-MAP, our appearance-based navigation system. For the last year I’ve been hard at work on pushing the scale of the system. Our initial approach from 2007 could handle trajectories about 1km long. This year, we’re presenting a new system that we demonstrate doing real-time place recognition over a 1,000km trajectory. In terms of accuracy, the 1,000km result seems to be on the edge of what we can do, however at around the 100km scale performance is really rather good. Some video results below.

You need to a flashplayer enabled browser to view this YouTube video

One of the hardest things to get right was simply gathering the 1,000km dataset. The physical world is unforgiving! Everything breaks. I’ll have a few posts about the trials of building the data collection system over the next few days.

Amazon Buys SnapTell

So the visual search story of the day is that Amazon has acquired SnapTell. This is a really natural fit – SnapTell have solid technology, and Amazon are one of the best use cases. Not too surprised to hear the deal has been done – SnapTell has been conspicuously quiet for several months, and word was that they either had to exit or secure another funding round before the end of the year. So congratulations are in order to everyone at SnapTell on securing what seems like an ideal exit.

The big question now is how this changes the playing field for other companies in the visual search space. I would assume Amazon will move SnapTell’s focus away from their enhanced print advertising service and concentrate on image recognition for books, CDs, DVD, etc. (Up to now, Amazon has been doing this with human-powered image recognition, which was nuts.) While this makes perfect sense for Amazon, it’s going to mean more rather than less opportunities for companies still focused on the general visual search market.

So  I guess this is an ideal point to mention the open secret that I’m currently co-founding Plink, a new visual search engine similar in capability to SnapTell. While our demo shows some familiar use cases, we’re working on taking the technology in some entirely new directions. Visual search is very young, there’s a whole lot still to do! Anyone interested in visual search, feel free to contact me.

Autonomous Marathon!

Congratulations to everyone at Willow Garage for reaching Milestone 2 in the development of the PR2 robot. 26.2 miles of autonomous indoor navigation, including opening eight doors and plugging in to nine power sockets. We’ve been watching the video in the lab with serious robot envy. Very cool!

You need to a flashplayer enabled browser to view this YouTube video

Dinosaurs and Tail Risk

Writing in this morning’s FT, Nassim Nicholas Taleb proposes Ten principles for a Black Swan-proof world:

1. What is fragile should break early while it is still small. Nothing should ever become too big to fail. Evolution in economic life helps those with the maximum amount of hidden risks — and hence the most fragile — become the biggest.

Then we will see an economic life closer to our biological environment: smaller companies, richer ecology, no leverage.

A sensible plan, but unfortunately Mr. Taleb’s faith in biology is misplaced.

Why the Dinosaurs got so Large

19th-century palaeontologist Edward Drinker Cope noticed that animal lineages tend to get bigger over evolutionary time, starting out small and leaving ever bigger descendants. This process came to be known as Cope’s rule.

Getting bigger has evolutionary advantages, explains David Hone, an
expert on Cope’s rule at the Institute of Vertebrate Paleontology and
Paleoanthropology in Beijing, China. “You are harder to predate and it
is easier for you to fight off competitors for food or for mates.” But
eventually it catches up with you. “We also know that big animals are
generally more vulnerable to extinction,” he says. Larger animals eat
more and breed more slowly than smaller ones, so their problems are
greater when times are tough and food is scarce. “Many of the very
large mammals, such as Paraceratherium, had a short tenure in the
fossil record, while smaller species often tend to be more
persistent,” says mammal palaeobiologist Christine Janis of Brown
University in Providence, Rhode Island. So on one hand natural
selection encourages animals to grow larger, but on the other it
eventually punishes them for doing so. This equilibrium between
opposing forces has prevented most land animals from exceeding about 10 tonnes.

Dinosaurs had skewed incentives and took on too much tail risk! If even evolution falls into this trap, God help the bank regulators…