Highlights of Robotics: Science and Systems 2012

I spent last week at RSS 2012 in Sydney. Here are a few of the papers that caught my attention. This year I went to more talks on manipulation, but I still find myself picking a SLAM paper as my favourite :)

Robust Estimators for SLAM

For me, the most interesting work at the conference were two related papers, one from Ed Olson and another from Niko Sünderhauf.

Figure 1 from Olson and Agarwal 2012

Large Scale Deep Learning at Google

[This blog has been dormant a long time, since I post most of this kind of content on Google+ these days. I’ll still cross-post the occasional longer piece here.]

There’s an important paper at ICML this week, showing results from a Google X project which scaled up deep learning to 16,000 cores. Just by throwing more computation at the problem, things moved substantially beyond the prior state of the art.

Building High-level Features Using Large Scale Unsupervised Learning
Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng

Learned feature detectors

I think this is a really important paper, so let me give some background. Since starting at Google, there have been five projects which I’ve seen internally that blew me away with their obvious potential to truly change the world. A few of those are now public: (1) Self-driving cars, (2) Project Glass, (3) Knowledge Graph. Number four is this paper. It might grab fewer headlines than the first three, but in the long term this is by far the most important.

What this paper demonstrates (or at least convinced me) is that raw computation is now the key limiting factor for machine learning. That is a huge. For the last twenty years or more, it was not really the case. The field was dominated by SVMs and Boosting. Progress didn’t really have much to do with Moore’s Law. If machines got a million times faster, it wasn’t really clear that we had any good way to use the extra computation. There certainly wasn’t a viable path to animal-level perceptual abilities. Now I would like to stick my neck out and say that I think that position has changed. I think we now have a research program that has a meaningful chance of arriving at learning abilities comparable to biological systems.
That doesn’t mean that if someone gifted us a datacenter from 2050 we could solve machine learning immediately. There is a lot of algorithmic progress still to be made [1]. Unlike SVMs, the training of these systems still owes a lot to black magic. There are saturation issues that I think nobody has really figured out yet, to name one of a hundred problems [2]. But, the way seems navigable. I’ve been optimistic about this research ever since I saw Geoff Hinton’s talk on RBMs back in 2007, but it was a cautious optimism back then. Now that Google has shown you can scale the methods up by orders of magnitude and get corresponding performance improvements, my level of confidence has gone up several notches.

Returning to the present, here are a few cool aspects of the current paper:

1) Without supervision, the model learns complex, generalizable features (see the human face and cat face detectors below). To say that again, there is no labelled training data. Nobody told the model to detect faces. The face feature simply emerges naturally as a compact way for the network to reconstruct its inputs. We’ve seen that before for low level features like edges and edge junctions, but to see it for high level concepts is a result.

2) “Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation.”

This is important too. It’s been known for a while that most current approaches used in computer vision don’t really learn any meaningful invariance to transformations which are not explicitly hand-designed into the features. e.g. See this paper from the DiCarlo lab: Comparing State-of-the-Art Visual Features on Invariant Object Recognition Tasks

3) “Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.”

It works! Well – it still fails 85% of the time (on a very very hard test set), but it’s big progress. These techniques apply to everything from speech recognition to language modeling. Exciting times.


[1]: I saw a talk by Geoff Hinton just yesterday which contained a big advance which he called “dropout”. No paper available yet, but check his website in the next few days. Lots happening right now.

[2] Or the embarrassing fact that models that achieve close to record performance on MNIST totally fail on 1 – MNIST (i.e. just invert the colours and the model fails to learn). Another example is the structural parameters (how many layers, how wide) which are still picked more or less arbitrarily. The brain is not an amorphous porridge, in some places the structure is important and the details will take years for us to figure out.

The Universal Robotic Gripper

I just saw a video of device that consists of nothing more than a rubber balloon, some coffee grounds and a pump. I’m pretty sure it’s going to change robotics forever. Have a look:

You need to a flashplayer enabled browser to view this YouTube video

It’s a wonderful design. It’s cheap to make. You don’t need to position it precisely. You need only minimal knowledge of the object you’re picking up. Robotic grasping has always been too hard to be really practical in the wild. Now a whole class of objects just got relatively easy.

Clearly, the design has it’s limitations. It’s not going to allow for turning the pages of a book, making a cheese sandwich, tying a dasiy chain, etc. But for relatively straightforward manipulation of rigid objects, it’s a beautiful solution. This one little idea could help start a whole industry.

The research was a collaboration between Chicago, Cornell and iRobot, with funding from DARPA. It made the cover of PNAS this month. The research page is here.

A Thousand Kilometers of Appearance-Only SLAM

I’m off to RSS 2009 in Seattle next week to present a new paper on FAB-MAP, our appearance-based navigation system. For the last year I’ve been hard at work on pushing the scale of the system. Our initial approach from 2007 could handle trajectories about 1km long. This year, we’re presenting a new system that we demonstrate doing real-time place recognition over a 1,000km trajectory. In terms of accuracy, the 1,000km result seems to be on the edge of what we can do, however at around the 100km scale performance is really rather good. Some video results below.

You need to a flashplayer enabled browser to view this YouTube video

One of the hardest things to get right was simply gathering the 1,000km dataset. The physical world is unforgiving! Everything breaks. I’ll have a few posts about the trials of building the data collection system over the next few days.

An Insider’s Guide to BigDog

In common with half of YouTube, I was mesmerized by the BigDog videos from Boston Dynamics earlier in the year, though I couldn’t say much about how the robot worked. For everyone hungry for some more technical details, check out the talk by Marc Raibert at Carnegie Mellon’s Field Robotics 25 event. There’s some interesting discussion of the design of the system, where’s it’s headed, and more great video.

There are a bunch of other worthwhile talks from the event. I particularly enjoyed Hugh Durrant-Whyte’s description of building a fully automated container terminal “without a graduate student in 1000km”.

OpenGL Invades the Real World

Augmented reality systems are beginning to look pretty good these days. The videos below show some recent results from an ISMAR paper by Georg Klein. The graphics shown are inserted directly into the live video stream, so that you can play with them as you wave the camera around. To do this, the system needs to know where the camera is, so that it can render the graphics with the right size and position. Figuring out the camera motion by tracking features in the video turns out to be not that easy, and people have been working on it for years. As you can see below, the current crop of solutions are pretty solid, and run at framerate too. More details on Georg’s website.

You need to a flashplayer enabled browser to view this YouTube video

You need to a flashplayer enabled browser to view this YouTube video

Back in 2005, Andy Davison’s original augmented reality system got me excited enough that I decided to do a PhD. The robustness of these systems has improved a lot since then, to the point where they’re a fairly short step from making good AR games possible. In fact, there are a few other cool computer-vision based game demos floating around the lab at the moment. It’s easy to see this starting a new gaming niche. Basic vision-based games have been around for a while, but the new systems really are a shift in gear.

There are still some problems to be ironed out – current systems don’t deal with occlusion at all, for example. You can see some other issues in the video involving moving objects and repetitive texture. Still, it looks like they’re beginning to work well enough to start migrating out of the lab. First applications will definitely be of the camera-and-screen variety. Head-mounted display style systems are still some way off; the reason being that decent displays just don’t seem to exist right now.

(For people who wonder what this has to do with robotics – the methods used for tracking the environment here are basically identical to those used for robot navigation over larger scales.)

Citation: Parallel Tracking and Mapping for Small AR Workspaces“, Georg Klein and David Murray, ISMAR 2007.

Deep Learning

After working in robotics for a while, it becomes apparent that despite all the recent progress, the underlying machine learning tools we have at our disposal are still quite primitive. Our standard stock of techniques like Support Vector Machines and boosting methods are both more than ten years old, and while you can do some neat things with them, in practice they are limited in the kind of things they can learn efficiently. There’s been lots of progress since the techniques were first published, particularly through careful design of features, but to get beyond the current plateau it feels like we’re going to need something really new.

For a glimmer of what “something new” might look like, I highly recommend this wonderful Google Tech Talk by Geoff Hinton: “The Next Generation of Neural Networks“, where he discusses restricted Boltzmann machines. There are some stunning results, and an entertaining history of learning algorithms, during which he amusingly dismisses SVMs as “a very clever type of Perceptron“. There’s a more technical version of the talk in this NIPS tutorial, along with a workshop on the topic. Clearly the approach scales beyond toy problems – they have an entry sitting high on the Netflix Prize leaderboard.

These results with deep architectures are very exciting. Neural network research has effectively been abandoned by most of the machine learning community for years, partly becuase SVMs work so well, and partly because there was no good way to train multi-layer networks. SVMs were very pleasant to work with – there was no parameter tuning and black magic involved, you just throw data at them and press start. However, it seems clear that to make real progress we’re going to have to return to multi-layer learning architectures at some point. It’s good to see progress in that direction.

Hat tip: Greg Linden

More from ISRR

ISRR finished today. It’s been a good conference, low on detailed technical content, but high on interaction and good for an overview of parts of robotics I rarely get to see.

One of the highlights of the last two days was a demo from Japanese robotics legend Shigeo Hirose, who put on a show with his ACM R5 swimming snake robot in the hotel’s pool. Like many Japanese robots, it’s remote controlled rather than autonomous, but it’s a marvellous piece of mechanical design. Also on show was a hybrid roller-walker robot and some videos of a massive seven-ton climbing robot for highway construction.

You need to a flashplayer enabled browser to view this YouTube video

Another very interesting talk with some neat visual results was given by Shree Nayar, on understanding illumination in photographs. If you take a picture of a scene, the light that reaches the camera can be thought of as having two components – direct and global. The “direct light” leaves the light source and arrives at the camera via a single reflection off the object. The “global light” takes more complicated paths, for example via multiple reflections, subsurface scatter, volumetric scatter, etc. What Nayar showed was that by controlling the illumination, it’s possible to separate the direct and global components of the lighting. Actually, this turns out to be almost embarrassingly simple to do – and it produces some very interesting results. Some shown below, and many more here. It’s striking how much the direct-only photographs look like renderings from simple computer graphics systems like OpenGL. Most of the reason early computer graphics looked unrealistic was due to the difficulty of modelling the global illumination component. The full paper is here.

Scene Direct Global

Lots of other great technical talks too, but obviously I’m biased towards posting about the ones with pretty pictures!

Citation: “Visual Chatter in the Real World”, S. Nayar et. al., ISRR 2007

ISRR Highlights – Day 1

I’m currently in Hiroshima, Japan at ISRR. It’s been a good conference so far, with lots of high quality talks. I’m also enjoying the wonderful Japanese food (though fish for breakfast is a little strange).

One of the most interesting talks from Day 1 was about designing a skin-like touch sensor. The design is ingeniously simple, consisting of a layer of urethane foam with some embedded LEDs and photodiodes. The light from the LED scatters into the foam and is detected by the photodiode. When the foam is deformed by pressure, the amount of light reaching the photodiode changes. By arranging an array of these sensing sites under a large sheet of foam, you get a skin-like large-area pressure sensor. The design is simple, cheap, and appears to be quite effective.

Principle of the Sensor

Having a decent touch sensor like this is important. People rely on their sense of touch much more than they realize – one of the presenters demonstrated this by showing some videos of people trying to perform simple mechanical tasks with anaesthetised sensory neurons (they weren’t doing well). Walking robots weren’t getting very far until people realized the importance of having pressure sensors in the soles of the feet.

The authors were able to show some impressive new abilities with a humanoid robot using their sensor. Unfortunately I can’t find their videos online, but the below figure shows a few frames of the robot picking up a 30KG load. Using its touch sensor the robot can steady itself against the table, which helps with stability.

Touching the washing

I get the impression that the sensor is limited by the thickness of the foam – too thick to use on fingers for example. It’s also a long way from matching the abilities of human skin, which has much higher resolution and sensitivity to other stimuli like heat, etc. Still, it’s a neat technology!

Update: Here’s another image of the robot using it’s touch sensor to help with a roll-and-rise manoeuvre. There’s a video over at BotJunkie.

Citation:Whole body haptics for augmented humanoid task capabilities“, Yasuo Kuniyoshi, Yoshiyuki Ohmura, and Akihiko Nagakubo, International Symposium on Robotics Research 2007.