Computer Vision in the Elastic Compute Cloud

In a datacenter somewhere on the other side of the planet, a rack-mounted computer is busy hunting for patterns in photographs of Oxford.  It is doing this for 10 cents an hour, with more RAM and more horsepower than I can muster on my local machine. This delightful arrangement is made possible by Amazon’s Elastic Compute Cloud.

For the decreasing number of people who haven’t heard of EC2, it’s a pretty simple idea. Via a simple command line interface you can “create” a server running in Amazon’s datacenter. You pick a hardware configuration and OS image, send the request and voilà – about 30 seconds later you get back a response with the IP address of the machine, to which you now have root access and sole use.  You can customize the software environment to your heart’s content and then save the disk image for future use. Of course, now that you can create one instance you can create twenty. Cluster computing on tap.

This is an absolutely fantastic resource for research. I’ve been using it for about six months now, and have very little bad to say about it. Computer vision has an endless appetite for computation. Most groups, including our own, have their own computing cluster but demand for CPU cycles typically spikes around paper deadlines, so having the ability to instantly double or triple the size of your cluster is very nice indeed.

Amazon also have some hi-spec machines available. I recently ran into trouble where I needed about 10GB of RAM for a large learning job. Our cluster is 32-bit, so 4GB RAM is the limit. What might have been a serious headache was solved with a few hours and $10 on Amazon EC2.

The one limitation I’ve found is that disk access on EC2 is a shared resource, so bandwidth to disk tends to be about 10MB/s, as opposed to say 70MB/sec on a local SATA hard drive. Disk bandwidth tends to be a major factor in running time for very big out-of-core learning jobs. Happily, Amazon very recently released a new service called Elastic Block Store which offers dedicated disks, though the pricing is a little hard to figure out.

I should mention that for UK academics there is a free service called National Grid, though personally I’d rather work with Amazon.

Frankly, the possibilities opened up by EC2 just blow my mind. Every coder in a garage now potentially has access to Google-level computation. For tech startups this is a dream. More traditional companies are playing too. People have been talking about this idea for a long time, but it’s finally here, and it rocks!

Update: Amazon are keen to help their scientific users. Great!

An Insider’s Guide to BigDog

In common with half of YouTube, I was mesmerized by the BigDog videos from Boston Dynamics earlier in the year, though I couldn’t say much about how the robot worked. For everyone hungry for some more technical details, check out the talk by Marc Raibert at Carnegie Mellon’s Field Robotics 25 event. There’s some interesting discussion of the design of the system, where’s it’s headed, and more great video.

There are a bunch of other worthwhile talks from the event. I particularly enjoyed Hugh Durrant-Whyte’s description of building a fully automated container terminal “without a graduate student in 1000km”.

A Simple Thing Done Perfectly

I’ve been blown away by Dropbox. It’s such a simple thing – online storage easily shared between different computers. The concept is simple, but there are so many ways to do it wrong.  With Dropbox, the execution is pretty near perfect.

Amazon AWS has made scaling so easy that great little tools like this are suddenly popping up everywhere.

Hat tip: Andy Davison

Snaptell Explorer – First Impressions

I finally got a chance to try out SnapTell Explorer, and I have to say that I’m impressed. Almost all of books and CDs I had lying around were correctly recognised, despite being pretty obscure titles. With 2.5 million objects in their index, SnapTell can recognise just about any book, CD, DVD or game. Once the title is recognised, you get back a result page like this with a brief review and the option to buy it on Amazon, or search Google, Yahoo or Wikipedia. For music, there is a link to iTunes.

I spent a while “teasing”  the search engine with badly taken photos, and the recognition is very robust. It has no problems with blur, rotation, oblique views, background clutter or partial occlusion of the object. Below is a real recognition example:

Match

I did find the service pretty slow, despite having a WiFi connection. Searches took about five seconds.  I had a similar experience with kooaba earlier.  There are published visual search algorithms that would let these services be as responsive as Google, so I do wonder what’s going on here. It’s possible the speed issue is somewhere else in the process, or possibly they’re using brute-force descriptor comparison to ensure high recognition rate. For a compelling experience, they desperately need to be faster.

While the recognition rate was generally excellent, I did manage to generate a few incorrect matches. One failure mode is where multiple titles have similar cover design (think “X for Dummies”)  – a picture of one title randomly returns one of the others. I saw a similar problem with a CD mismatching to another title because both featured the same record company logo. Another failure mode that might be more surprising to people who haven’t worked with these systems was mismatching on font. A few book searches returned completely unrelated titles which happened to use the same font on the cover. This happened particularly when the query image had a very plain cover, so there was no other texture to disambiguate it. The reason this can happen is that the search method relies on local shape information around sets of interest points, rather than attempt to recognise the book title as a whole by OCR.

My overall impression, however, is that this technology is very much ready for prime time. It’s easy to see visual search becoming the easiest and fastest way to get information about anything around you.

If you haven’t got an iPhone, you can try it by sending a picture to fun@snaptell.com.

SnapTell Explorer – Mobile Visual Search Heats Up

Well well. Hot on the heals of kooaba, competitor SnapTell just released an iPhone client for their visual search engine. A little sleuthing reveals that the index contains 2.5 million items – apparently most books, DVDs, CDs and game covers.  If the recognition rate is as high as it should be, that’s a pretty impressive achievement. In principle the service was already available via email/mms. In practice, an iPhone client changes the experience completely. Image search becomes the fastest way to get information about anything around you.

I really think this technology is going to take off big-time in the near future.  The marketing intelligentsia are aware of this too. There is an adoption challenge, but SnapTell in particular are already running an excellent high profile education/promotion campaign with print magazines. They’re not messing about either: The campaign is running in Rolling Stone, GQ, Men’s Health, ESPN, Wired and Martha Stewart Weddings. In short, publications that reach a substantial chunk of the reading public. Whether the message will carry over that the technology is good for more than signing up for free deoderant samples is something I’m a little skeptical about, but in the short term it’s a ready revenue stream for the startup, and a serious quantity of collateral publicity.

Usage report later when I can get hold of an iPhone.

Update: I just got to try it, and it’s really rather good. First impressions here.

“But I’m Not Lost!” – Adoption Challenges for Visual Search

I’m still rather excited about yesterday’s kooaba launch. I’ve been thinking about how long this technology will take to break into the mainstream, and it strikes me that getting people to adopt it is going to take some work.

When people first started using the internet, the idea of search engines didn’t need much promotion. People were very clearly lost, and needed some tool to find the interesting content. Adopting search engines was reactive, rather than active.

Visual search is not like that. If kooaba or others do succeed in building a tool that lets you snap a picture of any object or scene and get information, well, people may completely ignore it. They’re not lost – visual search is a useful extra, not a basic necessity. The technology may never reach usage levels seen by search engines. That said, it’s clearly very useful, and I can see it getting mass adoption. It’ll just need education and promotion. Shazam is great example of a non-essential search engine that’s very useful and massively popular.

So, promotion, and lots of it. What’s the best way? Well, most of the different mobile visual search startups are currently running trail campaigns involving competitions and magazine ads (for example this SnapTell campaign).  Revenue for the startups, plus free public education on how to use visual search. Not a bad deal, easy to see why all the companies are doing it. The only problem is that it may get the public thinking that visual search is only about cheap promotions, not useful for anything real. That would be terrible for long-term usage. I rather prefer kooaba’s demo based on movie posters – it reinforces a real use case, plus it’s got some potential for revenues too.

A Visual Search Engine for iPhone

Today kooaba released their iPhone client. It’s a visual search engine – you take a picture of something, and get search results. The YouTube clip below shows it in action.  Since this is the kind of thing I work on all day long, I’ve got a strong professional interest. Haven’t had a chance to actually try it yet, but I’ll post an update once I can nab a friend with an iPhone this afternoon to give it a test run.

You need to a flashplayer enabled browser to view this YouTube video

At the moment it only recognises movie posters. Basically it’s current form is more of a technology demo than something really useful. Plans are to expand to recognise other things like books, DVDs, etc. I think there’s huge potential for this stuff. Snap a movie poster, see the trailer or get the soundtrack. Snap a book cover, see the reviews on Amazon. Snap an ad in a magazine, buy the product. Snap a resturant, get reviews. Most of the real world becomes clickable. Everything is a link.

The technology is very scalable – The internals use an inverted index just like normal text search engines. In my own research I’m working with hundreds of thousands of images right now. It’s probably going to be possible to index a sizeable fraction of all the objects in the world –  literally take a picture of anything and get search results. The technology is certainly fast enough, though how the recognition rate will hold up with such large databases is currently unknown.

My only question is – where’s the buzz, and why has it taken them so long?

Update: I gave the app a spin today on a friend’s iPhone, and it basically works as advertised. It was rather slow though – maybe 5 seconds per search. I’m not sure if this was a network issue (though the iPhone had a WiFi connection), or maybe kooaba got more traffic today than they were expecting. The core algorithm is fast – easily less than 0.2 seconds (and even faster with the latest GPU-based feature detection).  I am sure the speed issue will be fixed soon. Recognition seemed fine, my friend’s first choice of movie was located no problem. A little internet sleuthing shows that they currently have 5363 movie posters in their database. Recognition shouldn’t be an issue until the database gets much larger.

Mobile Manipulation Made Easy

GetRobo has an interesting interview with Brian Gerkey of Willow Garage. Willow Garage are a strange outfit – a not-for-profit company developing open source robotic hardware and software, closely linked to Stanford. They’re funded privately by a dot com millionaire. They started with several projects including autonomous cars and autonomous boats, but now concentrate on building a new robot called PR2.

The key thing PR2 is designed to support is mobile manipulation. Basically research robots right now come in two varieties – sensors on wheels, that move about but can’t interact with anything, and fixed robotic arms, that manipulate objects but are rooted in place. A few research groups have build mobile manipulation systems where the robot can both move about and interact with objects, but the barrier to entry here is really high. There’s a massive amount of infrastructure you need to get a decent mobile manipulation platform going – navigation, obstacle avoidance, grasping, cross-calibration, etc. As a result, there are very very few researchers in this area. This is a terrible shame, because there are all sorts of interesting possibilities opened up by having a robot that can both move and interact. Willow Garage’s PR2 is designed to fill the gap – an off-the-shelf system that provides basic mobile manipulation capabilities.

Brian: We have a set of demos that we hope that the robot can do out of the box. So things like basic navigation around the environment so that it doesn’t run into things and basic motion planning with the arms, basic identifying which is looking at an object and picking it out from sitting on the table and picking it up and moving it somewhere. So the idea is that it should have some basic mobile manipulation capabilities so that the researcher who’s interested in object recognition doesn’t have to touch the arm part in order to make the object recognizer better. The arm part is not to say that it can be improved but good enough.

If they can pull this off it’ll be great for robotics research. All the pieces don’t have to be perfect, just enough so that say a computer vision group could start exploring interactive visual learning without having to worry too much about arm kinematics, or a manipulation group could experiment on a mobile platform without having to write a SLAM engine.

Another interesting part of the interview was the discussion of software standards. Brian is one of the lead authors of Player/Stage, the most popular robot OS. Player is popular, but very far from universal – there are nearly as many robot OSes as there are robot research groups (e.g. CARMEN, Orca, MRPT, MOOS, Orocos, CLARAty, MS Robotics Studio, etc, etc). It seems PR2 will have yet another OS, for which there are no apologies:

I think it’s probably still too early in robotics to come up with a standard. I don’t think we have enough deployed systems that do real work to have a meaningful standard. Most of the complex robots we have are in research labs. A research lab is the first place we throw away a standard. They’re building the next thing. So in robotics labs, a standard will be of not much use. They are much more useful when you get to the commercialization side to build interoperable piece. And at that point we may want to talk about standards and I think it’s still a little early. Right now I’m much more interested in getting a large user community and large developer community. I’m less interested in whether it’s blessed as a standard by a standard’s body.

Anyone working in robotics will recognise the truth of this. Very much a sensible attitude for the moment.

Big Data to the Rescue?

Peter Norvig of Google likes to say that for machine learning, you should “worry about the data before you worry about the algorithm”.

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data.

It’s a rallying cry taken up by many, and there’s a lot of truth to it.  Peter’s talk here has some nice examples (beginning at 4:30). The maxim about more data holds over several orders of magnitude. For some examples of the power of big-data-simple-algorithm for computer vision, check out the work of Alyosha Efros’ group at CMU.  This is all pretty convincing evidence that scale helps. The data tide lifts all boats.

What I find more interesting, though, is the fact that we already seem to have reached the limits of where data scale alone can take us. For example, as discussed in the talk, Google’s statistical machine translation system incorporates a language model consisting of length 7 N-grams trained from a 10^12 word dataset. This is an astonishingly large amount of data. To put that in perspective, a human will hear less than 10^9 words in an entire lifetime. It’s pretty clear that there must be huge gains to be made on the algorithmic side of the equation, and indeed some graphs in the talk show that, for machine translation at least, the performance gain from adding more data has already started to level off. The news from the frontiers of the Netflix Prize is the same – the top teams report that the Netflix dataset is so big that adding more data from sources like IMDB makes no difference at all! (Though this is more an indictment of ontologies than big data.)

So, the future, like the past, will be about the algorithms. The sudden explosion of available data has given us a significant bump in performance, but has already begun to reach its limits. There’s still lots of easy progress to be made as the ability to handle massive data spreads beyond mega-players like Google to more average research groups, but fundamentally we know where the limits of the approach lie. The hard problems won’t be solved just by lots of data and nearest neighbour search. For researchers this is great news – still lots of fun to be had!