A Simple Thing Done Perfectly

I’ve been blown away by Dropbox. It’s such a simple thing – online storage easily shared between different computers. The concept is simple, but there are so many ways to do it wrong.  With Dropbox, the execution is pretty near perfect.

Amazon AWS has made scaling so easy that great little tools like this are suddenly popping up everywhere.

Hat tip: Andy Davison

Snaptell Explorer – First Impressions

I finally got a chance to try out SnapTell Explorer, and I have to say that I’m impressed. Almost all of books and CDs I had lying around were correctly recognised, despite being pretty obscure titles. With 2.5 million objects in their index, SnapTell can recognise just about any book, CD, DVD or game. Once the title is recognised, you get back a result page like this with a brief review and the option to buy it on Amazon, or search Google, Yahoo or Wikipedia. For music, there is a link to iTunes.

I spent a while “teasing”  the search engine with badly taken photos, and the recognition is very robust. It has no problems with blur, rotation, oblique views, background clutter or partial occlusion of the object. Below is a real recognition example:

Match

I did find the service pretty slow, despite having a WiFi connection. Searches took about five seconds.  I had a similar experience with kooaba earlier.  There are published visual search algorithms that would let these services be as responsive as Google, so I do wonder what’s going on here. It’s possible the speed issue is somewhere else in the process, or possibly they’re using brute-force descriptor comparison to ensure high recognition rate. For a compelling experience, they desperately need to be faster.

While the recognition rate was generally excellent, I did manage to generate a few incorrect matches. One failure mode is where multiple titles have similar cover design (think “X for Dummies”)  – a picture of one title randomly returns one of the others. I saw a similar problem with a CD mismatching to another title because both featured the same record company logo. Another failure mode that might be more surprising to people who haven’t worked with these systems was mismatching on font. A few book searches returned completely unrelated titles which happened to use the same font on the cover. This happened particularly when the query image had a very plain cover, so there was no other texture to disambiguate it. The reason this can happen is that the search method relies on local shape information around sets of interest points, rather than attempt to recognise the book title as a whole by OCR.

My overall impression, however, is that this technology is very much ready for prime time. It’s easy to see visual search becoming the easiest and fastest way to get information about anything around you.

If you haven’t got an iPhone, you can try it by sending a picture to [email protected].

SnapTell Explorer – Mobile Visual Search Heats Up

Well well. Hot on the heals of kooaba, competitor SnapTell just released an iPhone client for their visual search engine. A little sleuthing reveals that the index contains 2.5 million items – apparently most books, DVDs, CDs and game covers.  If the recognition rate is as high as it should be, that’s a pretty impressive achievement. In principle the service was already available via email/mms. In practice, an iPhone client changes the experience completely. Image search becomes the fastest way to get information about anything around you.

I really think this technology is going to take off big-time in the near future.  The marketing intelligentsia are aware of this too. There is an adoption challenge, but SnapTell in particular are already running an excellent high profile education/promotion campaign with print magazines. They’re not messing about either: The campaign is running in Rolling Stone, GQ, Men’s Health, ESPN, Wired and Martha Stewart Weddings. In short, publications that reach a substantial chunk of the reading public. Whether the message will carry over that the technology is good for more than signing up for free deoderant samples is something I’m a little skeptical about, but in the short term it’s a ready revenue stream for the startup, and a serious quantity of collateral publicity.

Usage report later when I can get hold of an iPhone.

Update: I just got to try it, and it’s really rather good. First impressions here.

“But I’m Not Lost!” – Adoption Challenges for Visual Search

I’m still rather excited about yesterday’s kooaba launch. I’ve been thinking about how long this technology will take to break into the mainstream, and it strikes me that getting people to adopt it is going to take some work.

When people first started using the internet, the idea of search engines didn’t need much promotion. People were very clearly lost, and needed some tool to find the interesting content. Adopting search engines was reactive, rather than active.

Visual search is not like that. If kooaba or others do succeed in building a tool that lets you snap a picture of any object or scene and get information, well, people may completely ignore it. They’re not lost – visual search is a useful extra, not a basic necessity. The technology may never reach usage levels seen by search engines. That said, it’s clearly very useful, and I can see it getting mass adoption. It’ll just need education and promotion. Shazam is great example of a non-essential search engine that’s very useful and massively popular.

So, promotion, and lots of it. What’s the best way? Well, most of the different mobile visual search startups are currently running trail campaigns involving competitions and magazine ads (for example this SnapTell campaign).  Revenue for the startups, plus free public education on how to use visual search. Not a bad deal, easy to see why all the companies are doing it. The only problem is that it may get the public thinking that visual search is only about cheap promotions, not useful for anything real. That would be terrible for long-term usage. I rather prefer kooaba’s demo based on movie posters – it reinforces a real use case, plus it’s got some potential for revenues too.

A Visual Search Engine for iPhone

Today kooaba released their iPhone client. It’s a visual search engine – you take a picture of something, and get search results. The YouTube clip below shows it in action.  Since this is the kind of thing I work on all day long, I’ve got a strong professional interest. Haven’t had a chance to actually try it yet, but I’ll post an update once I can nab a friend with an iPhone this afternoon to give it a test run.

You need to a flashplayer enabled browser to view this YouTube video

At the moment it only recognises movie posters. Basically it’s current form is more of a technology demo than something really useful. Plans are to expand to recognise other things like books, DVDs, etc. I think there’s huge potential for this stuff. Snap a movie poster, see the trailer or get the soundtrack. Snap a book cover, see the reviews on Amazon. Snap an ad in a magazine, buy the product. Snap a resturant, get reviews. Most of the real world becomes clickable. Everything is a link.

The technology is very scalable – The internals use an inverted index just like normal text search engines. In my own research I’m working with hundreds of thousands of images right now. It’s probably going to be possible to index a sizeable fraction of all the objects in the world –  literally take a picture of anything and get search results. The technology is certainly fast enough, though how the recognition rate will hold up with such large databases is currently unknown.

My only question is – where’s the buzz, and why has it taken them so long?

Update: I gave the app a spin today on a friend’s iPhone, and it basically works as advertised. It was rather slow though – maybe 5 seconds per search. I’m not sure if this was a network issue (though the iPhone had a WiFi connection), or maybe kooaba got more traffic today than they were expecting. The core algorithm is fast – easily less than 0.2 seconds (and even faster with the latest GPU-based feature detection).  I am sure the speed issue will be fixed soon. Recognition seemed fine, my friend’s first choice of movie was located no problem. A little internet sleuthing shows that they currently have 5363 movie posters in their database. Recognition shouldn’t be an issue until the database gets much larger.

Mobile Manipulation Made Easy

GetRobo has an interesting interview with Brian Gerkey of Willow Garage. Willow Garage are a strange outfit – a not-for-profit company developing open source robotic hardware and software, closely linked to Stanford. They’re funded privately by a dot com millionaire. They started with several projects including autonomous cars and autonomous boats, but now concentrate on building a new robot called PR2.

The key thing PR2 is designed to support is mobile manipulation. Basically research robots right now come in two varieties – sensors on wheels, that move about but can’t interact with anything, and fixed robotic arms, that manipulate objects but are rooted in place. A few research groups have build mobile manipulation systems where the robot can both move about and interact with objects, but the barrier to entry here is really high. There’s a massive amount of infrastructure you need to get a decent mobile manipulation platform going – navigation, obstacle avoidance, grasping, cross-calibration, etc. As a result, there are very very few researchers in this area. This is a terrible shame, because there are all sorts of interesting possibilities opened up by having a robot that can both move and interact. Willow Garage’s PR2 is designed to fill the gap – an off-the-shelf system that provides basic mobile manipulation capabilities.

Brian: We have a set of demos that we hope that the robot can do out of the box. So things like basic navigation around the environment so that it doesn’t run into things and basic motion planning with the arms, basic identifying which is looking at an object and picking it out from sitting on the table and picking it up and moving it somewhere. So the idea is that it should have some basic mobile manipulation capabilities so that the researcher who’s interested in object recognition doesn’t have to touch the arm part in order to make the object recognizer better. The arm part is not to say that it can be improved but good enough.

If they can pull this off it’ll be great for robotics research. All the pieces don’t have to be perfect, just enough so that say a computer vision group could start exploring interactive visual learning without having to worry too much about arm kinematics, or a manipulation group could experiment on a mobile platform without having to write a SLAM engine.

Another interesting part of the interview was the discussion of software standards. Brian is one of the lead authors of Player/Stage, the most popular robot OS. Player is popular, but very far from universal – there are nearly as many robot OSes as there are robot research groups (e.g. CARMEN, Orca, MRPT, MOOS, Orocos, CLARAty, MS Robotics Studio, etc, etc). It seems PR2 will have yet another OS, for which there are no apologies:

I think it’s probably still too early in robotics to come up with a standard. I don’t think we have enough deployed systems that do real work to have a meaningful standard. Most of the complex robots we have are in research labs. A research lab is the first place we throw away a standard. They’re building the next thing. So in robotics labs, a standard will be of not much use. They are much more useful when you get to the commercialization side to build interoperable piece. And at that point we may want to talk about standards and I think it’s still a little early. Right now I’m much more interested in getting a large user community and large developer community. I’m less interested in whether it’s blessed as a standard by a standard’s body.

Anyone working in robotics will recognise the truth of this. Very much a sensible attitude for the moment.

Big Data to the Rescue?

Peter Norvig of Google likes to say that for machine learning, you should “worry about the data before you worry about the algorithm”.

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data.

It’s a rallying cry taken up by many, and there’s a lot of truth to it.  Peter’s talk here has some nice examples (beginning at 4:30). The maxim about more data holds over several orders of magnitude. For some examples of the power of big-data-simple-algorithm for computer vision, check out the work of Alyosha Efros’ group at CMU.  This is all pretty convincing evidence that scale helps. The data tide lifts all boats.

What I find more interesting, though, is the fact that we already seem to have reached the limits of where data scale alone can take us. For example, as discussed in the talk, Google’s statistical machine translation system incorporates a language model consisting of length 7 N-grams trained from a 10^12 word dataset. This is an astonishingly large amount of data. To put that in perspective, a human will hear less than 10^9 words in an entire lifetime. It’s pretty clear that there must be huge gains to be made on the algorithmic side of the equation, and indeed some graphs in the talk show that, for machine translation at least, the performance gain from adding more data has already started to level off. The news from the frontiers of the Netflix Prize is the same – the top teams report that the Netflix dataset is so big that adding more data from sources like IMDB makes no difference at all! (Though this is more an indictment of ontologies than big data.)

So, the future, like the past, will be about the algorithms. The sudden explosion of available data has given us a significant bump in performance, but has already begun to reach its limits. There’s still lots of easy progress to be made as the ability to handle massive data spreads beyond mega-players like Google to more average research groups, but fundamentally we know where the limits of the approach lie. The hard problems won’t be solved just by lots of data and nearest neighbour search. For researchers this is great news – still lots of fun to be had!

Google Street View – Soon in 3D?

Some Google Street View cars were spotted in Italy this morning. Anyone who works in robotics will immediately notice the SICK laser scanners. It looks like we can expect 3D city data from Google sometime soon. Very interesting!

Street View car spotted in Rome

More pictures of the car here, here and here.

The cars have two side-facing vertical scanners, and another forward-facing horizontal scanner. Presumably they will do scan matching with the horizontal laser, and use that to align the data from the side-facing lasers to get some 3D point clouds. Typical output will look like this (video shows data collected from a similar system built by one of my labmates.)

The other sensors on the pole seem to have been changed too. Gone are the Ladybug2 omnidirectional cameras used on the American and Australian vehicles, replaced by what looks like a custom camera array. This photo also shows a third sensor, which I can’t identify.

So, what is Google doing with 3D laser data? The obvious application is 3D reconstruction for Google Earth. Their current efforts to do this involve user-generated 3D models from Sketchup. They have quite a lot of contributed models, but there is only so far you can get with an approach like that. With an automated solution, they could go for blanket 3D coverage. For an idea of what the final output might look like, have a look at the work of Frueh and Zakhor at Berkeley. They combined aerial and ground based laser with photo data to create full 3D city models. I am not sure Google will go to quite this length, but it certainly looks like they’re made a start on collecting the street-level data. Valleywag claims Google are hiring 300 drivers for their European data gathering efforts, so they will soon be swimming in laser data.

Frueh and Zakhor 3D city model

 

Google aren’t alone in their 3D mapping efforts. Startup Earthmine has been working on this for a while, using a stereo-vision based approach (check out their slick video demonstrating the system). I also recently built a street-view car myself, to gather data for my PhD research. One way or another, it looks like online maps are headed to a new level in the near future.

Update:  Loads more sightings of these cars, all over the world. San Francisco, Oxford, all over Spain. Looks like this is a full-scale data gathering effort, rather than a small test project.

Clever Feet

Check out this great TED talk by UC Berkeley biologist Robert Full. His subject is feet – or rather, all the clever ways animals have evolved to turn leg power into forward motion.
It’s a short, fun talk, and rather nicely makes the point that the secret to success for many of nature’s creations resides not in sensing or intelligence, but in good mechanical design. The nice thing about this is that nature’s mechanical innovations are much easier to duplicate than her neurological ones. The talk ends with examples of robotic applications, such as Boston Dynamics’ cockroach-inspired RHex and Stanford’s gecko-inspired climbing robots.

Hat tip: Milan