“But I’m Not Lost!” – Adoption Challenges for Visual Search

I’m still rather excited about yesterday’s kooaba launch. I’ve been thinking about how long this technology will take to break into the mainstream, and it strikes me that getting people to adopt it is going to take some work.

When people first started using the internet, the idea of search engines didn’t need much promotion. People were very clearly lost, and needed some tool to find the interesting content. Adopting search engines was reactive, rather than active.

Visual search is not like that. If kooaba or others do succeed in building a tool that lets you snap a picture of any object or scene and get information, well, people may completely ignore it. They’re not lost – visual search is a useful extra, not a basic necessity. The technology may never reach usage levels seen by search engines. That said, it’s clearly very useful, and I can see it getting mass adoption. It’ll just need education and promotion. Shazam is great example of a non-essential search engine that’s very useful and massively popular.

So, promotion, and lots of it. What’s the best way? Well, most of the different mobile visual search startups are currently running trail campaigns involving competitions and magazine ads (for example this SnapTell campaign).  Revenue for the startups, plus free public education on how to use visual search. Not a bad deal, easy to see why all the companies are doing it. The only problem is that it may get the public thinking that visual search is only about cheap promotions, not useful for anything real. That would be terrible for long-term usage. I rather prefer kooaba’s demo based on movie posters – it reinforces a real use case, plus it’s got some potential for revenues too.

A Visual Search Engine for iPhone

Today kooaba released their iPhone client. It’s a visual search engine – you take a picture of something, and get search results. The YouTube clip below shows it in action.  Since this is the kind of thing I work on all day long, I’ve got a strong professional interest. Haven’t had a chance to actually try it yet, but I’ll post an update once I can nab a friend with an iPhone this afternoon to give it a test run.

You need to a flashplayer enabled browser to view this YouTube video

At the moment it only recognises movie posters. Basically it’s current form is more of a technology demo than something really useful. Plans are to expand to recognise other things like books, DVDs, etc. I think there’s huge potential for this stuff. Snap a movie poster, see the trailer or get the soundtrack. Snap a book cover, see the reviews on Amazon. Snap an ad in a magazine, buy the product. Snap a resturant, get reviews. Most of the real world becomes clickable. Everything is a link.

The technology is very scalable – The internals use an inverted index just like normal text search engines. In my own research I’m working with hundreds of thousands of images right now. It’s probably going to be possible to index a sizeable fraction of all the objects in the world –  literally take a picture of anything and get search results. The technology is certainly fast enough, though how the recognition rate will hold up with such large databases is currently unknown.

My only question is – where’s the buzz, and why has it taken them so long?

Update: I gave the app a spin today on a friend’s iPhone, and it basically works as advertised. It was rather slow though – maybe 5 seconds per search. I’m not sure if this was a network issue (though the iPhone had a WiFi connection), or maybe kooaba got more traffic today than they were expecting. The core algorithm is fast – easily less than 0.2 seconds (and even faster with the latest GPU-based feature detection).  I am sure the speed issue will be fixed soon. Recognition seemed fine, my friend’s first choice of movie was located no problem. A little internet sleuthing shows that they currently have 5363 movie posters in their database. Recognition shouldn’t be an issue until the database gets much larger.

Mobile Manipulation Made Easy

GetRobo has an interesting interview with Brian Gerkey of Willow Garage. Willow Garage are a strange outfit – a not-for-profit company developing open source robotic hardware and software, closely linked to Stanford. They’re funded privately by a dot com millionaire. They started with several projects including autonomous cars and autonomous boats, but now concentrate on building a new robot called PR2.

The key thing PR2 is designed to support is mobile manipulation. Basically research robots right now come in two varieties – sensors on wheels, that move about but can’t interact with anything, and fixed robotic arms, that manipulate objects but are rooted in place. A few research groups have build mobile manipulation systems where the robot can both move about and interact with objects, but the barrier to entry here is really high. There’s a massive amount of infrastructure you need to get a decent mobile manipulation platform going – navigation, obstacle avoidance, grasping, cross-calibration, etc. As a result, there are very very few researchers in this area. This is a terrible shame, because there are all sorts of interesting possibilities opened up by having a robot that can both move and interact. Willow Garage’s PR2 is designed to fill the gap – an off-the-shelf system that provides basic mobile manipulation capabilities.

Brian: We have a set of demos that we hope that the robot can do out of the box. So things like basic navigation around the environment so that it doesn’t run into things and basic motion planning with the arms, basic identifying which is looking at an object and picking it out from sitting on the table and picking it up and moving it somewhere. So the idea is that it should have some basic mobile manipulation capabilities so that the researcher who’s interested in object recognition doesn’t have to touch the arm part in order to make the object recognizer better. The arm part is not to say that it can be improved but good enough.

If they can pull this off it’ll be great for robotics research. All the pieces don’t have to be perfect, just enough so that say a computer vision group could start exploring interactive visual learning without having to worry too much about arm kinematics, or a manipulation group could experiment on a mobile platform without having to write a SLAM engine.

Another interesting part of the interview was the discussion of software standards. Brian is one of the lead authors of Player/Stage, the most popular robot OS. Player is popular, but very far from universal – there are nearly as many robot OSes as there are robot research groups (e.g. CARMEN, Orca, MRPT, MOOS, Orocos, CLARAty, MS Robotics Studio, etc, etc). It seems PR2 will have yet another OS, for which there are no apologies:

I think it’s probably still too early in robotics to come up with a standard. I don’t think we have enough deployed systems that do real work to have a meaningful standard. Most of the complex robots we have are in research labs. A research lab is the first place we throw away a standard. They’re building the next thing. So in robotics labs, a standard will be of not much use. They are much more useful when you get to the commercialization side to build interoperable piece. And at that point we may want to talk about standards and I think it’s still a little early. Right now I’m much more interested in getting a large user community and large developer community. I’m less interested in whether it’s blessed as a standard by a standard’s body.

Anyone working in robotics will recognise the truth of this. Very much a sensible attitude for the moment.

Big Data to the Rescue?

Peter Norvig of Google likes to say that for machine learning, you should “worry about the data before you worry about the algorithm”.

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data.

It’s a rallying cry taken up by many, and there’s a lot of truth to it.  Peter’s talk here has some nice examples (beginning at 4:30). The maxim about more data holds over several orders of magnitude. For some examples of the power of big-data-simple-algorithm for computer vision, check out the work of Alyosha Efros’ group at CMU.  This is all pretty convincing evidence that scale helps. The data tide lifts all boats.

What I find more interesting, though, is the fact that we already seem to have reached the limits of where data scale alone can take us. For example, as discussed in the talk, Google’s statistical machine translation system incorporates a language model consisting of length 7 N-grams trained from a 10^12 word dataset. This is an astonishingly large amount of data. To put that in perspective, a human will hear less than 10^9 words in an entire lifetime. It’s pretty clear that there must be huge gains to be made on the algorithmic side of the equation, and indeed some graphs in the talk show that, for machine translation at least, the performance gain from adding more data has already started to level off. The news from the frontiers of the Netflix Prize is the same – the top teams report that the Netflix dataset is so big that adding more data from sources like IMDB makes no difference at all! (Though this is more an indictment of ontologies than big data.)

So, the future, like the past, will be about the algorithms. The sudden explosion of available data has given us a significant bump in performance, but has already begun to reach its limits. There’s still lots of easy progress to be made as the ability to handle massive data spreads beyond mega-players like Google to more average research groups, but fundamentally we know where the limits of the approach lie. The hard problems won’t be solved just by lots of data and nearest neighbour search. For researchers this is great news – still lots of fun to be had!