I finally got a chance to try out SnapTell Explorer, and I have to say that I’m impressed. Almost all of books and CDs I had lying around were correctly recognised, despite being pretty obscure titles. With 2.5 million objects in their index, SnapTell can recognise just about any book, CD, DVD or game. Once the title is recognised, you get back a result page like this with a brief review and the option to buy it on Amazon, or search Google, Yahoo or Wikipedia. For music, there is a link to iTunes.
I spent a while “teasing” the search engine with badly taken photos, and the recognition is very robust. It has no problems with blur, rotation, oblique views, background clutter or partial occlusion of the object. Below is a real recognition example:
I did find the service pretty slow, despite having a WiFi connection. Searches took about five seconds. I had a similar experience with kooaba earlier. There are published visual search algorithms that would let these services be as responsive as Google, so I do wonder what’s going on here. It’s possible the speed issue is somewhere else in the process, or possibly they’re using brute-force descriptor comparison to ensure high recognition rate. For a compelling experience, they desperately need to be faster.
While the recognition rate was generally excellent, I did manage to generate a few incorrect matches. One failure mode is where multiple titles have similar cover design (think “X for Dummies”) – a picture of one title randomly returns one of the others. I saw a similar problem with a CD mismatching to another title because both featured the same record company logo. Another failure mode that might be more surprising to people who haven’t worked with these systems was mismatching on font. A few book searches returned completely unrelated titles which happened to use the same font on the cover. This happened particularly when the query image had a very plain cover, so there was no other texture to disambiguate it. The reason this can happen is that the search method relies on local shape information around sets of interest points, rather than attempt to recognise the book title as a whole by OCR.
My overall impression, however, is that this technology is very much ready for prime time. It’s easy to see visual search becoming the easiest and fastest way to get information about anything around you.
If you haven’t got an iPhone, you can try it by sending a picture to [email protected].
Not sure where you got the 5.5 million number. Its currently 2.5 million and increasing.
We do a lot of things to improve accuracy. The iphone has one of the
relatively better cameras for a phone. Many of them are a lot worse and the goal was to make it work with virtually all of them. Blur is
one of the really bad things that can happen. We have to achieve a very high precision at 1 – most researchers look at precision at 3 or 5 but for a mobile phone there isn’t enough screen
space to display multiple results. At the accuracies we are operating
a 1% gain in accuracy is a big win (a 1% difference translates to 10
queries out of a 1000) and usually paid for in terms of latency.
Latency. Most of the latency occurs outside our server and its not
clear why its happening.
In terms of recognition, the Dummies problem you mention is a difficult issue. A substantial portion of many of these covers is identical and in the case of the Dummies series there are about 2000 covers. The problem is complicated by the fact that there are
variations of this problem where the differences are even smaller
– eg Bluray DVD vs regular DVD and the difference is just a small
piece of text. The system doesn’t know whether the query was
the portion of the book that was common or the other portion
We are working on this and hope to resolve it in the future at least
for the Dummies case.
The font problem – some of this will hopefully go away in a future
Thanks, I corrected the index size figure.
Your precision concerns sound familiar – I work on this technology for place recognition in mobile robotics, and one wrong match can irrecoverably corrupt the robot’s map, so we’re also very focused on getting near-100% precision (though we’re less concerned about recall).
The font problem I think isn’t too severe – One decent solution is to detect text patches and use a less local descriptor for those image regions.
The “Dummies” issue is certainly harder. I face a related problem in place recognition in that many places share repeating features such as brick walls. (See here). Our solution is to learn a generative model of the image data so that we know which parts of the image are distinctive and which are common. We learn correlations between sets of features also. I’ve found it to give excellent results, although the largest index I’ve worked with so far is about 100k images.
The hardest case of detecting Bluray DVD vs regular DVD is really tough – I think probably beyond the abilities of current methods.
Thanks for the pointers to your work.
It’s super webpage, I was looking for something like this
Nice post, I love the website.
Comments are closed.