Monday 22 June 2009

A new landmark in computer vision

Science fiction books and movies have long imagined that computers will someday be able to see and interpret the world. At Google, we think computer vision has tremendous potential benefits for consumers, which is why we're dedicated to research in this area. And today, a Google team is presenting a paper on landmark recognition (think: Statue of Liberty, Eiffel Tower) at the Computer Vision and Pattern Recognition (CVPR) conference in Miami, Florida. In the paper, we present a new technology that enables computers to quickly and efficiently identify images of more than 50,000 landmarks from all over the world with 80% accuracy.

To be clear up front, this is a research paper, not a new Google product, but we still think it's cool. For our demonstration, we begin with an unnamed, untagged picture of a landmark, enter its web address into the recognition engine, and poof — the computer identifies and names it: "Recognized Landmark: Acropolis, Athens, Greece." Thanks computer.

How did we do it? It wasn't easy. For starters, where do you find a good list of thousands of landmarks? Even if you have that list, where do you get the pictures to develop visual representations of the locations? And how do you pull that source material together in a coherent model that actually works, is fast, and can process an enormous corpus of data? Think about all the different photographs of the Golden Gate Bridge you've seen — the different perspectives, lighting conditions and image qualities. Recognizing a landmark can be difficult for a human, let alone a computer.

Our research builds on the vast number of images on the web, the ability to search those images, and advances in object recognition and clustering techniques. First, we generated a list of landmarks relying on two sources: 40 million GPS-tagged photos (from Picasa and Panoramio) and online tour guide webpages. Next, we found candidate images for each landmark using these sources and Google Image Search, which we then "pruned" using efficient image matching and unsupervised clustering techniques. Finally, we developed a highly efficient indexing system for fast image recognition. The following image provides a visual representation of the resulting clustered recognition model:


In the above image, related views of the Acropolis are "clustered" together, allowing for a more efficient image matching system.

While we've gone a long way towards unlocking the information stored in text on the web, there's still much work to be done unlocking the information stored in pixels. This research demonstrates the feasibility of efficient computer vision techniques based on large, noisy datasets. We expect the insights we've gained will lay a useful foundation for future research in computer vision.

If you're interested to learn more about this research, check out the paper.