Friday, 31 October 2008

A picture of a thousand words?

(Note: Click on the first result in each of the search results pages linked to throughout the post to see this feature in action.)

A scanner is a wonderful tool. Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves.But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world.

In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words -- words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world's information accessible and useful.

While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.

To see our new system at work, click on these search queries. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link:

[repairing aluminum wiring]
[spin lock performance]
[Mumps and Severe Neutropenia]
[Steady success in a volatile world]