Monday, May 19, 2008

So why can't google provide html for 40% of pdfs?

A google search for opensource filetype:pdf returns the standard 10 results on the front page, but only 6 of them offer a "View as HTML" link. Is it just me, or has this become more prevalent recently? And what is the common property that results in this behaviour?

If anyone has any clues or ideas I would love to hear them.

1 comment:

Chris Bogart said...

Some pdfs have the text stored as an image, so the text isn't available for searching. I don't know what proportion of PDFs are that way, but someone concerned about copyright or plagiarism might do things that way.