Google will index our PDF Course Catalog: FAIL

May 07 2008

We use google to power our university search.  We assumed that our important PDF documents were being searched or indexed.  Until yesterday, when my illustrious colleague alerted the web team that most of our PDF Course Catalogs weren’t, in fact, being being noticed.  More like ignored. In fact, google was only aware of ONE of our catalogs.

continuing from his email:

All the other (catalogs), while linked to from the Academics pages and therefore scannable by Google, don’t seem to be indexed by Google like the illustrious 2003-2004 catalog.

I found this on Google webmaster forums from 2006:

I suspect that Google simply won’t index documents of that size.  Traditionally, their recommended limit on HTML documents has been 100K, although they’ve certainly relaxed that in recent years to more than twice that size.  But I don’t think you can expect multi-megabyte files to be indexed.

So I checked our file sizes:

2002-2003 catalog: 7.907 MB
2003-2004 catalog: 2.387 MB
2004-2005 catalog: 4.361 MB
2005-2006 catalog: 12.535 MB
2006-2007 catalog: 4.634 MB

Looks like if a PDF file is larger than ~3 MB, it won’t be indexed by Google for searches.

So thats something of a revelation.  If you’ve got important content and you need google to know about it, reconsider depending on large PDF files.  It’s worth someone’s time to convert/provide that content in web-text form.

In fact, according to this source:

Large web pages are far less likely to be relevant to your query than smaller pages. For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file. Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a web page and the first 26 pages of a pdf file. If a page is larger, Google will list the page as being 101 kilobytes or 120 kilobytes for a pdf file. This means that Google’s results won’t reference any part of a web page beyond its first 101 kilobytes or any part of a pdf file beyond the first 120 kilobytes.

So now I don’t know what to believe.  Anybody have additional insight on this?

3 responses so far

  1. There are ways to optimize a PDF for web. Here are two links pulled from my Delicious account. Depending on how large the original file is you might not be able to optimize it down enough, but it’s definitely worth getting what you can.

    What you don’t know about optimizing PDFs can hurt you
    How to Optimize Your PDFs to Increase Search Traffic: 10 Steps

  2. At the college I work at, I’ve been optimizing PDFs for accessibility and usability, as well as for providing a more search-friendly file. I’m interested in hearing more about what you learn about Google and indexing PDFs.

  3. [...] few weeks ago, I came across an interesting post about Google and PDF files.  It seems that Google has very tight rules about how it indexes PDF documents, and you have to be [...]

Leave a Reply