We use google to power our university search. We assumed that our important PDF documents were being searched or indexed. Until yesterday, when my illustrious colleague alerted the web team that most of our PDF Course Catalogs weren’t, in fact, being being noticed. More like ignored. In fact, google was only aware of ONE of our catalogs.
continuing from his email:
All the other (catalogs), while linked to from the Academics pages and therefore scannable by Google, don’t seem to be indexed by Google like the illustrious 2003-2004 catalog.
I found this on Google webmaster forums from 2006:
“I suspect that Google simply won’t index documents of that size. Traditionally, their recommended limit on HTML documents has been 100K, although they’ve certainly relaxed that in recent years to more than twice that size. But I don’t think you can expect multi-megabyte files to be indexed.”
So I checked our file sizes:
2002-2003 catalog: 7.907 MB
2003-2004 catalog: 2.387 MB
2004-2005 catalog: 4.361 MB
2005-2006 catalog: 12.535 MB
2006-2007 catalog: 4.634 MBLooks like if a PDF file is larger than ~3 MB, it won’t be indexed by Google for searches.
So thats something of a revelation. If you’ve got important content and you need google to know about it, reconsider depending on large PDF files. It’s worth someone’s time to convert/provide that content in web-text form.
In fact, according to this source:
Large web pages are far less likely to be relevant to your query than smaller pages. For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file. Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a web page and the first 26 pages of a pdf file. If a page is larger, Google will list the page as being 101 kilobytes or 120 kilobytes for a pdf file. This means that Google’s results won’t reference any part of a web page beyond its first 101 kilobytes or any part of a pdf file beyond the first 120 kilobytes.
So now I don’t know what to believe. Anybody have additional insight on this?










If you work with PR people who aren’t concerned as much about web, instead burying their heads in local newspapers and regional television, you should show them 













