Skip to content

Can Google robots read pdf content?

Posted on February 16, 2012
Filed Under: Digital, Website content, Website design
Leave a comment

When it comes to SEO, a question clients often ask, is whether Google robots can read pdf content, or whether pdfs should be avoided in favour of HTML files. Here’s a run down of the facts.

1. Google can index PDF files

Google can access and index the majority of PDF files with the exception of encrypted or password protected files which are inaccessible to the search engines.

2. Google cannot index images from PDF files

Google cannot index images directly from PDF files unless they have a separate HTML page set up for them. Even on standard web pages Google cannot index images accurately unless they are attributed with a relevant and descriptive ‘ALT tag’ – this renders them inaccessible for the search engines when they are contained in PDF documents.

3. Can Google index links in PDF files?

Links in PDF files are treated similarly to standard HTML links. The one exception is that links placed in PDF documents cannot have the ‘nofollow’ attribute attached to them, but they can still pass PageRank and various other authority-relevant ranking signals. Links placed in a PDF file may even be followed by the search engines after the file itself has been crawled and indexed.

4. You can stop/remove a PDF file from the SERPs

You can stop a PDF file from being crawled and you can remove an already indexed PDF document from the results pages. The solution to both is to use the following command in the HTTP header where the file is being served: X-Robots-Tag:noindex

This will stop the PDF file being accessed by the search engines; therefore it will not appear in the results pages. If the file has already been indexed then the above command can still be used and the PDF will drop out of the listings over time.

5. Is it classed as duplicate content if copies exist of pages in both HTML and PDF formats?

Google has always suggested that only one version of the page is served to the search engines to avoid such issues occurring. If the same content was going to be used on multiple pages in the HTML format then the solution would be to use the ‘rel=canonical’ tag to nominate one version for indexation. The same applies to duplication between HTML and PDF versions of content. The canonical version can be specified in the HTML coding of the webpage or in the HTTP header with the PDF file source. A preferred version for indexation can also be made by including the URL address in the Sitemap.

6. PDF files can rank highly in the search engine results pages

The optimisation process for a PDF file is almost like scaling down the process of optimisation for an entire website. Excellent content in a PDF file with a diverse and authoritative link portfolio has as much chance at ranking highly as a well optimised web page (as illustrated by the first image).

7. Can you influence how PDF files are presented in the SERPs?

Google will use two factors to determine how a title for a PDF document is presented in the SERPs. These are the Metadata within the PDF file and the anchor text contained in the links pointing to the file. These will serve as strong indicators to the search engines about how they should title a PDF file when it is being listed.

While optimising PDF files might not be the highest priority for everyone’s SEO strategy, it’s worth knowing how content in PDF files is treated by the search engines. The PDF format is great for presenting long documents in a readable way and should be considered a valuable resource for company’s who produce guides, instructions or any other type of extended content.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS