Canonicalization can be a confusing area for webmasters, so let’s take a look at what it is, and ways to avoid it causing problems.

What Is Canonicalization?

Canonicalization is the process by which URLs are standardized. For example, www.acme.com and www.acme.com/ are treated as the same page, even though the syntax of the URL is different.

Why Is Canonicalization An Issue For SEO?

Problems can occur when the search engine doesn’t normalize URLs properly.

For example, a search engine might see http://www.acme.com and http://aceme.com as different pages. In this instance, the search engine has the host names confused.

Why Is This a Problem?

If the search engines sees a page as being published at many separate URLs, the search engine may rank your pages lower than they would otherwise, or not rank them at all.

Canonicalization issues can split link juice between pages if people link to variants of the URL. Not only does this affect rank (less PageRank = lower rank), but it can also affect crawl depth (if PageRank is spent on duplicate content it is not being spent getting other unique content indexed).

To appreciate what a dramatic effect canonicalization issues can have on search traffic look at the following example, and notice that for the given example proper canonicalization increased traffic for that keyword by 300%

  Link Equity Google Ranking Position % of Search Traffic Daily Traffic Volume Traffic Increase
split 1 60% 8 3% 50 -
split 2 40% 15, filtered = 0 0% 0 -
canonical 100% 2 12% 200 300%

What Conditions Can Cause This Problem?

There are various conditions, but the following are amongst the most common:

  • Different host names i.e. www.acme.com vs acme.com
  • Redirects pointing to different URLs i.e. 302 used inappropriately
  • Forwarding multiple URLs to the same content, and/or publishing the same content on multiple domains
  • Improperly configured dynamic URLs i.e. any url rewriting based on changing conditions
  • Two index pages appearing in the same location i.e. Index.htm vs Index.html
  • Different protocols i.e. https://www vs http://www
  • Multiple slashes in the filepath i.e. www.acme.com/ vs www.acme.com//
  • Scripts that generate alternate URLs for the same content i.e. some blogging and forum software, ecommerce software that adds tracking URLs
  • Port numbers in the domain name i.e. acme.com/4430 : can sometimes be seen in virtual hosting environments.
  • Capitalization – i.e. www.acme.com/Index.html vs www.acme.com/index.html
  • URLs “built” from the path you take to reach a page i.e. tracking software may incorporate the click path in the URL for statistical purposes.
  • Trailing questions marks, with or without parameters i.e. www.acme.com/? or www.acme.com/?source=cnn (a common tagging strategy amongst ad buys)

How Can I Tell If Canonicalization Issues Are Affecting My Site?

Besides working through the checklist performing a manual check, you can also use Google’s cache date.

Previously, you would have been able to use Google’s supplemental index marker, although Google have recently done away with this feature.

The supplemental index is a secondary index, seperate from Google’s main index. It is a graveyard, of sorts, containing outdated pages, pages with low trust scores, duplicate content, and other erroneous pages. As duplicate pages often reside in the supplemental index, appearing in the supplemental index can be an indicator you may have canonicalization issues, all else being equal.

Before Google removed the supplemental index label, many SEOs noticed that supplemental pages had an old cache date and that cache date is a good proxy for trust. If your page is not indexed frequently, and you think it should be, chances are the page is residing in the supplemental index.

Michael Gray at Wolf-Howl” outlines a method to easily check for this data. In summary, you add a date and unique field to each page, wait a couple of months, then search on this term.

How Can I Avoid Canonicalization Issues?

Good Site Planning

Using good site planning and architecture, from the start, can save you a lot of problems later on. Pick a convention for linking, and stick with it.

Maintain Consistent Linking Conventions

It’s an important point, so I’ll repeat it ;) Always link to www.acme.com, rather than sometimes linking to acme.com/index.htm, and sometimes linking to www.acme.com.

301 Redirect Non-www to www , Or Vice Versa

You can force resolution to one URL only. To do this, you create a 301 redirect.

Here’s a typical 301 redirect script:

RewriteEngine On

RewriteCond %{HTTP_HOST} ^seobook.com [NC]
RewriteRule ^(.*)$ http://www.seobook.com/$1 [L,R=301]

For a more detailed analysis on how to use redirects, see .htaccess, 301 Redirects & SEO.

Use The Website Health Check Tool

This tool, and accompanying video, shows you how to spot a number of site architecture problems, including canonicalization issues.

Download the tool, check the www vs non-www option box, and hit the Analyze button.

If you have a large site you may not be able to surface all the canonicalization issues using the default tool settings. You may need to use the date based filter options to get a deep view of recently indexed pages…many canonicalization issues occur sitewide, so looking deeply at new pages should help you detect problems.

Another free, but far more time consuming option, is to use the date based filters on Google’s advanced search page.

Workaround For Https://

Sometimes Google will index both the http:// and the https:// versions of a site.

One way around this is to tell the bots not to index the https:// version.

Tony Spencer outlines two ways to do this in .htaccess, 301 Redirects & SEO. One is to cloak the robots.txt file, the other is to create a conditional php script.

Use Absolute, As Opposed To Relative Links

An absolute link specifies the exact location of a file on a webserver. For example, http://www.acme.com/filename.html

A relative link is, as the name suggests, relative to a pages’ location on the server.

A relative link looks like this:

“/directory/filename.htm”

There are various issues to consider, not related to canonicalization issues, when deciding to using either format. These issues include page download speed, server access times, and design conventions. The point to remember is to remain consistent. Absolute links tend to make doing so easier, as there is only ever one URL format for a file, regardless of context.

Don’t Link To Multiple Versions Of The Page

In some cases, you may intend to have duplicate content on your site.

For example, some software, such as blog and forum software, aggregates posts into archives. Always link to the original version of the post, as opposed to the archive, or any other, location i.e. www.acme.com/todays-post.htm , not www.acme.com/archive/december/todays-post.htm.

If your software program links to a duplicate version of the content (like an individual post from a forum thread) consider adding rel=nofollow to those links.

Use 301s, not 302s On Internal Affiliate Redirects

A 301 redirect is a permanent redirect, which indicates a page has been moved permanently. 301s typically pass PageRank, and do not cause canonicalization issues.

A 302 redirect is a temporary redirect. If you use 302s the wrong page may rank. Google’s Matt Cutts claims they are trying to fix the problem:

we’ve changed our heuristics to make showing the source url for 302 redirects much more rare. We are moving to a framework for handling redirects in which we will almost always show the destination url. Yahoo handles 302 redirects by usually showing the destination url, and we are in the middle of transitioning to a similar set of heuristics. Note that Yahoo reserves the right to have exceptions on redirect handling, and Google does too. Based on our analysis, we will show the source url for a 302 redirect less than half a percent of the time (basically, when we have strong reason to think the source url is correct)

but if you use 302s on affiliate links the affiliate page may rank in the search results, as shown in the below SnapNames search. This, in turn, would credit the affiliate with a commission anytime someone buys through that link in the search results…effectively cutting the margins of the end merchant.

Specify preferred urls in Google Webmaster Tools

Google Webmaster Tools provides an area where you can specify which version of URL i.e. http://www.acme or http//acme Google should use.

Note: It is important not to use the remove URL tool to try and fix these domain issues. Doing so may result in your entire domain, as opposed to one page, being removed from the index.

Further Reading


Continue Reading: URL Canonicalization: The Missing Manual

Random Posts