How do canonicals impact indexing?
Answer
A canonical tag (aka "rel canonical") is a way of telling search engines that a specific URL represents the master copy of a page. This is done by setting the canonical tag in the head section of the page, as below.
<link rel="canonical" href="https://www.sajari.com" />
Canonicals are used for a variety of reasons, such as choosing the preferred domain, http vs https preference, and consolidation of ranking "juice" for a given piece of content. Good canonicals can also help improve SEO. For more information, read how Google handles canonical tags and why the SEO community considers them important.
Canonicals are very important to the way Sajari works and one of the biggest reasons for crawling failing to index content correctly. They are a very strong signal and we generally won't index a URL if it has a canonical pointing elsewhere; we will instead try to index the canonical URL. The biggest mistakes we see with canonicals are:
Redirect loops: The canonical will point to a different URL, which will redirect back to the original, and so on.
Unresolvable: The URL in the canonical tag is either not a URL, does not exist, or cannot be resolved.
Self referential: Sometimes developers and CMS' set the canonical for each page as itself, defeating the point of canonicals.
All the same: Every page on a site has the exact same canonical URL (often the root domain or homepage).
You can tell if you have some of these issues using our content debug tool. You should either a) fix these issues or b) remove canonical tags from your pages altogether. Removing all canonicals is much better than setting them incorrectly.