In Google Search Console (GSC), within the coverage report, accessible on the left column under "coverage", we can take stock of a website's indexing status. It’s a gold mine of information on a website’s good page crawlability and indexing.
If you do not have a GSC account yet, you will first need to create one, and validate your property.
The coverage report is as follows:
It contains 4 sections: Error, Valid with warnings, Valid and Excluded.
We recommend that you analyze the "Valid" section in green first.
Now, in this article, we’ll explain the “Excluded” section (in gray), that you should analyze last.
This is the section listing the URLs that Google did not index, assuming that it has been a voluntary choice / action. Unlike the "Error" section, these are excluded URLs that you did not choose to send to Google via a sitemap, which is why it cannot assume that it is an error.
Click on the gray section "Excluded":
Let’s first list the excluded URLs for technical reasons:
Technical issues
Blocked due to unauthorized request (401): an authorization request (401 response) prevents Googlebot from accessing this page. If you want Googlebot to be able to crawl this page, remove the login credentials or allow Googlebot to access your page.
Not Found (404): this page returned a 404 error when requested. Google detected this URL without an explicit request or sitemap. Google may have detected the URL via a link from another site, or the page may have been deleted. Googlebot is more likely to continue to try to access this URL for some time. There is no way to tell Googlebot to permanently forget a URL. However, Googlebot will explore it less and less often. 404 responses aren't an issue if they're intentional, just don't make any connections to them. If your page has moved, use a 301 redirect to the new location.
Anomaly while crawling: an unspecified anomaly occurred while crawling this URL. It can be caused by 4xx or 5xx levels’ response code. Try to analyze the page using the Explorer tool like Google to check if there are any problems, preventing it from being crawled, then loop back with the technical team.
Soft 404: the page request returns what appears to be a "soft 404" response. In other words, it indicates that the page cannot be found in a user-friendly way, without including the corresponding 404 response code. We recommend that you either return a 404 response code for "not found" pages to prevent indexing and remove them from internal linking, or add information on the page to tell Google that it is not a "soft 404" type error.
Issues linked to a duplicate or a canonical
Other page with correct canonical tag: this page is a page’s duplicate that Google recognizes as canonical. It correctly refers to the canonical page. There is, in theory, no action to be taken with Google, but we recommend that you check why these 2 pages exist and are visible to Google in order to make appropriate corrections.
Duplicate page without any canonical tag selected by user: this page has duplicates, none of which are marked as canonical. Google thinks this page is not canonical. You should designate this page’s canonical version explicitly. This URL’s inspection should show the canonical URL selected by Google.
Duplicate page, Google did not choose the same canonical URL as the user: this page is marked as canonical, but Google thinks another URL would be a more appropriate canonical version and therefore indexed it. We recommend that you check the duplicate’s origin (maybe you should use a 303 rather than keeping the 2 pages), then add the canonicals tags to specify it to Google. This page was detected without an explicit exploration request. This URL’s inspection should show the canonical URL selected by Google.
If you’ve got this message on 2 different pages, it means that they are too similar and that Google does not see the point of having two. Let’s say that you own a shoe store, if you’ve got a "red shoes" page and a "black shoes" page. The two pages contain little or no content, or content that is too similar, with barely the title changing: you have to ask yourself if these pages should really exist, and if so, improve their content.
Duplicate page, the URL sent has not been selected as canonical URL: the URL is part of duplicate URLs’ set without an explicitly specified canonical page. You requested that this URL has to be indexed, but since this is a duplicate and Google thinks another URL would be a better canonical version, this one has been indexed in favor of the one you declared. The difference between this case and "Google did not choose the same canonical page as the user" is that, in this case, you explicitly requested indexing. This URL’s inspection should show the canonical URL selected by Google.
Page with redirect: the URL is a redirect and therefore has not been added to the index. There is nothing to do in this case, except to check that the list is correct.
Page deleted due to legal claim: the page has been removed from the index due to a legal claim.
Issues related to indexing management
Blocked by a "noindex" tag: when Google tried to index the page, it identified a "noindex" directive, and therefore did not index it. If you don't want the page indexed, you've done it right. If you want it to be indexed, you must remove this "noindex" directive.
Blocked by page removal tool: the page is currently blocked by a URL removal request. If you are a verified site owner, you can use the URL removal tool to see who made this request. Deletion requests are only valid for 90 days after the deletion date. After this period, Googlebot can crawl your page again and index it, even if you don't send another index request. If you don't want the page to be indexed, use a "noindex" directive, add credentials to the page, or delete it.
Blocked by robots.txt file: a robots.txt file is preventing Googlebot from accessing this page. You can verify it with the test tool in the robots.txt file. Note that this does not mean that the page will not be indexed by other means. If Google can find other information on this page without loading it, the page could still be indexed (although this is a rare case). To make sure that a page is not indexed by Google, remove the robots.txt block and use a "noindex" directive.
Crawled, currently unindexed: the page has been crawled by Google, but not indexed. It may be indexed in the future; it is not necessary to return this URL for exploration.
It happens quite often with paginated pages after the 1st page, because the engine does not see the point of indexing them in addition to the first.
It’s also possible that it concerns a large number of very similar or low-quality pages, for which Google does not see the point of indexing them. Therefore, we must think about whether it is not better to deindex them voluntarily, unless we plan to work on them in the near future.
Detected, currently not indexed: the page was found by Google, but not yet crawled. Usually, this means that Google tried to crawl the URL, but the site was overloaded. Therefore, Google had to postpone exploration. That is why the last exploration date is not included in the report.
This happens quite often with paginated pages after the 1st page, because the engine does not see the point of crawling them in addition to the first.
It’s also a good idea to dig into website depth: when you’ve got many deep pages, it’s difficult for the robot to crawl your website well, so it may decides to hide an "uninteresting" part of the website. This issue must be corrected as soon as possible because it can affect the website’s overall crawlability, and therefore ignore other pages that may be crucial for your SEO.
You know everything about the Search Console excluded URLs report!
Don't hesitate to contact us, we can help you audit your index coverage!
The other sections of the coverage report:
Check out similar articles:
Have you found your answer?