A robots.txt file tells Googlebot which pages or files you don't want Google to crawl. Therefore, the robots.txt is not used to deindex pages but to prevent them from being browsed:
==> if the page had never been indexed before, preventing its crawl will never allow it to be indexed. However, if the page is already indexed or if another website links it, the robots.txt will not allow it to be deindexed. To prevent a page from appearing on Google, you must use noindex tags / directives, or even protect it with a password.
The robots.txt file’s main goal is therefore to manage the robot’s crawl time by preventing it from browsing pages with low added value, but which must exist for the user journey (shopping cart, etc.).
PS: the robots.txt file is one of the first files scanned by the engines.
Format and use rules
The robots.txt is a text file that must be placed at the server / site’s root, for example: https://smartkeyword.io/robots.txt.
It cannot be placed in a subdirectory (for example, in http://example.com/pages/robots.txt), but can apply to subdomains (for example, http://website.example.com/robots.txt).
The robots.txt file’s name must be lowercase (no Robots.txt or ROBOTS.TXT).
The website can only contain one robots.txt file.
If absent, a 404 error will be displayed, and the robots consider that no content is prohibited.
File contents
Let's start from the following example:
"User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: sitemap link"
The User-agent: * instruction means that the instructions after the mention apply for all robots. This instruction must be mentioned.
The Disallow instruction allows robots to not crawl a page or the website’s entire directory.
The Allow instruction allows us to make exceptions, for example:
Disallow: /wp-admin/ = we’ve asked bots not to crawl /wp-admin.
Allow: /wp-admin/admin-ajax.php = we made an exception to allow bots to explore admin-ajax.php which is part of the directory we previously banned, /wp-admin/.
Sitemap: it also lets search engines know the address of the website's sitemap.xml file, if there is one.
How to create a robots.txt file?
To create a robots.txt file, you can use almost any text editor that saves standard text files in ASCII or UTF-8 format. Do not use word processing software, as they often save files in a proprietary format, and may add unexpected characters (curved quotes for example), which can confuse crawlers.
How to test a robots.txt file?
To test your robots.txt file, you must use the Google Search Console.
Once logged into the Console, click on “Access the old version” at the bottom left:
Next, click on “Explore”, and then click on the robots.txt file test tool.
Scroll through the code in the robots.txt file to locate syntax warnings and reported logic errors. The number of syntax-related warnings and logic errors are displayed immediately in the editor.
Then, to directly test the blocked URLs on the GSC:
Enter the URL of your website's page in the text box at the bottom of the page. Then in the drop-down list to the text box’s right, select the user-agent you want to simulate.
Click on the TEST button to test the access.
Check if the TEST button says ACCEPTED or BLOCKED to see whether or not the crawlers can crawl this URL.
Edit the file on the page, and test access again if necessary.
Copy your changes to the website's robots.txt file. This tool does not modify the file that is actually on the website. It only checks the copy hosted on the tool.
Audit a robots.txt file
Do you want to analyze a website’s robots.txt? Here are the questions to ask yourself, and the corrections to be made if necessary.
The website does not have robots.txt
It’s quite possible that a website does not have one. Just search on the browser by adding "/robots.txt" at the end of the home URL. Also check the subdomains.
If you don't have robots.txt:
Do you need it? Check that you don't have any low added value pages that require it. Example: shopping cart, internal search engine’s search pages, etc.
If you need to, create the file following the guidelines above.
The website has a robots.txt
Open the file, and simply check the blocked pages:
If pages are blocked when they should not be: remove them
If there are missing pages that must be blocked: add them
If the blocked pages are exactly the ones you need: that's fine, there is nothing to do
Now you know everything about robots.txt and how to analyze it!
Check out similar articles:
The Google Search Console coverage report:
Have you found your answer?