In this contemporary epoch, humans rely mostly on search engines to gain knowledge about any topic. Here, Robots.txt acts as a catalyst in managing and instructing search engine crawlers on how to crawl over a particular website.
As every coin has two sides, similarly Robot.txt has some issues that need to be addressed. So, 8 common Robot.txt issues along with the methods to fix them are as follows:
1.Robots.txt Not In The Root Directory
Search robots fail to discover the files which are not in the root folders.
To avoid this issue, make sure you move your file in the root directory.
2. Inappropriate usage of wildcards
Robot.txt only allows two wildcards
- * representing instances of a valid character
- $ representing the end of a URL
To overcome this issue, make sure you minimise the usage of wildcards as poor placement of these wildcards could block your entire file.
3. Noindex In Robots.txt
Google has already stopped obeying the Noindex rules so avoid using such files and if you still use such files, they are generally indexed.
To overcome this issue, one can shift to alternatives of Noindex available. One of such examples is the robots meta tag which can be added to the head of a webpage to avoid indexing on google.
4. Blocked scripts and stylesheets
It generally seems logical to block crawler access to external JavaScripts and cascading style sheets (CSS). However, remember that Googlebot needs access to CSS and JS files to “see” your HTML and PHP pages correctly.
To overcome this obstacle, remove the line from your robots.txt file that is blocking access.
5.Avoid XML Sitemap URL
One can include the URL of XML sitemap in the robots.txt file.
One can tackle the situation by omitting a sitemap as it would not negatively affect the actual core functionality and appearance of the website
6. Accessibility to development sites
Blocking crawlers from your live website is not a good idea, but so is not allowing them to crawl and index your under development pages.
In case you see this when you shouldn’t (or don’t see it when you should), make the required changes to your robots.txt file and check that your website’s search appearance updates accordingly.
7. Usage of absolute URLs
Using relative paths in the robots.txt file is the recommended approach for indicating which parts of a site should not be accessed by crawlers.
One way to tackle this issue is while using an absolute URL, there’s no guarantee that crawlers will interpret it as intended and that the disallow/allow rule will be followed.
8. Deprecated & Unsupported Elements
Bing still supports crawl-delay, Google doesn’t, but it is often specified by webmasters. You used to be able to set crawl settings in Google Search Console, but this was removed towards the end of 2023.
It is seen that this was not a widely supported or standardised practice, and the preferred method for noindex was to use on-page robots, or x-robots measures at a page level.