January 24, 2025

What Is a Robots.txt File?

A robots.txt file is a set of directions that inform serps which pages to crawl and which pages to keep away from, guiding crawler entry however not essentially retaining pages out of Google’s index.

A robots.txt file seems like this:

robots.txt example

Robots.txt recordsdata could seem difficult. Nonetheless, the syntax (laptop language) is simple.

Earlier than explaining the small print of robots.txt, we’ll make clear how robots.txt differs from different phrases that sound related.

Robots.txt recordsdata, meta robots tags, and x-robots tags information serps in dealing with web site content material however differ of their stage of management, whether or not they’re positioned, and what they management.

Take into account these specifics:

  • Robots.txt: This file is positioned within the web site’s root listing and offers site-wide directions to look engine crawlers on which areas of the positioning they need to and shouldn’t crawl
  • Meta robots tags: These tags are snippets of code within the <head> part of particular person webpages and supply page-specific directions to serps on whether or not to index (embody in search outcomes) and observe (crawl the hyperlinks on) every web page
  • X-robot tags: These code snippets are used primarily for non-HTML recordsdata, akin to PDFs and pictures and are applied within the file’s HTTP header

Additional studying: Meta Robots Tag & X-Robots-Tag Defined

Why Is Robots.txt Necessary for search engine optimization?

A robots.txt file is necessary for search engine optimization as a result of it helps handle internet crawler actions to forestall them from overloading your web site and crawling pages not meant for public entry.

Beneath are just a few causes to make use of a robots.txt file:

1. Optimize Crawl Price range

Blocking pointless pages with robots.txt permits Google’s internet crawler to spend extra crawl finances (what number of pages Google will crawl in your web site inside a sure timeframe) on pages that matter.

Crawl finances can range based mostly in your web site’s measurement, well being, and variety of backlinks.

In case your web site has extra pages than its crawl finances, necessary pages might fail to get listed.

Unindexed pages received’t rank, which suggests you’ve wasted time creating pages that customers by no means see in search outcomes.

2. Block Duplicate and Non-Public Pages

Not all pages are meant for inclusion within the search engine outcomes pages (SERPs), and a robots.txt file enables you to block these personal pages from crawlers.

Take into account staging websites, inner search outcomes pages, duplicate pages, or login pages. Some content material administration methods deal with these inner pages robotically.

WordPress, for instance, disallows the login web page “/wp-admin/” for all crawlers.

3. Conceal Assets

Robots.txt enables you to exclude sources like PDFs, movies, and pictures from crawling if you wish to maintain them non-public or have Google deal with extra necessary content material.

How Does a Robots.txt File Work?

A robots.txt file tells search engine bots which URLs to crawl and (extra importantly) which URLs to keep away from crawling.

When search engine bots crawl webpages, they uncover and observe hyperlinks. This course of leads them from one web site to a different throughout numerous pages. 

If a bot finds a robots.txt file, it reads that file earlier than crawling any pages.

The syntax is simple. You assign guidelines by figuring out the user-agent (the search engine bot) and specifying directives (the principles).

You need to use an asterisk (*) to assign directives to all user-agents without delay.

For instance, the next instruction permits all bots besides DuckDuckGo to crawl your web site:

all bots except DuckDuckGo instructed to crawl the site

Semrush bots crawl the net to assemble insights for our web site optimization instruments, akin to Web site Audit, Backlink Audit, and On Web page search engine optimization Checker.

Semrush bots respect the principles in your robots.txt file, which means they received’t crawl your web site in case you block Semrush bots from crawling.

Nonetheless, blocking Semrush bots limits using sure Semrush instruments. 

For instance, in case you block the SiteAuditBot from crawling your web site, you’ll be able to’t audit your web site with the Web site Audit software. This software helps analyze and repair technical points in your web site.

Errors section in Site Audit tool

For those who block the SemrushBot-SI from crawling your web site, you’ll be able to’t use the On Web page search engine optimization Checker software successfully. 

In consequence, you lose the chance to generate optimization concepts that would enhance your webpages’ rankings.

On Page SEO Checker tool

Find out how to Discover a Robots.txt File

Your robots.txt file is hosted in your server, identical to different recordsdata in your web site.

You’ll be able to view any web site’s robots.txt file by typing the positioning’s homepage URL into your browser and including “/robots.txt” on the finish. 

For instance: “https://semrush.com/robots.txt.”

https://semrush.com/robots.txt example

Examples of Robots.txt Recordsdata

Listed below are some real-world robots.txt examples from common web sites.

YouTube

YouTube’s robots.txt file tells crawlers to not entry consumer feedback, video feeds, login/signup pages, and age verification pages.

The foundations in YouTube’s robots.txt file discourages indexing user-specific or dynamic content material that doesn’t assist search outcomes and should elevate privateness considerations.

YouTube robots.txt file with highlights showing disallowed URL paths

G2

G2’s robots.txt file tells crawlers to not entry sections with user-generated content material, like survey responses, feedback, and contributor profiles.

The foundations in G2’s robots.txt recordsdata assist shield consumer privateness by proscribing entry to probably delicate private data. The foundations additionally stop makes an attempt to control search outcomes.

G2 robots.txt file with highlights showing disallowed URL paths

Nike

Nike’s robots.txt file makes use of the disallow directive to dam crawlers from accessing user-generated directories, like “/checkout/” and “*/member/inbox.”

The foundations in Nike’s robots.txt file stop delicate consumer information from showing in search outcomes and scale back alternatives to control search engine optimization rankings.

Nike robots.txt file with highlights showing disallowed URL paths

Search Engine Land

Search Engine Land’s robots.txt file makes use of the disallow tag to discourage indexing of “/tag/” listing pages, which regularly have low search engine optimization worth and might trigger duplicate content material points.

The foundations in Search Engine Land’s robots.txt file encourage serps to deal with higher-quality content material and optimize the positioning’s crawl finances—one thing particularly necessary for big web sites like Search Engine Land.

Search Engine Land robots.txt file with highlights showing disallowed url path

Forbes

Forbes’s robots.txt file instructs Google to not crawl the “/take a look at/” listing, which seemingly accommodates testing or staging environments.

The foundations in Forbes’s robots.txt file stop unfinished or delicate content material from being listed, assuming it’s not linked to from elsewhere.

Forbes robots.txt file with highlights showing disallowed URL path

Explaining Robots.txt Syntax

A robots.txt file consists of a number of directive blocks, with every block specifying a user-agent (a search engine bot) and offering “permit” or “disallow” directions.

A easy block might appear like this:

Person-agent: Googlebot
Disallow: /not-for-google
Person-agent: DuckDuckBot
Disallow: /not-for-duckduckgo
Sitemap: https://www.yourwebsite.com/sitemap.xml

The Person-Agent Directive

The primary line of every directive block specifies the user-agent, which identifies the crawler.

For instance, use these strains to forestall Googlebot from crawling your WordPress admin web page:

Person-agent: Googlebot
Disallow: /wp-admin/

When a number of directives exist, a bot might select probably the most particular one. 

Think about you’ve three units of directives: one for *, one for Googlebot, and one for Googlebot-Picture. 

If the Googlebot-Information consumer agent crawls your web site, it should observe the Googlebot directives. 

Nonetheless, the Googlebot-Picture consumer agent will observe the extra particular Googlebot-Picture directives.

The Disallow Robots.txt Directive

The disallow directive lists elements of the positioning a crawler shouldn’t entry.

An empty disallow line means no restrictions exist. 

For instance, the under rule permits all crawlers entry to your total web site:

Person-agent: *
Enable: /

To dam all crawlers out of your total web site, use the under block:

Person-agent: *
Disallow: /

The Enable Directive

The permit directive permits serps to crawl a subdirectory or particular web page, even in an in any other case disallowed listing.

For instance, use the under rule to forestall Googlebot from accessing all weblog posts besides one:

Person-agent: Googlebot
Disallow: /weblog
Enable: /weblog/example-post

The Sitemap Directive

The sitemap directive tells serps—particularly Bing, Yandex, and Google—the place to search out your XML sitemap (a file that lists all of the pages you need serps to index).

The picture under reveals what the sitemap directive seems like:

The Sitemap directive example

Together with a sitemap directive in your robots.txt file is a fast method to share your sitemap. 

Nonetheless, you also needs to submit your XML sitemap on to serps by way of their webmaster instruments to hurry up crawling.

The Crawl-Delay Directive

The crawl-delay directive tells crawlers what number of seconds to attend between requests, which helps keep away from server overload.

Google not helps the crawl-delay directive. To set crawl charges for Googlebot, use Google Search Console.

Bing and Yandex do assist the crawl-delay directive. 

For instance, use the under rule to set a 10-second delay after every crawl motion:

Person-agent: *
Crawl-delay: 10

Additional studying: 15 Crawlability Issues & Find out how to Repair Them

The Noindex Directive

A robots.txt file tells serps what to crawl and what to not crawl however can’t reliably maintain a URL out of search outcomes—even in case you use a noindex directive.

For those who use noindex in robots.txt, the web page can nonetheless seem in search outcomes with out seen content material.

Example of page indexed and blocked by robots.txt in search results

Google by no means formally supported the noindex directive in robots.txt and confirmed so in September 2019.

To reliably exclude a web page from search outcomes, use a meta robots noindex tag as a substitute.

Find out how to Create a Robots.txt File

Use a robots.txt generator tool to shortly create a robots.txt file. 

Comply with these steps to create a robotx.txt file from scratch:

1. Create a File and Title It Robots.txt

Open a .txt doc in a text editor or internet browser.

Title the doc “robots.txt.”

Now you can begin typing directives.

2. Add Directives to the Robots.txt File

A robots.txt file accommodates a number of teams of directives, and every group consists of a number of strains of directions.

Every group begins with a user-agent and specifies:

  • Who the group applies to (the user-agent)
  • Which directories (pages) or recordsdata the agent ought to entry
  • Which directories (pages) or recordsdata the agent shouldn’t entry
  • A sitemap (optionally available) to inform serps which pages and recordsdata you deem necessary

Crawlers ignore strains that don’t match the above directives.

Think about you don’t need Google to crawl your “/purchasers/” listing as a result of it’s for inner use solely.

The primary group in your file would appear like this block:

Person-agent: Googlebot
Disallow: /purchasers/

You’ll be able to add extra directions for Google after that, just like the one under:

Person-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google

Then press enter twice to begin a brand new group of directives. 

Now think about you need to stop entry to “/archive/” and “/assist/” directories for all serps.

A block stopping entry to these :

Person-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google
Person-agent: *
Disallow: /archive/
Disallow: /assist/

When you’re completed, add your sitemap:

Person-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google
Person-agent: *
Disallow: /archive/
Disallow: /assist/
Sitemap: https://www.yourwebsite.com/sitemap.xml

Save the file as “robots.txt.”

3. Add the Robots.txt File

After saving your robots.txt file, add the file to your web site so serps can discover it.

The method of importing your robots.txt file is dependent upon your internet hosting surroundings.

Search on-line or contact your internet hosting supplier for particulars.

For instance, search “add robots.txt file to [your hosting provider]” for platform-specific directions.

Beneath are some hyperlinks explaining methods to add robots.txt recordsdata to common platforms:

After importing, affirm that the file is accessible and that Google can learn it.

4. Take a look at Your Robots.txt File

First, confirm that anybody can view your robots.txt file by opening a personal browser window and coming into your sitemap URL.

For instance, “https://semrush.com/robots.txt.”

Example of https://semrush.com/robots.txt search

For those who see your robots.txt content material, take a look at the markup.

Google offers two testing choices:

  1. The robots.txt report in Search Console
  2. Google’s open-source robots.txt library (superior)

Use the robots.txt report in Search Console in case you are not a complicated consumer.

Open the robots.txt report

For those who haven’t linked your web site to Search Console, add a property and confirm web site possession first.

If you have already got verified properties, choose one from the drop-down after opening the robots.txt report.

The software studies syntax warnings and errors.

Display of warnings and errors

Edit errors or warnings immediately on the web page and retest as you go. 

Modifications made inside the robots.txt report aren’t saved to your web site’s reside robots.txt file, so copy and paste corrected code into your precise robots.txt file.

Semrush’s Web site Audit software can even verify for robots.txt points.

Arrange a mission and run an audit.

When the software is prepared, navigate to the “Points” tab and seek for “robots.txt.”

Search for "robots.txt" in Site Audit

Click on “Robots.txt file has format errors” if it seems.

"Robots.txt file has format errors"

View the record of invalid strains.

List of invalid lines

Click on “Why and methods to repair it” for particular directions.

"Why and how to fix it" section

Examine your robots.txt file usually. Even small errors can have an effect on your web site’s indexability.

Robots.txt Greatest Practices

Use a New Line for Every Directive

Place every directive by itself line to make sure serps can learn them and observe the directions.

Incorrect instance:

Person-agent: * Disallow: /admin/
Disallow: /listing/

Right instance:

Person-agent: *
Disallow: /admin/
Disallow: /listing/

Use Every Person-Agent Solely As soon as

Checklist every user-agent as soon as to maintain the file organized and scale back the chance of human error.

Complicated instance:

Person-agent: Googlebot
Disallow: /example-page
Person-agent: Googlebot
Disallow: /example-page-2

Clear instance:

Person-agent: Googlebot
Disallow: /example-page
Disallow: /example-page-2

Writing all directives below the identical user-agent is cleaner and helps you keep organized.

Use Wildcards to Make clear Instructions

Use wildcards (*) to use directives broadly. 

To forestall serps from accessing URLs with parameters, you could possibly technically record them out one after the other. 

Nonetheless, you’ll be able to simplify your instructions with a wildcard.

Inefficient instance:

Person-agent: *
Disallow: /footwear/vans?
Disallow: /footwear/nike?
Disallow: /footwear/adidas?

Environment friendly instance:

Person-agent: *
Disallow: /footwear/*?

The above instance blocks all search engine bots from crawling all URLs below the “/footwear/” subfolder with a query mark.

Use ‘$’ to Point out the Finish of a URL

Use “$” to point the tip of a URL.

To dam serps from crawling all of a sure file sort, utilizing “$” helps you keep away from itemizing all of the recordsdata individually. 

Inefficient:

Person-agent: *
Disallow: /photo-a.jpg
Disallow: /photo-b.jpg
Disallow: /photo-c.jpg

Environment friendly:

Person-agent: *
Disallow: /*.jpg$

Use “$” fastidiously as a result of errors can result in unintentional unblocking.

Add feedback by beginning a line with “#”—crawlers ignore something that begins with a hash.

For instance:

Person-agent: *
#Touchdown Pages
Disallow: /touchdown/
Disallow: /lp/
#Recordsdata
Disallow: /recordsdata/
Disallow: /private-files/
#Web sites
Enable: /web site/*
Disallow: /web site/search/*

Builders typically add humorous feedback utilizing hashes since most customers by no means see the file.

For instance, YouTube’s robots.txt file reads: “Created within the distant future (the 12 months 2000) after the robotic rebellion of the mid 90’s which worn out all people.”

YouTube's robots.txt file example

And Nike’s robots.txt reads “simply crawl it” (a nod to its “simply do it” tagline) and options the model’s emblem.

Nike's robots.txt example

Use Separate Robots.txt Recordsdata for Completely different Subdomains

Robots.txt recordsdata solely management crawling on the subdomain the place they reside, which suggests it’s possible you’ll want a number of recordsdata.

In case your web site is “area.com” and your weblog is “weblog.area.com,” create a robots.txt file for each the area’s root listing and the weblog’s root listing.

5 Robots.txt Errors to Keep away from

When creating your robots.txt file, be careful for the next widespread errors:

1. Not Together with Robots.txt within the Root Listing

Your robots.txt file have to be positioned in your web site’s root listing to make sure search engine crawlers can discover it simply.

For instance, in case your web site’s homepage is “www.instance.com,” place the file at “www.instance.com/robots.txt.”

For those who put it in a subdirectory, like “www.instance.com/contact/robots.txt,” serps might not discover it and will assume you haven’t set any crawling directions.

2. Utilizing Noindex Directions in Robots.txt

Don’t use noindex directions in robots.txt—Google doesn’t support the noindex rule within the robots.txt file.

As an alternative, use meta robots tags (e.g., <meta identify=”robots” content material=”noindex”>) on particular person pages to manage indexing.

noindex meta robots tag in page source code

3. Blocking JavaScript and CSS

Keep away from blocking entry to JavaScript and CSS recordsdata by way of robots.txt until essential (e.g., proscribing entry to delicate information).

Blocking crawling of JavaScript and CSS recordsdata makes it tough for serps to grasp your web site’s construction and content material, which may hurt your rankings.

Additional studying: JavaScript search engine optimization: Find out how to Optimize JS for Search Engines

4. Not Blocking Entry to Your Unfinished Web site or Pages

Block serps from crawling unfinished variations of your web site to maintain it from being discovered earlier than you’re prepared (additionally use a meta robots noindex tag for every unfinished web page).

Search engines like google crawling and indexing an in-development web page might result in a poor consumer expertise and potential duplicate content material points.

Utilizing robots.txt to maintain unfinished content material non-public till you’re able to launch.

5. Utilizing Absolute URLs

Use relative URLs in your robots.txt file to make it simpler to handle and preserve.

Absolute URLs are pointless and might trigger errors in case your area modifications.

❌ Instance with absolute URLs (not advisable):

Person-agent: *
Disallow: https://www.instance.com/private-directory/
Disallow: https://www.instance.com/temp/
Enable: https://www.instance.com/important-directory/

✅ Instance with relative URLs (advisable):

Person-agent: *
Disallow: /private-directory/
Disallow: /temp/
Enable: /important-directory/

Preserve Your Robots.txt File Error-Free

Now that you just perceive how robots.txt recordsdata work, it’s best to guarantee yours is optimized. Even small errors can have an effect on how your web site is crawled, listed, and displayed in search outcomes.

Semrush’s Web site Audit software makes analyzing your robots.txt file for errors straightforward and offers actionable suggestions to repair any points.