Sunday, August 17, 2008

What's In My "robots.txt" File?

When we setup a Blogger blog, we give up control of files and folders, and let Blogger maintain the structure of the blog.

When we publish a Blogger blog (either to BlogSpot, or to a Google Custom Domain), all that we do is post, and maintain the blog template. Even with the files and folders controlled by Blogger, and normally hidden from view, there are ways to examine the contents of some files.

Those who would be better off not examining file content do so anyway, become confused, and stress themselves needlessly. Every week, we read anxious queries

Help! My blog has been hacked!!

My robots.txt file is blocking my blog from being indexed!

You can make two Settings changes that are relevant here, but mostly this file is maintained by Blogger code.

Blogger maintains "robots.txt", in each blog, on our behalf.

Occasionally, Blogger makes changes to our blogs in general, and changes the content of "robots.txt" to support the changes made. Recently, changes to "feedReaderJson" necessitated a change to "robots.txt".

Here's the "robots.txt" file for this blog, "blogging.nitecruzr.net", as of 2016/01.

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://blogging.nitecruzr.net/sitemap.xml

There are 3 main entries, in a standard "robots.txt" file.

Here we see 3 entries. For an explanation of some terms, and for a demonstration illustrating the results, please refer to Google Webmaster Tools - "Analyze robots.txt". You may also be enlightened by reading The Web Robots Pages.

This allows access to all blog components of the blog ("Disallow: (null)"), to the spider "Mediapartners-Google", which is the spider that crawls pages to determine AdSense content. This entry overrides the following entry, for the specified spider.
User-agent: Mediapartners-Google
Disallow:
This disallows indexing from all URLs containing "/search"(ie, label searches) - and it allows all other blog URLs.
User-agent: *
Disallow: /search
Allow: /
This defines the URL of the sitemap.
Sitemap: http://blogging.nitecruzr.net/sitemap.xml
The sitemap is separate from the blog feed, and automatically provided.

You can use Search Console, for an analysis of your "robots.txt" file.

Here's my analysis of "robots.txt" for this blog, run using the wizard. You'll want to view this in full screen mode, zoomed as highly as possible.

Note the hypothetical effects of indexing 3 hypothetical blog URLs, against both "Googlebot" and "Mediapartners-Google", shown at the very bottom.

I've now examined a dozen or so different similar files, for blogs published to both native BlogSpot and custom domains, and excepting the URL of the sitemap, all files have been identical in content. My conclusion is that this is a normal file, and unless we start seeing a flood of complaints about indexing problems, I see no reason to suspect a problem.

So, the next time someone comes to you moaning

My robots.txt file is blocking my blog from being indexed!

you can assure them

No, your "robots.txt" file is normal.

Then, introduce them to Google Webmaster Tools, and its many diagnostic reports.

Elm0D

Author & Editor

Has laoreet percipitur ad. Vide interesset in mei, no his legimus verterem. Et nostrum imperdiet appellantur usu, mnesarchum referrentur id vim.

August 17, 2008 Google Webmaster Tools, Robots.Txt, Search Console, Search Engines, Sitemap, Webmaster Tools

IElm0D

Sunday, August 17, 2008

What's In My "robots.txt" File?

0 comments:

Post a Comment

Navigate» Become author for this Blog

Manual Categories

About

Site Links

Popular Posts

Join the Team