Baidu Robots.txt

A Robots.txt file restricts search engine robots from accessing your website (or from crawling your web pages), and Baidu search engine follows the instructions in Robots.txt.

Why Using Robots.txt for Chinese Search Engine Baidu?

Submitting your site to Baidu through the website submission form notifies Baidu search engine to crawl and index your website. To exclude specific content (or web pages) from being crawled by Baiduspider, Baidu’s search engine robot / spider / user agent, use Robots.txt.

  • Using Robots.txt is optional.
  • Include a Robots.txt file only if your site has content that you do not want Baidu to index.
  • If you want Baidu to access your entire website’s content, do not include a Robots.txt file.
  • Place Robots.txt file under your website’s root directory. Before crawling your website’s pages, Baiduspider first checks the root directory of your site’s domain if a plain text file called “robots.txt” can be found.
  • Robots.txt can improve your site’s Baidu SEO traffic and ranking when it is done right.
  • Robots.txt blocks your web pages’ content from being crawled or indexed by Baidu, but Baiduspider may still index the URLs if they can be found on other web pages on the web. So the “blocked” web pages’ URLs may appear in Baidu’s organic search results pages.

Using Robots.txt for Baiduspider

Baiduspider follows two basic rules in Robots.txt files:

  • User-agent: the robot the following rule applies to
  • Disallow: the URL you want to block

To block your entire site from Baidu:

User-agent: Baiduspider
Disallow: /

To block your entire site from all search engine spiders, but Baiduspider:

User-agent: Baiduspider
Disallow:

User-agent: *
Disallow: /

To block a directory of your site and all the files in it, from Baiduspider:

User-agent: Baiduspider
Disallow: /cgi-bin/

To block a directory of your site, but some of the URLs in it, from Baidu:

User-agent: Baiduspider
Allow: /cgi-bin/tmp-1
Allow: /cgi-bin/tmp-2
Disallow: /cgi-bin/

To block a web page from Baidu:

User-agent: Baiduspider
Disallow: /my-page.html

Baiduspider supports using wildcard symbols including “*” and “$” to match URLs:

  • “*” matches zero or more arbitrary characters.
  • “$” matches the line terminating character(s).

To block access to all dynamic URLs (i.e. all URLs that contain “?”) by Baiduspider:

User-agent: Baiduspider
Disallow: /*?*

To block access to certain file types, but to allow other types of files, by Baiduspider:

User-agent: Baiduspider
Allow: .gif$
Allow: .jpg$
Disallow: .jpeg$
Disallow: .png$
Disallow: .bmp$

Other Baiduspider Bots

While Baiduspider is responsible for crawling web search content, Baidu uses different search engine spiders/bots to crawl different types of content:

  • Baiduspider-image crawls images
  • Baiduspider-mobile crawls mobile search content
  • Baiduspider-video crawls videos
  • Baiduspider-news crawls news content
  • Baiduspider-favo crawls bookmarks
  • Baiduspider-sfkr crawls Baidu PPC/ads
  • Baiduspider-cpro crawls Baidu’s contextual advertising network

Robots.txt Examples on Large Chinese Websites

  • Baidu.com is blocking Baiduspider to access some of the site’s directories: http://www.baidu.com/robots.txt
  • Taobao.com at the time is blocking Baiduspider through Robots.txt at the root directory: http://www.taobao.com/robots.txt
  • Alibaba China at the time is blocking the entire site from certain bots/spiders: http://china.alibaba.com/robots.txt

Guidelines to Robots.txt

Comments

Leave your comments

  • Your first comment will be reviewed before getting posted.
  • Your subsequent comments will be posted without review.
  • All spammy comments will be deleted.