Back to topics

sitemap.xml and robots.txt

What are sitemap.xml and robots.txt

What are these two things? In simple terms, they are almost essential documentation files for every website.

After we’ve worked hard to complete a website, we usually hope that many users will visit it—that is, we want traffic. Where does traffic come from? A major source is search engines.

So how does a search engine know this website exists? Or, digging deeper, how does a search engine provide a list of websites based on keywords?

As we all know, search engines crawl web pages using bot programs. These bots follow links on various websites and save the content of those pages.

If your website is referenced by another site (e.g., through a friend link), the crawler will follow that link to your site, and your content gets stored in the search engine's database. This seems to save some SEO effort.

But for a newly built website, it might be difficult to rely on others linking to you. So you may need to submit an application to a search engine like Google yourself, telling it, “Hey, I have a website here, please index it.”

Soon, another problem appears: How does the search engine’s crawler know which pages on your site can be crawled and which cannot? For example, you may not want the crawler to access user profile pages, settings pages, or some outdated expired links.

There’s also the scenario where your site is very large. For instance, Bilibili has sections for Dance, Food, Anime, each containing many videos. You don’t want the crawler aimlessly crawling back and forth; you want to provide a content outline that tells the crawler how the site is organized, improving crawl efficiency.

Furthermore, you might want to specify which pages are updated every hour—welcome the crawler to revisit each hour—and which pages remain unchanged for years, so the crawler only needs to visit once to avoid burdening the server. There are also “orphan pages”—pages that the crawler cannot reach via any link click, as if they are hidden. Normally, the crawler doesn’t even know they exist. In summary, you may need to provide a documentation file that introduces the crawler to the specifics of each page on your site.

Check out Bilibili’s configuration

robots.txt and sitemap.xml are exactly the documentation files mentioned above. Let’s take Bilibili as an example and open them to see what’s described inside.

First, look at robots.txt. The first part looks like this:

User-agent: *
Disallow: /medialist/detail/
Disallow: /index.html

Interpretation: regardless of who, crawling these two pages is prohibited. The first one, medialist/detail, is already inaccessible—not sure why it was banned. The second one might belong to an old website structure where URLs contained index.html. Since this form is no longer needed, don’t crawl it. All mainstream search engines are welcome to crawl the rest.

An exception is at the bottom for Facebook and Twitter (below). Bilibili only allows crawling of the page https://www.bilibili.com/tbhx. This page is a promotional page for a game, likely designed for overseas marketing.

User-agent: facebookexternalhit
Allow: /tbhx/hero

User-agent: Facebot
Allow: /tbhx/hero

User-agent: Twitterbot
Allow: /tbhx/hero

Finally there is:

User-agent: *
Disallow: /

This means: for all other miscellaneous crawlers not listed above, you are not allowed to crawl.

This is a whitelist strategy, used to block beginners or malicious small companies that might download a crawler from GitHub and keep hitting the site, adding meaningless server load to Bilibili. Although crawlers may ignore this file, it can still block those who lack the ability to modify open-source code.

Now let’s look at the sitemap. Access sitemap.xml via a browser—this is a typical sitemap-of-sitemaps. I’ve noticed that some sitemaps redirect to 404, so perhaps only the core content is manually maintained by the developers. That’s normal, but it’s only my guess.

Still, you can see that here they deliberately guide crawlers to “ranking,” which may be a way to drive traffic to the most popular and hottest content on the entire site.

Other parts might be specifically for Baidu’s crawler, telling it which sections exist to help optimize crawling efficiency.

Also, this read/detail.xml should help search engines index text-and-image articles. In fact, all these are quite useful, because when we search for an article, we are often directed to Bilibili—this is also a way for Bilibili to channel traffic to itself.

Finally, it’s worth adding that these are just guidelines—a gentleman’s agreement—not absolute rules.

You might forbid everything, but some crawlers still come anyway. You might allow everything, but if Google hasn’t indexed your domain, the crawler still won’t visit.