Googlebot: How Google's web crawler works

With a market share of almost 87 percent, Google is the most popular search engine worldwide. In Switzerland, almost 91 percent of all users use Google as their search engine. This makes Google the undisputed market leader. For website owners, this is a great reason to look into SEO and Googlebot. In the following article, I will explain what Googlebot is, how it works, and why it is so important.

What is Googlebot?

Googlebot is a web crawler whose job is to crawl the World Wide Web. The web crawler, also known asa"spider," works in exactly the same way as the Google Chrome browser when surfing. It follows one link after another, giving Google a complete picture of a website.

Googlebot ensures that the websites and their content that it finds are also included in the Google index and thus appear in search results.

How Googlebot works

Der Googlebot funktioniert grundlegend ganz einfach. Er folgt einem Link, welcher im HTML Code mit einem <a>-Tag inklusive href-Attribut festgelegt wurde. Es gibt zu beachten, dass der Googlebot keinen Links folgt, die andere Formate verwenden.

The process always starts with websites that are already known to Googlebot. These are checked regularly to see if anything has changed. If there is a new link in the"old content," the web crawler follows it.

The time intervals between a website being indexed by Googlebot vary and depend on various factors. One important factor is how often a website is updated.

When Googlebot crawls URLs that rarely display new information, it increases the crawling interval. If websites are regularly updated with new content and news, Googlebot crawls the site at shorter intervals. News websites such as newspapers and blogs benefit from more frequent crawls and a higher crawling budget, not least because of Google News and the news sitemap. But more on that in another post.

After successful crawling, Google stores all data in the cache to prevent excessive crawling. Another Googlebot that also wants to crawl the page first accesses the cache. This conserves resources and does not burden the server hosting the website.

Please note that Googlebot can only crawl the first 15 MB of an HTML file. Each resource referenced in the HTML code, such as CSS and JavaScript, is retrieved separately, and each retrieval is subject to the same file size restriction. After the first 15 MB of the file, Googlebot stops crawling and only considers the first 15 MB of the file for indexing.

Google developer documentation

The different types of Googlebots

There are different types of Googlebots. One version, for example, crawls normal websites that are accessed from a computer, while thesmartphone bot only crawls the mobile version of a website and evaluates its content. The latter has been preferred for some time now—keyword: mobile first.

The complete list of all Googlebot crawlers can be found below:

Googlebot image: Used to crawl image bytes for Google Images and for products that depend on images.
Googlebot News: Googlebot News uses Googlebot to crawl news articles, but respects the existing Googlebot News user agent token.
Googlebot video: Used when crawling video bytes for Google videos and for products that depend on videos.
Google Inspection Tool: Google Inspection Tool is the crawler used by search testing tools such as the Rich Results Test and the URL inspection in Search Console. Apart from the user agent and user agent token, Googlebot is imitated.
GoogleOther: General crawler that can be used by various product teams to retrieve publicly accessible content from websites. It can be used, for example, for one-time crawling for internal research and development.
Google StoreBot: Among other things, Google StoreBot crawls pages with product details and shopping carts, as well as payment pages.

You can see which Googlebot crawled your website by looking at the server log data, which also plays an important role in search engine optimization. We will publish a blog article on this topic later.

How to control Googlebot

Google offers website owners several options for controlling Googlebot. This allows you to determine which content is indexed or crawled and therefore does not appear in search results.

Many reputable search engines follow the instructions I have listed below.

How to control crawling

Nofollow – The nofollow link attribute or meta robots tag indicates that a web crawler should not follow a link. Currently, however, this is only considered a suggestion and can therefore be ignored by search engine crawlers.
Robots.txt – With this small file, which is located in the root directory of your website, you can control what is crawled.
Password protection – If you want to ensure that your website is not crawled by search engines at all, the safest method is to set up password protection using htpasswd.

How to control indexing

Noindex – By using the meta robots tag, you instruct search engines not to index your page.
Password protection – Search engines do not index content that is protected by accounts or passwords. By using a login or password protection, you can prevent this from happening.
Delete content – A surefire way to prevent a search engine from indexing your content is to delete it.

The Googlebot IP address

Google has now published a list of the IP addresses used for crawling and accessing websites. This allows you to identify Googlebot by its unique IP address.

If you want to block Googlebot or ensure that only Googlebot crawls your website, you or a server administrator can block the crawlers or whitelist the Google IP address.

How are Googlebot and SEO related?

SEO, or search engine optimization, aims to optimize websites so that users can find them more easily via search engines. The basic requirement is that the website in question is listed in the index of Google or other search engines such as Bing and Yahoo. For companies' online marketing, it is therefore essential to understand how Googlebot works, among other things.

Conclusion

If you want your company's website to rank as high as possible in search results, the content and the website itself must be designed to be crawler-friendly. A clear structure and a server that can handle the many crawling requests are essential here. Since Googlebot also thrives on links, the focus of any SEO optimization should be on good internal linking.