Knowledge / Article

Sitemap, robots.txt and indexing - what Google really needs from your site

Once your new website goes live, the first thing that happens is: nothing. Google doesn't know you yet. Before your pages show up in the search results, three things have to work together cleanly: Google has to find your site, be allowed to read it, and decide to include it. That's exactly what sitemaps, robots.txt and indexing are for. We'll explain - without the jargon - what these three things do, and which of them you actually need.

Crawling, indexing, ranking - the difference

These three terms are often confused, but they're three separate steps:

An important point to understand: a page can be crawled but not indexed. That's normal and not an error. Google doesn't index everything - only what it considers useful.

What is a sitemap?

A sitemap (usually sitemap.xml) is a simple list of all the important URLs on your site. With it, you tell Google: "Here are my pages, please take a look at them." It's not a command but a recommendation - a kind of table of contents.

For a small website with five to ten well-linked pages, Google will find the content even without a sitemap, because it follows the internal links. A sitemap becomes genuinely valuable when:

With a large catalogue, the sitemap makes a real difference. We run seven of our own brands in production - one of them is a product portal with around 177,000 products. Without clean, properly split sitemaps, Google would never discover a large share of them. For a simple one-pager, on the other hand, a sitemap is nice to have but not decisive.

What does robots.txt do?

The robots.txt is a small text file in the root directory of your domain (reachable at yourdomain.com/robots.txt). It controls which areas the Googlebot is allowed to visit and which it isn't. Typical entries here block login areas, shopping carts or internal tools - pages, in other words, that have no business showing up in search.

An important misconception: robots.txt does not prevent indexing. It only controls crawling. A page blocked via robots.txt can still end up in the index if other pages link to it - but then it appears without a description. If you really want to keep a page out of search, that's handled with a noindex meta tag, not with robots.txt.

The robots.txt usually also contains a reference to your sitemap - that way search engines find it automatically.

The most common and most expensive mistake

The classic one: during development, the site is hidden from Google with noindex or a complete robots.txt block - which is correct. When it goes live, everyone forgets to remove that block again. The result: the finished site is online, looks great, but Google ignores it completely. This can go unnoticed for weeks.

That's why every serious launch includes a check: does the robots.txt accidentally contain Disallow: /? Is there a stray noindex somewhere? We actively check this on every project before we call a site "live."

Your practical checklist

Do you need an agency for this?

Honestly: for a small, clearly structured site, all of this can be done yourself in one to two hours - Search Console guides you through it well. You don't need to pay anyone for it. Things only get complex with large site structures, multilingual sites, online shops, or when Google has already marked pages as "discovered, currently not indexed." In that case the problem usually runs deeper - in the site architecture, the internal linking or the content quality.

With the fixed-price projects we build, a correctly generated sitemap, a clean robots.txt and the Search Console connection are part of the delivery from the start - not an extra that has to be expensively retrofitted later. A technically correct setup is the basic prerequisite for any later SEO work to take effect at all.

Need a website, a tool or a SaaS of your own?

We build it at a fixed price — by the team that runs seven of its own brands live. Clear scope, clear price, clear timeline.

Start a projectServices & pricing