Sitemap, robots.txt and indexing - what Google really needs from your site

Once your new website goes live, the first thing that happens is: nothing. Google doesn't know you yet. Before your pages show up in the search results, three things have to work together cleanly: Google has to find your site, be allowed to read it, and decide to include it. That's exactly what sitemaps, robots.txt and indexing are for. We'll explain - without the jargon - what these three things do, and which of them you actually need.

Crawling, indexing, ranking - the difference

These three terms are often confused, but they're three separate steps:

Crawling: The Googlebot visits your site and reads the content - like an automated visitor.
Indexing: Google decides to add the page to its huge catalogue (the index). Only indexed pages can appear in search at all.
Ranking: When someone runs a search, Google decides at which position your indexed page appears.

An important point to understand: a page can be crawled but not indexed. That's normal and not an error. Google doesn't index everything - only what it considers useful.

What is a sitemap?

A sitemap (usually sitemap.xml) is a simple list of all the important URLs on your site. With it, you tell Google: "Here are my pages, please take a look at them." It's not a command but a recommendation - a kind of table of contents.

For a small website with five to ten well-linked pages, Google will find the content even without a sitemap, because it follows the internal links. A sitemap becomes genuinely valuable when:

your site has many subpages that aren't linked everywhere,
you publish new content regularly (blog, products, locations),
your site is freshly online and still has few backlinks.

With a large catalogue, the sitemap makes a real difference. We run seven of our own brands in production - one of them is a product portal with around 177,000 products. Without clean, properly split sitemaps, Google would never discover a large share of them. For a simple one-pager, on the other hand, a sitemap is nice to have but not decisive.

What does robots.txt do?

The robots.txt is a small text file in the root directory of your domain (reachable at yourdomain.com/robots.txt). It controls which areas the Googlebot is allowed to visit and which it isn't. Typical entries here block login areas, shopping carts or internal tools - pages, in other words, that have no business showing up in search.

An important misconception: robots.txt does not prevent indexing. It only controls crawling. A page blocked via robots.txt can still end up in the index if other pages link to it - but then it appears without a description. If you really want to keep a page out of search, that's handled with a noindex meta tag, not with robots.txt.

The robots.txt usually also contains a reference to your sitemap - that way search engines find it automatically.

The most common and most expensive mistake

The classic one: during development, the site is hidden from Google with noindex or a complete robots.txt block - which is correct. When it goes live, everyone forgets to remove that block again. The result: the finished site is online, looks great, but Google ignores it completely. This can go unnoticed for weeks.

That's why every serious launch includes a check: does the robots.txt accidentally contain Disallow: /? Is there a stray noindex somewhere? We actively check this on every project before we call a site "live."

Your practical checklist

Set up Google Search Console: Google's free tool is a must. This is where you see which pages are indexed and where things are stuck.
Create and submit a sitemap: In Search Console under "Sitemaps," enter the full URL (e.g. https://yourdomain.com/sitemap.xml), not just the file name.
Check robots.txt: Open it and make sure no important areas are blocked.
Monitor indexing status: The search site:yourdomain.com gives you a rough idea of what Google knows about you.
Be patient: Indexing takes time. Days to weeks are normal, especially for new domains without links.

Do you need an agency for this?

Honestly: for a small, clearly structured site, all of this can be done yourself in one to two hours - Search Console guides you through it well. You don't need to pay anyone for it. Things only get complex with large site structures, multilingual sites, online shops, or when Google has already marked pages as "discovered, currently not indexed." In that case the problem usually runs deeper - in the site architecture, the internal linking or the content quality.

With the fixed-price projects we build, a correctly generated sitemap, a clean robots.txt and the Search Console connection are part of the delivery from the start - not an extra that has to be expensively retrofitted later. A technically correct setup is the basic prerequisite for any later SEO work to take effect at all.