Crawling: How the Web Gets Read

Lesson 2 · taught by Pip

The web has no master list of every page. Nobody handed the search engine a phone book of the internet. So how does it find pages to read in the first place?

It follows links. That's the whole trick, and it's beautifully simple.

A crawler — sometimes called a spider or a bot — is just a program that reads a page, notices every link on it, and adds those linked pages to a list of places to visit next. Then it visits the next page, finds more links, adds those too. On and on, forever.

Picture exploring a city with no map, but every doorway has signs pointing to other doorways. You walk through one, jot down where its signs point, then follow them, jotting down more as you go. Keep that up long enough and you'll have walked the whole city without ever owning a map. That's a crawler.

Starting points and the endless loop

A crawler begins with a handful of known pages — popular sites the engine already trusts. From those, it follows links outward, and because the web is so densely connected, a few starting points eventually lead almost everywhere.

This is why links matter so much. A page that nothing links to is like a house with no road leading to it — the crawler may never find it. Being linked from pages the engine already visits is how a new page gets discovered at all.

The loop never truly finishes. The web keeps changing — new pages appear, old ones update, some vanish. So crawlers run continuously, re-visiting pages to catch what's new. A news site might get re-read every few minutes; a quiet page that never changes might be re-read once a month. The engine learns each page's rhythm and matches it.

Being a polite guest

A crawler could, in theory, hammer a website with thousands of requests a second and knock it over. Good crawlers don't. They pace themselves, reading a site gently so they don't slow it down for real visitors.

There's also a quiet agreement called robots.txt — a small text file a website can post saying, in effect, "please don't read these parts." It's like a sign on a door: staff only. Well-behaved crawlers read that sign and respect it. It isn't a lock — it's a request — but the major engines honor it. That's how a site keeps, say, its checkout pages or internal tools out of search results.

So crawling is part explorer, part polite guest: it wanders eagerly but knocks before entering and leaves when asked. 🔦

What the crawler actually grabs

When a crawler reads a page, it doesn't just glance — it pulls down the page's full text and structure: the headings, the paragraphs, the links, the image descriptions. All of that gets handed to the next stage to be filed away. The crawler's only job is fetching; it doesn't judge whether a page is good. Judgment comes later.

One thing worth knowing: a crawler reads the page's underlying text, not the picture you see. So content hidden inside an image, with no text describing it, can be invisible to the crawler — it simply isn't there to read.

Your turn

Visit almost any website and add /robots.txt to the end of its address (for example, a site's homepage address followed by /robots.txt). You'll often see a plain list of rules telling crawlers where they may and may not go. You're reading the exact note the website left for the spiders.

Next: what happens to everything the crawler read — "Indexing: Filing the Whole Web."

Stuck or curious?

Ask Pip about this lesson — tap the porthole bottom-right.