Indexing: Filing the Whole Web

Lesson 3 · taught by Pip

The crawler has read millions of pages. Now there's a mountain of text and no fast way to search it. If the engine had to re-read every page each time you typed, you'd grow old waiting. So before you ever search, it builds an index — and the way it's built is the cleverest idea in this whole course.

Flipping the book inside-out

Think about the index at the back of a textbook. The book itself runs page by page. But the index runs the other way: it lists words, and beside each word, the page numbers where that word appears. Want to find "volcano"? You don't read the book cover to cover — you flip to "volcano" in the index and it tells you: pages 12, 88, 203.

A search engine builds exactly that, for the entire web. For every word, it keeps a list of every page that contains it. Search for "lighthouse" and the engine doesn't scan the web — it flips to "lighthouse" in its index and instantly has every page that mentions it.

Because the index is organized by word, looking something up is near-instant no matter how huge the web grows. This inside-out arrangement even has a name — an inverted index — because it's the normal page-by-page order turned backwards into word-by-word order. That one flip is what makes search feel like magic.

More than just a word list

A plain list of pages-per-word would work, but the index quietly records more, because that extra detail makes results far better later.

For each word on each page, the engine often notes where it appeared. Was "lighthouse" in the page's title? In a big heading? Buried in the footer? A word in the title usually means the page is really about that thing; the same word in the footer means much less. Recording the position lets the engine weigh it.

It also notes how often a word shows up, and which words sit near each other. A page where "lighthouse" and "keeper" appear side by side is probably about exactly what it sounds like — and the index remembers that closeness.

Think of a librarian who doesn't just note that a book mentions "lighthouse," but jots whether it's the title, a chapter heading, or one passing line on page 200. Same word, very different meaning — and the notes capture the difference. 🔦

Keeping the notes honest

The index isn't carved in stone. As crawlers re-read pages, the index updates: new pages get added, changed pages get rewritten, and pages that vanished get removed. It's a living set of notes, constantly tidied so it stays close to the real web.

Not everything makes the cut, either. The engine may skip pages it judges to be empty, duplicated, or spammy — there's no point filing junk. So the index is both enormous and curated: vast, but not a dumping ground.

Why this sets up ranking

Notice what the index gives us. Type a word, and in an instant the engine has every page that contains it — often thousands or millions. That's wonderful and also a problem: nobody wants a million results. They want the best ten.

So the index answers "which pages match?" Lightning-fast. But it leaves the harder question wide open: of all these matches, which ones go on top? That question is ranking, and it's where we head next.

Your turn

Search for a fairly specific phrase and note the "About X results" count — that number is roughly how many pages the index holds for your words. Now add one more word to narrow it and watch the count drop. You're watching the index filter from millions of matches down toward a handful.

Next: the job everyone argues about — "Ranking: Choosing the Best Answer."

Stuck or curious?

Ask Pip about this lesson — tap the porthole bottom-right.