Engineering

How we crawl: Crawlee, Readability, and segment-level reindexing

Dmitrii Kuzmenkov

Software Engineer, IndexFox.ai

October 8, 2025 8 min read Updated March 22, 2026

Indexing is where most search widgets quietly fail. They crawl your site once, embed everything, then drift out of date until a customer files a support ticket. We've shipped four iterations of the crawl pipeline. This is what stuck.

Discovery: Crawlee, not curl + regex

The discovery layer is Crawlee — the Apify framework. We tried writing our own crawler. Three times. Each time we ended up reimplementing politeness, retry, robots.txt, HTTP/2 connection reuse, and the fifty other things a real crawler needs. Crawlee handles it, exposes hooks for what we actually care about (content extraction, request routing, fingerprinting), and the rest is our domain logic.

One war story: HTTP/2 GOAWAY frames from large CDNs were silently retrying the same URL forever. We documented the fix in our internal runbook (eventually filed upstream). If you're rolling your own, you will hit this.

Extraction: Readability, then surgery

We use Mozilla's Readability library — the same engine that powers Firefox's reader mode — as the first pass. It nukes navigation, ads, footers, and cookie banners with surprising accuracy. Then we run our own post-processing:

Heading hierarchy normalization (H1/H2/H3 must form a tree, not a list).
List and table preservation — tables are first-class because product pages live or die by spec tables.
Code-block detection and language hinting for documentation sites.
Anchor preservation so we can deep-link search results to #section-id.

The output is not HTML. It's a structured tree of segments: title, description, paragraphs, list items, code, table rows. Each segment carries its position, its anchor, and its parent heading. That structure is what we embed.

Segment-level reindexing

Here's the trick that saves us money. We don't reindex pages. We reindex segments.

Every segment gets a stable content hash. When we re-crawl, the URL stays the same but most segments hash-match the previous crawl. Only the segments whose hash changed get re-embedded. On a typical documentation site recrawl, that's a single-digit percentage of segments, not 100%.

page_hash      = sha256(canonical_html)
segment_hash   = sha256(segment_text + segment_position + parent_heading)
embed_if      not exists(segment_hash) in store

For a customer with 50,000 pages and weekly recrawls, this is the difference between a $200 monthly embedding bill and a $4 one.

Why structure matters for retrieval, not just for cost

Storing segments instead of full pages gives you something else: granular results. When a user searches for "rate limit", we don't return the URL of a 6,000-word API reference page. We return the specific paragraph, with a deep link to its anchor, and we know which H2 it lives under. The widget renders that hierarchy in the result card.

Google has a name for the retrieval side of this: passage ranking — identifying the specific section of a longer page that answers a query, rather than ranking the page as a single unit. We've been doing the indexing equivalent for two years because reranking smaller, semantically clean units beats reranking 6,000-word blobs, every time.

What we still get wrong

Single-page applications. If your site renders the content client-side and doesn't ship server-rendered HTML or a clean DOM that a headless browser can scrape, the crawler is fighting your framework. Our recommendation, repeated to every customer: render the content the search engine should see. The same advice Google has been giving since 2015 still applies, and it applies to our crawler too. More on that here.

What's next

We're testing incremental crawls triggered by sitemap lastmod and customer-side publish webhooks (the same hook most teams already use to purge their CDN cache via API). Most customers don't need a full weekly crawl — they need "tell me when something actually changed, then crawl only that." The data so far suggests we can cut crawl traffic by ~80% on documentation sites without measurable freshness regression. Will write it up when it ships.