I Wanted a Free Screaming Frog. I Built a Crawl Engine.

I was looking at alternatives to a paid Screaming Frog license and got what I thought was a pretty bright idea. Build my own. How hard could it be?

The answer, predictably, is: harder than I expected, more interesting than I anticipated, and ultimately worth it. The result is an open-source crawl engine I'm releasing today - a TypeScript monorepo with a crawl core, an MCP server that exposes it as Claude tools, and a Chrome pipe module that handles JavaScript-rendered pages. It's production-hardened, MIT-licensed, and built to be extended.

This is the story of how it got designed the way it did.

The F1 Model

Early on I started thinking about the architecture the way Formula 1 teams think about their cars - not as one thing, but as three distinct layers: power unit, chassis, and livery. Each does a different job. Each can evolve independently. None of them should know too much about the others.

The power unit is the crawl engine core. It fetches pages, follows links, respects robots.txt, manages concurrency, deduplicates URLs, and hands you page snapshots. It has no opinion about what you're crawling or why. It doesn't know if you're auditing a site for SEO gaps, training a dataset, or mapping a competitor. That ignorance is intentional.

The chassis is the MCP server - the structural layer that mounts on top of the engine and makes it driveable. It defines the tool contracts: what a crawl call looks like, what comes back, what can fail. It's not branding, and it's not the engine - it's the scaffolding that connects them. Swap it out and you'd have a REST API or a CLI instead. The power unit wouldn't notice.

The livery is everything application-specific: domain heuristics, rate-limit decisions for particular hosts, dork recipe templates, extractor choices. These are the team colors. They say something about how you use the car, not about how the car works. The rule I enforced: if a behavior is application policy, it lives in the livery. If it's a crawl primitive, it belongs in the power unit.

This sounds obvious in retrospect. In practice it took real discipline. The temptation is to keep adding convenience to the engine - "let's add a special delay for search engines," "let's flag JS-gated domains at the core level." Every one of those is a livery decision trying to sneak into the power unit. I had to rip several out before release.

The Chrome Pipe

The most unconventional decision in the codebase is how JavaScript-rendered pages are handled. A lot of crawlers reach for Puppeteer or Playwright and spin up a headless browser. That works, but it adds infrastructure weight and its own fingerprinting surface.

This engine takes a different approach. The MCP server exposes a tool called fetch_page_via_chrome_pipe that delegates rendering to Claude in Chrome - the browser extension that lets Claude see and interact with your actual Chrome session. Claude navigates to the URL, captures the rendered HTML, and pipes it back. The crawl engine receives a full DOM snapshot, the same as it would from any other backend.

What you get: real browser rendering, your existing session cookies, and no headless browser to maintain. What this means for the architecture: the rendering backend is just another swappable module. The engine doesn't care whether a page arrived via fetch() or a Chrome pipe.

This pattern is genuinely new territory. Using Claude in Chrome as a first-class rendering backend - rather than a UI automation tool - opens up workflows that headless browsers can't replicate, particularly for sites that detect and block automated clients.

Security by Default

One thing I didn't want to compromise on: SSRF protection out of the box. It's easy to ship a crawler that, if pointed at a malicious URL, starts fetching http://169.254.169.254/ - the AWS metadata endpoint, or other private network addresses. Most open-source crawlers leave this as an exercise for the operator.

Here it's built in at the transport layer using a custom undici connector. Private IP ranges - loopback, link-local, RFC 1918 - are blocked before a connection is opened, not after. The default policy is BLOCK_PRIVATE. You can opt out, but you have to do it explicitly.

What Surprised Me

The MCP protocol turned out to be a better interface layer than I expected. I went in thinking of it as a way to expose tools to Claude. What I found is that the tool-call model forces you to define clean contracts: what goes in, what comes out, what can fail. That discipline made the whole module layer easier to reason about.

The hardest part wasn't the crawl logic. Queuing, deduplication, rate limiting, robots.txt parsing - all of that was relatively straightforward to get right. The hard part was the boundary enforcement. It's genuinely difficult to look at something like "should this domain heuristic live in core?" and make the right call consistently. I got it wrong several times.

The second hardest part was the Chrome pipe. Getting Claude in Chrome and the MCP server to coordinate correctly - the timing, the HTML capture format, the fallback behavior when Chrome isn't connected - took more iteration than I expected. It's stable now, but it earned that stability.

What's in the Repo

The public snapshot includes the engine core, the MCP server with a set of working tools (crawl_site, fetch_page, fetch_api, extract_links, summarize_manifest), the Chrome pipe module, extractors for SEO data, links, headings, images, and schema.org markup, and a dev container config so you can be up and running in a few minutes.

It's a TypeScript monorepo using npm workspaces. The test suite has full coverage of the core. There's a Dork Builder utility in the MCP module if you want to build structured search queries from templates. The architecture doc is in the README.

I built this because I needed it. I'm releasing it because the patterns - MCP-native crawling, Chrome-as-pipe rendering, SSRF protection by default - seem worth sharing. If you build something with it, I'd genuinely like to know.

github.com/timothy-nishimura/crawl