I built an open-source web extraction tool. Here's why.

For the past few years I've been building web scraping infrastructure. Custom scrapers, proxy rotation, bot protection bypass. Helped clients generate over 50M in volume. Every single project ran into the same wall: getting clean, structured content from the web is absurdly hard.

Anti-bot systems change every week. Half the web needs a headless browser to render. And once you finally get the HTML, you still have to figure out which part of the page is actually content and which part is cookie banners and navigation.

I got tired of solving this from scratch every time. So I built webclaw.

What it does

webclaw takes a URL and gives you back clean content. Markdown, JSON, plain text, or a format optimized for LLMs. From the command line:

webclaw https://example.com

You get markdown, metadata, structured data from JSON-LD, links. No parsing HTML yourself, no stripping boilerplate, no dealing with consent overlays.

It also works as an MCP server, so you can plug it straight into Claude Code, Claude Desktop, or any MCP-compatible AI tool and let your agent browse the web.

Why this and not the ten other scraping tools

It's Rust. The extraction engine is open source and compiles to native code. Sub-second for most pages.

Anti-bot bypass at the network level. Cloudflare, Akamai, DataDome, PerimeterX. 93% of the top 100 most-protected sites resolve without spinning up a browser.

It understands page structure. Not a text dump. The engine scores content blocks, strips noise (nav, ads, cookie banners, footers), resolves URLs, and outputs markdown that actually makes sense when you feed it to an LLM.

Machine-readable data extraction. Point it at a product page and get price, sizes, stock availability. Pulled from JSON-LD, not scraped from rendered text.

Full toolkit. Scrape one page, crawl a whole site, discover sitemaps, batch process URLs, diff content changes, extract brand identity. All from the CLI or through MCP.

The stack

Core engine in Rust with zero network dependencies (runs in WASM). CLI binary and MCP server ship as standalone executables. No runtime, no Docker, no dependencies to install.

Why open source

I've spent years dealing with scraping tools that are either too basic or locked behind expensive APIs. The extraction layer shouldn't be a black box. You should know exactly what's happening to the pages you fetch.

webclaw is fully open source. Install it, run it, modify it. A managed API with additional features (hosted anti-bot bypass, JS rendering, AI extraction) is coming in the next few weeks for those who want to skip the self-hosting.

Get it

Website: webclaw.io GitHub: github.com/0xMassi/webclaw Docs: webclaw.io/docs

Install and start extracting in 30 seconds. If you're building anything that touches web content, give it a shot. Reply to this email and tell me what you think.

Massi