Agent-native HTML cleanup

Clean HTML before your agent burns tokens.

Refinery MCP lets Claude, Cursor, and agent workflows turn noisy URLs or raw HTML into clean LLM-ready text plus word_count. Less page chrome. More useful context.

Not a crawler. The cleanup step after Firecrawl, Crawl4AI, Playwright, or your own fetcher.
clean_html
<nav>Home Pricing Login</nav>
<script>track()</script>
<article>
  <h1>Refinery test</h1>
  <p>Clean this before embedding.</p>
</article>
{
  "text": "Refinery test Clean this before embedding.",
  "word_count": 5,
  "processing_time_ms": 44.96
}
76%smaller payload
31estimated tokens saved
3MCP tools
Agent pipeline: fetch page, Refinery MCP cleanup, clean text for RAG and embeddings Before and after: bloated HTML versus clean LLM-ready text with token savings

Agents scrape. Refinery trims the junk.

Web pages are full of markup the model does not need. Refinery keeps the content path explicit: fetch the page however you want, clean it, then send compact text into RAG, embeddings, or summarization.

01

Fetch

Use browser automation, Firecrawl, Crawl4AI, Playwright, or a plain HTTP client.

02

Refine

Strip scripts, styles, nav, layout junk, tracking fragments, and repeated chrome.

03

Measure

Return clean text, word count, timing, and quick savings estimates for token budgets.

04

Use

Chunk, embed, summarize, answer questions, or pass the content to your agent.

Three simple tools.

Small surface area on purpose. Agents should know exactly when to call this.

clean_url

Send a URL to the Refinery Apify Actor and get clean text plus metadata.

clean_html

Pass HTML your crawler already fetched and get clean text plus estimated savings.

estimate_savings

Compare raw versus clean text locally without spending Apify credits.

Drop it into Cursor or Claude.

Build locally, pass an Apify token, and your agent gets a cleanup tool it can use before stuffing web pages into context.

git clone https://github.com/LareLabs/refinery-mcp
cd refinery-mcp
npm install
npm run build

export APIFY_TOKEN=apify_api_xxx
npm run smoke