Agent-native HTML cleanup

Clean HTML before your agent burns tokens.

Refinery MCP lets Claude, Cursor, and agent workflows turn noisy URLs or raw HTML into clean LLM-ready text plus word_count. Less page chrome. More useful context.

Install via npm Try the Apify Actor View source

npm @larelabs/refinery-mcp Apify $0.002/page Registry MCP official listing

Not a crawler. The cleanup step after Firecrawl, Crawl4AI, Playwright, or your own fetcher.

clean_html

<nav>Home Pricing Login</nav>
<script>track()</script>
<article>
  <h1>Refinery test</h1>
  <p>Clean this before embedding.</p>
</article>

{
  "text": "Refinery test Clean this before embedding.",
  "word_count": 5,
  "processing_time_ms": 44.96
}

76%smaller payload

31estimated tokens saved

3MCP tools

Agent pipeline: fetch page, Refinery MCP cleanup, clean text for RAG and embeddings

Before and after: bloated HTML versus clean LLM-ready text with token savings

Agents scrape. Refinery trims the junk.

Web pages are full of markup the model does not need. Refinery keeps the content path explicit: fetch the page however you want, clean it, then send compact text into RAG, embeddings, or summarization.

Fetch

Use browser automation, Firecrawl, Crawl4AI, Playwright, or a plain HTTP client.

Refine

Strip scripts, styles, nav, layout junk, tracking fragments, and repeated chrome.

Measure

Return clean text, word count, timing, and quick savings estimates for token budgets.

Use

Chunk, embed, summarize, answer questions, or pass the content to your agent.

Three simple tools.

Small surface area on purpose. Agents should know exactly when to call this.

`clean_url`

Send a URL to the Refinery Apify Actor and get clean text plus metadata.

`clean_html`

Pass HTML your crawler already fetched and get clean text plus estimated savings.

`estimate_savings`

Compare raw versus clean text locally without spending Apify credits.

Drop it into Cursor or Claude.

One npm package. Set APIFY_TOKEN and your agent gets a cleanup tool before stuffing web pages into context.

# Cursor / Claude Desktop
{
  "mcpServers": {
    "refinery": {
      "command": "npx",
      "args": ["-y", "@larelabs/refinery-mcp"],
      "env": {
        "APIFY_TOKEN": "apify_api_xxx",
        "REFINERY_ACTOR_ID": "larelabs/refinery-html-to-llm-cleaner"
      }
    }
  }
}

# Quick smoke (no Apify credits)
npx -y @larelabs/refinery-mcp