Clean HTML before your agent burns tokens.
Refinery MCP lets Claude, Cursor, and agent workflows turn noisy
URLs or raw HTML into clean LLM-ready text plus word_count.
Less page chrome. More useful context.
<nav>Home Pricing Login</nav> <script>track()</script> <article> <h1>Refinery test</h1> <p>Clean this before embedding.</p> </article>
{
"text": "Refinery test Clean this before embedding.",
"word_count": 5,
"processing_time_ms": 44.96
}
Agents scrape. Refinery trims the junk.
Web pages are full of markup the model does not need. Refinery keeps the content path explicit: fetch the page however you want, clean it, then send compact text into RAG, embeddings, or summarization.
Fetch
Use browser automation, Firecrawl, Crawl4AI, Playwright, or a plain HTTP client.
Refine
Strip scripts, styles, nav, layout junk, tracking fragments, and repeated chrome.
Measure
Return clean text, word count, timing, and quick savings estimates for token budgets.
Use
Chunk, embed, summarize, answer questions, or pass the content to your agent.
Three simple tools.
Small surface area on purpose. Agents should know exactly when to call this.
clean_url
Send a URL to the Refinery Apify Actor and get clean text plus metadata.
clean_html
Pass HTML your crawler already fetched and get clean text plus estimated savings.
estimate_savings
Compare raw versus clean text locally without spending Apify credits.
Drop it into Cursor or Claude.
Build locally, pass an Apify token, and your agent gets a cleanup tool it can use before stuffing web pages into context.
git clone https://github.com/LareLabs/refinery-mcp cd refinery-mcp npm install npm run build export APIFY_TOKEN=apify_api_xxx npm run smoke