A browser for agents

Today I'm announcing an early preview of Rotunda. It gives your agents and programs access to a local browser tailored for programmatic control. Everything it does looks just like you're doing it, except it's an agent behind your keyboard instead of you.

  • Can interact with any site: echo text or html, click buttons and links, take a screenshot, etc
  • Doesn't leak any indicators that it's being controlled programmatically: down to using a probabilistic model for how real people move their mouse and type on their keyboard
  • Has a high level control CLI, perfect for bash tool calls without the expense of adding a new MCP server
  • Can spoof different fingerprints, to make it look like it's running on a similar but unique machine

Here's a screen recording of Rotunda logging into my Amazon account step by step and parsing my recent orders (driven by just telling Codex about the available CLI and using 1Password to auth).

Check it out on its website or GitHub.


Automation was one of the dreams that really got me into computer science. Freshman year of college I wanted to join a small energy seminar that opened for registration at some odd hour of the morning. It had capped enrollment and an unclear waitlist. Since sleep was such a premium, I wrote a Puppeteer script to log into my SUNet account and click the registration link the second the page went live. By the time I woke up at 11am it worked and I was in the class. A few days later I got a polite email from the registrar's office asking how on earth I'd managed to register that fast. "I'm fast on my keyboard" I answered.

I assume the statute of limitations has run out on my diploma: so yes, it was automation all the way down. That was a soft introduction to the cat-and-mouse game of automating other people's websites, and the first time I realized how much of the internet only exists in browser form. There's some machine-readable surface behind it but it's just not accessible to us, the users.

The web has been made into a beautiful layer to interact with almost anything. If there's something that can be done remotely, it can almost certainly be done online. From your bank to your flights to your maps to your payroll, it's frankly unbelievable how much of a universal language websites have become.

Yet it's never been harder to automate. Because bad actors have so abused the open nature of the internet, you have a billion dollar industry that is responsible for doing bot detection. They cast a wide net: and for ease of classification they really don't care if you're running a crawl farm or doing a single task on your local computer. A bot is a bot and a bot is bad.

Landscape today

So how does your Claude Cowork, Codex, or OpenClaw talk to the open web? Usually one of these options:

  1. use screen control to talk to your existing Chrome/Safari and click buttons visually. This uses some combination of the macos accessibility APIs (text) and computer use (vision) to figure out what to click on. It works much better than it did a few years ago but it's still error prone and very expensive. Plus these models can only rely on what the browser can see in its current viewport: if some button is all the way at the bottom of the page, they have to issue scroll down requests until they get there. And that's only if the model intuits that the button might even be down there. Until they see it they won't know.

  2. get some code involved. Playwright has become the industry standard way to drive new web control workflows. It provides a high level API on top of CDP, which powers Chrome's programmatic debugging. It also ports between Safari, Firefox, and Chromium. Because it plugs into browser internals it has the benefit of being able to inspect everything on the screen as raw text and reason about buttons as logical objects instead of pixels. But both Playwright and Chromium leak their own flags that scream they're being controlled programmatically. The webdriver global is the most obvious but there are dozens of others. Plus if you run it headless it'll leak other information about software rendering when drawing to the canvas.

  3. when you get blocked by a captcha or fingerprint that opaquely fails to log you in, you switch to a stealth browser. Most are defunct but some promise relatively active development. Alas it's an almost impossible game to spoof your browser: among the statistical soup that is the billions of daily web sessions, it's easy enough to find you're anomalous if you're pretending to be a desktop Mac but running without a GPU in Linux. Some go down to monitoring your mouse and typing patterns.1

Rotunda tries to take the good of all approaches and leave the bad. A well-behaved agent wants a browser that looks like an extension of you: on your home IP, taking its time and making some mistakes while filling out forms, truthful GPU reporting, and no leaked Javascript identifiers for browser control. We just want it to have access to the same browser experience that we do when we're using the web.

That's for the good bots. Serving the bad bots (ticket scalpers, shoe resellers, etc.) is of no interest to me.

Rotunda combines a few different elements: fingerprint fibbing, isolated javascript control, and realistic actions. My goal is for these to all be implementation details to you.

On stealth

Rotunda is a heavily patched version of Firefox. This is an unusual choice in the browser automation community, since almost all alternatives are based on Chromium because of its great scripting support by CDP.

This decision was based off of early great work by Camoufox. They proved out that Firefox has more safeguards by default, so it's harder to fingerprint, plus CDP simply leaks too much state. Juggler is Firefox's alternative automation API to CDP that was developed before Google formalized the spec. It has the same effective area as CDP but operates in a fully isolated Javascript context, so it won't leak any state to the remote site no matter what variables you access. I probably could have hacked some patches to CDP to make that work as well, but Juggler ends up being far less of an uphill battle.

Stealth research is helpful in the design of our implementation but Rotunda is by design not a stealth browser.

Stealth browsers forever must play the cat and mouse game with deep-pocketed fingerprinting vendors like hCaptcha, Fingerprint.js, and Cloudflare, just to name some that are commercially available. Most big sites will also implement their own.2 These vendors are naturally at an advantage because it only takes one thing out of a thousand to look anomalous; it takes patching those thousand to look consistent. It's a case where verifiability is much easier than the solution.

Browsers are better off not lying. But they're pretty safe fibbing. The things that are hardest to mock are low level: GPU rendering and audio drivers, mostly. But these remain consistent across all versions of your hardware. Everyone with an M4 Macbook will have the same signature, same with a NUC Windows. Sites aren't going to block you because you're using the same hardware as someone else. Higher entropy is found in the things that you control: browser extensions, font selection, monitor screen size, etc. We choose to mock those safe attributes because there's no real way to assess the validity from inside the browser sandbox.

So what we say is real: just permuted. If we advertise you have a font, we'll properly render with that font. But each profile can have a subset of valid fonts that provides the additional entropy.

This is hugely helped by running locally: it inherits your solid IP reputation and your full device GPU capabilities. Cloud-hosted browsers like Steel.dev, Kernel, and Browser-Use all work by spinning up Linux VMs (either headless or in X11 sessions) that must lie about their underlying hosts to avoid getting flagged. They usually provide captcha solving as part of their "stealth" offerings, but enough of these can still get your account flagged. It's better to minimize the amount of captchas you see in the first place.

On desktop

Desktop software is having a moment.

We're seeing more agentic work done locally than I would have expected. All the major harnesses like Claude Code, Codex, and OpenClaw are all on the desktop. Their equivalent cloud platforms haven't caught on in the same way. Rotunda plays nicely with all of them by running locally and giving them access to a simple API surface:

uvx rotunda agent new-context agent-demo
uvx rotunda agent navigate 3 https://pierce.dev
uvx rotunda agent describe 3

I opted for a native bash CLI over an MCP server for a couple reasons.

First, every MCP you load eats a chunk of your context window before the agent has done anything. Tool descriptions and JSON schemas add up fast, and in practice most agents only use two or three tools per task. A CLI defers that cost: the model only sees the --help text when it's unsure and asks. Most of the CLI names are obvious enough that it can intuit what they'll do without even prompting.

Second, models have been trained on orders of magnitude more bash than MCP. They already know how to pipe a command into jq, or chain it with grep, or wrap it in a loop. MCP is its own little walled garden that doesn't really compose with anything you already have.

I only use MCP these days for products that (for some reason) have an MCP server but not an official API. I incidentally think this is largely political: MCP was so en vogue in 2025, which gave engineering groups the will-power and optics to look AI-first.3

Right now for anything you'd actually want to live on a developer's machine, a CLI mostly lets you do more for less.

Key logging myself for fun and profit

If you actually watch a session that's being controlled by a script, it's immediately obvious what's going on. The mouse will barely move then jump to the link that it wants, or every input entry will be filled instantly with a full paragraph. It's fast but it's super unrealistic for a human operator. If antibot providers aren't already modeling that signal they will be soon.

The only foolproof way to escape these checks is to emulate human usage of websites. That brings me to last Monday when I wrote a Swift app to record my keyboard and mouse movements. I let it run for a few days to record all the idiosyncrasies of how I move my mouse and make mistakes as I type.

Yes I basically built a key logger.4

But I got a big data stream with all we need:

  • the full text that I type in an input box
  • (src, dst) points of where my mouse is trying to go

That's all it takes when you're trying to fill in a form on a remote site. All we need is to predict the intermediary actions to get to completion. At the moment I do this via a trained autoregressive RNN. This can be fast at inference in C++ and very parameter efficient. This RNN is also character aware so it can learn that different characters have different likelihood of mistakes: if I mistype "P" it's more likely to be "O" than "A".

But we don't want to leave your field inputs up entirely to autoregressive RNNs. There's a chance they'll output the wrong values even when they're given the initial prompt. So during inference we add a structured decoding pipeline to ensure it actually does eventually type what you want it to and that any mistakes are corrected.

On ethics

Rotunda is the kind of tool whose duality keeps me up at night a little. The same browser that lets you scrape your own amazon account also, in principle, lets someone scrape something they shouldn't.

The structural shape of Rotunda makes it pretty bad at most of the things abusive bots are after. Take ticket scalping: the entire economy is about being first to the API the millisecond the queue opens. Most of these architectures are on a 10gbps connection banging endpoints at thousands of requests per second. A browser politely moving its mouse from button to button doesn't help them much.

Most actual abuse on the open web depends on volume, speed, or both. Rotunda is designed to be the opposite of either. The use cases that look like a person happen to be the ones that don't scale into adversarial harm. The cost-per-action is too high and by definition puts you into competition with actual people instead of high frequency traders.

And if some category of abuse does emerge that I haven't thought of, the fingerprinting industry will route around it the same way it routes around every prior tool. Fresh accounts will get rate-limited or IPs without history get flagged. Volume stands out by definition. And it's that kind of high volume load that gives bots a bad name.

I philosophically disagree with sites that try to block automation that lets regular people get some time back in their day; if an intern can manually do something for you, a computer should be capable of doing the same.

Going forward

There's a lot to do here:

  • Windows support
  • Better mouse/keyboard simulations
  • Better CI probing of fingerprint consistency

Immediately on my plate is improving the accuracy of the mouse/keyboard simulation model. I expect that to naturally improve as I keep my key logger enabled. 😅 Ideally I'd also like to crowd-source data capture to give more variation to mouse and keyboard patterns - but I have to give more thought to doing that in a privacy preserving way.

Also, while we currently have perfect scores on all fingerprint benchmarks, I need to test on more heterogenous machines to make sure those scores stay strong. This should also be automated via CI.

There's also the outputs of the CLI. Right now we're mostly trying to get raw outputs to agents so they can figure out what to do... but we could be smarter about rendering Markdown formatted or structured content. Perhaps even driven by a mini-LLM that's just tailored to that purpose. For some of these feature questions I'm waiting to see how the community ends up using it and plan the feature roadmap from there.

So give it a try! And hit me up on X if you do end up using it and let me know what you think. It's free and licensed under Mozilla Public License. A star on Github would also help justify the nights and weekends.

Til next time! Have fun automating.

Footnotes

  1. These device fingerprints are both used to identify bots, as well as to track you across the web. Even if you disable cookies, there are enough unique signals on your browser that it's almost trivial to identify you across websites. I have a lot of thoughts on fingerprinting more broadly and how it's toxic to free society, but I'll cover those some other time. ↩

  2. At the limit, browsers have an advantage because they control the sandbox that remote sites must operate in. If you invested those same billions into stealth evasion, I'm sure you could make something pretty bullet proof. But ML is easier to deploy with bigger datasets, which the vendors have and clients do not; plus billions have not been invested in stealth browsing. So the cat is pretty firmly winning this race at the moment. ↩

  3. Enterprises have always been squirrelly about opening up an API to risk data exfiltration and easier vendor migration. But the competitive tides were just strong enough for MCP to take hold. ↩

  4. It's very funny to look back over what you typed for a day, without the context of what you were responding to. ↩

/dev/newsletter

Technical deep dives on machine learning research, engineering systems, and building scalable products. Published weekly.

Unsubscribe anytime. No spam, promise.

Ctrl+K
to Preview