Computer use and browser agents in production
Computer use and browser agents have demos that go viral. Production deployments at scale have a different shape — narrow scoping, heavy guardrails, careful UX. The patterns that work, the failures we keep seeing, and the honest economics.
The demos are mesmerizing. An AI navigates a complex booking flow, switches between applications, fills out government forms, completes multi-hour tasks autonomously. In 2024–2025, Anthropic's Computer Use, OpenAI's Operator, Google's Project Mariner, and a wave of startups showed agents that operate computers like humans.
In 2026, these tools are real. They work. Some companies are deploying them in production successfully. But the deployments don't look like the demos. They're narrower, more constrained, and surrounded by guardrails. The patterns that distinguish production from demo are what this article is about.
We covered the basics in an intermediate-level article. This one goes deeper: the production patterns, the failures we keep seeing, the economics, and how to ship a computer-use system that's actually useful at scale.
The production reality
A few patterns we observe in actual production deployments:
Pattern 1: Narrow tasks dominate. Production deployments succeed for specific, well-defined tasks. Not "do anything." Not "operate any website." Specific workflows on specific sites.
Pattern 2: Heavy scoping. Tasks are scoped tightly. The agent is allowed only certain actions on certain sites. Anything outside the scope triggers stops, not improvisation.
Pattern 3: Recorded workflows over autonomous exploration. Many production deployments use recorded workflows (steps defined once, replayed with adjustments) rather than fully autonomous agents. More reliable, easier to maintain.
Pattern 4: Human-in-the-loop for consequence. Any action with significant consequences (financial, legal, customer-facing) goes through human review.
Pattern 5: Aggressive monitoring. Every action logged. Anomaly detection. Kill switches. Operations team watching dashboards.
Pattern 6: Cost discipline. The economics matter. Many "AI does it all" theoretical deployments don't pencil out against human or RPA alternatives.
Pattern 7: Specialized vs general. Production deployments tend to use specialized models (or specialized configurations) rather than general computer-use models for everything.
These patterns map to what production AI typically looks like — narrower scope and stronger guardrails than the marketing suggests.
Where computer-use agents shine in production
Specific categories of task where computer-use is the right tool:
1. Data extraction from sites without APIs
Many enterprise tools, government portals, and small B2B services don't have APIs. Or have APIs with significant gaps. Computer-use agents can extract data by operating the UI.
Examples in production:
- Pulling invoices from 50+ vendor portals.
- Extracting case data from court system websites.
- Scraping competitor pricing pages.
- Aggregating data from compliance reporting portals.
When the alternative is a human doing tedious clicking, computer-use is a clear win.
2. Form filling at scale
Submitting the same kind of form to many different sites. Each site is slightly different; an API would be ideal but doesn't exist.
Examples:
- Government applications (each agency has its own portal).
- Compliance filings.
- Customer onboarding into vendor systems.
- Account setup for SaaS tools.
3. UI testing and quality assurance
Computer-use agents make good QA testers. They can navigate apps, attempt user flows, report issues.
Examples:
- End-to-end testing of web apps.
- Visual regression testing.
- Accessibility audits.
- User-flow validation across multiple devices.
This is RPA-adjacent but with AI's flexibility to handle UI changes.
4. Cross-application workflows
Tasks that span multiple applications without a single integration point.
Examples:
- Pulling data from a CRM, formatting it, uploading to an analytics tool.
- Taking customer support tickets, creating tasks in a project tool, updating status in a CRM.
- Aggregating reports from multiple internal tools.
When you can't or won't integrate the apps directly, an agent provides a flexible bridge.
5. Repetitive multi-step processes
Tasks the same human keeps doing.
Examples:
- Onboarding new customers through a 30-step process.
- Reconciling data between two systems weekly.
- Generating periodic reports that require pulling from multiple sources.
If a process is well-defined, repeats often, and is currently done by humans clicking — it's a candidate.
Where they still fail
The other side: tasks computer-use agents are not yet ready for in production.
1. Tasks requiring judgment
"Find me a good vendor." Agents can navigate vendor sites; they can't make the judgment about which is good for your specific needs.
2. Tasks with novel UI patterns
A new site, never seen before. Agents struggle to discover unusual UI conventions. They work better on common patterns (forms, lists, navigation menus) than on bespoke designs.
3. Tasks with strong anti-bot measures
Many sites actively detect and block automation. Agents can sometimes circumvent (with effort), but it's a perpetual cat-and-mouse. Often not worth it.
4. High-stakes individual actions
Sending a payment, signing a legal document, posting publicly on behalf of someone. The blast radius of a wrong action is large; human review is essential.
5. Tasks requiring real-world context
The agent only sees what's on screen. It doesn't know your relationship with that customer, your team's recent context, the political situation. Context-poor tasks fail.
6. Open-ended exploration
"Find the best deal" or "research this person thoroughly" — tasks without clear completion criteria. Agents either spin or stop too early.
The architecture
A production computer-use system has these layers:
┌─────────────────────────────────────┐
│ Orchestration │ Schedules, retries, escalations
├─────────────────────────────────────┤
│ Task definition + scope │ What the agent does and doesn't do
├─────────────────────────────────────┤
│ Agent runtime (Computer Use SDK) │ Anthropic / OpenAI / Browserbase
├─────────────────────────────────────┤
│ Browser / desktop environment │ Isolated, sandboxed
├─────────────────────────────────────┤
│ Authentication and session │ Credentials, cookies, MFA handling
├─────────────────────────────────────┤
│ Result handling │ Capture, validate, store
├─────────────────────────────────────┤
│ Monitoring + alerts │ Real-time observability
└─────────────────────────────────────┘We'll go through each.
Task definition
The single most important step. Define what the agent does, narrowly.
A good task definition includes:
Trigger. What initiates the task? (Schedule, event, manual.)
Inputs. What data does the agent have? (Specific record, structured form data.)
Scope. Which sites, which actions, which paths through the UI.
Success criteria. What does completion look like?
Stop conditions. What ends the task early?
Output. What data does the agent return?
Error semantics. How are failures categorized and reported?
A poorly defined task: "Submit our weekly compliance report."
A well-defined task:
Task: Submit weekly compliance report to portal X.
Trigger: Cron, every Monday at 9 AM.
Inputs:
- Report data file (CSV) from /reports/weekly.csv
- Submitter info from environment variables (name, ID).
- Credentials from secrets manager.
Scope:
- Site: https://portal.example.gov/submit (and subpaths)
- Allowed actions: navigate, click, type, upload, submit, screenshot.
- Forbidden: visit external sites, change account settings, navigate away from submission flow.
Success criteria:
- Receive confirmation page with submission ID.
- Capture submission ID.
Stop conditions:
- Confirmation received: success.
- CAPTCHA: escalate to human.
- Login failure: escalate to human.
- Form validation error: report and stop.
- Timeout 5 minutes: report and stop.
Output:
- Submission ID.
- Screenshot of confirmation page.
- Timestamp.
Errors:
- Validation: log, notify owner, do not retry.
- Auth: log, notify ops, do not retry.
- Network: retry once, then escalate.This level of specificity is what production looks like. "Submit the report" is what demos look like.
Scope enforcement
The scope isn't just a description; it's enforced at runtime.
URL allowlist. The agent can only navigate to URLs matching a defined pattern. Outside the allowlist, navigation is blocked.
Action filtering. Only certain action types are allowed. Wholesale "operate the computer" gives way to specific allowed actions.
Element filtering. Some pages have elements the agent should never interact with (settings, logout, dangerous buttons). These can be filtered out of the perception layer.
Time limits. Tasks have hard maximums. If not done in N minutes, abort.
Step limits. Tasks have step maximums. Same logic as agent loops.
Implementation varies by platform — Anthropic Computer Use, OpenAI Operator, Browserbase all have different mechanisms. The principle is universal: enforce scope at the runtime, not just describe it in the prompt.
Authentication
A perpetual challenge. Production deployments need to authenticate the agent's session.
Pre-authenticated sessions. A human logs in once; the session cookies/tokens are captured; the agent operates within that session. Refreshes when needed.
Service accounts. Dedicated accounts for the agent (where the site supports them). Scoped permissions, audit logging.
Credential injection. The agent receives credentials at runtime, uses them to log in, then discards them. Secure storage and handling required.
MFA handling. A real challenge. Options:
- Use TOTP secrets the agent can compute.
- Route MFA to a human for approval.
- Use accounts/sites that allow API tokens instead of MFA.
OAuth. For modern sites, OAuth flows work well — the agent gets a token from a flow approved once by the human.
The pattern: agents should never have humanlike access to your accounts. They should have scoped, auditable, terminable credentials.
Result validation
When the agent claims success, validate.
Capture artifacts. Screenshots, downloaded files, output data. Don't trust the agent's report; check the evidence.
Verify success conditions. Did the form actually submit? Is there a confirmation? Was the data correct?
Cross-check. If you can verify success through a different channel (an API, an email confirmation, a database check), do it.
Anomaly detection. Was this run unusually long, unusually short, unusually costly? Investigate outliers.
The pattern: assume the agent might be wrong. Have verification independent of the agent's self-report.
Error handling
Computer-use tasks fail in many ways. Categorize and handle each:
Network errors. Site down, timeout. Retry with backoff.
Auth failures. Login failed, session expired. Refresh credentials or escalate.
UI changes. Site changed; expected element not found. Stop, alert maintenance.
Validation errors. Form input rejected. Log, notify, do not retry blindly.
Anti-bot detection. CAPTCHAs, blocks. Escalate; potentially blacklist the site.
Agent confusion. Agent stuck, looping, going off-script. Kill, log, investigate.
Quota / rate limit. Site rate-limited the agent. Backoff and retry, or schedule for later.
Each category has different response semantics. Bad pattern: "agent failed, retry." Good pattern: "agent failed in category X, follow recipe X."
Monitoring
Every action logged, every run tracked, every anomaly surfaced.
Per-run logs:
- Start/end timestamps.
- All actions taken.
- All screenshots.
- Outcome (success/failure/escalation).
- Cost.
- Performance metrics.
Per-run dashboard: Operations team can see active runs, recent failures, queue depth.
Aggregate metrics:
- Success rate per task type.
- Latency distribution.
- Cost per run.
- Anomaly rate.
Alerts:
- Success rate drops below threshold.
- Cost per run spikes.
- Specific failure types increase.
- Site UI may have changed (multiple recent failures on same step).
This monitoring is what catches issues before they become incidents.
The economics
The blunt question: is computer use cheaper than the alternative?
Costs:
- Per-run cost: typically €0.50-€5 depending on task complexity (vision model calls are expensive).
- Infrastructure: managed runtime (Browserbase, similar) or self-hosted.
- Maintenance: tasks break when sites change. Some ongoing work.
Alternatives:
- Human at €30/hour: a 10-minute task is €5. A 1-minute task is €0.50.
- RPA tools: lower per-run cost but require structured automation.
- Direct API integration: much cheaper per-call, but requires the API to exist.
- Outsourced offshore: €5-10/hour, similar math to in-house humans.
The economics favor computer-use when:
- The site has no API.
- The task is long enough that automation pays back fixed costs.
- Volume is high enough that human time accumulates.
- Site is relatively stable (low maintenance burden).
The economics don't favor computer-use when:
- An API exists (just use it).
- The task is short and infrequent.
- The site changes constantly.
- The task has too many edge cases (high maintenance).
A useful exercise: estimate cost per task in computer-use vs human. Multiply by volume. Compare.
Production patterns that work
A few patterns from successful deployments:
Pattern 1: The "recorded recipe" approach
For high-volume, narrow tasks: record the workflow once with explicit steps, then have the agent replay it on each input with minor adjustments.
This is closer to traditional RPA but with AI's flexibility to handle minor variations (e.g., a button moved slightly, an extra confirmation dialog).
Way more reliable than pure autonomous operation.
Pattern 2: The "extract and submit" split
Many workflows have two phases:
- Extract data from somewhere.
- Submit data somewhere.
Splitting these into separate agent runs (or recipes) is cleaner. Each phase has clearer success criteria. Failures in one don't compound into the other.
Pattern 3: The "human checkpoint" pattern
The agent does prep work autonomously, then surfaces a "ready to act" state for human approval. Human reviews, approves, agent executes.
Used for: payments, public posts, sensitive submissions. The agent saves time on prep; the human catches errors.
Pattern 4: The "specialist agent" pattern
Rather than one general agent, have specialized agents for specific tasks. Each is tuned, tested, and maintained for its specific workflow.
A general "operate any website" agent is hard to maintain. A "submit our weekly compliance report" agent is straightforward.
Pattern 5: The "fallback to RPA" pattern
For tasks where AI flexibility isn't actually needed (the site is stable, the workflow is fixed), fall back to traditional RPA (Playwright scripts, Selenium). Cheaper, faster, more reliable for those cases.
Use computer-use specifically when AI's flexibility adds value.
Pattern 6: The "batched run" pattern
Don't run agents on-demand for high-volume tasks. Batch the work; run agents in parallel on a schedule.
E.g., instead of "user submits a request; agent runs immediately," queue requests, run agents on a 15-minute batch. Smooths the load, simplifies the architecture.
What can go wrong
A short list of failures we've seen:
The site changed. The agent worked perfectly for 6 months. Site redesign breaks everything. Without monitoring, you find out from angry users.
Anti-bot detection caught up. The site implemented bot detection. Agent runs increasingly fail. Eventually account is banned.
Cost spiral on a stuck task. Agent loops on a confusing page. Each cycle is a vision-model call. €100 in an hour.
Wrong action taken. Agent clicked the wrong button. Cancelled an order instead of confirming. Or sent a message to the wrong person.
Stuck on MFA. Agent can't get past MFA. Production runs back up. Queue grows.
Account banned. Site detected unusual activity, suspended the account. All similar tasks broken until account is restored.
Credential leak. Agent accidentally exposed credentials in a log or screenshot. Security incident.
Privacy issue. Agent inadvertently captured PII in screenshots that were logged.
Most of these are preventable with the patterns above. But each has happened in real deployments. Build defenses accordingly.
A realistic ROI example
Production deployment, real numbers (anonymized):
Task: Submit weekly regulatory reports to 12 different state agency portals.
Without automation: 1 person × 6 hours/week × €30/hour = €180/week. 90 minutes per portal on average.
With computer-use automation:
- Per-portal run cost: ~€2 (vision model + infrastructure).
- 12 portals × €2 = €24/week.
- Engineer maintenance: ~2 hours/month × €100/hour = €200/month ≈ €50/week.
- Total: ~€74/week.
Savings: €106/week ≈ €5,500/year. Plus the human time is freed for higher-value work.
Reliability: ~92% success rate on autonomous runs. Other 8% escalate to human. Human resolves in 10 minutes typically.
Outcomes:
- Net positive economics.
- Faster completion (parallel processing).
- Audit trail (every action logged).
- Worth maintaining.
This is what production success looks like. Not magical "AI does it all" — measured, monitored, with clear ROI.
A deployment checklist
If you're deploying a computer-use system to production:
- [ ] Task is narrow and well-defined.
- [ ] Scope enforced at runtime, not just described.
- [ ] Step / time / cost budgets in place.
- [ ] Authentication strategy with secure credentials.
- [ ] Anti-bot considerations (use legitimate accounts; respect rate limits).
- [ ] Error categorization and handling.
- [ ] Result validation independent of agent self-report.
- [ ] Monitoring and alerting.
- [ ] Kill switches.
- [ ] Human-in-loop for consequential actions.
- [ ] Audit logging.
- [ ] Privacy/PII handling.
- [ ] Cost economics make sense vs alternatives.
- [ ] Maintenance plan for when sites change.
Each is non-trivial. Skipping any creates a risk.
The takeaway
Computer-use and browser agents in production look different from the viral demos. Narrow tasks. Strong guardrails. Heavy monitoring. Human checkpoints. Realistic economics.
For the right tasks, they're genuinely useful. Data extraction from API-less sites, form filling at scale, cross-application workflows, repetitive UI work. Real production deployments save real time and money.
For the wrong tasks — open-ended judgment, novel UIs, high-stakes individual actions — they're not yet ready. Don't try to make them.
The skill is in matching technology to task. Done well, computer-use agents are a useful tool in the production AI stack. Done poorly, they're an expensive way to introduce new failure modes.
Pick narrow tasks. Build the guardrails. Monitor relentlessly. Maintain consistently. That's how computer-use agents earn their place in production systems.