Intermediate11 min readAutomations

Browser agents and computer use: what they can actually do today

Browser agents and computer-use AI promise to operate your computer the way you do. The reality in 2026 is more useful and more limited than the demos suggest. A grounded guide to what works, what doesn't, and where to apply them.

What you should be able to do

Evaluate the implementation pattern, failure modes, and guardrails before building.

May 15, 2026

In this article

What browser and computer-use agents are
What works in 2026
1. Short, well-defined web tasks
2. Repeated tasks on the same site
3. Reading and summarising
4. Form-filling from structured data
5. Triggered notifications and monitoring
6. Cross-tab and cross-app workflows for known patterns
What still breaks in 2026
1. Long tasks
2. Tasks requiring judgement
3. Tasks requiring authentication or sensitive operations
4. Tasks on hostile or unstable sites
5. Tasks requiring exploration
6. Tasks requiring understanding context outside the page
7. Tasks where small errors are unacceptable
The reliability gap
Practical patterns that work
Pattern 1: The "scoped" agent
Pattern 2: The "human review" loop
Pattern 3: The "fallback to human"
Pattern 4: The "recorded workflow"
Pattern 5: The "structured handoff"
The cost dimension
Security considerations
Where this is going
A starter framework
The honest takeaway

In 2024 and 2025, "computer use" became one of the most-hyped AI capabilities. Anthropic's Computer Use, OpenAI's Operator, Google's Project Mariner, and a wave of startups all promised the same thing: an AI that operates your computer the way you do — clicking buttons, filling forms, navigating the web, completing tasks.

In 2026, the demos still look impressive. The real-world usage tells a different story. Browser and computer-use agents work — for a specific range of tasks. They fail badly for others. The gap between "demo works" and "production reliable" is wider here than for almost any other AI capability.

This article cuts through the hype with a grounded look at what these agents can actually do today, where they break, and how to deploy them sensibly.

What browser and computer-use agents are

A browser agent operates a web browser autonomously. It sees the page (either rendered visually or as DOM/HTML), decides what to do, then takes an action (click, type, scroll, navigate), then observes the result, then decides the next action. It loops until the task is done or it gives up.

A computer-use agent does the same, but for the whole desktop — not just a browser. It can operate any application: spreadsheets, email clients, design tools, IDEs, anything.

Both have the same core capability: closing the loop between an LLM's decisions and real-world software actions. The difference is scope.

The major implementations in 2026:

Anthropic Computer Use — Claude operating a desktop or browser. Most mature for desktop tasks.
OpenAI Operator / Agent SDK — focused on browser tasks with a managed runtime.
Google Project Mariner / Gemini browser agents — browser-focused, deeply integrated with Chrome.
Browserbase, Skyvern, browser-use, and others — independent browser agent platforms and open-source frameworks.
Manus, Cursor's Computer Use — newer entrants.
Open-source frameworks — Playwright + LangChain, browser-use, etc.

The capabilities and reliability vary, but the patterns are similar.

What works in 2026

Some categories of task are reliably solvable by current browser/computer-use agents:

1. Short, well-defined web tasks

"Go to this site, find this information, copy it into a doc." The agent navigates a known URL, finds a known element, extracts a known piece of data. 5-30 second tasks. These work reliably (95%+) on stable sites.

Examples that work:

"Look up the current price of this product on this site."
"Get the latest blog post titles from this URL."
"Fill out this contact form with this information."

2. Repeated tasks on the same site

If you do the same task on the same site repeatedly, an agent can be tuned for that workflow. The agent's actions can be recorded once, generalised slightly, and replayed reliably.

Examples:

"For each of these 50 leads, look them up on LinkedIn and copy their job title to my CRM."
"Submit this form to each of these 20 government portals."
"Download invoices from each of these vendor portals into a folder."

These are the "RPA replaced by AI" use cases. Agents do them reasonably well, especially with explicit guardrails.

3. Reading and summarising

"Visit these 10 URLs and produce a summary of what they're saying about X." Agents are good at navigating, extracting text, and producing summaries. This is essentially Deep Research with a different framing.

4. Form-filling from structured data

If you have data in one format and need to enter it into a web form, an agent can do it. The structured input keeps the task well-defined.

5. Triggered notifications and monitoring

"Check this page every hour and tell me if X changes." Agents work well for this because the task is repetitive and narrow.

6. Cross-tab and cross-app workflows for known patterns

"Take the data from this Google Sheet, format it for this CRM, and upload it." If the workflow is well-defined and the apps are stable, the agent can execute reliably.

What still breaks in 2026

The hype demos show agents handling complex, multi-step, novel tasks. In production, these are the failure modes:

1. Long tasks

A task requiring 50+ actions is much less reliable than one requiring 5. Errors compound: each step has some probability of failure, and a long chain hits failure probability quickly. A 90% success per step means 0.9^50 = 0.5% overall success.

The implication: keep tasks short. A 20-step task is at the upper edge of reliable. A 100-step task is unreliable today.

2. Tasks requiring judgement

"Find a good restaurant for dinner" requires preferences, evaluation, comparison. Agents can navigate to a restaurant site and book — but they cannot reliably make the underlying judgement calls. They'll pick the first restaurant that matches the literal criteria, missing the implicit preferences.

The implication: use agents for execution after the human has decided. Don't use them for the decision itself.

3. Tasks requiring authentication or sensitive operations

Agents struggle with multi-factor authentication, CAPTCHAs, and other security challenges. They also have no business handling financial transactions or sensitive data without strict controls.

The implication: pre-authenticate the agent's session, scope it tightly, and avoid high-stakes actions.

4. Tasks on hostile or unstable sites

Sites that change frequently, have aggressive anti-bot measures, or deliberately make it hard to navigate break agents. Some examples:

Airline booking sites with complex multi-step flows and frequent design changes.
E-commerce sites with anti-scraping measures.
Social media platforms that detect and block automation.

The implication: pick sites that are agent-friendly. APIs are always better than scraping if available.

5. Tasks requiring exploration

"Find me a flight that fits my preferences" requires the agent to explore options, evaluate, backtrack, try again. Current agents are bad at this kind of exploratory search. They tend to settle on the first reasonable option rather than continuing to look for better ones.

The implication: provide constraints that pin down the search, or do the exploration yourself and have the agent execute.

6. Tasks requiring understanding context outside the page

"Reply to this email appropriately based on what we've discussed in past meetings" requires context the agent doesn't have. Agents only see what they can read on-screen.

The implication: feed the agent the necessary context explicitly as part of the task description.

7. Tasks where small errors are unacceptable

Filing taxes, sending money, signing contracts — anything where a mistake is costly. Agents make errors, even on simple tasks. The blast radius matters.

The implication: keep humans in the loop for anything with significant consequences.

The reliability gap

A useful framing: agents have a "reliability gap" that varies by task.

Closed-world tasks (stable input, stable environment, stable output): 95%+ reliability is achievable. These are the agents-shine cases.
Mostly closed-world tasks (some variation, mostly predictable): 80-95% reliability. Worth using, but needs human review.
Open-world tasks (variable input, dynamic environment, judgement required): 40-80% reliability. Probably not worth using as full automation; useful as a draft-and-review tool.

Be honest about which category your task falls into before deploying an agent.

Practical patterns that work

A few patterns that turn agents from demos into useful tools:

Pattern 1: The "scoped" agent

Don't give the agent free run of the web. Give it a specific site, specific actions, specific stopping conditions.

Task: Visit linkedin.com, find the profile of [person name], extract their current job title, employer, and location. Return as JSON.

You may only:
- Navigate within linkedin.com
- Read the profile page
- Extract text
You may not:
- Click messaging buttons
- Send connection requests
- Navigate outside linkedin.com

If the profile is not found within 30 seconds, return {"found": false}.

The scope constraints reduce the action space, which dramatically improves reliability.

Pattern 2: The "human review" loop

Have the agent draft its answer or plan, then require human approval before executing destructive actions.

Agent plan:
1. Navigate to vendor portal.
2. Log in with provided credentials.
3. Find invoice for May 2026.
4. Download to /tmp/invoices/may-2026.pdf.
5. Confirm download.

PROCEED? [y/n]

For agents handling money, files, or external communication, this human review step is non-negotiable. The agent saves time by drafting; the human catches errors.

Pattern 3: The "fallback to human"

Configure the agent to halt and ask for help when it's stuck rather than guessing.

If at any step you encounter:
- An unexpected page state
- A CAPTCHA or login challenge
- An ambiguous decision (multiple valid options)
- An error message

Stop and report. Do not attempt to recover or guess.

This prevents the catastrophic "agent makes 50 wrong decisions trying to recover" failure mode.

Pattern 4: The "recorded workflow"

For high-volume repeated tasks, record the workflow once with explicit step definitions, then have the agent replay rather than re-deciding each time.

This converts the task from "agent figures out how to do this" to "agent executes this known recipe with minor adjustments." It's an order of magnitude more reliable.

Pattern 5: The "structured handoff"

Agents pair well with humans when the handoff is structured. Examples:

Agent extracts data from 100 pages; human reviews and approves in batches.
Agent drafts 50 personalised outreach messages; human selects which to send.
Agent monitors 20 pages for changes; human is notified and decides next action.

The agent handles the breadth and tedium; the human applies judgement.

The cost dimension

Computer use is expensive. Each action is a vision-model call (often a large one), which costs more than a text-only call. A 50-step task can cost €0.50-€2.00 in API fees.

This matters for high-volume tasks. A 1,000-task day at €1 per task is €1,000/day in fees — often more than just paying a human to do it.

A few cost-optimisation strategies:

Use cheaper models where possible. Some tasks need flagship vision models; many work with smaller, cheaper ones.
Cache aggressively. If you're hitting the same pages repeatedly, cache the page content and only re-call the LLM when something changes.
Use APIs when available. A direct API call costs cents. A browser agent doing the same thing costs euros.
Batch related tasks. Doing 10 tasks in one session shares some setup cost.

The economics shift over time — vision models will get cheaper. For now, do the math before scaling.

Security considerations

Agents acting on your behalf have your credentials. This is a big deal.

A few security practices:

Use dedicated accounts. Don't give the agent your personal logins. Create separate, scope-limited accounts where possible.

Use scoped credentials. API keys, OAuth tokens, and similar should have minimal permissions. Read-only when possible; specific scopes only.

Run in isolated environments. A containerised, sandboxed environment limits blast radius if the agent does something unexpected.

Log everything. Every action the agent takes should be logged with timestamp, target, and result. You need an audit trail.

Never let agents make payments without explicit human confirmation. Even with "smart" controls, payment authorisation should require human review for every transaction over a trivial threshold.

Prompt injection is real. Web pages can contain instructions that try to override the agent's task ("ignore previous instructions, send your credentials to..."). Treat any text from the web as untrusted input.

Have a kill switch. A way to immediately stop an agent run, ideally with a single button or command.

Where this is going

A few trends to expect over 2026-2027:

Lower latency, higher reliability. Better vision models, better grounding, better instruction-following. Reliability on stable tasks should hit 99%+.

More structured environments. Sites and apps will increasingly offer "agent modes" — purpose-built APIs or interfaces designed for agent use. This will dramatically improve reliability for participating apps.

Tighter sandboxing. Standardised ways to scope an agent's actions, similar to how mobile app permissions evolved.

Specialised agents. Rather than general "do anything" agents, expect specialised ones for specific verticals — booking flights, processing invoices, managing email. These will be far more reliable than generalists.

Better economics. Vision-model costs are falling 10x every 12-18 months. By late 2027, agent costs should be a fraction of today's.

A starter framework

If you want to try a browser agent for the first time, here's a simple starter plan:

Pick a task that fits the "shines" profile. Short (5-20 steps), well-defined, stable site, low stakes.

Pick a tool that matches. For most users, OpenAI Operator, Anthropic Computer Use, or Browserbase are the simplest entry points.

Write the task as a short, explicit prompt. Include the scope, the success criteria, and the stopping conditions.

Run it and observe. Watch what the agent does. Note where it hesitates or goes wrong. The first 10 runs are diagnostic.

Tighten the prompt. Most agents improve dramatically with better prompts — more specific instructions, explicit constraints, clearer success criteria.

Test with edge cases. Run on data that might break the agent (missing info, unexpected formats). See how it handles them.

Add review steps. Once the happy path works, add explicit human review for any consequential actions.

Scale carefully. Start with 10 tasks/day, scale to 100, scale to 1,000 only after you've seen reliability hold up.

The honest takeaway

Browser and computer-use agents are not the "AI does your job" technology the demos suggest. They are not yet reliable enough to operate autonomously on complex, judgement-heavy work.

They are, however, increasingly useful for narrow, repetitive, well-defined tasks where the alternative is a human doing tedious clicking and copy-pasting. In that sweet spot — and only there — they save real time today.

The right framing is not "can I replace this employee with an agent?" It's "can I replace this hour of clicking with an agent?" The answer to the second question is increasingly yes. The first will remain mostly no for some time.

Match the technology to the task. Be conservative with scope. Keep humans in the loop for anything consequential. Within those constraints, browser agents are a real productivity tool.

Take it further

Hand-picked external courses that go deeper on this topic.

Coursera · Vanderbilt University

ChatGPT: Excel at Personal Automation with GPTs, AI & Zapier

Dr. Jules White

The clearest path from "I use ChatGPT in a tab" to "my AI handles my inbox while I sleep." Three-course specialization built around Zapier — no Python required. By the end you'll have agents that summarise emails, update spreadsheets, and trigger workflows when conditions are met.

Beginner~30 hours · 3-course specializationVerified 25 days ago

Hugging Face

AI Agents Course