Org in a Box
Features

Browser Automation

How agents interact with the web using accessibility-tree snapshots instead of screenshots.

Overview

The browser plugin gives agents full web browsing capability inside the sandbox. Instead of screenshots (5MB images that consume huge token budgets), OIAB uses accessibility tree snapshots — text representations of page structure with ref=N element IDs.

A typical snapshot looks like:

[page] Acme Corp — Dashboard
  [nav ref=1] Main Navigation
    [link ref=2] Home
    [link ref=3] Reports
  [main ref=4]
    [heading ref=5] Q2 2026 Pipeline
    [button ref=6] Export CSV
    [table ref=7] Opportunities
      [row ref=8] Acme Deal | $50k | Closing Q2

The agent reads the text, says browser_click ref=6, and the plugin translates that to a Playwright page.click() call. This is ~100× cheaper than screenshots.

Available Tools

ToolArgumentsDescription
browser_navigateurlOpen URL; returns accessibility snapshot
browser_clickrefClick element by ref ID; returns new snapshot
browser_typeref, textType into element; returns new snapshot
browser_snapshotReturn current page accessibility tree
browser_screenshotReturn base64 PNG screenshot (use sparingly)
browser_scrolldirection, amount?Scroll page; returns new snapshot

Example Agent Interaction

User: Go to acme.com and download the latest invoice

Agent: [browser_navigate url="https://acme.com"]
→ snapshot shows login form

Agent: [browser_type ref=12 text="alice@acme.com"]
Agent: [browser_type ref=13 text="***"]
Agent: [browser_click ref=14]  (Sign In button)
→ snapshot shows dashboard

Agent: [browser_click ref=22]  (Invoices link)
→ snapshot shows invoice list

Agent: [browser_click ref=31]  (Download PDF for latest invoice)
→ File saved to /workspace/user/invoice-2026-03.pdf

Sandbox Requirements

Browser automation requires the sandbox image, which already includes:

  • Chromium + chromium-codecs-ffmpeg
  • Playwright (installed in the sandbox via bun add playwright)
  • Xvfb virtual display (:1)
  • noVNC at :6080 for visual debugging

The browser plugin uses a lazy singleton — one Chromium instance per sandbox process, created on first browser_navigate call and closed on process exit.

Debugging Visually

Open http://localhost:6080 in your browser to see the live Chromium session via noVNC. This is invaluable for debugging complex web interactions.

Audit Trail

Every browser action is logged to the audit log:

ActionLogged Fields
browser.navigateurl, sessionId
browser.clickref, url
browser.typeref (content redacted)
browser.screenshoturl

Performance Tips

  • Prefer browser_snapshot over browser_screenshot — snapshots are text and don't consume image tokens
  • Use browser_click with specific ref IDs rather than broad selectors
  • Close unnecessary tabs: the plugin supports a single page instance; navigate to a new URL to replace it

On this page