Skip to content

Browser

sb.browser is a Playwright-driven Chromium running inside the sandbox. It's stealthed by default (passes bot.sannysoft.com fingerprint checks), threads its calls through a per-VM worker, and keeps state across calls — same tab, same cookies — until you close().

sb.browser.start() launches a visible Chromium under Xvfb by default, so the session is ready to stream / VNC. Pass headless=True to run full Chromium with --headless=new (same binary as headed Chrome, no HeadlessChrome UA) when you don't need a screen.

The agent loop

The canonical pattern, matching Manus / browser-use:

sb.browser.start()
sb.browser.navigate("https://example.com", wait_for_load="networkidle")

while not done:
    obs = sb.browser.view(since_view_token=last_token)   # perception
    decision = llm.decide(obs)                            # reason
    if decision.action == "click":
        sb.browser.click_idx(decision.idx)                # act
    elif decision.action == "type":
        sb.browser.input_idx(decision.idx, decision.text)
    elif decision.action == "scroll":
        sb.browser.scroll_by(decision.direction)
    last_token = obs["view_token"]

view() is the perception primitive. click_idx / input_idx / select_option_idx are the action primitives. The agent never writes CSS selectors; it picks an idx from the indexed element list view() returned.

What view() returns

v = sb.browser.view()
# {
#   "view_token":     "8fcc2271f82f40cf",        # feed back next turn
#   "url":            "https://example.com",
#   "title":          "Example Domain",
#   "markdown":       "# Example Domain\n\nThis domain is for use ...",
#   "elements":       [                           # indexed, frame-aware
#     {"idx": 0, "tag": "a", "text": "More info", "x": 297, "y": 209,
#      "width": 87, "height": 19, "href": "https://www.iana.org/..."},
#     {"idx": 1, "tag": "a", "text": "Learn more",
#      "frame": "https://example.com/", "x": 169, "y": 275, ...},
#   ],
#   "screenshot_b64": "iVBORw0KGgoAAAANSUhE...",  # annotated PNG
#   "viewport":       {"width": 1280, "height": 800},
#   "wait_state":     "networkidle",
#   "total_chars":    1247, "returned_chars": 1247, "truncated": False,
# }

By default it:

  • Walks every same-origin AND cross-origin iframe — captchas, embedded payments, third-party widgets are visible. iframe-resident elements include a "frame" field; coords are translated to top-level viewport so click_idx works through frame boundaries.
  • Waits for network-idle (500 ms of network silence, capped at 5 s) before reading. Pass wait_for_load=None to skip and read immediately.
  • Annotates the screenshot with yellow numbered boxes that match each element's idx. Pair with a vision LLM for "click 5".

Diff-since-last-view

Pass the prior view_token to get only what changed:

v1 = sb.browser.view()
last = v1["view_token"]

sb.browser.click_idx(some_button)

v2 = sb.browser.view(since_view_token=last, delta_only=True)
# v2 = {
#   "view_token": "...",
#   "diff": {
#     "added_count": 3, "removed_count": 1, "stable_count": 12,
#     "added_elements":   [{idx, tag, text, ...}],
#     "removed_elements": [{tag, text, href}],
#     "markdown_unchanged": False,
#     "url_changed": False,
#   },
#   "markdown": null, "elements": null, "screenshot_b64": null,
#   # (delta_only suppresses redundant full payload)
# }

Cuts session tokens 5–10× for long-running agent loops. LRU-capped at 16 snapshots per worker.

Index-based interaction

Pick an element by its idx from view() or clickables():

sb.browser.click_idx(7, humanlike=True)            # Bezier cursor + click
sb.browser.input_idx(3, "alice@example.com",       # focus + clear + type
                     clear=True, humanlike=True)
sb.browser.select_option_idx(12, "Europe")         # <select> by value or label
sb.browser.move_mouse(idx=4)                       # hover (for menus)
sb.browser.scroll_in_element(8, "down", 0.5)       # scroll within element

Index re-resolves server-side per call, so small DOM updates between view and act are tolerated as long as the element at that position is still present.

Selector / coordinate / JS escape hatches

When idx isn't right (you wrote the agent loop yourself, you have stable selectors, or you need raw JS):

sb.browser.click("button.login")
sb.browser.fill("input[name='email']", "alice@example.com")
sb.browser.text("h1")
items = sb.browser.clickables()
img   = sb.browser.screenshot_annotated()      # numbered overlays
sb.browser.click_at(items[5]["x"], items[5]["y"], humanlike=True)
titles = sb.browser.evaluate("""
    () => [...document.querySelectorAll(".titleline > a")]
              .map(a => a.innerText).slice(0, 5)
""")

Network capture

Read XHR data the page received, instead of scraping the rendered DOM:

sb.browser.navigate("https://api-driven-spa.example.com")
log = sb.browser.network_log(url_contains="/api/", phase="response")

for entry in log["entries"]:
    print(entry["status"], entry["url"])
    print(entry["body_preview"])     # first 10 KB of JSON / text bodies

Bodies captured for application/json / text/* / javascript content types. Filters: url_contains, method, phase (request/response), min_status. Cleared on restart() or network_clear().

Tabs

Multi-tab work in the same browser context. Tab IDs are 4-char strings (matches browser-use's convention).

sb.browser.navigate("https://google.com/search?q=foo")
v = sb.browser.view()
top_links = [e for e in v["elements"] if e["tag"] == "a" and e.get("href")][:5]

# Open all 5 in parallel tabs
new_tabs = [sb.browser.tabs_new(link["href"]) for link in top_links]

# Switch between them; subsequent browser_* calls act on the active tab.
sb.browser.tabs_switch(new_tabs[2]["id"])
text = sb.browser.text("article")

sb.browser.tabs_close(new_tabs[2]["id"])

What's exposed

flowchart TB
    subgraph SDK["sb.browser API"]
        VIEW[view / extract_markdown / find_elements / search_page]
        ACT[click_idx / input_idx / select_option_idx / move_mouse / drag]
        BROWSE[navigate / back / forward / reload / restart]
        TABS[tabs_list / tabs_new / tabs_switch / tabs_close]
        SCROLL[scroll_by / scroll_in_element / send_keys]
        READ[text / text_all / attr / html / dropdown_options]
        VISUAL[screenshot / screenshot_annotated / clickables]
        SAVE[save_pdf / upload_file / save_profile]
        OBS[network_log / console_view]
        EVAL[evaluate]
    end

More extraction primitives

For tasks where view() is overkill or under-specified:

# Standalone richer markdown — supports pagination
md = sb.browser.extract_markdown(
    extract_links=True, extract_images=True,
    start_from_char=0, max_chars=20000,
)
# {url, title, markdown, total_chars, returned_chars, truncated}

# Selector-driven enumeration (any CSS, not just visible-interactive)
rows = sb.browser.find_elements("table.products tr",
                                  attributes=["data-id", "class"])
# {selector, count, total, items: [{idx, tag, text, attributes, bbox}]}

# Text/regex search across the page with surrounding context
hits = sb.browser.search_page(r"\$\d+", regex=True, context_chars=80)
# {pattern, count, matches: [{match, before, after, offset}]}

# List <option>s for a <select> at idx
opts = sb.browser.dropdown_options(12)
# {tag, name, multiple, options: [{value, label, selected, disabled}]}

Save a PDF, upload a file

sb.browser.save_pdf(path="/tmp/page.pdf",
                     paper_format="A4", landscape=False)
pdf_bytes = sb.files.read("/tmp/page.pdf")

# Upload via the indexed <input type="file">
sb.files.write("/tmp/data.csv", csv_text)
sb.browser.upload_file(idx=4, paths="/tmp/data.csv")

Console messages and uncaught errors

log = sb.browser.console_view(limit=50, clear=False)
# {"messages": [{type: "warning", text: "...", location: {...}, ts: ...}],
#  "total": 50}

Useful for debugging agents that fail because the page logged a JS error mid-flow.

Stealth profile

sb.browser.start() defaults to stealth=True. That turns on:

Layer Patch
Launch flags --disable-blink-features=AutomationControlled, drop --enable-automation
Backend Patchright (drop-in fork that fixes Runtime.enable + console.debug leaks at the chromium binary level)
navigator.webdriver scrubbed from prototype + own
window.chrome populated
navigator.plugins five-entry PluginArray (real plugins, real prototype)
navigator.languages ['en-US', 'en']
navigator.permissions.query consistent with native Chrome
navigator.hardwareConcurrency 8
navigator.deviceMemory 8
navigator.maxTouchPoints 0
Notification.permission 'default'
window.outerHeight/Width offset from inner
navigator.userAgentData Chrome 120 brands + Linux platform
Canvas sub-pixel noise on getImageData / toDataURL
AudioContext float-buffer noise on getChannelData
WebRTC host-candidate IPs stripped
WebGL vendor / renderer reported as Intel Inc. / Iris OpenGL Engine
User-Agent Linux Chrome 120 (no HeadlessChrome string)
Locale en-US
Timezone Europe/Ljubljana

To get a vanilla automation profile (debugging, captcha tests):

sb.browser.start(stealth=False)

Captcha solving

Two paths. Use whichever is cheaper:

For reCAPTCHA v2 / hCaptcha on sites that allow audio mode. Cost: $0, accuracy ~70-80 %, ~5-10 s.

if sb.browser.detect_captcha():
    sb.captcha.solve_recaptcha_audio(retries=3)
    # → {"verified": True, "attempts": 1, "text": "..."}

Works for v2, v3, hCaptcha, Cloudflare Turnstile. Cost: $0.001-0.003 per solve, ~30-60 s.

info = sb.browser.detect_captcha()
if info:
    sb.browser.solve_captcha_on_page(api_key="<2captcha-key>")

Surfaces the puzzle as data your VLM can act on.

challenge = sb.captcha.recaptcha_open_image_challenge()
# ... your VLM picks indices ...
sb.captcha.recaptcha_click_cells([0, 4, 7])
sb.captcha.recaptcha_verify_image()

Profile persistence

Cookies + localStorage + sessionStorage survive across sandbox lifetimes when you save them as a named profile:

# First sandbox: log in once.
with Sandbox.create(template="browser-use") as sb:
    sb.browser.start()
    sb.browser.navigate("https://app.example.com/login")
    sb.browser.fill("#email", "alice@example.com")
    sb.browser.fill("#password", "...")
    sb.browser.click("button[type=submit]")
    sb.browser.save_profile("alice-app")

# Later sandbox: restore — already logged in.
with Sandbox.create(template="browser-use") as sb:
    sb.browser.start(profile="alice-app")
    sb.browser.navigate("https://app.example.com/dashboard")

Profiles live in /var/firebox-profiles/<name>.json inside the template's rootfs. They're template-scoped (alice-app saved from browser-use only loads in browser-use).

Live preview via noVNC

Hand the user a URL they can open in any browser tab to watch the agent work in real time:

stream = sb.stream.start()
print(f"Watch the agent: {stream['url']}")
# → http://your-host:51234/vnc.html?autoconnect=1&password=...

sb.stream spins up Xvfb + x11vnc + websockify + noVNC inside the sandbox and DNATs the noVNC port out via sb.ports.expose. Auto apt-installs websockify+novnc if the template doesn't have them.

sb.stream.stop() tears down the public mapping; sandbox close cleans the rest.

What it can't do

  • Real Chrome TLS — Chromium's TLS hello differs from Chrome's. Cloudflare Bot Manager / Akamai do fingerprint at the TLS layer. Workaround: use sb.http (curl_cffi with real Chrome 120 JA3) for raw HTTP calls behind the same cookies.
  • Mobile emulation — viewport-based only; no touch event fidelity.
  • Pause / resume of the whole VM mid-test — no Firecracker snapshot integration yet (warm pool is on the roadmap).
  • GPU rendering — Chromium runs on CPU. The Spark's Blackwell is idle from the sandbox's POV.