Browser¶
sb.browser is a Playwright-driven Chromium running inside the
sandbox. It's stealthed by default (passes
bot.sannysoft.com fingerprint checks),
threads its calls through a per-VM worker, and keeps state across
calls — same tab, same cookies — until you close().
sb.browser.start() launches a visible Chromium under Xvfb by
default, so the session is ready to stream / VNC. Pass
headless=True to run full Chromium with --headless=new (same
binary as headed Chrome, no HeadlessChrome UA) when you don't need
a screen.
The agent loop¶
The canonical pattern, matching Manus / browser-use:
sb.browser.start()
sb.browser.navigate("https://example.com", wait_for_load="networkidle")
while not done:
obs = sb.browser.view(since_view_token=last_token) # perception
decision = llm.decide(obs) # reason
if decision.action == "click":
sb.browser.click_idx(decision.idx) # act
elif decision.action == "type":
sb.browser.input_idx(decision.idx, decision.text)
elif decision.action == "scroll":
sb.browser.scroll_by(decision.direction)
last_token = obs["view_token"]
view() is the perception primitive. click_idx / input_idx /
select_option_idx are the action primitives. The agent never writes
CSS selectors; it picks an idx from the indexed element list view()
returned.
What view() returns¶
v = sb.browser.view()
# {
# "view_token": "8fcc2271f82f40cf", # feed back next turn
# "url": "https://example.com",
# "title": "Example Domain",
# "markdown": "# Example Domain\n\nThis domain is for use ...",
# "elements": [ # indexed, frame-aware
# {"idx": 0, "tag": "a", "text": "More info", "x": 297, "y": 209,
# "width": 87, "height": 19, "href": "https://www.iana.org/..."},
# {"idx": 1, "tag": "a", "text": "Learn more",
# "frame": "https://example.com/", "x": 169, "y": 275, ...},
# ],
# "screenshot_b64": "iVBORw0KGgoAAAANSUhE...", # annotated PNG
# "viewport": {"width": 1280, "height": 800},
# "wait_state": "networkidle",
# "total_chars": 1247, "returned_chars": 1247, "truncated": False,
# }
By default it:
- Walks every same-origin AND cross-origin iframe — captchas,
embedded payments, third-party widgets are visible. iframe-resident
elements include a
"frame"field; coords are translated to top-level viewport soclick_idxworks through frame boundaries. - Waits for network-idle (500 ms of network silence, capped at 5 s)
before reading. Pass
wait_for_load=Noneto skip and read immediately. - Annotates the screenshot with yellow numbered boxes that match
each element's
idx. Pair with a vision LLM for "click 5".
Diff-since-last-view¶
Pass the prior view_token to get only what changed:
v1 = sb.browser.view()
last = v1["view_token"]
sb.browser.click_idx(some_button)
v2 = sb.browser.view(since_view_token=last, delta_only=True)
# v2 = {
# "view_token": "...",
# "diff": {
# "added_count": 3, "removed_count": 1, "stable_count": 12,
# "added_elements": [{idx, tag, text, ...}],
# "removed_elements": [{tag, text, href}],
# "markdown_unchanged": False,
# "url_changed": False,
# },
# "markdown": null, "elements": null, "screenshot_b64": null,
# # (delta_only suppresses redundant full payload)
# }
Cuts session tokens 5–10× for long-running agent loops. LRU-capped at 16 snapshots per worker.
Index-based interaction¶
Pick an element by its idx from view() or clickables():
sb.browser.click_idx(7, humanlike=True) # Bezier cursor + click
sb.browser.input_idx(3, "alice@example.com", # focus + clear + type
clear=True, humanlike=True)
sb.browser.select_option_idx(12, "Europe") # <select> by value or label
sb.browser.move_mouse(idx=4) # hover (for menus)
sb.browser.scroll_in_element(8, "down", 0.5) # scroll within element
Index re-resolves server-side per call, so small DOM updates between view and act are tolerated as long as the element at that position is still present.
Selector / coordinate / JS escape hatches¶
When idx isn't right (you wrote the agent loop yourself, you have
stable selectors, or you need raw JS):
Network capture¶
Read XHR data the page received, instead of scraping the rendered DOM:
sb.browser.navigate("https://api-driven-spa.example.com")
log = sb.browser.network_log(url_contains="/api/", phase="response")
for entry in log["entries"]:
print(entry["status"], entry["url"])
print(entry["body_preview"]) # first 10 KB of JSON / text bodies
Bodies captured for application/json / text/* / javascript
content types. Filters: url_contains, method, phase
(request/response), min_status. Cleared on restart() or
network_clear().
Tabs¶
Multi-tab work in the same browser context. Tab IDs are 4-char strings (matches browser-use's convention).
sb.browser.navigate("https://google.com/search?q=foo")
v = sb.browser.view()
top_links = [e for e in v["elements"] if e["tag"] == "a" and e.get("href")][:5]
# Open all 5 in parallel tabs
new_tabs = [sb.browser.tabs_new(link["href"]) for link in top_links]
# Switch between them; subsequent browser_* calls act on the active tab.
sb.browser.tabs_switch(new_tabs[2]["id"])
text = sb.browser.text("article")
sb.browser.tabs_close(new_tabs[2]["id"])
What's exposed¶
flowchart TB
subgraph SDK["sb.browser API"]
VIEW[view / extract_markdown / find_elements / search_page]
ACT[click_idx / input_idx / select_option_idx / move_mouse / drag]
BROWSE[navigate / back / forward / reload / restart]
TABS[tabs_list / tabs_new / tabs_switch / tabs_close]
SCROLL[scroll_by / scroll_in_element / send_keys]
READ[text / text_all / attr / html / dropdown_options]
VISUAL[screenshot / screenshot_annotated / clickables]
SAVE[save_pdf / upload_file / save_profile]
OBS[network_log / console_view]
EVAL[evaluate]
end
More extraction primitives¶
For tasks where view() is overkill or under-specified:
# Standalone richer markdown — supports pagination
md = sb.browser.extract_markdown(
extract_links=True, extract_images=True,
start_from_char=0, max_chars=20000,
)
# {url, title, markdown, total_chars, returned_chars, truncated}
# Selector-driven enumeration (any CSS, not just visible-interactive)
rows = sb.browser.find_elements("table.products tr",
attributes=["data-id", "class"])
# {selector, count, total, items: [{idx, tag, text, attributes, bbox}]}
# Text/regex search across the page with surrounding context
hits = sb.browser.search_page(r"\$\d+", regex=True, context_chars=80)
# {pattern, count, matches: [{match, before, after, offset}]}
# List <option>s for a <select> at idx
opts = sb.browser.dropdown_options(12)
# {tag, name, multiple, options: [{value, label, selected, disabled}]}
Save a PDF, upload a file¶
sb.browser.save_pdf(path="/tmp/page.pdf",
paper_format="A4", landscape=False)
pdf_bytes = sb.files.read("/tmp/page.pdf")
# Upload via the indexed <input type="file">
sb.files.write("/tmp/data.csv", csv_text)
sb.browser.upload_file(idx=4, paths="/tmp/data.csv")
Console messages and uncaught errors¶
log = sb.browser.console_view(limit=50, clear=False)
# {"messages": [{type: "warning", text: "...", location: {...}, ts: ...}],
# "total": 50}
Useful for debugging agents that fail because the page logged a JS error mid-flow.
Stealth profile¶
sb.browser.start() defaults to stealth=True. That turns on:
| Layer | Patch |
|---|---|
| Launch flags | --disable-blink-features=AutomationControlled, drop --enable-automation |
| Backend | Patchright (drop-in fork that fixes Runtime.enable + console.debug leaks at the chromium binary level) |
navigator.webdriver |
scrubbed from prototype + own |
window.chrome |
populated |
navigator.plugins |
five-entry PluginArray (real plugins, real prototype) |
navigator.languages |
['en-US', 'en'] |
navigator.permissions.query |
consistent with native Chrome |
navigator.hardwareConcurrency |
8 |
navigator.deviceMemory |
8 |
navigator.maxTouchPoints |
0 |
Notification.permission |
'default' |
window.outerHeight/Width |
offset from inner |
navigator.userAgentData |
Chrome 120 brands + Linux platform |
| Canvas | sub-pixel noise on getImageData / toDataURL |
| AudioContext | float-buffer noise on getChannelData |
| WebRTC | host-candidate IPs stripped |
| WebGL | vendor / renderer reported as Intel Inc. / Iris OpenGL Engine |
| User-Agent | Linux Chrome 120 (no HeadlessChrome string) |
| Locale | en-US |
| Timezone | Europe/Ljubljana |
To get a vanilla automation profile (debugging, captcha tests):
Captcha solving¶
Two paths. Use whichever is cheaper:
For reCAPTCHA v2 / hCaptcha on sites that allow audio mode. Cost: $0, accuracy ~70-80 %, ~5-10 s.
Works for v2, v3, hCaptcha, Cloudflare Turnstile. Cost: $0.001-0.003 per solve, ~30-60 s.
Profile persistence¶
Cookies + localStorage + sessionStorage survive across sandbox lifetimes when you save them as a named profile:
# First sandbox: log in once.
with Sandbox.create(template="browser-use") as sb:
sb.browser.start()
sb.browser.navigate("https://app.example.com/login")
sb.browser.fill("#email", "alice@example.com")
sb.browser.fill("#password", "...")
sb.browser.click("button[type=submit]")
sb.browser.save_profile("alice-app")
# Later sandbox: restore — already logged in.
with Sandbox.create(template="browser-use") as sb:
sb.browser.start(profile="alice-app")
sb.browser.navigate("https://app.example.com/dashboard")
Profiles live in /var/firebox-profiles/<name>.json inside the
template's rootfs. They're template-scoped (alice-app saved from
browser-use only loads in browser-use).
Live preview via noVNC¶
Hand the user a URL they can open in any browser tab to watch the agent work in real time:
stream = sb.stream.start()
print(f"Watch the agent: {stream['url']}")
# → http://your-host:51234/vnc.html?autoconnect=1&password=...
sb.stream spins up Xvfb + x11vnc + websockify + noVNC inside the
sandbox and DNATs the noVNC port out via sb.ports.expose. Auto
apt-installs websockify+novnc if the template doesn't have them.
sb.stream.stop() tears down the public mapping; sandbox close
cleans the rest.
What it can't do¶
- Real Chrome TLS — Chromium's TLS hello differs from Chrome's.
Cloudflare Bot Manager / Akamai do fingerprint at the TLS layer.
Workaround: use
sb.http(curl_cffi with real Chrome 120 JA3) for raw HTTP calls behind the same cookies. - Mobile emulation — viewport-based only; no touch event fidelity.
- Pause / resume of the whole VM mid-test — no Firecracker snapshot integration yet (warm pool is on the roadmap).
- GPU rendering — Chromium runs on CPU. The Spark's Blackwell is idle from the sandbox's POV.