Tools showcase¶
Every capability the agent (yours, or an MCP host like Claude Desktop) can use, with a 5-line example. Python SDK on the left, the matching MCP tool name on the right. Click the section title to jump.
| Section | Use it for |
|---|---|
| Run a command | Synchronous shell exec |
| Stream output | Long-running jobs, live tail |
| Multi-session shells | Named long-running shells with stdin |
| Background processes | Servers that outlive a single call |
| Read & write files | Inject code/data, fetch results |
| Upload & download | Move files between local and VM |
| Open the browser | Stealth Chromium |
view() — perceive a page |
One-call markdown + indexed elements + screenshot |
| Index-based interaction | click_idx / input_idx / select_option_idx |
| Diff-since-last-view | 5–10× token savings on long agent loops |
| Network capture | Read XHR responses, skip DOM scraping |
| Tabs | Multi-tab work in one browser context |
| Selector / coords / JS | When idx isn't right |
| Extraction primitives | extract_markdown / find_elements / search_page |
| More browser actions | scroll_by / send_keys / save_pdf / upload_file |
| Read page content | Text / DOM / attributes |
| Screenshots | Plain or annotated |
| Run JavaScript | Escape hatch |
| Persist sessions | Cookies / localStorage save/load |
| OS-level input | sb.os.click/type/key/screenshot via xdotool |
| Live noVNC stream | URL the user opens to watch the agent |
| Public URL for a service | sb.ports.expose(port) |
| Real-Chrome HTTP | Bypass TLS-fingerprint blocks |
| Aggregated search | SearxNG fan-out |
| Transcribe audio | Whisper inside the VM |
| Solve captcha | 2captcha API or local Whisper |
Run a command¶
Synchronous shell exec inside the sandbox. Returns stdout, stderr, exit code, duration.
Stream output¶
Same as run but yields chunks as they arrive — perfect for progress
bars, tail -f, long compiles.
NDJSON over the wire; the SDK turns each line into a StreamChunk.
Multi-session shells¶
Named long-running shells the agent can stream stdin into. Manus parity.
view() supports incremental tailing via since_byte. Conflict on
duplicate name (kill it or use a different one).
Background processes¶
Start a server that outlives the call that started it; query it from later calls in the same sandbox.
Process registry endpoints: start / list / kill / wait / logs. Every
process gets its stdout / stderr captured to a tempfile so logs() is
re-readable.
Read & write files¶
Binary-safe (base64 on the wire). Use to inject Python scripts, config, datasets; fetch results back.
Upload & download¶
Convenience wrappers around read / write for whole files.
Mode (executable bit, etc.) carried across.
Open the browser¶
Stealth Chromium driven from outside. One call to start, state
persists. Defaults to visible (Xvfb) so the session is stream/VNC
ready; pass headless=true for full Chromium under --headless=new.
Template browser-use has Chromium + Playwright + patchright + curl_cffi
+ Whisper + Xvfb + x11vnc + websockify + noVNC + xdotool baked in.
view() — perceive a page¶
The headline browsing primitive. One call returns markdown + indexed elements (across same-origin AND cross-origin iframes) + an annotated PNG screenshot. Auto-waits for network-idle so SPAs settle before the agent reads.
obs = sb.browser.view()
# {
# "view_token": "8fcc2271f82f40cf",
# "url": "...", "title": "...",
# "markdown": "# Heading\n\nText...",
# "elements": [{idx, tag, text, x, y, width, height,
# href, role, frame?}, ...],
# "screenshot_b64": "iVBORw0KGgo...",
# "viewport": {"width": 1280, "height": 800},
# }
browser_view {
sandbox_id, max_markdown_chars?, screenshot?, annotated?,
wait_for_load? (load|domcontentloaded|networkidle|none),
wait_timeout?, traverse_iframes?,
since_view_token?, delta_only?,
extract_links?, extract_images?,
} → { view_token, url, title, markdown, elements, screenshot_b64,
viewport, ... }
iframe-resident elements include a "frame" field; coords are
translated to top-level viewport so click_idx works through frame
boundaries.
Index-based interaction¶
Pick an element by its idx from view() or clickables() — no
CSS selectors needed. Re-resolves server-side per call so small DOM
updates are tolerated.
browser_click_idx { sandbox_id, idx, button?, click_count?, humanlike? }
browser_input_idx { sandbox_id, idx, text, clear?, humanlike? }
browser_select_option_idx { sandbox_id, idx, value }
browser_move_mouse { sandbox_id, idx | (x,y) }
browser_scroll_in_element { sandbox_id, idx, direction?, amount? }
Diff-since-last-view¶
Pass the prior view_token to get only what changed. Cuts session
tokens 5–10× for long-running agent loops.
v1 = sb.browser.view()
last = v1["view_token"]
sb.browser.click_idx(some_button)
v2 = sb.browser.view(since_view_token=last, delta_only=True)
# v2["diff"] = {
# added_count, removed_count, stable_count,
# added_elements: [{idx, tag, text, ...}],
# removed_elements: [{tag, text, href}],
# markdown_unchanged, url_changed,
# }
delta_only=True suppresses redundant full markdown / elements /
screenshot payload when a prior snapshot exists.
Network capture¶
Read XHR data the page received instead of scraping the rendered DOM.
Bodies captured for application/json / text/* / javascript.
Cleared on browser restart().
Tabs¶
Multi-tab work in the same browser context. 4-char tab IDs.
new_tab = sb.browser.tabs_new("https://example.com") # becomes active
sb.browser.tabs_list() # {tabs: [{id, url, title, active}], active}
sb.browser.tabs_switch(new_tab["id"])
sb.browser.tabs_close(new_tab["id"])
MCP: browser_tabs_list / new / switch / close.
Selector / coords / JS¶
When idx isn't right (you have stable selectors, you wrote the
agent loop yourself, or you need raw JS):
humanlike=True (default) curves the cursor along a Bezier path with
per-step jitter; False does an instant teleport.
Extraction primitives¶
For tasks where view() is overkill or under-specified.
Selector-driven enumeration. Different from clickables —
arbitrary CSS, not just visible-interactive.
Text / regex search with surrounding context. Saves tokens vs dumping the whole DOM.
More browser actions¶
sb.browser.scroll_by("down", 0.8) # viewport-fraction units
sb.browser.scroll_by("bottom") # jump to bottom
sb.browser.send_keys("Control+O") # chord (modifiers)
sb.browser.send_keys("Escape")
sb.browser.save_pdf(path="/tmp/page.pdf",
paper_format="A4", landscape=False)
pdf_bytes = sb.files.read("/tmp/page.pdf")
sb.files.write("/tmp/data.csv", csv_text)
sb.browser.upload_file(idx=4, paths="/tmp/data.csv")
sb.browser.restart("https://example.com") # fresh context + nav
sb.browser.console_view(limit=50) # captured console + pageerror
MCP equivalents: browser_scroll, browser_send_keys,
browser_save_pdf, browser_upload_file, browser_restart,
browser_console_view.
Read page content¶
Screenshots¶
ImageContent is what Claude / GPT-4o see directly in the conversation.
Run JavaScript¶
Escape hatch for anything Playwright can't express cleanly.
The script must return JSON-serializable data.
Persist sessions¶
Save a logged-in browser context and restore in a future sandbox.
# First sandbox: log in once.
sb.browser.start()
sb.browser.navigate("https://app.example.com/login")
sb.browser.fill("#email", "alice@example.com")
sb.browser.fill("#password", "...")
sb.browser.click("button[type=submit]")
sb.browser.save_profile("alice-app") # /var/firebox-profiles/alice-app.json
# Days later, fresh sandbox: come back already logged in.
with Sandbox.create(template="browser-use") as sb2:
sb2.browser.start(profile="alice-app")
sb2.browser.navigate("https://app.example.com/dashboard")
Profiles include cookies + localStorage + sessionStorage. They're
template-scoped (saved from browser-use only loads in browser-use).
Real-Chrome HTTP¶
When Cloudflare or Akamai block the chromium browser at the TLS layer, fire raw HTTP with curl_cffi's real Chrome 120 JA3/JA4 fingerprint.
Methods: get / post / put / delete / request.
Cookies from the browser carry over via headers.
Aggregated search¶
Self-hosted SearxNG fans out to 5–15 engines per query. No API key, no rate limit.
sb.search.web("rust web framework") # general
sb.search.news("AI agents", time_range="day") # last day
sb.search.papers("microvm performance") # arxiv + Scholar
sb.search.code("playwright stealth github") # GitHub + SO + Arch
sb.search.images("hacker news logo")
sb.search.videos("rust async tutorial")
sb.search.wiki("Firecracker (software)")
sb.search.maps("Ljubljana")
Cached in-sandbox for 5 minutes; same query twice is microseconds.
Transcribe audio¶
Local Whisper (tiny.en, 39 MB) baked into the browser-use
template. ~5× realtime on the sandbox's CPU.
audio = sb.http.get("https://example.com/podcast.mp3").content
result = sb.audio.transcribe(audio, format="mp3")
print(result.text) # "..."
print(result.language) # "en"
print(result.segments) # [{start, end, text}, ...]
Language can be auto-detected or pinned with language="en".
Solve captcha¶
Four paths, picking by cost / target site.
Free, ~70-80 % per-attempt, works on sites that allow audio mode. Same shape for reCAPTCHA v2 and hCaptcha.
For sites that force image challenges. Firebox surfaces the puzzle as plain data (instructions + screenshot + cell bboxes). Your agent's vision LLM decides which cells; firebox runs no model.
challenge = sb.captcha.recaptcha_open_image_challenge()
# challenge.screenshot_b64 PNG of the grid
# challenge.instructions "Click verify once there are no more"
# challenge.target "crosswalks"
# challenge.cells [{idx, x, y, width, height}, ...]
# ... your LLM looks at the screenshot + instructions ...
# ... decides indices = [0, 4, 7] ...
sb.captcha.recaptcha_click_cells([0, 4, 7])
result = sb.captcha.recaptcha_verify_image()
# If reCAPTCHA wants more clicks, result has the next puzzle baked in:
while result.get("more_to_click"):
# hand result back to the LLM, repeat
...
Works for v2, v3, hCaptcha, Cloudflare Turnstile, Funcaptcha. Caller doesn't see the puzzle.
Last resort: ping the human, expose VNC, let them solve manually.
info = sb.captcha.handoff_to_vnc(poll_until_solved=True,
poll_timeout=300)
# info = {vnc_url, vnc_password, sandbox_ip,
# solved: True, waited: 23.4}
The caller is responsible for routing port 5900 from the sandbox
to the human (DNAT, Tailscale, whatever your topology requires).
poll_until_solved=True blocks until detect_captcha() returns None.
detect_captcha returns {type, sitekey, iframe_url, callback} for
reCAPTCHA v2/v3/enterprise, hCaptcha, Cloudflare Turnstile, Funcaptcha.
OS-level input¶
sb.os.* operates at the X server level (DISPLAY=:99) via xdotool, so
it reaches native dialogs, file pickers, browser chrome (download bar,
tabs, address bar) — anything Playwright can't reach because it's
outside the page DOM. This is the layer Anthropic Computer Use and
OpenAI CUA target.
sb.run("firebox-display test") # bring up Xvfb
sb.os.click(200, 300) # left click at coords
sb.os.type("hello world", delay_ms=25)
sb.os.key("ctrl+s") # X keysym, modifiers OK
sb.os.scroll("down", amount=3) # wheel ticks
sb.os.drag(50, 50, 250, 250)
png = sb.os.screenshot() # full-display PNG
state = sb.os.state() # geom + cursor
# {display, width, height, cursor_x, cursor_y}
os_click { sandbox_id, x, y, button?, click_count? }
os_double_click { sandbox_id, x, y }
os_right_click { sandbox_id, x, y }
os_move_mouse { sandbox_id, x, y }
os_drag { sandbox_id, x1, y1, x2, y2, button? }
os_scroll { sandbox_id, direction?, amount? }
os_type { sandbox_id, text, delay_ms? }
os_key { sandbox_id, key } # 'Return', 'ctrl+c', 'alt+F4', ...
os_screenshot { sandbox_id } → ImageContent (PNG)
os_state { sandbox_id }
Lazy apt-installs xdotool + scrot on first use if missing
(baked into the browser-use template).
Live noVNC stream¶
Hand the user a URL they open in any browser tab to watch the agent work in real time.
Pipeline: Xvfb on :99 → x11vnc on 5900 → websockify bridges 6080 ⇄ 5900
with embedded noVNC web client → sb.ports.expose(6080) DNATs out.
Auto-installs websockify+novnc if missing. Stream stays live until
stop() or sandbox close.
Public URL for a service¶
DNAT a free host port to a port the sandbox is listening on. Return a URL the agent can hand the user.
Idempotent on (port, scheme) — re-exposing the same vm_port returns
the existing mapping. Cleanup happens automatically on sandbox close.
Service inside the sandbox must bind 0.0.0.0 (not 127.0.0.1).
What's not yet here¶
These would be obvious additions; nothing's blocking them, just not written yet:
- PTY / interactive shell —
firebox sandbox attach <id>for a real terminal. The SDK and daemon would need TTY allocation + a bidirectional WebSocket. (Note:sb.shells.*already provides named long-running shells with stdin write — only the interactive TTY case is missing.) - GPU passthrough — pass a Blackwell partition through to the VM for in-sandbox inference. Firecracker supports VFIO; we don't wire it up yet.
- Pause / resume snapshots + warm pool — Firecracker has native snapshot support; we'd boot from a paused snapshot for sub-50 ms cold start.
- Volume mounts — persistent state across sandbox lifetimes for the same template / token.
- Anthropic Computer Use / OpenAI CUA adapters — the sandbox
primitives are all there (
sb.os.*+sb.browser.view); the shape-mapping layer is the missing glue.
If any of these are blocking your use case, the host setup docs tell you where each piece would slot in.