Skip to content

Tools showcase

Every capability the agent (yours, or an MCP host like Claude Desktop) can use, with a 5-line example. Python SDK on the left, the matching MCP tool name on the right. Click the section title to jump.

Section Use it for
Run a command Synchronous shell exec
Stream output Long-running jobs, live tail
Multi-session shells Named long-running shells with stdin
Background processes Servers that outlive a single call
Read & write files Inject code/data, fetch results
Upload & download Move files between local and VM
Open the browser Stealth Chromium
view() — perceive a page One-call markdown + indexed elements + screenshot
Index-based interaction click_idx / input_idx / select_option_idx
Diff-since-last-view 5–10× token savings on long agent loops
Network capture Read XHR responses, skip DOM scraping
Tabs Multi-tab work in one browser context
Selector / coords / JS When idx isn't right
Extraction primitives extract_markdown / find_elements / search_page
More browser actions scroll_by / send_keys / save_pdf / upload_file
Read page content Text / DOM / attributes
Screenshots Plain or annotated
Run JavaScript Escape hatch
Persist sessions Cookies / localStorage save/load
OS-level input sb.os.click/type/key/screenshot via xdotool
Live noVNC stream URL the user opens to watch the agent
Public URL for a service sb.ports.expose(port)
Real-Chrome HTTP Bypass TLS-fingerprint blocks
Aggregated search SearxNG fan-out
Transcribe audio Whisper inside the VM
Solve captcha 2captcha API or local Whisper

Run a command

Synchronous shell exec inside the sandbox. Returns stdout, stderr, exit code, duration.

r = sb.run("uname -a; df -h /")
print(r.stdout)            # captured
print(r.exit_code)         # 0
sandbox_run { sandbox_id, cmd, timeout?, cwd? }
→ { stdout, stderr, exit_code }

Stream output

Same as run but yields chunks as they arrive — perfect for progress bars, tail -f, long compiles.

for c in sb.stream("for i in 1 2 3; do echo $i; sleep 1; done"):
    if c.stream == "stdout": print(c.data, end="")
    elif c.stream == "final": print(f"\nexit {c.exit_code}")
firebox sandbox stream <id> "long-running-cmd"

NDJSON over the wire; the SDK turns each line into a StreamChunk.


Multi-session shells

Named long-running shells the agent can stream stdin into. Manus parity.

dev   = sb.shells.start("dev",   "cd /work && npm run dev")
tests = sb.shells.start("tests", "cd /work && pytest -xvs")
tests.wait(timeout=120)
print(tests.view()["stdout"])

# Answer a prompt on stdin
deploy = sb.shells.start("deploy", "deploy.sh")
deploy.write("y", append_newline=True)
shell_exec              { sandbox_id, name, cmd, cwd?, shell? }
shell_view              { sandbox_id, name, since_byte? }
shell_write_to_process  { sandbox_id, name, input, append_newline?, close_stdin? }
shell_wait              { sandbox_id, name, timeout? }
shell_kill_process      { sandbox_id, name }
shell_list              { sandbox_id }

view() supports incremental tailing via since_byte. Conflict on duplicate name (kill it or use a different one).


Background processes

Start a server that outlives the call that started it; query it from later calls in the same sandbox.

p = sb.process.start("python3 -m http.server 8000 --bind 0.0.0.0")
sb.run("curl -s http://127.0.0.1:8000/")    # talks to the bg server
p.kill()

Process registry endpoints: start / list / kill / wait / logs. Every process gets its stdout / stderr captured to a tempfile so logs() is re-readable.


Read & write files

Binary-safe (base64 on the wire). Use to inject Python scripts, config, datasets; fetch results back.

sb.files.write("/work/main.py", "print('hi')")
sb.files.write("/data/blob.bin", b"\x00\x01\x02", mode=0o600)

text = sb.files.read_text("/etc/firebox-template-name")
blob = sb.files.read("/work/output.parquet")    # bytes
listing = sb.files.list("/work")                # [FileEntry, ...]
file_write  { sandbox_id, path, content }   → { bytes_written }
file_read   { sandbox_id, path }            → { content }

Upload & download

Convenience wrappers around read / write for whole files.

sb.files.upload("local.csv",  "/data/in.csv")
sb.files.download("/work/out.png", "local-out.png")

Mode (executable bit, etc.) carried across.


Open the browser

Stealth Chromium driven from outside. One call to start, state persists. Defaults to visible (Xvfb) so the session is stream/VNC ready; pass headless=true for full Chromium under --headless=new.

sb.browser.start()                                  # visible (Xvfb) + stealth on by default
sb.browser.navigate("https://example.com")
print(sb.browser.text("h1"))                        # → "Example Domain"
browser_start    { sandbox_id, headless?, stealth?, user_agent?, ... }
browser_navigate { sandbox_id, url, wait_until?, timeout? }
browser_close    { sandbox_id }

Template browser-use has Chromium + Playwright + patchright + curl_cffi + Whisper + Xvfb + x11vnc + websockify + noVNC + xdotool baked in.


view() — perceive a page

The headline browsing primitive. One call returns markdown + indexed elements (across same-origin AND cross-origin iframes) + an annotated PNG screenshot. Auto-waits for network-idle so SPAs settle before the agent reads.

obs = sb.browser.view()
# {
#   "view_token":     "8fcc2271f82f40cf",
#   "url":            "...",  "title": "...",
#   "markdown":       "# Heading\n\nText...",
#   "elements":       [{idx, tag, text, x, y, width, height,
#                       href, role, frame?}, ...],
#   "screenshot_b64": "iVBORw0KGgo...",
#   "viewport":       {"width": 1280, "height": 800},
# }
browser_view {
    sandbox_id, max_markdown_chars?, screenshot?, annotated?,
    wait_for_load? (load|domcontentloaded|networkidle|none),
    wait_timeout?, traverse_iframes?,
    since_view_token?, delta_only?,
    extract_links?, extract_images?,
} → { view_token, url, title, markdown, elements, screenshot_b64,
      viewport, ... }

iframe-resident elements include a "frame" field; coords are translated to top-level viewport so click_idx works through frame boundaries.


Index-based interaction

Pick an element by its idx from view() or clickables() — no CSS selectors needed. Re-resolves server-side per call so small DOM updates are tolerated.

sb.browser.click_idx(7, humanlike=True)
sb.browser.input_idx(3, "alice@example.com", clear=True)
sb.browser.select_option_idx(12, "Europe")
sb.browser.move_mouse(idx=4)                   # hover
sb.browser.scroll_in_element(8, "down", 0.5)   # scroll within element
browser_click_idx          { sandbox_id, idx, button?, click_count?, humanlike? }
browser_input_idx          { sandbox_id, idx, text, clear?, humanlike? }
browser_select_option_idx  { sandbox_id, idx, value }
browser_move_mouse         { sandbox_id, idx | (x,y) }
browser_scroll_in_element  { sandbox_id, idx, direction?, amount? }

Diff-since-last-view

Pass the prior view_token to get only what changed. Cuts session tokens 5–10× for long-running agent loops.

v1 = sb.browser.view()
last = v1["view_token"]

sb.browser.click_idx(some_button)

v2 = sb.browser.view(since_view_token=last, delta_only=True)
# v2["diff"] = {
#   added_count, removed_count, stable_count,
#   added_elements:   [{idx, tag, text, ...}],
#   removed_elements: [{tag, text, href}],
#   markdown_unchanged, url_changed,
# }

delta_only=True suppresses redundant full markdown / elements / screenshot payload when a prior snapshot exists.


Network capture

Read XHR data the page received instead of scraping the rendered DOM.

sb.browser.navigate("https://api-driven-spa.example.com")
log = sb.browser.network_log(url_contains="/api/", phase="response")
for entry in log["entries"]:
    print(entry["status"], entry["url"])
    print(entry["body_preview"])    # first 10 KB of JSON / text
browser_network_log {
    sandbox_id, limit?, clear?, url_contains?, method?,
    phase? (request|response), min_status?,
} → { entries: [{ts, phase, method, url, status?, response_headers?,
                 body_preview?, ...}], total }
browser_network_clear { sandbox_id }

Bodies captured for application/json / text/* / javascript. Cleared on browser restart().


Tabs

Multi-tab work in the same browser context. 4-char tab IDs.

new_tab = sb.browser.tabs_new("https://example.com")  # becomes active
sb.browser.tabs_list()        # {tabs: [{id, url, title, active}], active}
sb.browser.tabs_switch(new_tab["id"])
sb.browser.tabs_close(new_tab["id"])

MCP: browser_tabs_list / new / switch / close.


Selector / coords / JS

When idx isn't right (you have stable selectors, you wrote the agent loop yourself, or you need raw JS):

sb.browser.click("button.submit")
sb.browser.fill("input[name='email']", "alice@example.com")
sb.browser.press("Enter")
sb.browser.type("manual typing", humanlike=True)    # 60-100 WPM jitter
sb.browser.wait_for("div.results", state="visible")
items = sb.browser.clickables()
img   = sb.browser.screenshot_annotated()
sb.browser.click_at(items[5]["x"], items[5]["y"], humanlike=True)

humanlike=True (default) curves the cursor along a Bezier path with per-step jitter; False does an instant teleport.


Extraction primitives

For tasks where view() is overkill or under-specified.

md = sb.browser.extract_markdown(
    extract_links=True, extract_images=True,
    start_from_char=0, max_chars=20000, traverse_iframes=True,
)
# {url, title, markdown, total_chars, returned_chars, truncated}

Selector-driven enumeration. Different from clickables — arbitrary CSS, not just visible-interactive.

rows = sb.browser.find_elements("table.products tr",
                                  attributes=["data-id", "class"])
# {selector, count, total, items: [{idx, tag, text, attributes, bbox}]}

Text / regex search with surrounding context. Saves tokens vs dumping the whole DOM.

hits = sb.browser.search_page(r"\$\d+", regex=True, context_chars=80)
# {pattern, count, matches: [{match, before, after, offset}]}

List <option>s before select_option_idx so the agent picks a known value.

opts = sb.browser.dropdown_options(12)
# {tag, name, multiple, options: [{value, label, selected, disabled}]}

More browser actions

sb.browser.scroll_by("down", 0.8)              # viewport-fraction units
sb.browser.scroll_by("bottom")                  # jump to bottom

sb.browser.send_keys("Control+O")               # chord (modifiers)
sb.browser.send_keys("Escape")

sb.browser.save_pdf(path="/tmp/page.pdf",
                     paper_format="A4", landscape=False)
pdf_bytes = sb.files.read("/tmp/page.pdf")

sb.files.write("/tmp/data.csv", csv_text)
sb.browser.upload_file(idx=4, paths="/tmp/data.csv")

sb.browser.restart("https://example.com")       # fresh context + nav

sb.browser.console_view(limit=50)               # captured console + pageerror

MCP equivalents: browser_scroll, browser_send_keys, browser_save_pdf, browser_upload_file, browser_restart, browser_console_view.


Read page content

title    = sb.browser.text("h1")                       # one element
titles   = sb.browser.text_all(".titleline > a")       # list[str]
href     = sb.browser.attr("a.cta", "href")            # attribute
full_html = sb.browser.html()                          # whole page
browser_text     { sandbox_id, selector? }   → { text }
browser_text_all { sandbox_id, selector }    → { items: [...] }
browser_html     { sandbox_id, selector? }   → { html }

Screenshots

sb.browser.screenshot(save_path="page.png")            # plain
sb.browser.screenshot(selector="header", save_path="h.png")
sb.browser.screenshot(full_page=True, save_path="full.png")

# With numbered overlays for LLM-friendly visual click:
sb.browser.screenshot_annotated(save_path="annotated.png")
browser_screenshot          → ImageContent (PNG)
browser_screenshot_annotated→ ImageContent (PNG, yellow numbers)

ImageContent is what Claude / GPT-4o see directly in the conversation.


Run JavaScript

Escape hatch for anything Playwright can't express cleanly.

titles = sb.browser.evaluate("""
    () => [...document.querySelectorAll('.titleline > a')]
              .map(a => a.innerText).slice(0, 5)
""")
browser_evaluate { sandbox_id, script }   → { result }

The script must return JSON-serializable data.


Persist sessions

Save a logged-in browser context and restore in a future sandbox.

# First sandbox: log in once.
sb.browser.start()
sb.browser.navigate("https://app.example.com/login")
sb.browser.fill("#email", "alice@example.com")
sb.browser.fill("#password", "...")
sb.browser.click("button[type=submit]")
sb.browser.save_profile("alice-app")                 # /var/firebox-profiles/alice-app.json

# Days later, fresh sandbox: come back already logged in.
with Sandbox.create(template="browser-use") as sb2:
    sb2.browser.start(profile="alice-app")
    sb2.browser.navigate("https://app.example.com/dashboard")

Profiles include cookies + localStorage + sessionStorage. They're template-scoped (saved from browser-use only loads in browser-use).


Real-Chrome HTTP

When Cloudflare or Akamai block the chromium browser at the TLS layer, fire raw HTTP with curl_cffi's real Chrome 120 JA3/JA4 fingerprint.

r = sb.http.get("https://api.cloudflared-site.com/v1/data",
                headers={"Accept": "application/json"})
print(r.status, r.json())
r = sb.http.get("https://tls.peet.ws/api/all").json()
print(r["tls"]["ja3_hash"])    # matches real Chrome

Methods: get / post / put / delete / request. Cookies from the browser carry over via headers.


Self-hosted SearxNG fans out to 5–15 engines per query. No API key, no rate limit.

sb.search.web("rust web framework")              # general
sb.search.news("AI agents", time_range="day")    # last day
sb.search.papers("microvm performance")          # arxiv + Scholar
sb.search.code("playwright stealth github")      # GitHub + SO + Arch
sb.search.images("hacker news logo")
sb.search.videos("rust async tutorial")
sb.search.wiki("Firecracker (software)")
sb.search.maps("Ljubljana")
search { sandbox_id, q, categories?, engines?, language?,
         time_range?, pageno?, safesearch?, limit? }
firebox search "calorie tracker" -c news -t week -n 5
firebox search "..." -u | xargs -P5 curl -sI -o /dev/null

Cached in-sandbox for 5 minutes; same query twice is microseconds.


Transcribe audio

Local Whisper (tiny.en, 39 MB) baked into the browser-use template. ~5× realtime on the sandbox's CPU.

audio = sb.http.get("https://example.com/podcast.mp3").content
result = sb.audio.transcribe(audio, format="mp3")
print(result.text)            # "..."
print(result.language)        # "en"
print(result.segments)        # [{start, end, text}, ...]

Language can be auto-detected or pinned with language="en".


Solve captcha

Four paths, picking by cost / target site.

Free, ~70-80 % per-attempt, works on sites that allow audio mode. Same shape for reCAPTCHA v2 and hCaptcha.

if sb.browser.detect_captcha():
    out = sb.captcha.solve_recaptcha_audio(retries=3)
    out = sb.captcha.solve_hcaptcha_audio(retries=3)
    # → { verified, attempts, text, language }

For sites that force image challenges. Firebox surfaces the puzzle as plain data (instructions + screenshot + cell bboxes). Your agent's vision LLM decides which cells; firebox runs no model.

challenge = sb.captcha.recaptcha_open_image_challenge()
# challenge.screenshot_b64   PNG of the grid
# challenge.instructions     "Click verify once there are no more"
# challenge.target           "crosswalks"
# challenge.cells            [{idx, x, y, width, height}, ...]

# ... your LLM looks at the screenshot + instructions ...
# ... decides indices = [0, 4, 7] ...
sb.captcha.recaptcha_click_cells([0, 4, 7])
result = sb.captcha.recaptcha_verify_image()
# If reCAPTCHA wants more clicks, result has the next puzzle baked in:
while result.get("more_to_click"):
    # hand result back to the LLM, repeat
    ...

Works for v2, v3, hCaptcha, Cloudflare Turnstile, Funcaptcha. Caller doesn't see the puzzle.

info = sb.browser.detect_captcha()
if info:
    sb.browser.solve_captcha_on_page(api_key="<2captcha-key>")

Last resort: ping the human, expose VNC, let them solve manually.

info = sb.captcha.handoff_to_vnc(poll_until_solved=True,
                                  poll_timeout=300)
# info = {vnc_url, vnc_password, sandbox_ip,
#         solved: True, waited: 23.4}

The caller is responsible for routing port 5900 from the sandbox to the human (DNAT, Tailscale, whatever your topology requires). poll_until_solved=True blocks until detect_captcha() returns None.

detect_captcha returns {type, sitekey, iframe_url, callback} for reCAPTCHA v2/v3/enterprise, hCaptcha, Cloudflare Turnstile, Funcaptcha.


OS-level input

sb.os.* operates at the X server level (DISPLAY=:99) via xdotool, so it reaches native dialogs, file pickers, browser chrome (download bar, tabs, address bar) — anything Playwright can't reach because it's outside the page DOM. This is the layer Anthropic Computer Use and OpenAI CUA target.

sb.run("firebox-display test")           # bring up Xvfb
sb.os.click(200, 300)                     # left click at coords
sb.os.type("hello world", delay_ms=25)
sb.os.key("ctrl+s")                        # X keysym, modifiers OK
sb.os.scroll("down", amount=3)             # wheel ticks
sb.os.drag(50, 50, 250, 250)
png = sb.os.screenshot()                   # full-display PNG
state = sb.os.state()                      # geom + cursor
# {display, width, height, cursor_x, cursor_y}
os_click          { sandbox_id, x, y, button?, click_count? }
os_double_click   { sandbox_id, x, y }
os_right_click    { sandbox_id, x, y }
os_move_mouse     { sandbox_id, x, y }
os_drag           { sandbox_id, x1, y1, x2, y2, button? }
os_scroll         { sandbox_id, direction?, amount? }
os_type           { sandbox_id, text, delay_ms? }
os_key            { sandbox_id, key }       # 'Return', 'ctrl+c', 'alt+F4', ...
os_screenshot     { sandbox_id } → ImageContent (PNG)
os_state          { sandbox_id }

Lazy apt-installs xdotool + scrot on first use if missing (baked into the browser-use template).


Live noVNC stream

Hand the user a URL they open in any browser tab to watch the agent work in real time.

info = sb.stream.start()
print(f"Watch the agent: {info['url']}")
# → http://your-host:51234/vnc.html?autoconnect=1&password=...
# info = {url, password, host_port, vm_port}

# ... agent does its thing ...

sb.stream.stop()        # tear down the public mapping
stream_start    { sandbox_id, password?, view_only?, require_auth? }
stream_stop     { sandbox_id }
stream_get_url  { sandbox_id }

Pipeline: Xvfb on :99 → x11vnc on 5900 → websockify bridges 6080 ⇄ 5900 with embedded noVNC web client → sb.ports.expose(6080) DNATs out. Auto-installs websockify+novnc if missing. Stream stays live until stop() or sandbox close.


Public URL for a service

DNAT a free host port to a port the sandbox is listening on. Return a URL the agent can hand the user.

sb.run("cd /work && python3 -m http.server 3000 --bind 0.0.0.0 &")
ep = sb.ports.expose(3000)
print(ep["url"])       # → http://your-host:51234

sb.ports.list()         # active mappings
sb.ports.unexpose(ep["host_port"])
sandbox_expose_port    { sandbox_id, port, scheme? }
    → { vm_port, host_port, url, scheme }
sandbox_list_ports     { sandbox_id }
sandbox_unexpose_port  { sandbox_id, host_port }

Idempotent on (port, scheme) — re-exposing the same vm_port returns the existing mapping. Cleanup happens automatically on sandbox close. Service inside the sandbox must bind 0.0.0.0 (not 127.0.0.1).


What's not yet here

These would be obvious additions; nothing's blocking them, just not written yet:

  • PTY / interactive shellfirebox sandbox attach <id> for a real terminal. The SDK and daemon would need TTY allocation + a bidirectional WebSocket. (Note: sb.shells.* already provides named long-running shells with stdin write — only the interactive TTY case is missing.)
  • GPU passthrough — pass a Blackwell partition through to the VM for in-sandbox inference. Firecracker supports VFIO; we don't wire it up yet.
  • Pause / resume snapshots + warm pool — Firecracker has native snapshot support; we'd boot from a paused snapshot for sub-50 ms cold start.
  • Volume mounts — persistent state across sandbox lifetimes for the same template / token.
  • Anthropic Computer Use / OpenAI CUA adapters — the sandbox primitives are all there (sb.os.* + sb.browser.view); the shape-mapping layer is the missing glue.

If any of these are blocking your use case, the host setup docs tell you where each piece would slot in.