Tools showcase¶
Every capability the agent (yours, or an MCP host like Claude Desktop) can use, with a 5-line example. Python SDK on the left, the matching MCP tool name on the right. Click the section title to jump.
| Section | Use it for |
|---|---|
| Run a command | Synchronous shell exec |
| Stream output | Long-running jobs, live tail |
| Background processes | Servers that outlive a single call |
| Read & write files | Inject code/data, fetch results |
| Upload & download | Move files between local and VM |
| Open the browser | Stealth Chromium |
| Click, fill, type | Selector-based interaction |
| Visual click | Coordinate-based, screenshot-driven |
| Read page content | Text / DOM / attributes |
| Screenshots | Plain or annotated |
| Run JavaScript | Escape hatch |
| Persist sessions | Cookies / localStorage save/load |
| Real-Chrome HTTP | Bypass TLS-fingerprint blocks |
| Aggregated search | SearxNG fan-out |
| Transcribe audio | Whisper inside the VM |
| Solve captcha | 2captcha API or local Whisper |
| Expose a public URL | DNAT host port to a sandbox port |
Run a command¶
Synchronous shell exec inside the sandbox. Returns stdout, stderr, exit code, duration.
Stream output¶
Same as run but yields chunks as they arrive — perfect for progress
bars, tail -f, long compiles.
NDJSON over the wire; the SDK turns each line into a StreamChunk.
Background processes¶
Start a server that outlives the call that started it; query it from later calls in the same sandbox.
Process registry endpoints: start / list / kill / wait / logs. Every
process gets its stdout / stderr captured to a tempfile so logs() is
re-readable.
Read & write files¶
Binary-safe (base64 on the wire). Use to inject Python scripts, config, datasets; fetch results back.
Upload & download¶
Convenience wrappers around read / write for whole files.
Mode (executable bit, etc.) carried across.
Open the browser¶
Headless stealth Chromium driven from outside. One call to start, state persists.
Template browser-use has Chromium + Playwright + patchright + curl_cffi
+ Whisper baked in.
Click, fill, type¶
Selector-based interaction.
Visual click¶
Click by viewport coordinates instead of selector. Pair with
screenshot_annotated + clickables so the agent sees numbered boxes.
humanlike=True (default) curves the cursor along a Bezier path with
per-step jitter; False does an instant teleport.
Read page content¶
Screenshots¶
ImageContent is what Claude / GPT-4o see directly in the conversation.
Run JavaScript¶
Escape hatch for anything Playwright can't express cleanly.
The script must return JSON-serializable data.
Persist sessions¶
Save a logged-in browser context and restore in a future sandbox.
# First sandbox: log in once.
sb.browser.start()
sb.browser.navigate("https://app.example.com/login")
sb.browser.fill("#email", "alice@example.com")
sb.browser.fill("#password", "...")
sb.browser.click("button[type=submit]")
sb.browser.save_profile("alice-app") # /var/firebox-profiles/alice-app.json
# Days later, fresh sandbox: come back already logged in.
with Sandbox.create(template="browser-use") as sb2:
sb2.browser.start(profile="alice-app")
sb2.browser.navigate("https://app.example.com/dashboard")
Profiles include cookies + localStorage + sessionStorage. They're
template-scoped (saved from browser-use only loads in browser-use).
Real-Chrome HTTP¶
When Cloudflare or Akamai block the chromium browser at the TLS layer, fire raw HTTP with curl_cffi's real Chrome 120 JA3/JA4 fingerprint.
Methods: get / post / put / delete / request.
Cookies from the browser carry over via headers.
Aggregated search¶
Self-hosted SearxNG fans out to 5–15 engines per query. No API key, no rate limit.
sb.search.web("rust web framework") # general
sb.search.news("AI agents", time_range="day") # last day
sb.search.papers("microvm performance") # arxiv + Scholar
sb.search.code("playwright stealth github") # GitHub + SO + Arch
sb.search.images("hacker news logo")
sb.search.videos("rust async tutorial")
sb.search.wiki("Firecracker (software)")
sb.search.maps("Ljubljana")
Cached in-sandbox for 5 minutes; same query twice is microseconds.
Transcribe audio¶
Local Whisper (tiny.en, 39 MB) baked into the browser-use
template. ~5× realtime on the sandbox's CPU.
audio = sb.http.get("https://example.com/podcast.mp3").content
result = sb.audio.transcribe(audio, format="mp3")
print(result.text) # "..."
print(result.language) # "en"
print(result.segments) # [{start, end, text}, ...]
Language can be auto-detected or pinned with language="en".
Solve captcha¶
Four paths, picking by cost / target site.
Free, ~70-80 % per-attempt, works on sites that allow audio mode. Same shape for reCAPTCHA v2 and hCaptcha.
For sites that force image challenges. Firebox surfaces the puzzle as plain data (instructions + screenshot + cell bboxes). Your agent's vision LLM decides which cells; firebox runs no model.
challenge = sb.captcha.recaptcha_open_image_challenge()
# challenge.screenshot_b64 PNG of the grid
# challenge.instructions "Click verify once there are no more"
# challenge.target "crosswalks"
# challenge.cells [{idx, x, y, width, height}, ...]
# ... your LLM looks at the screenshot + instructions ...
# ... decides indices = [0, 4, 7] ...
sb.captcha.recaptcha_click_cells([0, 4, 7])
result = sb.captcha.recaptcha_verify_image()
# If reCAPTCHA wants more clicks, result has the next puzzle baked in:
while result.get("more_to_click"):
# hand result back to the LLM, repeat
...
Works for v2, v3, hCaptcha, Cloudflare Turnstile, Funcaptcha. Caller doesn't see the puzzle.
Last resort: ping the human, expose VNC, let them solve manually.
info = sb.captcha.handoff_to_vnc(poll_until_solved=True,
poll_timeout=300)
# info = {vnc_url, vnc_password, sandbox_ip,
# solved: True, waited: 23.4}
The caller is responsible for routing port 5900 from the sandbox
to the human (DNAT, Tailscale, whatever your topology requires).
poll_until_solved=True blocks until detect_captcha() returns None.
detect_captcha returns {type, sitekey, iframe_url, callback} for
reCAPTCHA v2/v3/enterprise, hCaptcha, Cloudflare Turnstile, Funcaptcha.
Expose a public URL¶
DNAT a host port to a port the sandbox is listening on. Return a URL the public can hit.
The SDK doesn't have a one-liner — use firebox.vmm.run_exposed or
set up the iptables rules yourself before launching the sandbox.
See examples/browser-use/run_agent_vnc.py for a worked example
(DNAT 5900 → in-VM x11vnc).
Cleanup happens automatically when the sandbox closes — the iptables rules trap on exit.
What's not yet here¶
These would be obvious additions; nothing's blocking them, just not written yet:
- PTY / interactive shell —
firebox sandbox attach <id>for a real terminal. The SDK and daemon would need TTY allocation + a bidirectional WebSocket. - GPU passthrough — pass a Blackwell partition through to the VM for in-sandbox inference. Firecracker supports VFIO; we don't wire it up yet.
- Pause / resume snapshots — Firecracker has native snapshot support; we'd boot from a paused snapshot for sub-50 ms cold start.
- Volume mounts — persistent state across sandbox lifetimes for the same template / token.
If any of these are blocking your use case, the host setup docs tell you where each piece would slot in.