agent-browser
by inferen-shagent-browser lets AI agents control a Playwright-powered browser via inference.sh. Open pages, use @e element refs to click, type, drag, upload files, scrape content, and capture screenshots or video. Ideal for web automation, data extraction, and agent-driven browsing workflows.
Overview
What is agent-browser?
agent-browser is a browser automation skill designed for AI agents running on top of the inference.sh platform. It uses Playwright under the hood and exposes a simple, JSON-based interface so agents can:
- Open and navigate web pages in a real browser
- Interact with elements using stable
@ereferences - Click, type, drag-and-drop, and upload files
- Extract structured data for scraping and research
- Capture screenshots and record video of sessions
Instead of hand-writing Playwright code, you call agent-browser through the infsh CLI (or from an agent that can run Bash commands). The skill coordinates the browser session, returns machine-friendly descriptions of the page, and lets your agent drive the interaction step by step.
Who is agent-browser for?
agent-browser is aimed at:
- Developers wiring AI agents to real websites
- Automation engineers who need repeatable browser workflows
- Data and research teams doing targeted web scraping or UI-driven research
- Workflow builders using inference.sh as an orchestration layer
It fits best when you already use, or are willing to use, inference.sh and want the browser to be a controlled, agent-accessible tool.
What problems does it solve?
agent-browser helps you solve common browser-automation jobs:
- Automating login, navigation, and form workflows
- Scraping structured content that requires interaction (search forms, filters, pagination)
- Running agent-driven “testing-like” flows on live sites
- Recording videos of an automated browsing session for review
It abstracts away direct Playwright scripting and gives the agent a higher-level set of actions using @e element references, which helps keep interactions stable across multiple steps.
When is agent-browser a good fit?
Use agent-browser when:
- You run agents via inference.sh and need them to browse the web
- You want Playwright-level reliability without writing Playwright code
- Your flows consist of opening pages, interacting with elements, and reading results
It may not be a good fit when:
- You cannot use the
infshCLI or Bash-like tooling - You need extremely custom Playwright features beyond what the skill exposes
- Your use case is purely API-based and does not require a real browser
If you need fine-grained control of browser internals or frameworks beyond what the skill exposes, you might prefer direct Playwright scripts. For typical agent-driven automation, agent-browser provides a simpler, higher-level interface.
How to Use
Prerequisites
Before using agent-browser, make sure you have:
- An environment where you can run Bash commands
- The inference.sh CLI (
infsh) installed - An inference.sh account you can log into from the CLI
The skill’s Quick Start explicitly requires the infsh CLI. You can follow the official CLI install instructions from the repository:
- CLI install documentation:
https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md
Once infsh is installed and configured, you can invoke the agent-browser skill from your terminal or from any agent tooling that can run infsh commands.
Installation and skill activation
agent-browser is distributed as part of the inferen-sh/skills repository. In most inference.sh–based environments you do not need to install a separate npm package; instead, you make sure the skill is available and then call it via infsh.
Typical setup steps:
- Install the inference.sh CLI
- Follow
cli-install.mdfrom the repo.
- Follow
- Authenticate
- Run:
infsh login - Follow the prompts to authenticate with inference.sh.
- Run:
- Confirm skill availability
- Ensure your inference.sh environment has access to the
agent-browserapp/skill undertools/utilities/agent-browserin theinferen-sh/skillsrepository.
- Ensure your inference.sh environment has access to the
If you are integrating with a broader “skills” ecosystem that supports npx skills add, you can also wire this repository as a source, but the canonical flow for agent-browser usage is through infsh app run.
Core browser automation workflow
The skill documentation describes a consistent 4-step pattern:
- Open – Start a browser session and navigate to a URL.
- Interact – Use returned
@eelement references to click, type, drag, or upload. - Re-snapshot – Request an updated snapshot to get new
@erefs after navigation or DOM changes. - Close – End the session; optionally retrieve a video recording if enabled.
This pattern lets your agent maintain a mental model of the page state. Each call passes JSON input and receives structured JSON output, which you feed into your agent’s reasoning loop.
Quick start example
To see agent-browser in action with a simple one-page open, follow the Quick Start pattern from the repo:
infsh login
# Open a page and get interactive elements
infsh app run agent-browser --function open --input '{"url": "https://example.com"}' --session new
What this does:
--function opentells agent-browser to launch a new browser page at the given URL.--session newcreates a new session so later actions can reuse the same browser state.- The skill returns JSON that typically includes element descriptions and
@ereferences your agent can use in follow-upinteractcalls.
Working with @e element references
A central feature of agent-browser is its use of @e refs. Rather than requiring CSS selectors or XPath, the skill returns handles like @e:button-1 (the exact format depends on the implementation) along with human-readable descriptions.
Your agent then:
- Reads the list of available elements and their descriptions.
- Chooses the appropriate
@eref (for example, the button labeled “Search”). - Calls an interaction function (such as click or fill) using that
@eref.
This design is optimized for AI agents because they can reason over descriptions rather than low-level DOM details. It also helps keep interactions robust even if the underlying selectors change, as long as descriptions remain interpretable.
Example: open, click, and resnapshot
A typical multi-step flow might look like this (pattern only; adjust to your specific functions):
# 1. Start a session and open a page
OPEN_RESULT=$(infsh app run agent-browser \
--function open \
--session new \
--input '{"url": "https://example.com"}')
# 2. Use OPEN_RESULT to pick an @e ref (e.g. @e:search-button) in your agent logic
# 3. Interact with that element
INTERACT_RESULT=$(infsh app run agent-browser \
--function interact \
--session "$INF_SH_SESSION" \
--input '{"action": "click", "element": "@e:search-button"}')
# 4. Re-snapshot after the click to get updated elements
SNAPSHOT_RESULT=$(infsh app run agent-browser \
--function snapshot \
--session "$INF_SH_SESSION" \
--input '{}')
The names of functions beyond open can vary, so always check the latest SKILL.md and any associated docs in tools/utilities/agent-browser for the exact function signatures and input schema.
Screenshots and video recording
agent-browser can capture visual artifacts of the browsing session:
- Screenshots – Useful for debugging agent behavior or storing visual confirmations.
- Video – When you close the session with recording enabled, the skill can return or link to a video file of the full automated flow.
These features are configured through the skill’s input options. For details on enabling recording and accessing the outputs, consult the SKILL.md definition and any additional docs under tools/utilities/agent-browser in the repo.
Integration tips for agents and workflows
To make the most of agent-browser in your automation or research workflows:
- Persist
--sessionIDs: Make sure your agent stores the session identifier between calls so that multiple actions occur in the same browser. - Parse JSON output carefully: Use robust JSON parsing in your agent’s runtime; element lists and metadata can be rich.
- Throttle interactions if needed: If a page is slow or dynamic, ensure your agent accounts for timing and waits for elements to appear between steps when the skill supports that configuration.
- Log key actions: Keep a log of
open,interact, andclosecalls if you need to audit or debug your agent’s browsing behavior.
FAQ
What is the relationship between agent-browser, inference.sh, and Playwright?
agent-browser is a skill that runs inside the inference.sh ecosystem. When you invoke it via infsh app run, it uses Playwright as the underlying browser automation engine. You do not call Playwright directly; instead, you work with the higher-level skill functions and @e element references.
How do I install agent-browser?
You do not install agent-browser as a standalone binary or npm package. Instead:
- Install the inference.sh CLI (
infsh) using the officialcli-install.mdinstructions. - Log in with
infsh login. - Ensure your environment has access to the
agent-browserskill from theinferen-sh/skillsrepository (undertools/utilities/agent-browser).
From there, you can immediately invoke the skill via infsh app run agent-browser.
Do I need programming experience to use agent-browser?
Basic command-line and JSON familiarity are strongly recommended. You do not need to write Playwright scripts, but you should be comfortable:
- Running
infshcommands - Passing JSON as
--input - Parsing JSON output in your agent or scripts
For more advanced workflows (conditional logic, loops, error handling), general scripting or programming knowledge is helpful.
Can I use agent-browser outside of inference.sh?
The skill is built specifically for use with inference.sh and is described as “Browser automation for AI agents via inference.sh.” The supported and documented way to run it is through the infsh CLI. If you require a standalone library, you may prefer using Playwright directly in your language of choice.
Is agent-browser suitable for large-scale web scraping?
agent-browser can be used for targeted scraping, especially when pages require interaction or JavaScript rendering. However, for very high-volume scraping at scale, you should consider:
- inference.sh account limits and pricing
- Respect for target site terms of service and robots.txt
- Performance, concurrency, and rate limiting
For smaller-scale or workflow-specific scraping embedded in an agent, agent-browser is a strong fit. For massive crawling across many sites, a dedicated scraping stack may be more appropriate.
How does session management work?
Session management is controlled via the --session flag when calling infsh app run. A typical pattern is:
--session newwhen you callopenfor the first time- Reusing that session ID for subsequent
interactandsnapshotcalls - Calling the appropriate close function to end the session and optionally retrieve video
Always consult the current SKILL.md for the exact options and outputs related to session management.
Where can I find the full specification of functions and inputs?
The authoritative reference for agent-browser lives in the repository:
SKILL.mdat the root of theinferen-sh/skillsrepo- The
tools/utilities/agent-browserdirectory for implementation details, examples, and any additional documentation
Open these files to see the current list of functions, expected JSON inputs, and output formats, then model your agent or scripts around those definitions.
