HomeBlogPlaywright-MCP Deep Dive: The Perfect Combination of Large Language Models and Browser Automation

Playwright-MCP Deep Dive: The Perfect Combination of Large Language Models and Browser Automation

2025-10-29 18:11

When you build a component and need to test its functionality, you usually write a large number of automation scripts to simulate user behavior and ensure it works as expected. However, as the project grows more complex, these testing scripts also become larger and harder to maintain.

At this point, you might wonder:

“Can I just tell the tool in natural language — ‘please test the click effect of this button for me’?”

In the past, this idea might have seemed far-fetched. But today, with the rise of large language models (LLMs) and the continuous innovation of the Playwright ecosystem, this fantasy is becoming reality.
The key enabler is what we’ll explore today — playwright-mcp.

Part 1: Application Guide

The core position of playwright-mcp (Model Context Protocol) is to act as a bridge.
It runs as a local service, allowing your AI assistant to connect and gain control over the browser.

Step 1: Installation and Configuration

You don’t need complex global installation.

The design philosophy of playwright-mcp is on-demand startup — the simplest way is to use npx.

• General configuration method:

Most AI tools that support MCP (such as Cursor) allow you to add an mcpServers field in their JSON configuration. Example:

{
“mcpServers”: {
“playwright”: {
“command”: “npx”,
“args”: [
“@playwright/mcp@latest”
]
}
}
}

• VS Code one-click configuration:

If you are a VS Code or Cursor user, simply run the following command in your terminal:

code –add-mcp ‘{“name”:”playwright”,”command”:”npx”,”args”:[“@playwright/mcp@latest”]}’

After execution, VS Code will automatically add this service info to the configuration.

Once done, you’ll see this MCP configuration entry in your Cursor MCP Tools.

MCP Tools

Step 2: Customize Your Automation Environment

playwright-mcp provides rich command-line parameters, allowing you to refine control of the automation environment through the args array.

–allowed-origins <origins>
Allowed request origins, separated by semicolons. By default, all origins are allowed.
–blocked-origins <origins>
Blocked request origins, separated by semicolons. The blocklist takes precedence over the allowlist.
If no allowlist is specified, any request not in the blocklist will still be allowed.
–block-service-workers
Blocks Service Workers.
–browser <browser>
Specifies the browser or Chrome channel to use. Possible values: chrome, firefox, webkit, msedge.
–caps <caps>
Enables additional capabilities, provided as a comma-separated list. Possible values: vision, pdf.
–cdp-endpoint <endpoint>
Specifies the CDP (Chrome DevTools Protocol) endpoint to connect to.
–cdp-header <headers…>
Specifies custom headers to include when connecting to a CDP endpoint. Multiple headers can be defined.
–config <path>
Specifies the path to a configuration file.
–device <device>
Simulates a device, for example: “iPhone 15”.
–executable-path <path>
Specifies the path to the browser executable file.
–extension
Connects to an already running browser instance (Edge/Chrome only).
Requires the installation of the Playwright MCP Bridge browser extension.
–headless
Runs the browser in headless mode (default is with GUI).
–host <host>
Specifies the server hostname to bind to. The default is localhost.
Use 0.0.0.0 to bind to all available network interfaces.
–ignore-https-errors
Ignores HTTPS errors.
–isolated
Stores browser profile data only in memory, without writing it to disk.
–image-responses <mode>
Specifies whether to send image responses to the client.
Possible values: allow or omit. Default is allow.
–no-sandbox
Disables sandboxing for all processes that are normally isolated by the sandbox.
–output-dir <path>
Specifies the directory path where output files will be stored.
–port <port>
Specifies the port for SSE (Server-Sent Events) transport listening.
–proxy-bypass <bypass>
Comma-separated list of domain names that should bypass the proxy.
For example: .com,chromium.org,.domain.com.
–proxy-server <proxy>
Specifies the proxy server to use.
For example:
http://myproxy:3128 or socks5://myproxy:8080.
–save-session
Determines whether to save the Playwright MCP session to the output directory.
–save-trace
Determines whether to save the Playwright Trace (debug trace) to the output directory.
–secrets <path>
Specifies the path to a dotenv file that contains secret keys.
–storage-state <path>
Specifies the path to a storage state file, used for session isolation.
–timeout-action <timeout>
Specifies the action timeout in milliseconds. Default: 5000ms.
–timeout-navigation <timeout>
Specifies the page navigation timeout in milliseconds. Default: 60000ms.
–user-agent <ua string>
Specifies a custom User-Agent string.
–user-data-dir <path>
Specifies the path for the user data directory.
If not specified, a temporary directory will be created automatically.
–viewport-size <size>
Specifies the browser viewport size in pixels, for example: “1280,720”.

Step 3: Start Talking to the AI

Once configured, you can start interacting!
In your AI assistant chat window, just use natural language to give commands.

Example:

➡ You:

Use playwright to open github.com, enter microsoft/playwright-mcp in the search bar, and search, then click the first result.”

➡ AI Assistant (in the background):

Runs npx @playwright/mcp@latest to start service and browser.
Calls browser_navigate to open https://github.com.
Uses browser_snapshot to “see” the page and find the search box.
Uses browser_type to input text.
Uses another browser_snapshot to detect results.
Calls browser_click to click the first match.
Replies: “Operation complete. Navigated to the microsoft/playwright-mcp repository page.”

At this point, Cursor successfully opened the browser and navigated there.

microsoft/playwright-mcp

At the same time, we can see the output from Cursor and the entire MCP call.

Cursor Output

Now, you’ve mastered the basic workflow of collaborating with AI. Next, let’s dive beneath the surface and see how all of this actually works.

Part 2: Principle Dive

The success of playwright-mcp lies in abandoning fragile screenshot-based visual analysis and instead adopting a structured understanding approach — more stable and efficient.

Core Technology 1: The AOM as the “Raw Material”

This is playwright-mcp’s secret weapon — the most revolutionary part.
When AI needs to “see” a page, it doesn’t use screenshots but calls Playwright’s
page.accessibility.snapshot().

Example code in
playwright/packages/playwright-core/src/server/dispatchers/pageDispatcher.ts:

async accessibilitySnapshot(params: channels.PageAccessibilitySnapshotParams, progress: Progress): Promise<channels.PageAccessibilitySnapshotResult> {
const rootAXNode = await progress.race(this._page.accessibility.snapshot({
interestingOnly: params.interestingOnly,
root: params.root ? (params.root as ElementHandleDispatcher)._elementHandle : undefined
}));
return { rootAXNode: rootAXNode || undefined };
}

What is the AOM Tree?

It’s a semantic tree the browser generates for assistive technologies (like screen readers).
It includes meaningful elements only — their role (e.g., button, link, heading), name (button text), and state (checked, disabled), etc.

Structured Data vs. Pixel Data

Imagine giving the AI a screenshot of a login page — what it sees is just a cluster of pixels, requiring complex image recognition to guess where the input box is.

In contrast, page.accessibility.snapshot() gives the AI a piece of JSON like this:

{
“role”: “textbox”,
“name”: “Username or email address”
},
{
“role”: “button”,
“name”: “Sign in”
}

Core Technology 2: Custom Serialization Engine – snapshotter.ts

However, if the AOM data were passed directly to the LLM, the context would be enormous.

Therefore, playwright-mcp does not directly feed the massive, raw JSON object returned by this API to the LLM.

Instead, it first performs serialization — which is precisely the most ingenious step in the design of playwright-mcp.

After obtaining the raw accessibility tree, it runs a custom serializer (Serializer) that converts the data into a YAML-style text format, tailor-made and highly optimized for LLMs.

For example, in the operation we just performed, when Cursor attempted to invoke the browser_snapshot tool, Playwright returned the following content to Cursor:

### Page state
– Page URL: https://github.com/search?q=microsoft%2Fplaywright-mcp&type=repositories
– Page Title: Repository search results · GitHub
– Page Snapshot:
“`yaml
– generic [ref=e1]:
– generic [ref=e2]:
– generic [ref=e3]:
– link “Skip to content” [ref=e4] [cursor=pointer]:
– /url: “#start-of-content”
– banner [ref=e6]:
– heading “Navigation Menu” [level=2] [ref=e7]
– generic [ref=e8]:
– link “Homepage” [ref=e10] [cursor=pointer]:
– /url: /
– img [ref=e11]
– generic [ref=e13]:
– navigation “Global” [ref=e14]:
– list [ref=e15]:
– listitem [ref=e16]:
– button “Platform” [ref=e17] [cursor=pointer]:
– text: Platform
– img [ref=e18]
– listitem [ref=e20]:
– button “Solutions” [ref=e21] [cursor=pointer]:
– text: Solutions
– img [ref=e22]
– listitem [ref=e24]:
– button “Resources” [ref=e25] [cursor=pointer]:
– text: Resources
– img [ref=e26]
– listitem [ref=e28]:
– button “Open Source” [ref=e29] [cursor=pointer]:
– text: Open Source
– img [ref=e30]
– listitem [ref=e32]:
– button “Enterprise” [ref=e33] [cursor=pointer]:
– text: Enterprise
– img [ref=e34]
– listitem [ref=e36]:
– link “Pricing” [ref=e37] [cursor=pointer]:
– /url: https://github.com/pricing
…

Let’s analyze a few key features of this format and understand why it is so efficient for LLMs:

Hierarchy and Structure:Through simple indentation, it perfectly reconstructs the DOM hierarchy of the page, allowing the LLM to easily understand the parent-child and nested relationships between elements.
Semantic Description:
Each line clearly indicates the element’s role (such as link, button, heading) and name (such as “Skip to content”). These are crucial for the LLM to understand the function of each element.
Unique and Stable Reference [ref=eX]:
This is the core of the entire design. During the serialization process, playwright-mcp assigns each meaningful element on the page a unique temporary ID, in the form of ref=e23.This ID remains stable throughout the lifecycle of a single snapshot.It solves one of the most challenging problems in automation — element positioning.
Across the entire snapshot tool invocation chain, we can see Cursor referencing these ref values multiple times

Cursor Toolchain Invocation

Information Density and Simplicity:

This format retains only the information most valuable for the LLM’s decision-making — such as role, name, URL, and reference ID — while discarding a large amount of irrelevant DOM attributes.This greatly optimizes both the length and efficiency of the prompt.

Full Technical Workflow

Startup & Handshake:
npx launches the playwright-mcp WebSocket server, the AI client fetches tool definitions (like browser_click).
Observe (Snapshot):
LLM first calls browser_snapshot to “understand” the page.
Reasoning:
The server runs accessibility.snapshot() and sends structured JSON back.
LLM analyzes it, e.g., “find textbox named ‘Username’, then button named ‘Sign in’.”
Action:
LLM sends JSON-RPC requests to execute actions (browser_type, browser_click).
Execution:
The server translates requests to Playwright APIs, like
getByRole(‘button’, { name: ‘Sign in’ }), achieving stable matching.
Feedback:
Execution results (success/failure) are returned via JSON-RPC.
LLM decides next steps based on that feedback.

Conclusion: A New Paradigm for Automation

playwright-mcp is not just a tool — it represents a new automation paradigm.
By combining LLM’s natural language reasoning with Playwright’s precise browser control, it unlocks true intelligent automation.

It moves beyond traditional CV+AI’s unreliability and delay, offering a deterministic, efficient “AI vision” via the Accessibility Tree.

This will transform how automation tests, RPA, smart data scraping, and accessibility testing are written — opening up vast new possibilities.

FAQ

Q: What is Playwright-MCP and how does it change the way browser automation works?

A: Playwright-MCP is a bridge between large language models (LLMs) and browsers. Instead of relying on fragile screenshot recognition, it lets AI understand a page through its AOM Tree — the same structure used by screen readers. That means your AI can “see” elements semantically (buttons, inputs, links) and operate with stable references, not pixels. The result: smarter, faster, and far more reliable automated testing.

Q: How does ZStack Cloud support GPU virtualization for AI workloads, and can it be tested through Playwright-MCP?

A: ZStack Cloud provides full GPU resource management — from SR-IOV passthrough to mediated devices (mdev). These features let enterprises run AI training, inference, and visualization workloads efficiently. Using Playwright-MCP, testers can simulate user actions on the ZStack console (like attaching a GPU to a VM) and confirm that backend resources respond correctly. It brings human-like validation into GPU-heavy environments.

Q: I’m planning to migrate from VMware. How can ZStack and Playwright-MCP work together during the transition?

A: During VMware replacement projects, teams often need to validate functional parity between the old and new platforms. By combining those routes with Playwright-MCP, you can script natural-language validation tasks that continuously check VM creation, storage mounting, or HA behaviors across both environments. It shortens testing cycles and helps ensure migration stability.

Q: Who is ZStack and why is it relevant to AI-driven automation?

A: ZStack is a leading Chinese enterprise cloud provider offering full-stack products — ZStack Cloud for private cloud and virtualization, ZStack Cube for hyper-converged deployment, and ZStack Edge/ZStack Zaku for container and edge computing.

For AI and automation teams, ZStack provides the infrastructure layer (VMs, GPUs, storage, APIs) that pairs perfectly with tools like Playwright-MCP, allowing developers to build intelligent, repeatable workflows that test, deploy, and scale cloud resources with minimal manual effort.

AI Infra GPU

Private Cloud Platform

ZStack Cloud Platform

ZStack ZSphere Virtualization Platform

ZStack HCI

ZStack Software-Defined Storage

Data Center Management

Edge Orchestration

Cloud-Native Platform

Database Management

Private AI

Advanced Infrastructure Platform

ZStack Cloud Platform

ZStack ZSphere Virtualization Platform

By Scenario

By Industry

Documentation&Tools

Support & Services

Training & Certification

Content

VMware Alternative Solutions and Case Studies (Virtualization Chapter)

Blog