Skip to Content
TestingAutonomous App Walker — find bugs the prompt never thought to test

Autonomous App Walker — find bugs the prompt never thought to test

Point the walker at any app URL on a tester VM and it explores the UI for you: breadth-first across reachable routes, clicking real DOM elements via the Chromium DevTools Protocol, recording console errors and failed network requests, and emitting runnable scenarios for every blocker or major finding it traps. Use it on a freshly vibecoded app, a staging build, or a production URL to surface the failure modes a hand-written test plan would miss.

What this is for. Catching the bugs nobody scripted a test for — dead buttons, console errors on a route the prompt never named, broken redirects, a TypeError two clicks deep. For a curated login or checkout flow you already know about, write a scenario instead.

Today (M1 through M5 are live, end-to-end real). DevTools primitives talk to live CDP at :9222 on tester VMs, the BFS explorer drives the browser via the VM Interaction API, walks and findings persist under /TEST_STORAGE_DIR/app-walks/, the REST surface is authenticated, and the Walks tab in the Apps UI renders against real walk records. There is no stub mode — every walk you start runs against a real browser.

How it works

  1. Pick a tester VM and a target URLensure_tester_vm gives you a persistent desktop VM with Chromium running under CDP. Any reachable URL works (deployed app, preview, local dev server).
  2. Start a walkwalk_app returns a walk_id and a dashboard_url immediately. The explorer runs in the background.
  3. The explorer crawls the app — at each state it snapshots the page, enumerates real clickable elements via CDP, collects browser diagnostics, filters destructive actions, and pushes unvisited children into the BFS frontier (bounded by max_states and max_depth).
  4. Findings get deduped and ranked — console errors, failed requests, dead controls, and missing assertions are persisted with a sha256 dedup key over (category, route_template, selector, message[:80]) and tagged blocker | major | minor | info.
  5. Scenarios get emitted — for each blocker/major finding the walker traces back through parent_state_id and synthesizes a ScenarioStep sequence (open_url → clicks → assert_no_errorassert_text) and POSTs it to /api/tests/app-scenarios, ready to replay.
  6. Review the reportget_app_walk returns the full record; the server also renders a standalone report.html with a Mermaid state graph, severity-grouped findings, and the emitted scenarios.

Walker modes

  • mechanical — pure BFS over enumerated DOM. Deterministic, exhaustive within bounds, no LLM cost.
  • ai — at each state-seed the LLM ranks candidate actions against ai_goal (“Find broken buttons and flows”); the explorer follows the ranked indices first.
  • hybrid — LLM ranks the top candidates per state, mechanical BFS sweeps the remainder. Default for app-shaped targets.

Destructive actions

By default the walker refuses to click anything that looks destructive (Delete, Remove, Clear, Unsubscribe, Pay, etc). Opt back in with destructive_allowed=True and a destructive_label_allowlist of literal labels or re: regex patterns. Anything not on the allowlist stays blocked even when destructive actions are otherwise permitted.

MCP tools

walk_app

Start an autonomous walk over a target URL on a tester VM with configurable BFS bounds, mode, and destructive-action filters.

walk_app(app_name='walker-canary', app_url='https://example.com/app', vm_name='of-tester-xyz', max_states=20, max_depth=4, mode='hybrid', ai_goal='Find broken buttons and flows', destructive_allowed=True, destructive_label_allowlist=['Remove from cart', 're:^Clear ']) → { "walk_id": "550e8400-e29b-41d4-a716-446655440000", "status": "running", "vm_name": "of-tester-xyz", "app_url": "https://example.com/app", "config": {"mode": "hybrid", "max_states": 20, "max_depth": 4, "destructive_allowed": true}, "dashboard_url": "http://libvirt-backend/api/tests/app-walks/550e8400-...", "report_url": "http://libvirt-backend/api/tests/app-walks/550e8400-.../report.html", "started_at": "2026-06-26T10:15:23.456Z" }

get_app_walk

Fetch the full walk record: status, state graph, transitions, findings with evidence, and emitted scenario IDs.

get_app_walk(walk_id='550e8400-e29b-41d4-a716-446655440000', include_states=True, include_transitions=True, include_findings=True) → { "walk_id": "550e8400-...", "status": "completed", "app_name": "walker-canary", "totals": {"states": 8, "transitions": 12, "scenarios_emitted": 3, "findings_by_severity": {"blocker": 1, "major": 2, "minor": 5, "info": 3}}, "states": [{"state_id": "a1b2c3d4...", "url": "...", "title": "Dashboard"}], "findings": [{"finding_id": "f123", "category": "console_error", "severity": "blocker", "message": "TypeError: Cannot read property 'map'", "state_id": "a1b2c3d4..."}] }

list_app_walks

List recent walks filtered by app name, project, or status (running, completed, stopped, error).

list_app_walks(app_name='walker-canary', status='completed', limit=20) → { "walks": [{ "walk_id": "550e8400-...", "app_name": "walker-canary", "status": "completed", "started_at": "2026-06-26T10:15:23Z", "totals": {"states": 8, "transitions": 12, "scenarios_emitted": 3, "findings_by_severity": {"blocker": 1, "major": 2, "minor": 5, "info": 3}} }], "count": 1 }

stop_app_walk

Request cooperative cancellation of a running walk; the explorer terminates within one poll cycle.

stop_app_walk(walk_id='550e8400-e29b-41d4-a716-446655440000') → { "walk_id": "550e8400-...", "status": "stopped", "cancel_requested": true, "finished_at": "2026-06-26T10:22:15Z" }

List every clickable or typable DOM element on the active browser tab via CDP, with selector, accessible name, href, bbox, and state flags. Useful for debugging what the walker actually sees on a given page.

enumerate_dom_links(vm_name='of-tester-xyz', tab_filter='https://example.com', include_perf=True) → [ {"type": "text", "elements": [ {"idx": 0, "selector": "button.submit", "tag": "button", "role": "button", "accessible_name": "Submit", "bbox": [100, 200, 200, 230], "in_viewport": true, "visible": true, "disabled": false} ]}, {"type": "text", "text": "1 interactive element", "perf": {"screenshot_ms": 120, "cdp_enumerate_ms": 180}} ]

get_browser_diagnostics

Capture browser console errors, failed network requests, page violations, and performance metrics via CDP. Snapshot mode by default; pass collect_duration_ms for live collection.

get_browser_diagnostics(vm_name='of-tester-xyz', collect_duration_ms=1500, tab_filter='https://example.com') → [ {"type": "text", "text": "Browser diagnostics for of-tester-xyz"}, {"type": "text", "text": "Snapshot: tab=https://example.com, collected 0ms"}, {"type": "text", "text": "Console errors: 2\n - TypeError: Cannot read property 'map' of undefined (at app.js:145:12)\n - ReferenceError: globalThis is not defined"}, {"type": "text", "text": "Failed requests: 1\n - GET /api/user/profile (404 Not Found)"} ]

get_pending_checkpoint

Long-poll (0 to 25 s) for the oldest unresolved harness checkpoint visible to the caller. Used by external harnesses to gate mining-time scenario verification or replay-time step verification.

get_pending_checkpoint(walk_id='550e8400-e29b-41d4-a716-446655440000', wait_seconds=25) → { "checkpoint": { "checkpoint_id": "cp-abc123def456", "walk_id": "550e8400-...", "kind": "mine", "status": "pending", "prompt": "Verify proposed scenario captures the bug", "context": { "scenario_steps": [ {"action_type": "open_url", "value": "https://example.com"}, {"action_type": "click", "target_label": "Submit"} ], "finding": {"finding_id": "f123", "category": "console_error", "severity": "blocker", "message": "TypeError: Cannot read property 'map'"} }, "screenshot_url": "http://libvirt-backend/api/tests/app-walks/550e8400-.../states/a1b2c3d4.../screenshot" }, "pending_count": 1 }

submit_scenario_checkpoint

Submit a harness verdict (pass, fail, investigate, skip, timeout) on a paused mining or replay checkpoint to resume the walker or runner.

submit_scenario_checkpoint(checkpoint_id='cp-abc123def456', verdict='pass', notes='Scenario correctly reproduced the TypeError') → { "checkpoint_id": "cp-abc123def456", "status": "resolved", "verdict": "pass", "verdict_notes": "Scenario correctly reproduced the TypeError", "resolved_at": "2026-06-26T10:25:10Z" }

get_checkpoint

Fetch a single checkpoint record (pending or resolved) by ID to read its verdict and audit trail.

get_checkpoint(checkpoint_id='cp-abc123def456') → { "checkpoint_id": "cp-abc123def456", "walk_id": "550e8400-...", "kind": "mine", "status": "resolved", "verdict": "pass", "created_at": "2026-06-26T10:20:00Z", "resolved_at": "2026-06-26T10:25:10Z" }

REST surface

The MCP tools are thin wrappers over an authenticated REST API. All routes require Authorization: Bearer <token> plus X-User-Id (or X-User-Email for loopback). Callers see only their own walks unless they are admin.

MethodPathPurpose
POST/api/tests/app-walksCreate a new walk record
GET/api/tests/app-walksList walks (app_name, project_id, status, limit)
GET/api/tests/app-walks/{walk_id}Fetch full record (include_states, include_transitions, include_findings)
PATCH/api/tests/app-walks/{walk_id}Update status, totals, error, ai_state, emitted scenario IDs
POST/api/tests/app-walks/{walk_id}/stopRequest cooperative cancellation
POST/api/tests/app-walks/{walk_id}/loginRun an ephemeral login scenario as auth preamble
POST/api/tests/app-walks/{walk_id}/statesAppend an ExplorationState (BFS node)
POST/api/tests/app-walks/{walk_id}/transitionsAppend an ExplorationTransition (BFS edge)
POST/api/tests/app-walks/{walk_id}/findingsAppend a Finding (deduped by sha256)
POST/api/tests/app-walks/{walk_id}/screenshots/{state_id}Upload PNG for a state
GET/api/tests/app-walks/{walk_id}/states/{state_id}/screenshotRetrieve PNG for a state
GET/api/tests/app-walks/{walk_id}/report.htmlServer-rendered standalone HTML report
POST/api/tests/checkpointsWalker/runner creates a mining (kind=mine) or replay (kind=verify) checkpoint
GET/api/tests/checkpointsLong-poll oldest pending checkpoint scoped to walk_id or run_id
GET/api/tests/checkpoints/{checkpoint_id}Fetch a single checkpoint record
POST/api/tests/checkpoints/{checkpoint_id}/verdictHarness submits verdict, sets status=resolved, notifies waiter

Putting it together

  1. Operator calls walk_app(app_name='myapp', app_url='https://myapp.dev', vm_name='of-tester-1', mode='hybrid', ai_goal='Find broken buttons'). MCP returns walk_id, status='running', and dashboard_url immediately.
  2. The background explorer opens the URL in the VM’s browser, snapshots the page, calls enumerate_dom_links via CDP to read the true clickable elements, filters out destructive actions, and pushes the root state into the BFS frontier.
  3. For each queued state the explorer calls get_browser_diagnostics to collect any errors or failed requests, takes a screenshot, persists the state via POST /api/tests/app-walks/{walk_id}/states, and POSTs every click action as a transition.
  4. In hybrid or AI mode, on state-seeds the explorer ranks candidates by calling the LLM with the candidate brief plus the current screenshot, then prioritizes the LLM’s ranked indices and continues mechanical BFS for the remainder.
  5. When a click or page load triggers a console error, network failure, dead control, or missing assertion, the explorer computes a dedup key (normalized message plus category plus route template), POSTs a Finding, and if severity is blocker or major marks that branch for scenario emission.
  6. At finalize, the walker traces back through parent_state_id for each blocker or major finding, synthesizes a ScenarioStep list (open_url → clicks → assert_no_error with extracted error phrases → assert_text), and POSTs to /api/tests/app-scenarios. With mining_verification='polling', each proposed scenario creates a checkpoint, pauses until the harness submits a verdict, then persists or drops the scenario.
  7. The operator calls get_app_walk(walk_id) for the final record, or polls list_app_walks while it runs.
  8. In the UI, the Walks tab polls listAppWalks every 10 s, renders the walk list with status badges and severity-color-coded finding counts, and a row click opens the detail dialog: Mermaid state graph, findings table, emitted scenarios, AI decisions when ai_state is non-null, and the server-rendered report.html embedded at the bottom.

Example prompts

Walk https://walker-canary.apps.openfactory.tech on my tester VM in hybrid mode with goal "find broken buttons and unhandled errors". Cap at 25 states and depth 4. When it's done, show me the blocker findings and the scenarios it emitted.
List my last 10 completed walks for app "checkout-staging" and tell me which ones produced new blocker findings vs. only repeats.
Stop walk 550e8400-... it's stuck looping the modal. Then enumerate the DOM on the tester VM so I can see what selectors it thought were clickable on the current page.
  • App UI Testing — author and replay curated GUI scenarios. The walker emits these directly from findings.
  • App Deployment — deploy a Git repo to a public preview URL the walker can crawl.
  • MCP Integration — set up the OpenFactory MCP server.
Last updated on