Checkpoints & Rollback — safe iteration for agents
Snapshot and restore an app’s code SHA, VM disk, and database state together, then revert (or re-apply) a change in a single bi-directional call. Agents can take a risky swing at a refactor, a migration, or a dependency bump and know they can land back at a known-good state without losing the failed attempt for debugging.
Today (M1): the orchestration layer is live — CheckpointService, REST, MCP, state transitions, retention pruning, and the six-stage rollback pipeline all work end to end. The snapshot adapter is a stub that fabricates snapshot IDs without touching real libvirt disk snapshots or DB dumps. M1.5 swaps in the real
app_snapshots.pyimplementation behind the same interface, so any scenarios you write against the MCP tools today keep working unchanged.
Why checkpoints
Agent-driven iteration produces a lot of failed attempts. Without checkpoints you have to either (a) trust the agent’s diff-and-revert logic, or (b) carry the cost of rebuilding VM and DB state by hand every time something breaks. Checkpoints make rollback atomic across all three layers:
- Code — the git SHA the app was deployed from.
- VM disk — the qcow2 state of the app’s tester VM at snapshot time.
- Database — a logical dump of any managed database attached to the app.
And rollback is bi-directional: before reverting, the service captures a
pre-rollback safety checkpoint of the current (failed) state, so the agent can
roll forward into the failure again to keep debugging.
How it works
- Snapshot before the risky change —
create_checkpointquiesces the app, records the deployed git SHA, snapshots the VM disk, dumps each attached DB, and returns acheckpoint_id. - Let the agent iterate — modify code, run a migration, deploy, walk the app. If it works, keep going. If it doesn’t, you have a known-good state to revert to.
- Roll back —
rollback_appcaptures a safety checkpoint of the broken state first, then runs a six-stage pipeline:safety-checkpoint,stop-unit,vm-revert,db-restore,start-unit,health-check. The stages list comes back on the response so a GUI can render progress without polling. - Pin what you want to keep — retention prunes to the last 10 checkpoints
plus one daily for 7 days.
pin_checkpointexempts a checkpoint from pruning entirely.
The app’s slug, public URL, and gateway route are preserved across rollback,
the same https://<slug>.apps.openfactory.tech URL keeps serving, just from
the restored state.
MCP tools
| Tool | Use |
|---|---|
create_checkpoint | Capture a manual checkpoint (code, VM, DB) before a risky iteration |
list_checkpoints | List checkpoints (newest first) and per-app storage / retention usage |
rollback_app | Bi-directional rollback with auto safety checkpoint and staged progress |
pin_checkpoint | Exempt a checkpoint from retention pruning |
delete_checkpoint | Remove a checkpoint (fails if pinned, unpin first) |
create_checkpoint
Capture a manual checkpoint before a risky agent iteration. Optionally pin it so retention never prunes it.
create_checkpoint(
app_id="abc-123-def",
notes="before risky refactor",
pinned=True
)
→ {
"checkpoint_id": "cp-a1b2c3d4",
"ref": "abc1234567890def",
"trigger": "manual",
"vm_snapshot_id": "stub-snap-a1b2c3d4",
"db_snapshot_id": "stub-db-snap-e5f6g7h8",
"quiesced": true,
"pinned": true,
"size_bytes": 0,
"notes": "before risky refactor",
"deploy_id": null,
"created_at": "2026-06-26T14:32:15Z",
"session_token": "user-abc123",
"next": "rollback_app(app_id='abc-123-def', checkpoint_id='cp-a1b2c3d4') to restore this state. Bi-directional: rollback captures a pre-rollback safety cp so you can roll forward again."
}trigger is one of deploy, manual, agent, or pre-rollback. Useful when
filtering history for “what did the agent do here”.
list_checkpoints
List every checkpoint for the app plus the current storage footprint and the retention policy in effect.
list_checkpoints(app_id="abc-123-def")
→ {
"checkpoints": [
{
"checkpoint_id": "cp-a1b2c3d4",
"ref": "abc1234567890def",
"trigger": "manual",
"vm_snapshot_id": "stub-snap-a1b2c3d4",
"db_snapshot_id": "stub-db-snap-e5f6g7h8",
"quiesced": true,
"pinned": true,
"size_bytes": 0,
"notes": null,
"deploy_id": null,
"created_at": "2026-06-26T14:32:15Z"
},
{
"checkpoint_id": "cp-x9y8z7w6",
"ref": "def456",
"trigger": "manual",
"vm_snapshot_id": "stub-snap-x9y8z7w6",
"db_snapshot_id": null,
"quiesced": true,
"pinned": false,
"size_bytes": 0,
"notes": null,
"deploy_id": null,
"created_at": "2026-06-25T10:00:00Z"
}
],
"storage_usage": {
"checkpoint_count": 2,
"pinned_count": 1,
"total_bytes": 0,
"keep_last": 10,
"daily_days": 7
},
"session_token": "user-abc123"
}total_bytes is 0 under the stub adapter, real disk usage shows up in M1.5.
rollback_app
Bi-directional rollback. The service captures a pre-rollback safety
checkpoint first (so you can roll forward into the failed state for debugging),
then walks the six-stage pipeline.
rollback_app(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
→ {
"rolled_back_to": {
"checkpoint_id": "cp-a1b2c3d4",
"ref": "abc1234567890def",
"trigger": "manual",
"vm_snapshot_id": "stub-snap-a1b2c3d4",
"db_snapshot_id": "stub-db-snap-e5f6g7h8",
"quiesced": true,
"pinned": true,
"size_bytes": 0,
"notes": "before risky refactor",
"deploy_id": null,
"created_at": "2026-06-26T14:32:15Z"
},
"safety_checkpoint": {
"checkpoint_id": "cp-safety-xyz",
"ref": "abc999",
"trigger": "pre-rollback",
"vm_snapshot_id": "stub-snap-xyz",
"db_snapshot_id": null,
"quiesced": true,
"pinned": false,
"size_bytes": 0,
"notes": "safety checkpoint before rollback to cp-a1b2c3d4",
"deploy_id": null,
"created_at": "2026-06-26T14:35:00Z"
},
"stages": [
{
"stage": "safety-checkpoint",
"status": "ok",
"ts": "2026-06-26T14:35:00Z",
"notes": "id=cp-safety-xyz"
},
{
"stage": "stop-unit",
"status": "ok",
"ts": "2026-06-26T14:35:01Z",
"notes": "stub mode: would systemctl stop of-app-<slug>"
},
{
"stage": "vm-revert",
"status": "ok",
"ts": "2026-06-26T14:35:02Z",
"notes": "snap=stub-snap-a1b2c3d4"
},
{
"stage": "db-restore",
"status": "ok",
"ts": "2026-06-26T14:35:03Z",
"notes": "snap=stub-db-snap-e5f6g7h8"
},
{
"stage": "start-unit",
"status": "ok",
"ts": "2026-06-26T14:35:04Z",
"notes": "stub mode: would systemctl start of-app-<slug>"
},
{
"stage": "health-check",
"status": "ok",
"ts": "2026-06-26T14:35:05Z",
"notes": "stub mode: would curl localhost:<port>/"
}
],
"session_token": "user-abc123",
"next": "To roll FORWARD (undo the rollback): rollback_app(app_id='abc-123-def', checkpoint_id='cp-safety-xyz')"
}The stages list is ordered and timestamped. A GUI can render it as an
SSE-style progress strip without a second round-trip.
pin_checkpoint
Pin a checkpoint so the retention policy never prunes it. Use this for known-good baselines you want to keep around indefinitely.
pin_checkpoint(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
→ {
"checkpoint_id": "cp-a1b2c3d4",
"ref": "abc1234567890def",
"trigger": "manual",
"vm_snapshot_id": "stub-snap-a1b2c3d4",
"db_snapshot_id": "stub-db-snap-e5f6g7h8",
"quiesced": true,
"pinned": true,
"size_bytes": 0,
"notes": null,
"deploy_id": null,
"created_at": "2026-06-26T14:32:15Z",
"session_token": "user-abc123"
}delete_checkpoint
Remove a checkpoint. Fails with an error if the checkpoint is pinned, unpin first.
delete_checkpoint(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
→ {
"deleted": true,
"checkpoint_id": "cp-a1b2c3d4",
"session_token": "user-abc123"
}REST endpoints
All endpoints are owner-scoped (get_optional_user plus X-Guest-Id header)
and mirror the MCP surface for direct UI consumption.
| Method | Path | Purpose |
|---|---|---|
POST | /api/apps/{app_id}/checkpoints | Create a manual checkpoint with optional notes and pin flag |
GET | /api/apps/{app_id}/checkpoints | List checkpoints (newest first) plus storage usage and retention policy |
GET | /api/apps/{app_id}/checkpoints/{cp_id} | Fetch a single checkpoint by ID |
DELETE | /api/apps/{app_id}/checkpoints/{cp_id} | Remove a checkpoint (returns 204 No Content; returns 409 if pinned) |
POST | /api/apps/{app_id}/checkpoints/{cp_id}/pin | Pin a checkpoint to exempt it from pruning |
POST | /api/apps/{app_id}/checkpoints/{cp_id}/unpin | Unpin a checkpoint (eligible for pruning again) |
POST | /api/apps/{app_id}/checkpoints/{cp_id}/rollback | Bi-directional rollback, returns staged progress |
POST | /api/apps/{app_id}/checkpoints/prune | Manually trigger retention pruning (returns {deleted: [...], kept: count}) |
Checkpoint model fields
Each checkpoint record includes these fields:
| Field | Type | Notes |
|---|---|---|
checkpoint_id | string | Unique ID (cp-…) |
ref | string or null | Git SHA from the deployed code, or null if no deploy history |
trigger | string | One of deploy, manual, agent, pre-rollback |
vm_snapshot_id | string | Snapshot ID (stub-snap-… in M1) |
db_snapshot_id | string or null | DB snapshot ID if a managed database is attached |
quiesced | boolean | Whether fsfreeze succeeded (True in M1 stub) |
pinned | boolean | Exempt from retention pruning when True |
size_bytes | integer | Disk footprint (0 in M1, populated in M1.5) |
notes | string or null | Optional freeform note |
deploy_id | string or null | Links to deploy history entry when trigger=‘deploy’ |
created_at | string | ISO 8601 timestamp |
Retention policy
The default policy keeps:
- The last 10 checkpoints regardless of age.
- One checkpoint per day for the last 7 days (the oldest of each day wins).
- Every pinned checkpoint, indefinitely.
Anything outside those buckets is eligible for pruning. Pruning runs implicitly
after a successful create_checkpoint. Trigger it manually with the
POST /api/apps/{app_id}/checkpoints/prune endpoint if you need to reclaim
space immediately.
Putting it together
A typical agent loop looks like this:
- Agent calls
create_checkpoint(app_id, notes="before risky refactor")to snapshot code SHA, VM disk, and DB state. - Agent modifies the app, refactors code, runs a migration, redeploys.
- Agent calls
walk_appto test the changes. The walk fails with a regression. - Agent calls
rollback_app(app_id, checkpoint_id). The service captures apre-rollbacksafety checkpoint (preserving the failed state), then stages throughstop-unit,vm-revert(restoring the VM disk),db-restore(restoring the DB),start-unit, andhealth-check. - The GUI renders the six-stage progress directly from the returned
stageslist, no polling needed. - The app is now at its pre-refactor state, serving from the same URL and gateway route.
- If the agent wants to keep debugging the failure, it calls
rollback_appagain with the safety checkpoint ID to roll forward into the failed state. Bi-directional recovery.
Notes
- Stub today, real tomorrow. M1’s snapshot adapter fabricates IDs; the state transitions, retention, and rollback pipeline are real. M1.5 swaps in libvirt disk snapshots and DB dumps behind the same interface.
- Quiesced snapshots.
create_checkpointstops the app unit briefly to capture a consistent disk and DB pair, then restarts it. Thequiesced: truefield on the response confirms this happened. - Pinned checkpoints cannot be deleted. Unpin first, or leave them pinned. That is the point of pinning.
pre-rollbackcheckpoints are normal checkpoints. They count against retention and can themselves be rolled back to, pinned, or deleted.- REST DELETE returns 204 No Content. There is no response body, only the status code.
Related
- App Deployment — register a Git repo as an app and get the public preview URL that rollback preserves.
- Autonomous App Walker —
walk_appand scenario runs are what typically catch the regression that triggers a rollback. - Managed Databases — provision databases that get snapshotted as part of each checkpoint.