Checkpoints & Rollback — safe iteration for agents

Snapshot and restore an app’s code SHA, VM disk, and database state together, then revert (or re-apply) a change in a single bi-directional call. Agents can take a risky swing at a refactor, a migration, or a dependency bump and know they can land back at a known-good state without losing the failed attempt for debugging.

Today (M1): the orchestration layer is live — CheckpointService, REST, MCP, state transitions, retention pruning, and the six-stage rollback pipeline all work end to end. The snapshot adapter is a stub that fabricates snapshot IDs without touching real libvirt disk snapshots or DB dumps. M1.5 swaps in the real app_snapshots.py implementation behind the same interface, so any scenarios you write against the MCP tools today keep working unchanged.

Why checkpoints

Agent-driven iteration produces a lot of failed attempts. Without checkpoints you have to either (a) trust the agent’s diff-and-revert logic, or (b) carry the cost of rebuilding VM and DB state by hand every time something breaks. Checkpoints make rollback atomic across all three layers:

Code — the git SHA the app was deployed from.
VM disk — the qcow2 state of the app’s tester VM at snapshot time.
Database — a logical dump of any managed database attached to the app.

And rollback is bi-directional: before reverting, the service captures a pre-rollback safety checkpoint of the current (failed) state, so the agent can roll forward into the failure again to keep debugging.

How it works

Snapshot before the risky change — create_checkpoint quiesces the app, records the deployed git SHA, snapshots the VM disk, dumps each attached DB, and returns a checkpoint_id.
Let the agent iterate — modify code, run a migration, deploy, walk the app. If it works, keep going. If it doesn’t, you have a known-good state to revert to.
Roll back — rollback_app captures a safety checkpoint of the broken state first, then runs a six-stage pipeline: safety-checkpoint, stop-unit, vm-revert, db-restore, start-unit, health-check. The stages list comes back on the response so a GUI can render progress without polling.
Pin what you want to keep — retention prunes to the last 10 checkpoints plus one daily for 7 days. pin_checkpoint exempts a checkpoint from pruning entirely.

The app’s slug, public URL, and gateway route are preserved across rollback, the same https://<slug>.apps.openfactory.tech URL keeps serving, just from the restored state.

MCP tools

Tool	Use
`create_checkpoint`	Capture a manual checkpoint (code, VM, DB) before a risky iteration
`list_checkpoints`	List checkpoints (newest first) and per-app storage / retention usage
`rollback_app`	Bi-directional rollback with auto safety checkpoint and staged progress
`pin_checkpoint`	Exempt a checkpoint from retention pruning
`delete_checkpoint`	Remove a checkpoint (fails if pinned, unpin first)

`create_checkpoint`

Capture a manual checkpoint before a risky agent iteration. Optionally pin it so retention never prunes it.


create_checkpoint(
  app_id="abc-123-def",
  notes="before risky refactor",
  pinned=True
)
  → {
      "checkpoint_id": "cp-a1b2c3d4",
      "ref": "abc1234567890def",
      "trigger": "manual",
      "vm_snapshot_id": "stub-snap-a1b2c3d4",
      "db_snapshot_id": "stub-db-snap-e5f6g7h8",
      "quiesced": true,
      "pinned": true,
      "size_bytes": 0,
      "notes": "before risky refactor",
      "deploy_id": null,
      "created_at": "2026-06-26T14:32:15Z",
      "session_token": "user-abc123",
      "next": "rollback_app(app_id='abc-123-def', checkpoint_id='cp-a1b2c3d4') to restore this state. Bi-directional: rollback captures a pre-rollback safety cp so you can roll forward again."
    }

trigger is one of deploy, manual, agent, or pre-rollback. Useful when filtering history for “what did the agent do here”.

`list_checkpoints`

List every checkpoint for the app plus the current storage footprint and the retention policy in effect.


list_checkpoints(app_id="abc-123-def")
  → {
      "checkpoints": [
        {
          "checkpoint_id": "cp-a1b2c3d4",
          "ref": "abc1234567890def",
          "trigger": "manual",
          "vm_snapshot_id": "stub-snap-a1b2c3d4",
          "db_snapshot_id": "stub-db-snap-e5f6g7h8",
          "quiesced": true,
          "pinned": true,
          "size_bytes": 0,
          "notes": null,
          "deploy_id": null,
          "created_at": "2026-06-26T14:32:15Z"
        },
        {
          "checkpoint_id": "cp-x9y8z7w6",
          "ref": "def456",
          "trigger": "manual",
          "vm_snapshot_id": "stub-snap-x9y8z7w6",
          "db_snapshot_id": null,
          "quiesced": true,
          "pinned": false,
          "size_bytes": 0,
          "notes": null,
          "deploy_id": null,
          "created_at": "2026-06-25T10:00:00Z"
        }
      ],
      "storage_usage": {
        "checkpoint_count": 2,
        "pinned_count": 1,
        "total_bytes": 0,
        "keep_last": 10,
        "daily_days": 7
      },
      "session_token": "user-abc123"
    }

total_bytes is 0 under the stub adapter, real disk usage shows up in M1.5.

`rollback_app`

Bi-directional rollback. The service captures a pre-rollback safety checkpoint first (so you can roll forward into the failed state for debugging), then walks the six-stage pipeline.


rollback_app(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
  → {
      "rolled_back_to": {
        "checkpoint_id": "cp-a1b2c3d4",
        "ref": "abc1234567890def",
        "trigger": "manual",
        "vm_snapshot_id": "stub-snap-a1b2c3d4",
        "db_snapshot_id": "stub-db-snap-e5f6g7h8",
        "quiesced": true,
        "pinned": true,
        "size_bytes": 0,
        "notes": "before risky refactor",
        "deploy_id": null,
        "created_at": "2026-06-26T14:32:15Z"
      },
      "safety_checkpoint": {
        "checkpoint_id": "cp-safety-xyz",
        "ref": "abc999",
        "trigger": "pre-rollback",
        "vm_snapshot_id": "stub-snap-xyz",
        "db_snapshot_id": null,
        "quiesced": true,
        "pinned": false,
        "size_bytes": 0,
        "notes": "safety checkpoint before rollback to cp-a1b2c3d4",
        "deploy_id": null,
        "created_at": "2026-06-26T14:35:00Z"
      },
      "stages": [
        {
          "stage": "safety-checkpoint",
          "status": "ok",
          "ts": "2026-06-26T14:35:00Z",
          "notes": "id=cp-safety-xyz"
        },
        {
          "stage": "stop-unit",
          "status": "ok",
          "ts": "2026-06-26T14:35:01Z",
          "notes": "stub mode: would systemctl stop of-app-<slug>"
        },
        {
          "stage": "vm-revert",
          "status": "ok",
          "ts": "2026-06-26T14:35:02Z",
          "notes": "snap=stub-snap-a1b2c3d4"
        },
        {
          "stage": "db-restore",
          "status": "ok",
          "ts": "2026-06-26T14:35:03Z",
          "notes": "snap=stub-db-snap-e5f6g7h8"
        },
        {
          "stage": "start-unit",
          "status": "ok",
          "ts": "2026-06-26T14:35:04Z",
          "notes": "stub mode: would systemctl start of-app-<slug>"
        },
        {
          "stage": "health-check",
          "status": "ok",
          "ts": "2026-06-26T14:35:05Z",
          "notes": "stub mode: would curl localhost:<port>/"
        }
      ],
      "session_token": "user-abc123",
      "next": "To roll FORWARD (undo the rollback): rollback_app(app_id='abc-123-def', checkpoint_id='cp-safety-xyz')"
    }

The stages list is ordered and timestamped. A GUI can render it as an SSE-style progress strip without a second round-trip.

`pin_checkpoint`

Pin a checkpoint so the retention policy never prunes it. Use this for known-good baselines you want to keep around indefinitely.


pin_checkpoint(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
  → {
      "checkpoint_id": "cp-a1b2c3d4",
      "ref": "abc1234567890def",
      "trigger": "manual",
      "vm_snapshot_id": "stub-snap-a1b2c3d4",
      "db_snapshot_id": "stub-db-snap-e5f6g7h8",
      "quiesced": true,
      "pinned": true,
      "size_bytes": 0,
      "notes": null,
      "deploy_id": null,
      "created_at": "2026-06-26T14:32:15Z",
      "session_token": "user-abc123"
    }

`delete_checkpoint`

Remove a checkpoint. Fails with an error if the checkpoint is pinned, unpin first.


delete_checkpoint(app_id="abc-123-def", checkpoint_id="cp-a1b2c3d4")
  → {
      "deleted": true,
      "checkpoint_id": "cp-a1b2c3d4",
      "session_token": "user-abc123"
    }

REST endpoints

All endpoints are owner-scoped (get_optional_user plus X-Guest-Id header) and mirror the MCP surface for direct UI consumption.

Method	Path	Purpose
`POST`	`/api/apps/{app_id}/checkpoints`	Create a manual checkpoint with optional notes and pin flag
`GET`	`/api/apps/{app_id}/checkpoints`	List checkpoints (newest first) plus storage usage and retention policy
`GET`	`/api/apps/{app_id}/checkpoints/{cp_id}`	Fetch a single checkpoint by ID
`DELETE`	`/api/apps/{app_id}/checkpoints/{cp_id}`	Remove a checkpoint (returns 204 No Content; returns 409 if pinned)
`POST`	`/api/apps/{app_id}/checkpoints/{cp_id}/pin`	Pin a checkpoint to exempt it from pruning
`POST`	`/api/apps/{app_id}/checkpoints/{cp_id}/unpin`	Unpin a checkpoint (eligible for pruning again)
`POST`	`/api/apps/{app_id}/checkpoints/{cp_id}/rollback`	Bi-directional rollback, returns staged progress
`POST`	`/api/apps/{app_id}/checkpoints/prune`	Manually trigger retention pruning (returns `{deleted: [...], kept: count}`)

Checkpoint model fields

Each checkpoint record includes these fields:

Field	Type	Notes
`checkpoint_id`	string	Unique ID (cp-…)
`ref`	string or null	Git SHA from the deployed code, or null if no deploy history
`trigger`	string	One of `deploy`, `manual`, `agent`, `pre-rollback`
`vm_snapshot_id`	string	Snapshot ID (stub-snap-… in M1)
`db_snapshot_id`	string or null	DB snapshot ID if a managed database is attached
`quiesced`	boolean	Whether fsfreeze succeeded (True in M1 stub)
`pinned`	boolean	Exempt from retention pruning when True
`size_bytes`	integer	Disk footprint (0 in M1, populated in M1.5)
`notes`	string or null	Optional freeform note
`deploy_id`	string or null	Links to deploy history entry when trigger=‘deploy’
`created_at`	string	ISO 8601 timestamp

Retention policy

The default policy keeps:

The last 10 checkpoints regardless of age.
One checkpoint per day for the last 7 days (the oldest of each day wins).
Every pinned checkpoint, indefinitely.

Anything outside those buckets is eligible for pruning. Pruning runs implicitly after a successful create_checkpoint. Trigger it manually with the POST /api/apps/{app_id}/checkpoints/prune endpoint if you need to reclaim space immediately.

Putting it together

A typical agent loop looks like this:

Agent calls create_checkpoint(app_id, notes="before risky refactor") to snapshot code SHA, VM disk, and DB state.
Agent modifies the app, refactors code, runs a migration, redeploys.
Agent calls walk_app to test the changes. The walk fails with a regression.
Agent calls rollback_app(app_id, checkpoint_id). The service captures a pre-rollback safety checkpoint (preserving the failed state), then stages through stop-unit, vm-revert (restoring the VM disk), db-restore (restoring the DB), start-unit, and health-check.
The GUI renders the six-stage progress directly from the returned stages list, no polling needed.
The app is now at its pre-refactor state, serving from the same URL and gateway route.
If the agent wants to keep debugging the failure, it calls rollback_app again with the safety checkpoint ID to roll forward into the failed state. Bi-directional recovery.

Notes

Stub today, real tomorrow. M1’s snapshot adapter fabricates IDs; the state transitions, retention, and rollback pipeline are real. M1.5 swaps in libvirt disk snapshots and DB dumps behind the same interface.
Quiesced snapshots. create_checkpoint stops the app unit briefly to capture a consistent disk and DB pair, then restarts it. The quiesced: true field on the response confirms this happened.
Pinned checkpoints cannot be deleted. Unpin first, or leave them pinned. That is the point of pinning.
pre-rollback checkpoints are normal checkpoints. They count against retention and can themselves be rolled back to, pinned, or deleted.
REST DELETE returns 204 No Content. There is no response body, only the status code.

App Deployment — register a Git repo as an app and get the public preview URL that rollback preserves.
Autonomous App Walker — walk_app and scenario runs are what typically catch the regression that triggers a rollback.
Managed Databases — provision databases that get snapshotted as part of each checkpoint.