Table of Contents#

Introduction: the 72 hours after release determine long-term success
1. Observability: not “more logs,” but “faster diagnosis”
2. Alerting: define “what triggers action” before incidents happen
3. Rollback: rollback readiness is trained, not declared
4. Postmortems: from blame to mechanism upgrades
5. Team operating model: convert heroics into defaults
6. Series wrap-up: from “faster output” to “stable delivery”

Introduction: the 72 hours after release determine long-term success#

Teams often over-focus on the release moment and under-invest in post-release operations. In production:

user inputs differ from your test sets,
upstream dependencies fluctuate more than expected,
latency and cost shift structurally during peak traffic,
low-probability failures become frequent at scale.

To close the loop in Agentic Engineering, three capabilities are non-negotiable:
observability, rollback readiness, and postmortem discipline.

1. Observability: not “more logs,” but “faster diagnosis”#

A practical observability model has four layers:

Task layer: did a user request succeed, and if not, in which failure class?
Inference layer: model latency, token use, retries, refusals.
Tool layer: tool-call success rates, latency, error classes, idempotency collisions.
Business layer: conversion, retention, handoff-to-human rate, complaint rate.

If you only watch model metrics, you will misdiagnose many incidents. The model can return successfully while the tool path fails, producing business failure.

A minimal telemetry schema should include:

1	trace_id / request_id / tenant / use_case / model / tool / latency / tokens / cost / error_class

2. Alerting: define “what triggers action” before incidents happen#

Alerting is not about noticing anomalies. It is about triggering predefined actions.

A practical 3-tier policy:

P1 (immediate rollback): critical-path error spikes, unauthorized actions, sensitive-data leaks.
P2 (degraded mode): sustained latency breaches, budget anomalies, persistent upstream throttling.
P3 (observe and optimize): non-critical feature instability.

Predefine the action map:

1
2
3

P1 -> rollback immediately + notify on-call + freeze new releases
P2 -> degrade model/path + continue close observation
P3 -> create issue ticket + schedule in next iteration

3. Rollback: rollback readiness is trained, not declared#

Many systems are “rollback-capable in theory” but fail in real incidents. Common reasons:

rollback depends on migrated data structures,
old configs/routes are not preserved,
teams lack rehearsal and panic under pressure.

Run one small rollback drill each iteration:

simulate upstream timeout,
simulate malformed model output,
simulate tool permission failures,
measure recovery time against target.

Suggested SLO-style targets:

traffic switchback within 5 minutes,
core path recovery within 15 minutes,
internal incident update within 30 minutes.

4. Postmortems: from blame to mechanism upgrades#

A high-quality postmortem answers five questions:

What exactly triggered the incident?
Which guardrail should have blocked it but did not?
Why didn’t observability trigger early enough?
Why was rollback fast/slow?
Which mechanism changes prevent recurrence?

Use a mechanism-first structure:

1	Timeline -> Decision points -> Guardrail failures -> Reinforcements -> Owner & due date

A postmortem is only complete when conclusions are encoded back into gates, tests, and alert rules.

5. Team operating model: convert heroics into defaults#

Agentic Engineering maturity is not measured by one expert’s skill. It is measured by whether a new engineer can deliver safely in two weeks.

Institutionalize three assets:

Runbook: who does what in incidents and in what order.
Gate template: mandatory release checks for every change.
Postmortem knowledge base: historical failure patterns and prevention rules.

These assets reduce quality variance caused by staffing changes.

6. Series wrap-up: from “faster output” to “stable delivery”#

Across this three-part series:

Part 1 built the requirement-to-release pipeline.
Part 2 operationalized evals and quality gates.
Part 3 closed the loop with operations and postmortems.

That is the practical value of Agentic Engineering for frontline engineers:
not replacing human decisions, but turning those decisions into executable, testable, reusable system behavior.

If you are adopting this in your team, start with a minimal loop:

choose one critical path,
instrument a minimal telemetry schema,
enforce one release gate template,
run one rollback drill.

Once these are stable, expand to more agent-powered workflows with far lower risk.

本作品系原创，采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可，转载请注明出处。

DavidChan's Blog

Agentic Engineering in Practice (3): Observability, Rollbacks, and Team Postmortems