Table of Contents#
- Introduction: the 72 hours after release determine long-term success
- 1. Observability: not “more logs,” but “faster diagnosis”
- 2. Alerting: define “what triggers action” before incidents happen
- 3. Rollback: rollback readiness is trained, not declared
- 4. Postmortems: from blame to mechanism upgrades
- 5. Team operating model: convert heroics into defaults
- 6. Series wrap-up: from “faster output” to “stable delivery”
Introduction: the 72 hours after release determine long-term success#
Teams often over-focus on the release moment and under-invest in post-release operations. In production:
- user inputs differ from your test sets,
- upstream dependencies fluctuate more than expected,
- latency and cost shift structurally during peak traffic,
- low-probability failures become frequent at scale.
To close the loop in Agentic Engineering, three capabilities are non-negotiable:
observability, rollback readiness, and postmortem discipline.
1. Observability: not “more logs,” but “faster diagnosis”#
A practical observability model has four layers:
- Task layer: did a user request succeed, and if not, in which failure class?
- Inference layer: model latency, token use, retries, refusals.
- Tool layer: tool-call success rates, latency, error classes, idempotency collisions.
- Business layer: conversion, retention, handoff-to-human rate, complaint rate.
If you only watch model metrics, you will misdiagnose many incidents. The model can return successfully while the tool path fails, producing business failure.
A minimal telemetry schema should include:
1 | trace_id / request_id / tenant / use_case / model / tool / latency / tokens / cost / error_class |
2. Alerting: define “what triggers action” before incidents happen#
Alerting is not about noticing anomalies. It is about triggering predefined actions.
A practical 3-tier policy:
- P1 (immediate rollback): critical-path error spikes, unauthorized actions, sensitive-data leaks.
- P2 (degraded mode): sustained latency breaches, budget anomalies, persistent upstream throttling.
- P3 (observe and optimize): non-critical feature instability.
Predefine the action map:
1 | P1 -> rollback immediately + notify on-call + freeze new releases |
3. Rollback: rollback readiness is trained, not declared#
Many systems are “rollback-capable in theory” but fail in real incidents. Common reasons:
- rollback depends on migrated data structures,
- old configs/routes are not preserved,
- teams lack rehearsal and panic under pressure.
Run one small rollback drill each iteration:
- simulate upstream timeout,
- simulate malformed model output,
- simulate tool permission failures,
- measure recovery time against target.
Suggested SLO-style targets:
- traffic switchback within 5 minutes,
- core path recovery within 15 minutes,
- internal incident update within 30 minutes.
4. Postmortems: from blame to mechanism upgrades#
A high-quality postmortem answers five questions:
- What exactly triggered the incident?
- Which guardrail should have blocked it but did not?
- Why didn’t observability trigger early enough?
- Why was rollback fast/slow?
- Which mechanism changes prevent recurrence?
Use a mechanism-first structure:
1 | Timeline -> Decision points -> Guardrail failures -> Reinforcements -> Owner & due date |
A postmortem is only complete when conclusions are encoded back into gates, tests, and alert rules.
5. Team operating model: convert heroics into defaults#
Agentic Engineering maturity is not measured by one expert’s skill. It is measured by whether a new engineer can deliver safely in two weeks.
Institutionalize three assets:
- Runbook: who does what in incidents and in what order.
- Gate template: mandatory release checks for every change.
- Postmortem knowledge base: historical failure patterns and prevention rules.
These assets reduce quality variance caused by staffing changes.
6. Series wrap-up: from “faster output” to “stable delivery”#
Across this three-part series:
- Part 1 built the requirement-to-release pipeline.
- Part 2 operationalized evals and quality gates.
- Part 3 closed the loop with operations and postmortems.
That is the practical value of Agentic Engineering for frontline engineers:
not replacing human decisions, but turning those decisions into executable, testable, reusable system behavior.
If you are adopting this in your team, start with a minimal loop:
- choose one critical path,
- instrument a minimal telemetry schema,
- enforce one release gate template,
- run one rollback drill.
Once these are stable, expand to more agent-powered workflows with far lower risk.
本作品系原创,采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,转载请注明出处。
