Table of Contents#
- Introduction: “works in demo” is not a release criterion
- 1. Start with a unified scorecard
- 2. Regression layering: step, workflow, end-to-end
- 3. Make “fail first” a hard gate
- 4. Cost and safety belong in the same release gate
- 5. Reusable release gate template
- 6. Common anti-patterns and corrections
- Conclusion: speed without gates is not engineering speed
Introduction: “works in demo” is not a release criterion#
In agent-based systems, the most dangerous illusion is:
“The demo works, so we can ship.”
Demos prove possibility. Production requires repeatability under noisy conditions.
This article focuses on one practical objective:
combine evals, regression discipline, and risk controls into enforceable release gates.
1. Start with a unified scorecard#
Without a shared scorecard, each team declares success differently. The result is predictable: every team is “green,” while production quality remains unstable.
A practical scorecard for frontline teams has four dimensions:
- Correctness: does output satisfy task intent with valid evidence/context?
- Stability: how much does behavior vary across repeated runs?
- Cost: per-request tokens, tool-call fan-out, and spend drift.
- Risk: policy bypass, sensitive data leakage, unsafe auto-actions.
Define hard red lines before optimization goals.
Example thresholds (illustrative):
1 | Correctness >= 90% |
2. Regression layering: step, workflow, end-to-end#
Teams that rely on a single test layer usually miss important failures. A safer pattern is to run three layers together:
2.1 Step-level tests#
Validate local logic: prompt templates, retrieval quality, parameter mapping.
2.2 Workflow-level tests#
Validate multi-step collaboration: planning -> tool use -> synthesis -> final response.
2.3 End-to-end tests#
Replay realistic requests against full dependencies to validate production behavior.
Practical role split:
- Step-level gives fast feedback.
- Workflow-level protects orchestration integrity.
- E2E confirms release readiness.
3. Make “fail first” a hard gate#
The key difference in Agentic Engineering is not “more tests.” It is tests-first sequencing.
Recommended gate order:
1 | (1) Baseline green |
Why this works:
- Prevents memory-based testing (“I think it worked before”).
- Reduces accidental green from flaky runs.
- Protects long-term test assets from shortcut edits.
4. Cost and safety belong in the same release gate#
Many teams delay budget and safety checks to “later governance.” That is costly.
4.1 Budget gates#
At minimum:
- Request-level constraints (
max_tokens, latency cap). - Session-level alerts (automatic downgrade on threshold breach).
- Release-level comparison (new version cannot introduce unexplained cost inflation).
4.2 Safety gates#
At minimum:
- Access denied flows must fail safely.
- Prompt-injection probes must be blocked.
- Sensitive fields must be redacted.
- High-risk tool actions must require approval/confirmation.
For frontline engineers, the practical takeaway is simple:
convert non-negotiable risks into executable assertions.
5. Reusable release gate template#
1 | Release is allowed only if: |
Suggested role ownership:
- Developer: change set + self-verification evidence.
- Reviewer: gate completeness.
- Release owner: canary decision and launch control.
- On-call: production monitoring and fast rollback readiness.
6. Common anti-patterns and corrections#
Anti-pattern 1: tracking correctness only#
Correction: include repeatability and variance metrics.
Anti-pattern 2: functional regression without budget regression#
Correction: add cost delta table to release reports.
Anti-pattern 3: safety controls only at gateway level#
Correction: encode risky scenarios as mandatory regression tests.
Anti-pattern 4: treating gates as bureaucracy#
Correction: visualize prevented incidents and rework to show value.
Conclusion: speed without gates is not engineering speed#
The quality core of Agentic Engineering is replacing pre-release luck with pre-release evidence.
In Part 3, we complete the loop:
how observability, rollback discipline, and postmortems determine long-term system stability.
本作品系原创,采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,转载请注明出处。
