Table of Contents#

Introduction: “works in demo” is not a release criterion#

In agent-based systems, the most dangerous illusion is:

“The demo works, so we can ship.”

Demos prove possibility. Production requires repeatability under noisy conditions.

This article focuses on one practical objective:
combine evals, regression discipline, and risk controls into enforceable release gates.

1. Start with a unified scorecard#

Without a shared scorecard, each team declares success differently. The result is predictable: every team is “green,” while production quality remains unstable.

A practical scorecard for frontline teams has four dimensions:

  1. Correctness: does output satisfy task intent with valid evidence/context?
  2. Stability: how much does behavior vary across repeated runs?
  3. Cost: per-request tokens, tool-call fan-out, and spend drift.
  4. Risk: policy bypass, sensitive data leakage, unsafe auto-actions.

Define hard red lines before optimization goals.

Example thresholds (illustrative):

1
2
3
4
5
Correctness >= 90%
Critical-path failure rate <= 2%
p95 latency <= 3s
Per-request cost <= budget cap
Unsafe high-risk action count = 0

2. Regression layering: step, workflow, end-to-end#

Teams that rely on a single test layer usually miss important failures. A safer pattern is to run three layers together:

2.1 Step-level tests#

Validate local logic: prompt templates, retrieval quality, parameter mapping.

2.2 Workflow-level tests#

Validate multi-step collaboration: planning -> tool use -> synthesis -> final response.

2.3 End-to-end tests#

Replay realistic requests against full dependencies to validate production behavior.

Practical role split:

  • Step-level gives fast feedback.
  • Workflow-level protects orchestration integrity.
  • E2E confirms release readiness.

3. Make “fail first” a hard gate#

The key difference in Agentic Engineering is not “more tests.” It is tests-first sequencing.

Recommended gate order:

1
2
3
4
5
6
7
(1) Baseline green
(2) Add/update tests
(3) Capture first failure
(4) Minimal implementation
(5) Run affected + full verification
(6) Pass twice consecutively
(7) Only then enter release

Why this works:

  • Prevents memory-based testing (“I think it worked before”).
  • Reduces accidental green from flaky runs.
  • Protects long-term test assets from shortcut edits.

4. Cost and safety belong in the same release gate#

Many teams delay budget and safety checks to “later governance.” That is costly.

4.1 Budget gates#

At minimum:

  • Request-level constraints (max_tokens, latency cap).
  • Session-level alerts (automatic downgrade on threshold breach).
  • Release-level comparison (new version cannot introduce unexplained cost inflation).

4.2 Safety gates#

At minimum:

  • Access denied flows must fail safely.
  • Prompt-injection probes must be blocked.
  • Sensitive fields must be redacted.
  • High-risk tool actions must require approval/confirmation.

For frontline engineers, the practical takeaway is simple:
convert non-negotiable risks into executable assertions.

5. Reusable release gate template#

1
2
3
4
5
6
7
8
9
10
Release is allowed only if:
- Baseline: pass
- New tests: added
- First-fail evidence: captured
- Post-fix affected tests: pass
- Full verification: pass
- Double-run consistency: pass
- Budget thresholds: pass
- Safety thresholds: pass
- Rollback plan: verified

Suggested role ownership:

  • Developer: change set + self-verification evidence.
  • Reviewer: gate completeness.
  • Release owner: canary decision and launch control.
  • On-call: production monitoring and fast rollback readiness.

6. Common anti-patterns and corrections#

Anti-pattern 1: tracking correctness only#

Correction: include repeatability and variance metrics.

Anti-pattern 2: functional regression without budget regression#

Correction: add cost delta table to release reports.

Anti-pattern 3: safety controls only at gateway level#

Correction: encode risky scenarios as mandatory regression tests.

Anti-pattern 4: treating gates as bureaucracy#

Correction: visualize prevented incidents and rework to show value.

Conclusion: speed without gates is not engineering speed#

The quality core of Agentic Engineering is replacing pre-release luck with pre-release evidence.

In Part 3, we complete the loop:
how observability, rollback discipline, and postmortems determine long-term system stability.


本作品系原创,采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,转载请注明出处。