Regression Monitoring

Regression monitoring is the practice of running Test Runs repeatedly as your configuration and models evolve, then using comparisons to catch quality regressions early.

A practical regression workflow

Build one “golden” set per Intent Group
- Keep it focused: a smaller, high-signal set is easier to maintain and reason about.
- Mark it as Golden when appropriate.
Create a Test Run for every meaningful change
- Examples: a new system prompt revision, a model swap, a temperature change, or after a fine-tune.
- Use descriptions that make comparisons easy later (e.g. “prompt v4”, “temp 0.0”, “model X”, “post-finetune run”).
Compare runs instead of trusting a single metric
- Use Compare Runs on the Test Set to see request-by-request score shifts across runs.
- Use Compare on a single request to track how that specific scenario changed over time.
Focus attention where it matters
- Click into low buckets in the score distribution (e.g. “Poor”, “Fair”) to quickly identify regressions.
- Use the criteria breakdown tooltips to understand which criterion degraded.

Common “gotchas” when monitoring regressions

Error % matters: a run with a higher pass rate but a higher error rate can still be a regression.
Latency tradeoffs: track response time percentiles when comparing configurations.
Coverage drift: as your product evolves, add new real-world failure cases to the Test Set so regressions don’t hide in untested corners.

Keeping Test Sets healthy

Continuously add new failure modes discovered in Sessions/Requests via “Add to Test Set”.
Tag requests consistently so you can quickly spot which categories are regressing.

Get Started

Observe

Test

Build

Examples

SDK Reference

Regression Monitoring

A practical regression workflow

Common “gotchas” when monitoring regressions

Keeping Test Sets healthy

Get Started

Observe

Test

Build

Examples

SDK Reference

​A practical regression workflow

​Common “gotchas” when monitoring regressions

​Keeping Test Sets healthy

A practical regression workflow

Common “gotchas” when monitoring regressions

Keeping Test Sets healthy