Skip to main contentRegression monitoring is the practice of running Test Runs repeatedly as your configuration and models evolve, then using comparisons to catch quality regressions early.
A practical regression workflow
-
Build one “golden” set per Intent Group
- Keep it focused: a smaller, high-signal set is easier to maintain and reason about.
- Mark it as Golden when appropriate.
-
Create a Test Run for every meaningful change
- Examples: a new system prompt revision, a model swap, a temperature change, or after a fine-tune.
- Use descriptions that make comparisons easy later (e.g. “prompt v4”, “temp 0.0”, “model X”, “post-finetune run”).
-
Compare runs instead of trusting a single metric
- Use Compare Runs on the Test Set to see request-by-request score shifts across runs.
- Use Compare on a single request to track how that specific scenario changed over time.
-
Focus attention where it matters
- Click into low buckets in the score distribution (e.g. “Poor”, “Fair”) to quickly identify regressions.
- Use the criteria breakdown tooltips to understand which criterion degraded.
Common “gotchas” when monitoring regressions
- Error % matters: a run with a higher pass rate but a higher error rate can still be a regression.
- Latency tradeoffs: track response time percentiles when comparing configurations.
- Coverage drift: as your product evolves, add new real-world failure cases to the Test Set so regressions don’t hide in untested corners.
Keeping Test Sets healthy
- Continuously add new failure modes discovered in Sessions/Requests via “Add to Test Set”.
- Tag requests consistently so you can quickly spot which categories are regressing.