Skip to main content
Regression monitoring is the practice of running Test Runs repeatedly as your configuration and models evolve, then using comparisons to catch quality regressions early.

A practical regression workflow

  • Build one “golden” set per Intent Group
    • Keep it focused: a smaller, high-signal set is easier to maintain and reason about.
    • Mark it as Golden when appropriate.
  • Create a Test Run for every meaningful change
    • Examples: a new system prompt revision, a model swap, a temperature change, or after a fine-tune.
    • Use descriptions that make comparisons easy later (e.g. “prompt v4”, “temp 0.0”, “model X”, “post-finetune run”).
  • Compare runs instead of trusting a single metric
    • Use Compare Runs on the Test Set to see request-by-request score shifts across runs.
    • Use Compare on a single request to track how that specific scenario changed over time.
  • Focus attention where it matters
    • Click into low buckets in the score distribution (e.g. “Poor”, “Fair”) to quickly identify regressions.
    • Use the criteria breakdown tooltips to understand which criterion degraded.

Common “gotchas” when monitoring regressions

  • Error % matters: a run with a higher pass rate but a higher error rate can still be a regression.
  • Latency tradeoffs: track response time percentiles when comparing configurations.
  • Coverage drift: as your product evolves, add new real-world failure cases to the Test Set so regressions don’t hide in untested corners.

Keeping Test Sets healthy

  • Continuously add new failure modes discovered in Sessions/Requests via “Add to Test Set”.
  • Tag requests consistently so you can quickly spot which categories are regressing.