Skip to main content
Test Sets are how you curate a stable list of real requests so you can repeatedly benchmark changes (model config changes, prompt changes, and fine-tuned models) in a consistent way. In the Portal, Test Sets are organized by Intent Group.

Where to find Test Sets

You can access Test Sets from two places:
  • Global view: Test > Test Sets shows all Test Sets across your Intent Groups.
  • Intent Group view: open an Intent Group, then go to its Test Sets tab for intent-group-scoped work.

Create a new Test Set (wizard)

From either Test Sets view, click New Test Set. The creation flow is a short wizard:
  • 1) Select an Intent Group
    • A Test Set belongs to exactly one Intent Group.
  • 2) Add details
    • Name is required.
    • Description is optional.
  • 3) Select requests
    • You’ll see eligible requests (requests for that Intent Group that can be added).
    • Use the checkbox column to select one or more requests.
    • Add tags to selected requests to make later analysis easier.
    • Use View to inspect a request before you add it.
  • 4) Review
    • Confirm the name/description and the number of selected requests.
    • Optionally mark the Test Set as Golden.
In the UI, Golden Test Sets are described as being used to evaluate model performance after fine-tuning. Keep Golden sets small, high-signal, and well-maintained.
If a request’s response is almost correct (or totally wrong), you can set a “ground truth” answer for the Test Set by manually editing the request’s final assistant message before you add it. This is useful when you want your Test Set to encode what should have been said, not just what the model happened to output at the time. How it works (in the Portal):
  • The Portal lets you edit only the last assistant message for a request.
  • When you save, the edited content is stored on the request as a Test-Set-specific override (_maitai_test_request_response), and is used as the “original / expected” response in Test Run comparisons.
  • This does not change the original production request. It only changes the baseline used for this test request.
Steps:
  1. In the Eligible Requests table, hover View to open the request preview.
  2. Find the last assistant message (this is the only editable message).
  3. Edit the message content (and, if applicable, adjust tool calls).
  4. Click Save.
  5. Proceed to tag + select the request as usual.
In response comparison views, an asterisk (*) indicates the “Original Response” is a corrected response from the Test Set (not the original model output).

Add a single request to a Test Set (from Observe pages)

The fastest way to build a Test Set is often “promoting” interesting requests you’ve already observed:
  • From a Session page: on each request row, click the + button to open Add to Test Set.
  • From a Request page: click Add to Test Set near the top of the request details.
The Add-to-Test-Set menu supports:
  • Pick an existing Test Set for that request’s action/intent context.
  • Create a new Test Set inline (name + optional description), then add the request into it.
  • Add tags while adding the request (the UI enforces a maximum of 5 tags in this flow).

Manage an existing Test Set

Open any Test Set to see three key areas:
  • Overview
    • Core metadata (created date, request count, golden status, description)
    • Tag distribution (a quick breakdown of how many requests have each tag)
    • Semantic distribution (a chart view of the Test Set; you can add layers from within the chart)
  • Test Runs
  • Requests
    • Browse the requests currently included in the Test Set
Click Edit on a Test Set to:
  • Change the name, description, and golden flag
  • Bulk add eligible requests to the Test Set
  • Bulk remove requests from the Test Set

Practical guidance

  • Prefer real production requests: Test Sets are strongest when they mirror what users actually do.
  • Tag intentionally: Use tags to group scenarios you’ll want to scan later (e.g. edge_case, billing, spanish, tool_calling).
  • Keep sets “evergreen”: Regularly add newly discovered failure modes, and remove stale scenarios that no longer reflect expected behavior.