Create evaluations
Evaluations are automatically created by the Chat endpoint. The model output is what is evaluated, with the request used as context.
If a callback
is provided, the evaluation is passed to that function, and inference is not affected.
If no callback
is provided, and stream
is true
, then the evaluation is available on the last chunk.
If no callback
is provided and stream
is false
or none
, then the evaluation can be found on the completion response.
Evaluate Response
import maitai
from maitai import types as maitai_types
async def maitai_callback(eval_response: maitai_types.EvaluateResponse):
# handle evaluate response
pass
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Generate numbers 1-10"},
]
maitai_client = maitai.MaitaiAsync()
response = await maitai_client.chat.completions.create(
messages=messages,
model="gpt-4o",
session_id="YOUR_SESSION_ID",
intent="NUMBER_GENERATOR",
application="demo_app",
callback=maitai_callback,
}
{
"application_id": 16,
"session_id": "xxx",
"evaluation_results": [
{
"status": "FAULT",
"description": "Test with random number",
"confidence": 100.0,
"correction": "Test with random number 42.",
"sentinel_id": 41,
"eval_time": 489,
"date_created": 1716682520088,
"usage": {
"prompt_tokens": 380,
"completion_tokens": 54,
"total_tokens": 434
}
}
],
"evaluation_request_id": "b84f0d55-7c7e-4af4-8d84-106385682250"
}
The Maitai identifier for the application.
A unique identifier for the session, passed in from the Chat endpoint.
The identifier of the evaluated Chat Completion request.
A list of individual Sentinel’s results
A unique identifier for the evaluation result.
The status of the evaluation. FAULT
means a fault was detected. PASS
means the LLM output passed testing. NA
means an evaluation wasn’t performed.
A detailed description of the evaluation outcome.
The confidence level of the evaluation result.
Metadata associated with the evaluation.
The suggested alternative output to use in lieu of the faulty output.
An identifier for the Sentinel associated with this result.
The evaluation time in milliseconds.
The Unix timestamp marking the creation of the evaluation result.
Usage statistics for the Sentinel
Number of tokens in the generated completion.
Number of tokens in the prompt.
Total number of tokens used in the request (prompt + completion).
The unique identifier for the evaluation request.