Create evaluations
Evaluations are created when you call Chat with evaluations enabled (via Portal configuration or SDK parameters). The model output is what gets evaluated, with the request used as context.
If a callback is provided, the evaluation is passed to that function asynchronously.
If no callback is provided, and stream is true, then the evaluation is available on the last chunk.
If no callback is provided and stream is false, then the evaluation can be found on the completion response.
Example (callback)
Python (async)
Python (sync)
Node
import maitai
async def on_eval ( eval_response ):
# handle evaluate response (EvaluateResponse)
print (eval_response)
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant" },
{ "role" : "user" , "content" : "Generate numbers 1-10" },
]
client = maitai.MaitaiAsync()
response = await client.chat.completions.create(
messages = messages,
model = "gpt-4o" ,
session_id = "YOUR_SESSION_ID" ,
intent = "NUMBER_GENERATOR" ,
application = "demo_app" ,
callback = on_eval,
)
{
"application_id" : 16 ,
"session_id" : "xxx" ,
"evaluation_results" : [
{
"status" : "FAULT" ,
"description" : "Test with random number" ,
"confidence" : 100.0 ,
"correction" : "Test with random number 42." ,
"sentinel_id" : 41 ,
"eval_time" : 489 ,
"date_created" : 1716682520088 ,
"usage" : {
"prompt_tokens" : 380 ,
"completion_tokens" : 54 ,
"total_tokens" : 434
}
}
],
"evaluation_request_id" : "b84f0d55-7c7e-4af4-8d84-106385682250"
}
The Maitai identifier for the application.
A unique identifier for the session, passed in from the Chat endpoint.
The identifier of the evaluated chat completion request.
A list of individual Sentinel results. A unique identifier for the evaluation result.
The status of the evaluation. FAULT means a fault was detected. PASS means the LLM output passed testing. NA means an evaluation wasn’t performed.
A detailed description of the evaluation outcome.
The confidence level of the evaluation result.
Optional suggested correction text when a FAULT is detected.
Additional metadata associated with the evaluation result.
An identifier for the Sentinel associated with this result.
The evaluation time in milliseconds.
The Unix timestamp marking the creation of the evaluation result.
Usage statistics for the Sentinel Number of tokens in the generated completion.
Number of tokens in the prompt.
Total number of tokens used in the request (prompt + completion).
The unique identifier for the evaluation request.