mp3
, mp4
, mpeg
, mpga
, m4a
, wav
, and webm
.
Supported Models
Maitai supports a variety of speech-to-text models across different providers:Maitai Models
breeze
- Custom transcription model by Maitai
Groq Models
whisper-large-v3
- High-quality transcription modelwhisper-large-v3-turbo
- Fast transcription modeldistil-whisper-large-v3-en
- English-optimized transcription model
OpenAI Models
whisper-1
- OpenAI’s Whisper modelgpt-4o-mini-transcribe
- GPT-4o mini transcription modelgpt-4o-transcribe
- GPT-4o transcription model
Deepgram Models
nova-3
- Latest model from Deepgramnova-2
- Deepgram’s Nova 2 model
Quickstart
Using the Maitai SDK
Both application and intent parameters are required for all transcription requests. These are used to associate the request with your specific application and intent within the Maitai platform.
Using OpenAI SDK with Maitai
You can also use the OpenAI SDK with Maitai by setting the base URL and adding the required Maitai parameters as headers:Create transcription
Creates a transcription of the given audio file.Maitai Specific Parameters
The reference to the application the request is created from.
Specifies the intention type of the request.Examples:
TRANSCRIPTION
, TRANSLATION
A unique identifier you set for the session.
Model Provider Parameters
The audio file to transcribe, in one of the supported formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.
ID of the model to use. See Supported Models for available options.
The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.
An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. One or more of the following: word, or segment.
If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a
data: [DONE]
message.Returns
Returns a transcription object, or a streamed sequence of transcription chunk objects if the request is streamed.The transcription object
Represents a transcription response returned by model, based on the provided input.The transcribed text.
The language of the input audio.
The duration of the input audio.
Extracted words and their corresponding timestamps.
Segmented transcript content, with start and end timestamps.
Streaming transcriptions
You can stream transcriptions to receive partial results as they become available. This is useful for real-time applications or when processing long audio files.Using Maitai SDK for streaming
Using OpenAI SDK for streaming
Supported languages
Maitai supports transcription in the following languages through our supported models: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. While the underlying models were trained on 98 languages, we only list the languages that exceeded 50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The models will return results for languages not listed above but the quality will be low.Timestamps
By default, the Transcriptions API will output a transcript of the provided audio in text. Thetimestamp_granularities[]
parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.
Longer inputs
For very long audio files, you may want to break them up into smaller chunks to get the best performance. We suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost. One way to handle this is to use the PyDub open source Python package to split the audio:Prompting
You can use aprompt
parameter to improve the quality of the transcripts generated by the Transcriptions API.
- Correcting specific words or acronyms: Prompts can help correct specific words or acronyms that the model misrecognizes in the audio.
- Preserving context: To preserve the context of a file that was split into segments, prompt the model with the transcript of the preceding segment.
- Punctuation: Sometimes the model skips punctuation in the transcript. To prevent this, use a simple prompt that includes punctuation.
- Filler words: The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, use a prompt that contains them.
- Writing style: Some languages can be written in different ways, such as simplified or traditional Chinese. You can improve this by using a prompt in your preferred writing style.
Notes
- TypeScript SDK: Currently, the Maitai SDK only supports Python. TypeScript support is coming soon.
- Model availability: Not all models support all features. For example, timestamp granularities are only available with certain models and response formats.
- Streaming support: Streaming is not supported by all models. Check the model documentation for specific capabilities.