Convert audio files to text using Maitai’s speech-to-text API.
The Audio API provides transcription capabilities with support for multiple providers and models. You can transcribe audio into whatever language the audio is in, or translate and transcribe the audio into English.
The following input file types are supported: mp3
, mp4
, mpeg
, mpga
, m4a
, wav
, and webm
.
Maitai supports a variety of speech-to-text models across different providers:
breeze
- Custom transcription model by Maitaiwhisper-large-v3
- High-quality transcription modelwhisper-large-v3-turbo
- Fast transcription modeldistil-whisper-large-v3-en
- English-optimized transcription modelwhisper-1
- OpenAI’s Whisper modelgpt-4o-mini-transcribe
- GPT-4o mini transcription modelgpt-4o-transcribe
- GPT-4o transcription modelBoth application and intent parameters are required for all transcription requests. These are used to associate the request with your specific application and intent within the Maitai platform.
You can also use the OpenAI SDK with Maitai by setting the base URL and adding the required Maitai parameters as headers:
Creates a transcription of the given audio file.
The reference to the application the request is created from.
Specifies the intention type of the request.
Examples: TRANSCRIPTION
, TRANSLATION
A unique identifier you set for the session.
The audio file to transcribe, in one of the supported formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.
ID of the model to use. See Supported Models for available options.
The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.
An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. One or more of the following: word, or segment.
If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE]
message.
Returns a transcription object, or a streamed sequence of transcription chunk objects if the request is streamed.
Represents a transcription response returned by model, based on the provided input.
The transcribed text.
The language of the input audio.
The duration of the input audio.
Extracted words and their corresponding timestamps.
Segmented transcript content, with start and end timestamps.
You can stream transcriptions to receive partial results as they become available. This is useful for real-time applications or when processing long audio files.
Maitai supports transcription in the following languages through our supported models:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the underlying models were trained on 98 languages, we only list the languages that exceeded 50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The models will return results for languages not listed above but the quality will be low.
By default, the Transcriptions API will output a transcript of the provided audio in text. The timestamp_granularities[]
parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.
For very long audio files, you may want to break them up into smaller chunks to get the best performance. We suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.
One way to handle this is to use the PyDub open source Python package to split the audio:
You can use a prompt
parameter to improve the quality of the transcripts generated by the Transcriptions API.
Here are some examples of how prompting can help in different scenarios: