Speech to text

import maitai

maitai_client = maitai.MaitaiAsync()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = await maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

print(transcription.text)

{
    "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.",
    "language": "en"
}

The Audio API provides transcription capabilities with support for multiple providers and models. You can transcribe audio into whatever language the audio is in, or translate and transcribe the audio into English. The following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Supported Models

Maitai supports a variety of speech-to-text models across different providers:

Maitai Models

breeze - Custom transcription model by Maitai

Groq Models

whisper-large-v3 - High-quality transcription model
whisper-large-v3-turbo - Fast transcription model
distil-whisper-large-v3-en - English-optimized transcription model

OpenAI Models

whisper-1 - OpenAI’s Whisper model
gpt-4o-mini-transcribe - GPT-4o mini transcription model
gpt-4o-transcribe - GPT-4o transcription model

Deepgram Models

nova-3 - Latest model from Deepgram
nova-2 - Deepgram’s Nova 2 model

Quickstart

Using the Maitai SDK

Both application and intent parameters are required for all transcription requests. These are used to associate the request with your specific application and intent within the Maitai platform.

import maitai

maitai_client = maitai.MaitaiAsync()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = await maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

print(transcription.text)

Using OpenAI SDK with Maitai

You can also use the OpenAI SDK with Maitai by setting the base URL and adding the required Maitai parameters as headers:

import openai

client = openai.OpenAI(
    base_url="https://api.trymaitai.ai",
    api_key="your_maitai_api_key"
)

# Add Maitai-specific headers
headers = {
    "X-Maitai-Application": "demo_app",
    "X-Maitai-Intent": "TRANSCRIPTION",
    "X-Maitai-Session-Id": "YOUR_SESSION_ID",
    "X-Maitai-Metadata-custom_tag": "value"
}

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        extra_headers=headers
    )

print(transcription.text)

Create transcription

Creates a transcription of the given audio file.

Maitai Specific Parameters

application

string

required

The reference to the application the request is created from.

intent

string

required

Specifies the intention type of the request.Examples: TRANSCRIPTION, TRANSLATION

session_id

string

required

A unique identifier you set for the session.

Model Provider Parameters

file

required

The audio file to transcribe, in one of the supported formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.

model

string

required

ID of the model to use. See Supported Models for available options.

response_format

string

default:"json"

The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.

language

string

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

prompt

string

An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.

temperature

number

default:0

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

timestamp_granularities[]

array

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. One or more of the following: word, or segment.

stream

boolean

default:false

If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.

Returns

Returns a transcription object, or a streamed sequence of transcription chunk objects if the request is streamed.

The transcription object

Represents a transcription response returned by model, based on the provided input.

text

string

The transcribed text.

language

string

The language of the input audio.

duration

number

The duration of the input audio.

words

array

Extracted words and their corresponding timestamps.

segments

array

Segmented transcript content, with start and end timestamps.

Streaming transcriptions

You can stream transcriptions to receive partial results as they become available. This is useful for real-time applications or when processing long audio files.

Using Maitai SDK for streaming

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    stream = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        stream=True,
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    for chunk in stream:
        print(f"Chunk: {chunk}")
        if hasattr(chunk, 'delta') and chunk.delta:
            print(f"Delta: {chunk.delta}")

Using OpenAI SDK for streaming

import openai

client = openai.OpenAI(
    base_url="https://api.trymaitai.ai",
    api_key="your_maitai_api_key"
)

headers = {
    "X-Maitai-Application": "demo_app",
    "X-Maitai-Intent": "TRANSCRIPTION",
    "X-Maitai-Session-Id": "YOUR_SESSION_ID"
}

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    stream = client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        stream=True,
        extra_headers=headers
    )

    for chunk in stream:
        print(f"Chunk: {chunk}")

Supported languages

Maitai supports transcription in the following languages through our supported models: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. While the underlying models were trained on 98 languages, we only list the languages that exceeded 50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The models will return results for languages not listed above but the quality will be low.

Timestamps

By default, the Transcriptions API will output a transcript of the provided audio in text. The timestamp_granularities[] parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-large-v3",
        response_format="verbose_json",
        timestamp_granularities=["word"],
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    print(transcription.words)

Longer inputs

For very long audio files, you may want to break them up into smaller chunks to get the best performance. We suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost. One way to handle this is to use the PyDub open source Python package to split the audio:

from pydub import AudioSegment

song = AudioSegment.from_mp3("good_morning.mp3")

# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000

first_10_minutes = song[:ten_minutes]

first_10_minutes.export("good_morning_10.mp3", format="mp3")

Prompting

You can use a prompt parameter to improve the quality of the transcripts generated by the Transcriptions API.

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/speech.mp3", "rb") as audio_file:
    transcription = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="text",
        prompt="The following conversation is a lecture about the recent developments around AI and machine learning.",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    print(transcription.text)

Here are some examples of how prompting can help in different scenarios:

Correcting specific words or acronyms: Prompts can help correct specific words or acronyms that the model misrecognizes in the audio.
Preserving context: To preserve the context of a file that was split into segments, prompt the model with the transcript of the preceding segment.
Punctuation: Sometimes the model skips punctuation in the transcript. To prevent this, use a simple prompt that includes punctuation.
Filler words: The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, use a prompt that contains them.
Writing style: Some languages can be written in different ways, such as simplified or traditional Chinese. You can improve this by using a prompt in your preferred writing style.

Notes

TypeScript SDK: Currently, the Maitai SDK only supports Python. TypeScript support is coming soon.
Model availability: Not all models support all features. For example, timestamp granularities are only available with certain models and response formats.
Streaming support: Streaming is not supported by all models. Check the model documentation for specific capabilities.

Get Started

Examples

Features

SDK Reference

Speech to text

Supported Models

Maitai Models

Groq Models

OpenAI Models

Deepgram Models

Quickstart

Using the Maitai SDK

Using OpenAI SDK with Maitai

Create transcription

Maitai Specific Parameters

Model Provider Parameters

Returns

The transcription object

Streaming transcriptions

Using Maitai SDK for streaming

Using OpenAI SDK for streaming

Supported languages

Timestamps

Longer inputs

Prompting

Notes

Get Started

Examples

Features

SDK Reference

​Supported Models

​Maitai Models

​Groq Models

​OpenAI Models

​Deepgram Models

​Quickstart

​Using the Maitai SDK

​Using OpenAI SDK with Maitai

​Create transcription

​Maitai Specific Parameters

​Model Provider Parameters

​Returns

​The transcription object

​Streaming transcriptions

​Using Maitai SDK for streaming

​Using OpenAI SDK for streaming

​Supported languages

​Timestamps

​Longer inputs

​Prompting

​Notes

Supported Models

Maitai Models

Groq Models

OpenAI Models

Deepgram Models

Quickstart

Using the Maitai SDK

Using OpenAI SDK with Maitai

Create transcription

Maitai Specific Parameters

Model Provider Parameters

Returns

The transcription object

Streaming transcriptions

Using Maitai SDK for streaming

Using OpenAI SDK for streaming

Supported languages

Timestamps

Longer inputs

Prompting

Notes