import maitai

maitai_client = maitai.MaitaiAsync()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = await maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

print(transcription.text)
{
    "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.",
    "language": "en"
}
import maitai

maitai_client = maitai.MaitaiAsync()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = await maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

print(transcription.text)
{
    "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.",
    "language": "en"
}

The Audio API provides transcription capabilities with support for multiple providers and models. You can transcribe audio into whatever language the audio is in, or translate and transcribe the audio into English.

The following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Supported Models

Maitai supports a variety of speech-to-text models across different providers:

Maitai Models

  • breeze - Custom transcription model by Maitai

Groq Models

  • whisper-large-v3 - High-quality transcription model
  • whisper-large-v3-turbo - Fast transcription model
  • distil-whisper-large-v3-en - English-optimized transcription model

OpenAI Models

  • whisper-1 - OpenAI’s Whisper model
  • gpt-4o-mini-transcribe - GPT-4o mini transcription model
  • gpt-4o-transcribe - GPT-4o transcription model

Quickstart

Using the Maitai SDK

Both application and intent parameters are required for all transcription requests. These are used to associate the request with your specific application and intent within the Maitai platform.

import maitai

maitai_client = maitai.MaitaiAsync()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = await maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

print(transcription.text)

Using OpenAI SDK with Maitai

You can also use the OpenAI SDK with Maitai by setting the base URL and adding the required Maitai parameters as headers:

import openai

client = openai.OpenAI(
    base_url="https://api.trymaitai.ai",
    api_key="your_maitai_api_key"
)

# Add Maitai-specific headers
headers = {
    "X-Maitai-Application": "demo_app",
    "X-Maitai-Intent": "TRANSCRIPTION",
    "X-Maitai-Session-Id": "YOUR_SESSION_ID",
    "X-Maitai-Metadata-custom_tag": "value"
}

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        extra_headers=headers
    )

print(transcription.text)

Create transcription

Creates a transcription of the given audio file.

Maitai Specific Parameters

application
string
required

The reference to the application the request is created from.

intent
string
required

Specifies the intention type of the request.

Examples: TRANSCRIPTION, TRANSLATION

session_id
string
required

A unique identifier you set for the session.

Model Provider Parameters

file
file
required

The audio file to transcribe, in one of the supported formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.

model
string
required

ID of the model to use. See Supported Models for available options.

response_format
string
default:"json"

The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.

language
string

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

prompt
string

An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.

temperature
number
default:0

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

timestamp_granularities[]
array

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. One or more of the following: word, or segment.

stream
boolean
default:false

If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.

Returns

Returns a transcription object, or a streamed sequence of transcription chunk objects if the request is streamed.

The transcription object

Represents a transcription response returned by model, based on the provided input.

text
string

The transcribed text.

language
string

The language of the input audio.

duration
number

The duration of the input audio.

words
array

Extracted words and their corresponding timestamps.

segments
array

Segmented transcript content, with start and end timestamps.

Streaming transcriptions

You can stream transcriptions to receive partial results as they become available. This is useful for real-time applications or when processing long audio files.

Using Maitai SDK for streaming

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    stream = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        stream=True,
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    for chunk in stream:
        print(f"Chunk: {chunk}")
        if hasattr(chunk, 'delta') and chunk.delta:
            print(f"Delta: {chunk.delta}")

Using OpenAI SDK for streaming

import openai

client = openai.OpenAI(
    base_url="https://api.trymaitai.ai",
    api_key="your_maitai_api_key"
)

headers = {
    "X-Maitai-Application": "demo_app",
    "X-Maitai-Intent": "TRANSCRIPTION",
    "X-Maitai-Session-Id": "YOUR_SESSION_ID"
}

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    stream = client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="json",
        stream=True,
        extra_headers=headers
    )

    for chunk in stream:
        print(f"Chunk: {chunk}")

Supported languages

Maitai supports transcription in the following languages through our supported models:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

While the underlying models were trained on 98 languages, we only list the languages that exceeded 50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The models will return results for languages not listed above but the quality will be low.

Timestamps

By default, the Transcriptions API will output a transcript of the provided audio in text. The timestamp_granularities[] parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/audio.mp3", "rb") as audio_file:
    transcription = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-large-v3",
        response_format="verbose_json",
        timestamp_granularities=["word"],
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    print(transcription.words)

Longer inputs

For very long audio files, you may want to break them up into smaller chunks to get the best performance. We suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

One way to handle this is to use the PyDub open source Python package to split the audio:

from pydub import AudioSegment

song = AudioSegment.from_mp3("good_morning.mp3")

# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000

first_10_minutes = song[:ten_minutes]

first_10_minutes.export("good_morning_10.mp3", format="mp3")

Prompting

You can use a prompt parameter to improve the quality of the transcripts generated by the Transcriptions API.

import maitai

maitai_client = maitai.Maitai()

with open("/path/to/file/speech.mp3", "rb") as audio_file:
    transcription = maitai_client.audio.transcriptions.create(
        file=audio_file,
        model="breeze",
        response_format="text",
        prompt="The following conversation is a lecture about the recent developments around AI and machine learning.",
        application="demo_app",
        intent="TRANSCRIPTION",
        session_id="YOUR_SESSION_ID"
    )

    print(transcription.text)

Here are some examples of how prompting can help in different scenarios:

  • Correcting specific words or acronyms: Prompts can help correct specific words or acronyms that the model misrecognizes in the audio.
  • Preserving context: To preserve the context of a file that was split into segments, prompt the model with the transcript of the preceding segment.
  • Punctuation: Sometimes the model skips punctuation in the transcript. To prevent this, use a simple prompt that includes punctuation.
  • Filler words: The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, use a prompt that contains them.
  • Writing style: Some languages can be written in different ways, such as simplified or traditional Chinese. You can improve this by using a prompt in your preferred writing style.

Notes

  • TypeScript SDK: Currently, the Maitai SDK only supports Python. TypeScript support is coming soon.
  • Model availability: Not all models support all features. For example, timestamp granularities are only available with certain models and response formats.
  • Streaming support: Streaming is not supported by all models. Check the model documentation for specific capabilities.