{ "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.", "language": "en"}
SDK Reference
Speech to text
Convert audio files to text using Maitai’s speech-to-text API.
{ "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.", "language": "en"}
{ "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.", "language": "en"}
The Audio API provides transcription capabilities with support for multiple providers and models. You can transcribe audio into whatever language the audio is in, or translate and transcribe the audio into English.The following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.
Both application and intent parameters are required for all transcription requests. These are used to associate the request with your specific application and intent within the Maitai platform.
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. One or more of the following: word, or segment.
If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.
You can stream transcriptions to receive partial results as they become available. This is useful for real-time applications or when processing long audio files.
Maitai supports transcription in the following languages through our supported models:Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.While the underlying models were trained on 98 languages, we only list the languages that exceeded 50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The models will return results for languages not listed above but the quality will be low.
By default, the Transcriptions API will output a transcript of the provided audio in text. The timestamp_granularities[] parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.
For very long audio files, you may want to break them up into smaller chunks to get the best performance. We suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.One way to handle this is to use the PyDub open source Python package to split the audio:
Copy
Ask AI
from pydub import AudioSegmentsong = AudioSegment.from_mp3("good_morning.mp3")# PyDub handles time in millisecondsten_minutes = 10 * 60 * 1000first_10_minutes = song[:ten_minutes]first_10_minutes.export("good_morning_10.mp3", format="mp3")
You can use a prompt parameter to improve the quality of the transcripts generated by the Transcriptions API.
Copy
Ask AI
import maitaimaitai_client = maitai.Maitai()with open("/path/to/file/speech.mp3", "rb") as audio_file: transcription = maitai_client.audio.transcriptions.create( file=audio_file, model="breeze", response_format="text", prompt="The following conversation is a lecture about the recent developments around AI and machine learning.", application="demo_app", intent="TRANSCRIPTION", session_id="YOUR_SESSION_ID" ) print(transcription.text)
Here are some examples of how prompting can help in different scenarios:
Correcting specific words or acronyms: Prompts can help correct specific words or acronyms that the model misrecognizes in the audio.
Preserving context: To preserve the context of a file that was split into segments, prompt the model with the transcript of the preceding segment.
Punctuation: Sometimes the model skips punctuation in the transcript. To prevent this, use a simple prompt that includes punctuation.
Filler words: The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, use a prompt that contains them.
Writing style: Some languages can be written in different ways, such as simplified or traditional Chinese. You can improve this by using a prompt in your preferred writing style.
TypeScript SDK: Currently, the Maitai SDK only supports Python. TypeScript support is coming soon.
Model availability: Not all models support all features. For example, timestamp granularities are only available with certain models and response formats.
Streaming support: Streaming is not supported by all models. Check the model documentation for specific capabilities.