For Speech-To-Text (STT), Transcription returns all text extracted from the audio file organized in paragraphs, where the paragraphs are defined based on speech intervals. Additionally, it returns a confidence score ("confidence") ranging between [0, 1.0], where higher confidence values (i.e., close to 1.0) mean the greater the chance of prediction being right.


For Text-To-Speech (TTS), Transcription returns audio in an MP3 format that is encoded in Base64 The original expression is also included in the response.


In both cases, if the NLU analysis was also requested, then the intents, entities, sentiment polarity (positive, negative, and neutral), and emotions (happy, sad, angry, fear, surprise, disgust) will also be included in the response. Learn more about NLU.

