Whisper: a speech-to-text recognition model

Admin · October 21, 2024, 2:05pm

Whisper is a speech-to-text recognition model

You can upload your audio in a few ways:

From the Internet: Provide a direct link (URL) to the audio file.
From your сomputer: Upload the audio file directly from your device.

File String

File String refers to providing the audio content in the form of a Base64-encoded string. This is useful when you want to directly include the audio content within the API request, rather than uploading a file.

The audio file (e.g., .mp3, .wav, etc.) is converted into a Base64 string and sent to the API for transcription.

File URL

File URL allows you to provide a URL pointing to the audio file you want to transcribe. The API will download and process the audio file from the given URL.

The API fetches the audio from the provided URL and transcribes it.

File

File is where you upload the audio file directly for transcription. This would be used if you were sending an actual audio file with the API request.

This is for file uploads, so the file can be in formats like .mp3, .wav, .ogg, etc.

Group Segments

Group Segments determines whether the transcription should group similar speech segments. This can be useful if you have long audio files, and you want to divide transcription into logical segments based on the speaker or pauses.

If set to true, the transcription will group speech segments together, making it easier to follow in continuous speech.

Transcript Output Format

Transcript Output Format specifies how the transcribed text will be returned. You can choose from:

Words Only: Returns just the transcribed words without any timing or segment information.
Segments Only: Returns segments of speech along with timestamps and other metadata, but without the full word-level transcription.
Both: Provides the full transcript along with speech segments, including timestamps and metadata for a more detailed output.

Num Speakers

Num Speakers lets you specify the number of speakers in the audio. The API will then attempt to differentiate between speakers and segment the transcription accordingly.

If you know the audio contains two speakers, you can specify this, and the API will attempt to distinguish between them in the transcript.

Translate

Translate allows you to translate the transcribed audio from one language to another. For example, if the audio is in another language and you want the transcription in English, you can use this option.

Setting this to true will enable translation, and the transcription will be returned in the target language (often English).

Language

Language is where you specify the language of the input audio. This helps the API understand which language it’s dealing with, leading to more accurate transcription.

If your audio is in English, you would set the language to"en". For Spanish, it would be “es”, and so on.

Prompt

Prompt is a hint or additional context you provide to the Whisper about the content of the audio. This helps improve the accuracy of the transcription, especially in cases with specialized vocabulary or unclear speech.

The Whisper will take the context from the prompt and use it to improve transcription accuracy.

Offset Seconds

Offset Seconds allows you to start transcribing the audio from a specific time (in seconds). This is useful if you want to skip a certain portion of the audio.

Private Container

Private Container refers to whether the transcription result is stored in a private container, meaning only you or authorized users can access it.

Setting this to true ensures that the transcriptions and results are stored privately and securely.

How to use on Scade:

Configure the Start Node:

Create a file-type input in the start node and name it “audio.” This will allow the node to accept audio files for processing.

Set Up the Whisper Node:

Set the audio language to the language of the audio if you want the transcription in that language. If you leave this field blank, Whisper will automatically translate the audio into English.

Optionally, add a prompt to improve the transcription accuracy and choose the desired output format. The default setting is “both,” which will return both text and timestamped output.

Extract the Plain Transcribed Text:

Add a “Run Python Code” node to your flow. This node allows you to insert any Python code for processing within the flow.

If you’re not familiar with this, use the following code to extract the plain text from the Whisper output:

segments = context["CwQ1-thomasmol-whisper-diarization"]["success"]["segments"][0]["text"]

_result = text

In this code:

["CwQ1-thomasmol-whisper-diarization"] refers to the identifier of the Whisper node.

["success"] indicates that the node ran successfully, and we have a result.

["segments"][0]["text"] is the location of the transcribed text in the output.

Note: If the audio contains multiple speakers, you will receive a corresponding number of segments.

In order to add the variable change the type of input in the python node to expression.

Drag the text segment of the successfully worked Whisper node to the code. Then change the type of input back to Python code.

Configure the End Node:

Add an input text field named “result” to capture and display the final transcribed text.

This setup ensures a smooth transcription flow, allowing for language configuration, prompt customization, and plain text extraction.