Recognizer: Whisper transcribe/translate
Produced By: Whisper: OpenAI, ELAN extension: TLA, Max Planck Institute for Psycholinguistics, Nijmegen

How to use Whisper transcribe and translate from within ELAN (locally)

Version 1.2, May 2024

OpenAI   Whisper by OpenAI (https://openai.com/blog/whisper/) is an automatic speech recognition (ASR) system trained on many hours of data collected from the web. This extension implements ELAN's Recognizer API and allows to call a local installation of Whisper to transcribe or translate a media file linked in the current document.

Prerequisite is that Whisper (and everything it depends on) is properly installed and is working correctly when invoked from the command line. If that is the case, the Whisper extension in ELAN's user interface allows to configure the parameters and to start the recognition process.

Extensions based on the Recognizer API are installed in a sub-folder of ELAN's extensions folder. The whisper folder contains:

The main parameters and parameter categories are:

Parameter typeParameter idDescription
<input> audio ELAN automatically preselects the first suitable media file of your current annotation session, but you can change that to other supported files belonging to the session. WAVE audio files are preferred but since the media file is processed by FFmpeg other audio or video files probably work as well.
<textparam> run-command by default the command line command is set to whisper. Depending on the platform and on how Python and Whisper etc. are installed and configured this might not work and e.g. the full path to the executable might need to be entered. See the "known issues" below.
<textparam> --task by default the task to perform is set to transcribe. This can be changed to translate.
<textparam> --model by default the model to use is set to base. This can be changed to one of the other generic or English language models listed.
<textparam> --language by default the language is set to None, which triggers Whisper to auto-detect the language. This can be changed to one of the languages listed.
<textparam> --word_timestamps by default this is set to False. When set to True Whisper (version 03-2023) extracts word-level timestamps and refines the results based on those timestamps (experimetal). See the "known issues" below.
<numparam> --*** several numerical parameters, please refer to the Whisper documentation for descriptions.
<output> --output_dir a folder where Whisper should store resulting output files (.srt, .txt, .vtt). If not specified, the files will be stored in whatever the "current directory" is. If the folder does not exist, Whisper will try to create it. Depending on the platform it might be necessary to specify a path with a trailing slash, e.g. C:\Temp\.

In case a parameter you would like to use (i.e. would like to change the default value of) is not in the list in recognizer .cmdi, you can add it there and it will become an option in the user interface.

There are a few known issues, some depending on the platform: