Whisper transcribe
and translate
from within ELAN (locally)Version 1.2, May 2024
Whisper by OpenAI
(https://openai.com/blog/whisper/) is an automatic speech recognition (ASR) system
trained on many hours of data collected from the web. This extension implements ELAN's
Recognizer API and allows to call a local installation of Whisper to transcribe or
translate a media file linked in the current document.
Prerequisite is that Whisper (and everything it depends on) is properly installed and is working correctly when invoked from the command line. If that is the case, the Whisper extension in ELAN's user interface allows to configure the parameters and to start the recognition process.
Extensions based on the Recognizer API are installed in a sub-folder of ELAN's
extensions folder. The whisper
folder contains:
readme.html
file.png
file.jar
file with the actual implementationrecognizer.cmdi
file identifying the extension and its parameters to
ELANThe main parameters and parameter categories are:
Parameter type | Parameter id | Description |
---|---|---|
<input> |
audio | ELAN automatically preselects the first suitable media file of your current
annotation session, but you can change that to other supported files
belonging to the session. WAVE audio files are preferred but since the media
file is processed by FFmpeg other audio or video files probably
work as well. |
<textparam> |
run-command | by default the command line command is set to whisper . Depending on the
platform and on how Python and Whisper etc. are installed and configured
this might not work and e.g. the full path to the executable might need to
be entered. See the "known issues" below. |
<textparam> |
--task | by default the task to perform is set to transcribe . This can
be changed to translate . |
<textparam> |
--model | by default the model to use is set to base . This can be changed
to one of the other generic or English language models listed. |
<textparam> |
--language | by default the language is set to None , which triggers Whisper
to auto-detect the language. This can be changed to one of the languages
listed. |
<textparam> |
--word_timestamps | by default this is set to False . When set to True Whisper
(version 03-2023) extracts word-level timestamps and refines the results based on
those timestamps (experimetal). See the "known issues" below. |
<numparam> |
--*** | several numerical parameters, please refer to the Whisper documentation for descriptions. |
<output> |
--output_dir | a folder where Whisper should store resulting output files
(.srt , .txt , .vtt ). If not
specified, the files will be stored in whatever the "current directory" is.
If the folder does not exist, Whisper will try to create it. Depending on
the platform it might be necessary to specify a path with a trailing slash,
e.g. C:\Temp\ . |
In case a parameter you would like to use (i.e. would like to change the default value
of) is not in the list in recognizer .cmdi
, you can add it there and it will become an
option in the user interface.
There are a few known issues, some depending on the platform:
C:\anaconda3\python.exe C:\anaconda3\Scripts\conda-script.py run
--no-capture-output whisper
--no-capture-output
option is to ensure that output is not
captured by Anaconda. Without this option the results will only become available at
the very end of the processing. whisper
can be run from the Terminal, it probably
still is necesary to provide the full path as the run-command, e.g./opt/homebrew/bin/whisper
whisper
subsequently cannot find ffmpeg
.--word_timestamps
to True
and the installed
version of Whisper supports it (version 03-2023 or higher), the word level
segmentation will only be available in some of the exported output files. ELAN
currently uses the .srt
file to extract the word level segmentation and
add it as a second tier to the results. Therefore word level segmentation will not
be available in partial results, after canceling the recognition process. Also, the
.srt
import function (via the File->Import
submenu)
does not support the Whisper-specific encoding of words in .srt
yet.