Automatic speech recognition in MATLAB/Octave based on the excellent whisper.cpp from Georgi Gerganov and models from OpenAI's Whisper.
First, clone the repository with submodules:
git clone --recurse-submodules https://github.com/gllmflndn/whisper.m.git
Then compile the MEX file using make
in a Terminal:
make
The Accelerate and Metal frameworks will be used on macOS. On Windows, use MSYS2 and MinGW-w64, see MATLAB Support.
If compiling for Octave, execute the following instead from a Terminal:
make MEXBIN="mkoctfile --mex" MEXEXT=mex MEXOPT=""
To run whisper.m
on a pre-recorded audio file (mono, 16kHz) called input.wav
:
w = whisper('small');
[segments,tokens] = w.transcribe('input.wav',...
'print_realtime', true,...
'print_progress', false);
whisper.display_tokens(tokens);
Pre-trained models will be downloaded automatically from Hugging Face when needed and stored in a models
directory. Model options are tiny
, tiny.en
, base
, base.en
, small
, small.en
, medium
, medium.en
and large
.
Another example to record audio data and run whisper.m
:
Fs = 16000;
nbits = 16;
nchannels = 1;
id = 1; % see audiodevinfo to select the audio device
rec = audiorecorder(Fs, nbits, nchannels, id);
recDuration = 10;
disp('Begin speaking.')
recordblocking(rec, recDuration);
disp('End of recording.')
y = getaudiodata(rec);
w = whisper('small');
[segments,tokens] = w.transcribe(y','print_progress', false);
whisper.display_tokens(tokens);
To extrac the audio track from a video at 16kHz mono, you can use ffmpeg
:
ffmpeg -i video.mp4 -f wav -ar 16000 -ac 1 -vn audio.wav
There is also a demo that uses an audio file shipped with whisper.cpp
:
>> whisper.demo()
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 1
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 73.62 MB
whisper_model_load: model size = 73.54 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
whisper_init_state: compute buffer (conv) = 11.17 MB
whisper_init_state: compute buffer (encode) = 61.76 MB
whisper_init_state: compute buffer (cross) = 3.67 MB
whisper_init_state: compute buffer (decode) = 18.82 MB
And so my fellow Americans ask not what your country can do for you ask what you can do for your country