Draft: Add near real-time audio transcription feature with Whisper.cpp and Vosk models. (!5335) · Merge requests · BC / public / linphone-sdk

Flore Harlé requested to merge feature/audio-transcription into master Sep 17, 2025

Whisper.ccp and Vosk are external librairies. See the README of whisper.cpp for hardware optimization.

Whisper.ccp is preferred for english and Vosk for french. Choose the model size according to your hardware (CPU, iGPU or GPU). To build with this feature, use the preset "ENABLE_AUDIO_TRANSCRIPTION": "ON", build on release and disable sanitizer.

The model and the path to it can be set in linphonerc file, for example:

[transcription]
enabled=1
model_path=/[path to whisper.cpp]/ggml-base.en-q8_0.bin
method=whispercpp_overlap

[transcription]
enabled=1
model_path=/[path to Vosk]/vosk-model-fr-0.6-linto-2.2.0
method=vosk

A whisper.cpp model is downloaded by default during build in [build folder]/linphone-sdk/desktop/whisper.cpp/models/ggml-base.en-q8_0.bin, this is the model used for the tests. To use Vosk on tests, you have to download the model to the build folder yourself. The models are here for Vosk and whisper.cpp.

The transcription is made by a new MSFilter MSTranscript, that can use a Whipser.ccp model or a Vosk model. For Whisper.cpp, a real-time algorithm send the audio to the model as chunks of 3s duration, that overlap with the previous one. The transcribed words are sent to Liblinphone with successive events, and a LinphoneTranscription instance process into a sentence ready to be displayed. The application can display it thanks to the transcription API. In conference mode, the name of the speaker is given with each sentence.

The transcription process in the MSTranscript filter runs on a dedicated thread. The sampling rate of the audio must be 16000 Hz.

The ASR models can be easily added or updated thanks to the AbstractTranscript class of the MSTranscript filter.

TODO:

remove the dependency on mediastreamer2 in transcription.cpp (liblinphone)
move the creation and start/pause of transcription filter to mediasession
move some methods from linphone core to call
write a test for a conference with several speakers
test a real conference with custom desktop app to check the speakers names
manage the transcription duration in order to not increase the delay between spoken and transcribed words
work on the correction of an already transcribed word
define and implement the procedure to download, build and install
fix the build on macOS and Windows

Links:

BC/public/external/whisper.cpp!1

https://linphone.atlassian.net/wiki/spaces/TA/overview

MR on previous branch (project begining) :

!5074

liblinphone!3636

mediastreamer2!1070

For test with desktop :

branch feature/audio-transcription-on-release based on release 5.4 https://gitlab.linphone.org/BC/public/linphone-sdk/-/tree/feature/audio-transcription-on-release?ref_type=heads

https://gitlab.linphone.org/BC/public/linphone-desktop/-/commits/feature/audio-transcription

Edited Sep 19, 2025 by Flore Harlé