Draft: Add near real-time audio transcription feature with Whisper.cpp and Vosk models.
Whisper.ccp and Vosk are external librairies. See the README of whisper.cpp for hardware optimization.
Whisper.ccp is preferred for english and Vosk for french. Choose the model size according to your hardware (CPU, iGPU or GPU). To build with this feature, use the preset "ENABLE_AUDIO_TRANSCRIPTION": "ON", build on release and disable sanitizer.
The model and the path to it can be set in linphonerc file, for example:
[transcription]
enabled=1
model_path=/[path to whisper.cpp]/ggml-base.en-q8_0.bin
method=whispercpp_overlap
or
[transcription]
enabled=1
model_path=/[path to Vosk]/vosk-model-fr-0.6-linto-2.2.0
method=vosk
A whisper.cpp model is downloaded by default during build in [build folder]/linphone-sdk/desktop/whisper.cpp/models/ggml-base.en-q8_0.bin
, this is the model used for the tests. To use Vosk on tests, you have to download the model to the build folder yourself. The models are here for Vosk and whisper.cpp.
The transcription is made by a new MSFilter MSTranscript, that can use a Whipser.ccp model or a Vosk model. For Whisper.cpp, a real-time algorithm send the audio to the model as chunks of 3s duration, that overlap with the previous one. The transcribed words are sent to Liblinphone with successive events, and a LinphoneTranscription instance process into a sentence ready to be displayed. The application can display it thanks to the transcription API. In conference mode, the name of the speaker is given with each sentence.
The transcription process in the MSTranscript filter runs on a dedicated thread. The sampling rate of the audio must be 16000 Hz.
The ASR models can be easily added or updated thanks to the AbstractTranscript class of the MSTranscript filter.
TODO:
- remove the dependency on mediastreamer2 in transcription.cpp (liblinphone)
- move the creation and start/pause of transcription filter to mediasession
- move some methods from linphone core to call
- write a test for a conference with several speakers
- test a real conference with custom desktop app to check the speakers names
- manage the transcription duration in order to not increase the delay between spoken and transcribed words
- work on the correction of an already transcribed word
- define and implement the procedure to download, build and install
- fix the build on macOS and Windows
Links:
BC/public/external/whisper.cpp!1
https://linphone.atlassian.net/wiki/spaces/TA/overview
MR on previous branch (project begining) :
For test with desktop :
branch feature/audio-transcription-on-release based on release 5.4 https://gitlab.linphone.org/BC/public/linphone-sdk/-/tree/feature/audio-transcription-on-release?ref_type=heads
https://gitlab.linphone.org/BC/public/linphone-desktop/-/commits/feature/audio-transcription