Speech Recognition with VOSK-API

Introduction



VOSK is an offline speech recognition module that enables users to an easy way to do speech recognition in 20+ languages. VOSK VOSK modules are very simple and light weighted. It is a pre-build module that was created using the Kaldi project. Therefore it is easy to deploy.

And also VOSK gives us two types of models called the small model and the large model. We can replace our model with a that we most wanted. According to VOSK documentation, most small models allow dynamic vocabulary reconfiguration. Big models are static the vocabulary can not be modified in a runtime. You can download models here. A small model typically is around 50Mb in size and requires about 300Mb of memory in runtime. Big models are for the high-accuracy transcription on the server. Big models require up to 16Gb in memory since they apply advanced AI algorithms.

Currently supporting the following platforms: Supported languages and dialects

  • English

  • Indian English

  • German

  • French

  • Spanish

  • Portuguese

  • Chinese

  • Russian

  • Turkish

  • Italian

  • Dutch

  • Catalan

  • Arabic

  • Greek

  • Farsi

  • Filipino

  • Ukrainian

  • Kazakh

  • Japanese

  • Esperanto

  • Hindi

  • Czech

  • Polish

  • Vietnamese

  • Swedish

More to come.


An Android Build and an IOS BUild and available for VOSK, but you have to contact authors by sending an email to
contact@alphacephei.com
  • Linux on x86_64

  • Raspbian on Raspberry Pi 3/4

  • Linux on arm64

  • OSX (both x86 and M1)

  • Windows x86 and 64

Please note speech recognition tools didn’t give a perfectly accurate result. Sometime it will give some unexpected results. But the average accuracy of VOSK is very high.

Ok..! Let’s get started


Step 1: Create the virtual environment and activate it

    virtualenv venv
    source venv/bin/activate


Step 2: Install python3 and pip3

As mentioned in the documentation make sure that, you have the python 3.5 or above version installed. But highly recommend installing python 3.7 or above version when you trying to containerize your project. I faced lots of issues when using python 3.6. I could able to rectify those issues after starting with python 3.7.


    python3 --version

    pip3 --version

Step 3: Install VOSK


    pip install vosk


Step 4: Clone the GitHub repository

        git clone https://github.com/alphacep/vosk-api.git

This will clone the vosk-api repository from the GutHub and it will create a directory called “vosk-api”


Then direct to vosk-api directory and there, you can see the following directory structure.

This contains modules that support Python, C, C#, Java, NodeJS, Ruby, etc. But here, we’re only talking about the python module.

Ok..! Now all setup.

Speech recognition using an audio file

First of all, direct to the directory called “python” and to the directory “example” in it and see what files we have in it.

For this first example, I’m going to use the default test.wav file in the vosk project. 

Note that, the first time that you run these commands, it will automatically download and set up the English small model by default.

1. Process a .wav file as text.

Note that, when using your own audio file make sure it has the correct format - PCM 16khz 16bit mono. Otherwise, if you have ffmpeg installed, you can use test_ffmpeg.py, which does the conversion for you.

This takes the audio file path as a command line argument. Simply run the following command and you can get the output


    python test_text.py test.wav


2. Transcribe different types of audio files.

To transcribe different audio files we can use the python class called “test_ffmpeg.py”. This class required to be installed FFmpeg in your environment. To install FFmpeg in your environment run the following commands.
        
    sudo apt update
    sudo apt install ffmpeg
    ffmpeg -version

If you have different linux environments try-out these with their default package managers.

Then simply run the following command giving the audio file path as a command line argument.

    
    python test_ffmpeg.py test.wav


To try out different types of audio files copy the audio files to this location and give the correct name as the command line argument. For example,

python test_ffmpeg.py test.mp4

python test_ffmpeg.py test.mp3

python test_ffmpeg.py test.ogg

I tried-out this with .mp4, .mp3, .ogg files and I got the same results as previously.

Replace the default model with another language model

First, download the model into your project. Here, I’ll choose the Arabic language and I’ll use a small Arabic model. If you want to use a large model; no worries, follow these steps as it is.

After downloading, the model, unzip it and copy it to our project directory. (vosk-api/python/examples/). I renamed my model directory as “model-ar” for my convenience. Here is my project directory structure now.


Now open our python class in text editor.

    vi test_ffmpeg.py

Now you can see in the code, there’s a line “model = Model(lang="en-us")”. It gives a default argument as en-us when creating a “Model” object, saying that we are using the English model. Remove that parameter and give the path to our model directory as a String as follow and then, save it and quit.


Then, run the following command and see the output. I have already placed an Arabic audio file as test.mp3 in my current directory.

    python test_ffmpeg.py test.mp3



That’s it. Pretty cool. :)

Should I keep holding there all files?

The answer is No. You can use files that only you need. Let’s take a look. First, I’ll create a directory in a different location. (location: ../../../)


Then, redirect to the directory where our files are located. Then copy the following files to the newly-created directory.

  • The Python file that you want. (for this example, text_ffmpeg.py)

  • Model directory

  • Audio file (test.mp3)


    cp test_ffmpeg.py ../../../vosk-isolated/
    cp -r model-ar/ ../../../vosk-isolated/
    cp test.mp3 ../../../vosk-isolated/

Then locate your new directory and see the directory structure.

These are the only things that you need. Then run again the following command and see the output.

    python test_ffmpeg.py test.mp3

That’s very simple :)

Conclusion

Nowadays speech recognition is a very demanding area and most people are looking for speech recognition solutions. There are more and more resources available for online speech recognition but, there is less number of solutions that can be implemented on-premise without using the internet. In this case, VOSK-API is a valuable solution. Here we discussed, the installation of VOSK, and its different usages of it such as processing audio files, etc. And also VOSK is rich with different languages and therefore we can choose the programming language that we most preferred. Finally, VOSK-API is a powerful tool for offline speech recognition.

Thank you..!

Sandares Dhanujaya
Undergraduate,
University of Colombo School of Computing

Comments

Popular Posts