Offline Speech Recognition

Introduction



Most businesses in the world try to reach their customers according to their culture to get customers more interact with the business. In that case, reaching the customers with their own language is an excellent method to deal with their customers. Therefore, recognizing the customer's indentation using their own language is very important. Therefore, Here, we’re mainly focusing on the “What is speech recognition?”, options we have to do speech recognition and speech recognition with English and Arabic languages. But we’ll provide enough guidance to switch to other languages easily.

Natural Language Processing (NLP) is the core of speech recognition. It’s like teaching a language to a small child. We train a model on how to use language and say what the rules of the language are by providing sample data, and the model starts learning after that we can get predictions utilizing that model. NLP is a trending machine learning application that use in different areas to give a better experience to people in the world.

Let’s start with the content..!

Speech Recognition

Basically, lots of speech recognition services are available on the internet as online services. As an example,

( Documentation: Azure Speech Recognition, Google-speech-to-text ) can be considered as huge platforms where we can perform speech recognition, and there are very accurate when producing their predictions. But majorly some organizations are messed with the cost that they have to pay for these services, some organizations don't have cloud subscriptions, and some do not like to use cloud platforms as they do not like to share their business data with the third parties since there are more sensitive data in the business. At this point, organizations looking for applications that can run in offline mode, connecting our own model. However some cases it is not important whether it uses the internet or not.

Offline Speech Recognition

We are going to focus on offline speech recognition systems here widely. Here are some methods that we're going to demonstrate here.

  1. Mozilla DeepSpeech

  2. Arabic Speech Recognition with Klaam

  3. Vosk-API (developed using the Kaldi project)

Let's go through them one by one.

1. Mozilla DeepSpeech

Deepspeech is an open-source Speech-to-Text engine. It is the easiest way to do speech recognition using TensorFlow. Tensorflow is a huge platform for performing Artificial Intelligence, Machine Learning, and Deep Learning operations in various ways developed by Google. Before using DeepSpeech we have to make sure that TensorFlow is properly installed on your machine. DeepSpeech Documentation.

Step 1: Creating a virtual environment

    virtualenv -p python3 $HOME/tmp/deepspeech-venv/
    source $HOME/tmp/deepspeech-venv/bin/activate

Step 2: Install Tensorflow

    # Requires the latest pip
    pip install --upgrade pip
    # Current stable release for CPU and GPU
    pip install tensorflow

If you want further support you can refer to Tesorflow Official Documentation

Step 3: Install DeepSpeech

    pip3 install deepspeech

Step 4: Download English Model Files

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm

    curl -LOhttps://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Step 5: Download the Sample Audio file

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz

    tar xvf audio-0.9.3.tar.gz

Step 6: Transcribe an audio

deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav

This is pretty simple because there’s a pre-trained model available for English speech recognition. But if you want to create speech recognition models for different languages, you can create your own model. Check out the latest release including pre-trained models in GitHub by clicking here. You can check out what are the available languages datasets by clicking here. DeepSpeech mainly uses Common Voice Datasets for its training. For example, there are some models available that we can use. For Arabic, you can find out these models here. Note that, some models are asking for CUDA devices which means your device must have a GPU.

2. Arabic Speech Recognition with Klaam

Klaam is a powerful project that supports Natural Language Understanding in the Arabic Language. This project has the ability for Speech Recognition, Text-To-Speech, and Speech Classification. 

Let’s dive into how to create a klaam project.

It’s a best practice to create a virtual environment before starting your project. Because sometimes existing tools in your machine or the server will affect your project. We can isolate our project by creating a virtual environment.

Step 1: Create a virtual environment

    virtualenv venv or python3 -m venv venv
    source venv/bin/activate

Step 2: Clone the klaam project from Github.


    sudo git clone https://github.com/ARBML/klaam.git


Step 3: Install the dependencies

    pip install -r requirements.txt

Note that, if you get issues when installing packages please make sure your python version is equal to or above python 3.7.

Some python versions there’ll not have an IPython library in-build. If you get this issue, run the following command.

    pip install IPython

2.1. Speech Recognition
command First, copy your “.wav” audio file to your project directory and replace the “sample/demo.wav” with your audio file path.
speech_recognition.py

    from IPython.display import Audio

    from klaam import SpeechRecognition

    model = SpeechRecognition()

    data = model.transcribe('samples/demo.wav')

    print(data)

Run your application: python speech_recognition.py

2.2. Speech Classification

Same as Speech Recognition place your “.wav” file in your project directory and replace “wave_file” with a path to your audio file.

speech_classification.py

    from IPython.display import Audio

    from klaam import SpeechRecognition

    model = SpeechRecognition()

    data = model.transcribe('samples/demo.wav')

    print(data)

Run your application: python speech_classification.py

2.3. Text-To-Speech

Here you have to consider the root path. Make sure you put the path where the “cfgs” file is located. Replace your Arabic sentence with “arabic_sentence”

text_to_speech.py

    from IPython.display import Audio
    from klaam import TextToSpeech
    root_path = "./"
    prepare_tts_model_path = "cfgs/FastSpeech2/config/Arabic/preprocess.yaml"
    model_config_path = "cfgs/FastSpeech2/config/Arabic/model.yaml"
    train_config_path = "cfgs/FastSpeech2/config/Arabic/train.yaml"
    vocoder_config_path = "cfgs/FastSpeech2/model_config/hifigan/config.json"
    speaker_pre_trained_path = "data/model_weights/hifigan/generator_universal.pth.tar"
    model = TextToSpeech(prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path,root_path)
    model.synthesize(arabic_sentence)
Run your application: python text_to_speech.py

This will create a .wav file called “sample.wav” with translated voice.


3. VOSK-API

Vosk API is a powerful tool that we can use to do speech recognition with many languages like Python, C#, JAVA, NodeJS, Ruby etc.

Checkout my post on Speech Recognition with VOSK-API

Conclusion

Nowadays, most organizations are trying to reach their customers with their own languages, and to automate these processes, voice recognition models play a major role. Most of cases, organizations are looking for solutions because they want to reduce the cost of the internet, and also since their data are critical they don't need to go to online solutions. Here we discussed powerful solutions for overcoming these issues with Mozilla deepSpeech, Kaldi project, and VOSK-API. 

Thank you..!

Sandares Dhanujaya
Undergraduate,
University of Colombo School of Computing

Comments

Post a Comment

Popular Posts