Speech commands inside android.

4 min readSep 19, 2022

This is a project that demonstrates how to use Speech Commands Google’s example inside an android application. From a TensorFlow version 1.x frozen graph we will demonstrate the conversion to a .tflite file and then how inference is performed inside the mobile phone.

With this machine learning example users after capturing an audio input they can recognize commands like “up”, “down”, “left”, “right” etc. You can download the frozen graph and the audio labels using this link. If you do not want to capture audio and you want some .wav files to use right away download a dataset with this link.

After downloading the frozen graph check it with Netron to visualize the structure:

We can see that the model is using the Mfcc operator (Mel-frequency Cepstral Coefficient) which transforms a spectrogram into a form that’s useful for speech recognition.

Mel Frequency Cepstral Coefficients are a way of representing audio data that’s been effective as an input feature for machine learning. They are created by taking the spectrum of a spectrogram (a ‘cepstrum’), and discarding some of the higher frequencies that are less significant to the human ear. They have a long history in the speech recognition world, and this Wiki link is a good resource to learn more.

Inside Colab clone TensorFlow repository:

!git clone https://github.com/tensorflow/tensorflow.git

Navigate to Speech Commands folder:

%cd tensorflow/tensorflow/examples/speech_commands

Use label_wav.py python file and get the results right away:

!python label_wav.py \--graph=/content/speech_commands_files/conv_actions_frozen.pb \--labels=/content/speech_commands_files/conv_actions_labels.txt \--wav=/content/down.wav

This example is created with TensorFlow version 1.x . We have to convert it to a saved_model format and use it to generate the .tflite file that is going to be used inside android.

The procedure is as follows:

View the input and output layers’ names:

2. You can view also all the layers’ names.

3. After adding the appropriate imports then we convert to saved_model format:

import tensorflow as tfimport numpy as np

4. Having the saved_model folder you convert it to .tflite:

5. Finally after the procedure we verify the conversion by running inference using the .tflite file. That way we are sure the outputs of the TensorFlow Lite Interpreter are the same as the outputs of the label_wav.py file usage:

Visualising the speech_commands_model.tflite file with Netron we are getting the input and output types:

Check that we have the FlexDecodeWav operator and that is why we used:

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.]

during the conversion. Also we can see that the model expects a (string) bytes object and it will output a 2D float array. When we open the .wav file we use the open() puthon method. Based on the documentation we have:

“Python Open() function: Files opened in binary mode (including ‘b’ in the mode argument) return contents as bytes objects without any decoding.”

The above lines convert the .wav file to a string bytes object that contains the Header and the actual bytes for the audio. Check an explanation about how a .wav file if formatted at this Medium post.

The exact same procedure of loading and converting the .wav file has to be performed inside android so to have a byte array object that is going to be used by the Interpreter.

First copy-paste the speech_commands_model.tflite and the down.wav files inside the assets folder of the android project:

2. Inside the build.gradle app file we have to include these dependencies:

For the TensorFlow Lite Interpreter
“implementation “org.tensorflow:tensorflow-lite:2.9.0”

For the TensorFlow Flex Delegate
“implementation 'org.tensorflow:tensorflow-lite-select-tf-ops:2.9.0”

The select_tf_ops dependency will add 100 MB at the final .apk . If you are building a production application and size is an issue then you can create custom .aar files using the procedure here.

3. We instantiate the TensorFlow Lite Interpreter using the assets tflite file:

4. Load the down.wav file, convert it to a byte array and perform inference:

Printing the outputs we get the exact same results as when we performed inference with the Python Api. Based on the results we can get the hotword from the text file that is placed inside the assets folder. TensorFlow Lite Support library can be of a great use following the example here.

If you want to collect sound from the mobile’s microphone you can follow the procedure that is demonstrated at this project. With this you can record audio, create an appropriate .wav file or create directly a byte array with specific starter bytes for headers.

Conclusion
Using a TensorFlow version 1.x frozen graph we can follow a procedure which can generate a .tflite file that can be used inside and android application. Different visualization tools as Netron can be of great help to determine the input and output type of data. Procedure is straightforward after training the model and it can give us speech commands inside a mobile phone.

Speech commands inside android.

Written by George Soloupis