Classification of sounds using android mobile phone and the YAMNet ML model

3 min readDec 9, 2020

This is part 2 of a tutorial on how to classify sound that is recorded with a phone’s microphone in 500 and more classes using the extraordinary YAMNet Machine Learning model. (Part 1)

Now that we have explained the architecture of the model and after benchmarking it we have a tflite file that can be downloaded from TensorFlow Hub and used inside a mobile phone. This model file has no metadata so application uses interpreter for inference.

The procedure is as follows:

Mobile’s microphone records the sound that is converted into an array of floats.
The array is passed to the interpreter that uses the model file which is stored inside assets folder.
The interpreter generates 3 outputs. Scores, embeddings and spectrograms.
The output scores are used to get the top-K classes of the inference.
The top classes are displayed on screen.

The code can be found here in this GitHub repository. There you can find also info about the model and an executable colab notebook that you can run inference using your .wav files with tensorflow and tensorflow lite interpreter.

The collection of the sound is straightforward. Using the AudioRecord class you start recording sound and produce the ByteArrayOutputStream and an ArrayList<Shorts>.

Pay attention on how we increase the microphone gain:

You can find all the implementation of sound collection at ListeningRecorder class.

The input must be normalized into floats between -1 and 1. To normalize it, we just need to divide all the values by 2**16 or in our code 32768 (for 16 bit integers, the range is -32k … + 32k).

val floatsForInference = FloatArray(arrayListShorts.size)
for ((index, value) in arrayListShorts.withIndex()) {
    floatsForInference[index] = (value / 32768F)
}

This FloatArray is passed at YamnetModelExecutor class and inference is done inside execute function.

Phone is tuned to collect 2 seconds of sound repeatedly. At the first part of the YAMNet model input features are framed into 50%-overlapping examples of 0.96 seconds. Because of this in our example model outputs 4 arrays of scores. Then we find the average at 0 axis.

val arrayMeanScores = FloatArray(521) { 0f }
        for (i in 0 until 521) {
            // Find the average of the 4 arrays at axis = 0
            arrayMeanScores[i] = arrayListOf(
                arrayScores[0][i],
                arrayScores[1][i],
                arrayScores[2][i],
                arrayScores[3][i]
            ).average().toFloat()
        }

Classes of the model are provided as a .txt file that is stored inside the assets folder. Using the TensorFlow Support library conversion to an arraylist is very easy.

Adding this to build.gradle file:

implementation('org.tensorflow:tensorflow-lite-support:0.0.0-nightly')

Getting the arraylist of classes:

val labels = FileUtil.loadLabels(context, "classes.txt")

Printing the first values of the text file:

Speech
Child speech, kid speaking
Conversation
Narration, monologue
Babbling
Speech synthesizer
Shout
Bellow
Whoop
Yell
Children shouting
Screaming
Whispering
Laughter
Baby laughter
Giggle
Snicker
Belly laugh
………………………

Having the arraylist with the average scores of 521 classes we find the top 10 classes and its labels:

So in the end we have the probabilities and the names that correspond to the top 10 classes. These values are then passed to the main UI and are displayed on screen (due to limitations, screen is displaying 5 classes):

You can see the application in use below:

Project available here:

https://github.com/farmaker47/Yamnet_classification_project

This project is in Kotlin language and has:

TensorFlow Support Library usage.
TensorFlow Lite interpreter usage and also:
Databinding
MVVM with Coroutines
Koin DI

Future scopes for improvement:

Tune audio recording in less than 2 seconds and find out the optimal value in time that gives faster results with great accuracy .
Add metadata to tflite file so it can be used with ML Binding

This brings us to the end of the tutorial. I hope you have enjoyed reading it and will apply what you learned to your real-world applications with TensorFlow Lite. Visit TensorFlow Hub for a vast variety of model files! For more information and contributions visit:

https://github.com/ml-gde/e2e-tflite-tutorials

Thanks to Sayak Paul and Le Viet Gia Khanh for their reviews and support

Classification of sounds using android mobile phone and the YAMNet ML model

Written by George Soloupis

No responses yet