Classification of sounds using android mobile phone and the YAMNet ML model

5 min readDec 9, 2020

This is part 1 of a tutorial on how to classify sound that is recorded with a phone’s microphone in 500 and more classes using the extraordinary YAMNet Machine Learning model.

The tutorial is divided into two parts and feel free to follow along or skip to the part that is most interesting or relevant for you:

Part 1: Architecture of ML model, conversion to TensorFlow Lite (TFLite), benchmarking of the model
Part 2: Android implementation

Architecture of ML model

YAMNet is a pretrained deep net that predicts 521 audio event classes based on the AudioSet-YouTube corpus, and employing the Mobilenet_v1 depthwise-separable convolution architecture. You can see how model is constructed downloading this image.

Model was trained with audio features computed as follows:

All audio is resampled to 16 kHz mono.
A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125–7500 Hz.

A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.001) where the offset is used to avoid taking a logarithm of zero.

Computing stabilized log mel spectrogram

These features are then framed into 50%-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.

These 96x64 patches are then fed into the Mobilenet_v1 model to yield a 3x2 array of activations for 1024 kernels at the top of the convolution. These are averaged to give a 1024-dimension embedding, then put through a single logistic layer to get the 521 per-class output scores corresponding to the 960 ms input waveform segment. (Because of the window framing, you need at least 975 ms of input waveform to get the first frame of output scores.)

Model is using a vast number of Convolution and Depthwise Seperable Convolution layers.

You can find extensive info of depthwise seperable convolutions in this article. Briefly the difference between the convolution, the depthwise convolution and the depthwise seperable convolution is as follows:

Normal convolution:

Depthwise convolution:

Depthwise seperable convolution:

At the last example we use 1*1 filter at a depthwise convolution to cover depth dimension and separate it from horizontal.

Conversion of the model

Converting the model to TensorFlow Lite (TFLite) file is not straightforward. You can see an example of the procedure here where it is well explained how the problem of converting specific operators (RFFT and ComplexAbs) that are used during spectrogram was overcome.

TensorFlow Lite model is hosted at TensorFlow Hub. Also there are plenty of info about this model here. Running inference with the tflite model with Python code is as below:

The outputs of the model are:

Scores, a float32 Tensor of shape (N, 521) containing the per-frame predicted scores for each of the 521 classes in the AudioSet ontology that are supported by YAMNet.
Embeddings, a float32 Tensor of shape (N, 1024) containing per-frame embeddings, where the embedding vector is the average-pooled output that feeds into the final classifier layer.
log_mel_spectrogram, a float32 Tensor representing the log mel spectrogram of the entire waveform. These are the audio features passed into the model.

Performance measurement

There is a great tool that helps us to benchmark the TFLite model. TensorFlow Lite benchmark tools currently measure and calculate statistics for the following important performance metrics:

Initialization time
Inference time of warmup state
Inference time of steady state
Memory usage during initialization time
Overall memory usage

The benchmark tools are available as benchmark apps for Android and iOS and as native command-line binaries, and they all share the same core performance measurement logic.

After benchmarking the tflite model we have a great result:

At this picture we can see that we can use CPU with 2 or 4 threads to have the minimum inference time either with Pixel 3 or Pixel 4 device. Also the tool specified a GPU error when handling the model. After watching the logcat we can see:

Error: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.

These results were of great usage for android application development. We were able to avoid code of GPU usage and implement parameters that made app run faster!

That brings us to the end of getting in touch with model architecture, conversion to tflite and benchmarking. Next steps involve recording sound with android phone’s microphone, inserting tflite model inside app and displaying results on screen. For detailed information switch to part 2 of this tutorial.

Thanks to Sayak Paul and Le Viet Gia Khanh for their reviews and support

Classification of sounds using android mobile phone and the YAMNet ML model

Written by George Soloupis