Estimating musical scores (pitch) in android with TensorFlow’s SPICE model

5 min readAug 22, 2020

Written by George Soloupis and reviewed by Khanh LeViet, Sayak Paul and Luis Gustavo Martins.

Τhe objectives with this tutorial are to:

Understand what is pitch attribute and historical info how machines used to detect it in songs
Visualize songs data and the result after model execution
Provide information about the techniques that are used to collect sound with phone’s microphone
Deploy ML model inside android application
Transform song’s data and make inference with SPICE model
Render results on android phone’s screen

Pitch is a perceptual property of sounds that allows their ordering on a frequency-related scale, or more commonly, pitch is the quality that makes it possible to judge sounds as “higher” and “lower” in the sense associated with musical melodies. Pitch is a major auditory attribute of musical tones, along with duration, loudness, and timbre, is quantified by frequency and measured in Hertz (Hz), where one Hz corresponds to one cycle per second.

Pitch detection is an interesting challenge. Historically, the study of pitch and pitch perception has been a central problem in psychoacoustics, and has been instrumental in forming and testing theories of sound representation, signal-processing algorithms, and perception in the auditory system. A lot of techniques have been used for this purpose. Efforts have also been made to separate the relevant frequency from background noise and backing instruments.

Today, we can do that with machine learning, more specifically with the SPICE model (SPICE: Self-Supervised Pitch Estimation). This is a pretrained model that can recognize the fundamental pitch from mixed audio recordings (including noise and backing instruments).The model is available to use through TensorFlow Hub, on the web with TensorFlow.js and on mobile devices with TensorFlow Lite.

Audio is recorded in .wav format and in the format of one audio channel (mono) at 16khz sampling rate. Let’s use a simple audio file with that format. If we use a logarithmic frequency scale (to make the singing more clearly visible), load it and visualize the output, we will have a spectrogram which shows frequencies over time:

After model execution with song’s data we print the outputs of the model. With blue are pitch values that model predicts and with orange the confidence of these pitch values:

If we keep the results with confidence over 90% and overlap them with the spectrogram in grayscale we will have:

Great accuracy over all lengths of song!!

Note that for this particular example, a spectrogram-based heuristic for extracting pitch may have worked as well. In general, ML-based models perform better than hand-crafted signal processing methods in particular when background noise and backing instruments are present in the audio. For a comparison of SPICE with a spectrogram-based algorithm (SWIPE) see here.

Inside android application we have to collect the sound with a microphone. First we set the variables:

Then we select MediaRecorder.AudioSource.VOICE_RECOGNITION to tune microphone source to voice recognition and apply noise cancellation. Audio format is the desired one with 16bit, mono channel and 16KHz sample rate. Finally we start the recording process:

After stopping recording process mRecorder.stopRecording() we read audio from the audio recorder stream:

Pay attention to multiplication buffer[i] * 6.7 . This is used to control the microphone gain and give higher sensitivity (you can use different values instead of 6.7 to suit your needs)!

Inside this class there is also a function to transform a byte array to .wav file. This file is stored in Pitch Detector folder inside the phone’s internal memory and can be used to verify the accuracy of the mobile’s model output in accordance with the original Colab notebook.

Deployment of SPICE model comes first with copying .tflite file inside Assets folder. We include these dependencies inside app build.gradle file:

implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly'
implementation 'org.tensorflow:tensorflow-lite-gpu:0.0.0-nightly'
implementation 'org.tensorflow:tensorflow-lite-select-tf-ops:0.0.0-nightly'

Last dependency is for extra operations…it is mandatory for this project but adds significant size to final .apk because the model use some special operations that are not present yet on the first tflite dependency. You can find more info here.

We initialize interpreter loading model file from folder:

And we are ready to use an audio stream to make inferences!

Audio stream comes in ArrayList<Short> format. But to use it as input for our model we have to convert it into float values that have to be normalized in the range of -1 to 1 values. To achieve this we divide every value with MAX_ABS_INT16 = 32768

Then we execute the inference with the interpreter:

When we get results we:

Keep outputs that have confidence over 90%
Convert absolute pitch values to Hz values
Calculate the offset during singing
Use some heuristics to try and estimate the most likely sequence of notes that were sung.

The ideal offset computed above is one ingredient - but we also need to know the speed (how many predictions make, say, an eighth?), and the time offset to start quantizing. To keep it simple, we'll just try different speeds and time offsets and measure the quantization error, using in the end the values that minimize this error. You can follow along in PitchModelExecutor.kt file.

With the above procedure we get an ArrayList of note strings for example [A2, F2, G#2, C3] that is displayed on screen.

TensorFlow Hub example of SPICE model has a great visualization tool to view the notes on a static musical pentagram. So it was the time to copy this “live” effect inside the mobile application. An android webview is used to handle some custom html code. We load this code inside binding adapter:

And the html code for webview to render you can view it in this github gist!

When notes are displayed on screen like [A2, F2] we execute:

Here we observe note changes every two seconds and for every note inside list we execute the javascript functions. The values inside myMove functions are the vertical offset for the notes.

You can see the application in use below:

Project available here:

https://github.com/farmaker47/Pitch_Estimator

This project is in Kotlin language and has:

Webview usage with custom HTML loading.
TensorFlow usage with .tflite models and also:
Databinding
MVVM with Coroutines
Koin DI

Future scopes for improvement:

In app’s build.gradle file we added the dependency of special ops:

implementation ‘org.tensorflow:tensorflow-lite-select-tf-ops:0.0.0-nightly’

This dependency adds significant amount of size in final .apk file. By selecting only the ops that are needed for the model we aim to reduce final .apk size.

With an improvement to the algorithm we will be able to view whole notes, half ones, pauses and other musical symbols.

This brings us to the end of the tutorial. We hope you have enjoyed reading it and will apply what you learned to your real-world applications with TensorFlow Lite. Visit TensorFlow Hub for a vast variety of model files! For more information and contributions visit:

Thanks to Sayak Paul , Le Viet Gia Khanh and Luis Gustavo Martins.

Estimating musical scores (pitch) in android with TensorFlow’s SPICE model

Written by George Soloupis

Responses (6)