Using MediaPipe for an audio classification task

4 min readNov 12, 2023

Written by George Soloupis ML and Android GDE.

In this blog post, we delve into the integration of MediaPipe’s audio library within an Android project, showcasing the application of classification using a .tflite file generated by TensorFlow Lite Model Maker. Our approach involves a breakdown of the coding process, highlighting the seamless utilization of this high-level, low-code API. This technology streamlines the deployment of machine learning solutions on embedded devices, underscoring its user-friendly nature throughout the tutorial.

The MediaPipe Audio Classifier function enables the categorization of audio snippets into predefined classes, such as guitar melodies, train whistles, or avian vocalizations. These categories are established during the model training phase. This functionality processes audio data using a machine learning (ML) model, treating independent audio clips or a continuous stream, ultimately producing a ranked list of potential categories based on descending probability scores.

The inputs for this library’s task can be audio clips or audio stream. At this example we focus on stream. The output is a list of Category objects containing:

Index: Denotes the position of the category within the model outputs.
Score: Represents the confidence level assigned to this category, typically expressed as a probability within the [0,1] range.
Category name (optional): Identifies the category by name as outlined in the TFLite Model Metadata (when provided).
Category display name (optional): Offers a designated display name for the category, as specified in the TFLite Model Metadata. This can be in the language specified through display names locale options, if applicable.

There are also numerous configuration options that can be found here.

Setting up the project

Include below dependency at app’s build.gradle file:

implementation("com.google.mediapipe:tasks-audio:0.20230731")

You can find the releases at the Maven repository or use the latest.build tag at the dependency:

implementation("com.google.mediapipe:tasks-audio:latest.release")

The Audio Classifier requires an audio classification model to be stored in your project directory. You can put it in the assets folder:

Create the Audio Task

AudioClassifier can be initialized for audio stream as below:

private var audioClassifier: AudioClassifier? = null
val baseOptions = baseOptionsBuilder.build()
val optionsBuilder =
    AudioClassifier.AudioClassifierOptions.builder()
        .setScoreThreshold(classificationThreshold)
        .setMaxResults(numOfResults)
        .setBaseOptions(baseOptions)
        .setRunningMode(runningMode)

val options = optionsBuilder.build()
audioClassifier = AudioClassifier.createFromOptions(context, options)

You can check more configuration options at the documentation here.

Preparation of the data and execution

The Audio Classifier seamlessly processes both individual audio clips and continuous audio streams. This task streamlines the data input pre- processing, encompassing crucial steps such as resampling, buffering, and framing. Nevertheless, it is necessary to convert the input audio data into a com.google.mediapipe.tasks.components.container.AudioData object prior to submitting it to the Audio Classifier task.

private fun classifyAudioAsync(audioRecord: AudioRecord) {
        val audioData = AudioData.create(
            AudioData.AudioDataFormat.create(recorder?.format),  /* sampleCounts= */SAMPLING_RATE_IN_HZ
        )
        audioData.load(audioRecord)

        val inferenceTime = SystemClock.uptimeMillis()
        audioClassifier?.classifyAsync(audioData, inferenceTime)
    }

Note above that to run the Task you need to provide the Audio Classifier with a timestamp to track what audio data within the stream was used for the inference.

Results

Using a listener you can have the results inside a ViewModel, for example:

private val audioClassificationListener = object : AudioClassifierHelper.AudioClassifierListener {
        override fun onResult(resultBundle: AudioClassifierHelper.ResultBundle) {
            if (resultBundle.results.isNotEmpty()) {
                resultBundle.results[0].classificationResults().first()
                    .classifications()?.get(1)?.categories()?.let {
                        try {
                            if (it[0].index() == 1) {
                                gunshotNumber++
                            }
                            uiState = uiState.copy(gunshotNumber = gunshotNumber.toString())
                        }catch (e: Exception) {
                            Log.e("Error", e.toString())
                        }
                    }
            }
        }

        override fun onError(error: String) {
            Log.v(TAG, error)
        }
    }

Remember from the associated blog post that this model has multiple outputs:

There distinct outputs are:
1. One from the Yamnet model, providing probabilities for 521 audio classes
2. The other from our custom dataset, offering probabilities for the two specific classes — background noises and gunshots.

This differentiation is crucial because testing environments are multifaceted and diverse, extending beyond simple scenarios like gunshots sounds. By utilizing Yamnet’s output, we can effectively filter out irrelevant audio data. For instance, in a gunshot sounds use case, if Yamnet doesn’t classify certain sounds as gunshots, it indicates that the output from our model might have an inaccurate or irrelevant classification for those instances. This interplay between the broader Yamnet model and our custom dataset output ensures a more nuanced and accurate analysis of complex audio environments, enhancing the overall reliability of our classification system.

You can download the .tflite file from this link and build the whole android project to explore the code.

Conclusion

The blog post explored the integration of MediaPipe’s audio library in an Android project for audio classification. Utilizing a .tflite file generated by TensorFlow Lite Model Maker, the post detailed the coding process, emphasizing the user-friendly, high-level, low-code API offered by MediaPipe.