Whisper tflite model inside an android application

George Soloupis
4 min readDec 29, 2023

Written by George Soloupis ML and Android GDE.

This blog post provides a concise overview of integrating the Whisper TensorFlow Lite model into an Android application. It’s a follow-up to this blog post where we showcased the process of converting the Whisper “tiny” English model into the appropriate format. Since the code base is large enough, we will highlight only the important parts of the code.

This task does not align with the specific categories covered by the remarkable MediaPipe or Task libraries that is why we had to employ TensorFlow Lite’s API. While this necessitates slightly more code, its versatility and ability to address a wide range of tasks makes it a worthwhile choice.

First step is to insert the model inside the project’s assets folder:

Assets folder contents.

Then we need to load the model into memory using a ByteBuffer object:

private fun loadModel(modelPath: String?) {
val fileDescriptor = context.assets.openFd(modelPath!!)
val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
val fileChannel = inputStream.channel
val startOffset = fileDescriptor.startOffset
val declaredLength = fileDescriptor.declaredLength
val retFile = fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength)
fileDescriptor.close()
val tfliteOptions = Interpreter.Options()
tfliteOptions.setNumThreads(Runtime.getRuntime().availableProcessors())
tensorflowInterpreter = Interpreter(retFile, tfliteOptions)
}

The application needs to record the sound from the microphone. The model is expecting data from a .wav or a .flac voice file (that is how Whisper model is developed) and inside the app this is done using the Recorder class which saves the file into the applications file system.

Based on the .tflite file we created in the previous blog post, the model expects an input that is a tensor of specific dimensions. Specifically, the input tensor must have a shape of [batch_size, sequence_length, num_features]. This means that the input tensor must have three dimensions: a batch dimension, a sequence dimension, and a feature dimension.

Model’s inputs and outputs.

The conversion of the .wav file to a tensor of floats is conducted by calculating the log of the mel spectrogram of the .wav file. Inside the project you can find a Kotlin equivalent of this procedure at the WhisperUtil.kt file. We found out that this procedure for a specific input file was more than 4 seconds which was far more time consuming than the model’s inference. That is why we decided to use a C++ equivalent for this calculation which dropped the time under 1 second. You can find the conversion at the whisper.h C++ file.

Having the model loaded and the inputs in place we are running the inference with a simple line of code:

tensorflowInterpreter!!.run(inputBuffer.buffer, outputBuffer.buffer)

The output is an array of integers which represent words inside a dictionary. Imagine the procedure as iterating through each of the array elements and finding the appropriate word inside the dictionary. The assets folder contains a .bin file that is loaded when the application starts. At the beginning and at the end of the array there are tokens that instruct where the sentence starts and ends. You can see the whole inference procedure at the WhisperEngine.kt file.

private fun runInference(inputData: FloatArray): String {
// Create input tensor
val inputTensor = tensorflowInterpreter!!.getInputTensor(0)
val inputBuffer = TensorBuffer.createFixedSize(inputTensor.shape(), inputTensor.dataType())
Log.d(TAG, "Input Tensor Dump ===>")
printTensorDump(inputTensor)

// Create output tensor
val outputTensor = tensorflowInterpreter!!.getOutputTensor(0)
val outputBuffer = TensorBuffer.createFixedSize(outputTensor.shape(), DataType.FLOAT32)
Log.d(TAG, "Output Tensor Dump ===>")
printTensorDump(outputTensor)

// Load input data
val inputSize =
inputTensor.shape()[0] * inputTensor.shape()[1] * inputTensor.shape()[2] * java.lang.Float.BYTES
val inputBuf = ByteBuffer.allocateDirect(inputSize)
inputBuf.order(ByteOrder.nativeOrder())
for (input in inputData) {
inputBuf.putFloat(input)
}

inputBuffer.loadBuffer(inputBuf)

// Run inference
tensorflowInterpreter!!.run(inputBuffer.buffer, outputBuffer.buffer)

// Retrieve the results
val outputLen = outputBuffer.intArray.size
Log.d(TAG, "output_len: $outputLen")
val result = StringBuilder()
for (i in 0 until outputLen) {
val token = outputBuffer.buffer.getInt()
if (token == mWhisperUtil.tokenEOT) break

// Get word for token and Skip additional token
if (token < mWhisperUtil.tokenEOT) {
val word = mWhisperUtil.getWordFromToken(token)
Log.d(TAG, "Adding token: $token, word: $word")
result.append(word)
} else {
if (token == mWhisperUtil.tokenTranscribe) Log.d(TAG, "It is Transcription...")
val word = mWhisperUtil.getWordFromToken(token)
Log.d(TAG, "Skipping token: $token, word: $word")
}
}
return result.toString()
}

For short audio clips ranging from 1 to 4 seconds, the inference process can be completed in under 2 seconds. When combined with the time required for calculating the mel spectrogram, the overall processing time is approximately 3 seconds, making it an exceptionally efficient offline model for speech recognition.

This implementation was introduced as a Proof of Concept of a small assistant inside an android phone which was first implemented with PALM’s sdk for android. At this project you can find custom equivalents also with ChatGPT API calls. The offline speech to text is at this specific branch where inside the viewmodel you can see the online and offline equivalents for using the Whisper model. The online API call is using code that is not yet documented at OpenAI’s official documentation guide for java/kotlin.

The above work has been accelerated by the awesome work that has been done at the below repos:

  1. https://github.com/usefulsensors/openai-whisper/tree/main
  2. https://github.com/vilassn/whisper_android

Conclusion

This blog post offers a concise walkthrough of integrating the Whisper TensorFlow Lite model into an Android app. Following a prior post on converting the Whisper “tiny” English model, the code highlights crucial sections for clarity. Utilizing TensorFlow Lite’s API, the model is loaded into the project’s assets folder and memory. The conversion of .wav files to tensors of floats involves mel spectrogram calculations, optimized with a C++ equivalent for efficiency. Inference, producing an array of integers representing words, completes in around 3 seconds for short audio clips. The implementation, serving as a Proof of Concept for an Android assistant, demonstrates offline speech-to-text capabilities with privacy protection, enhanced reliability, reduce latency and minimum cost.

--

--

George Soloupis

I am a pharmacist turned android developer and machine learning engineer. Right now I’m a senior android developer at Invisalign, a ML & Android GDE.