Using Audio in a Multimodal Prompt inside android for the Gemini API with Vertex AI on Firebase

George Soloupis
3 min readJust now

--

Written by George Soloupis ML and Android GDE.

When using the Vertex AI in the Firebase SDK to call the Gemini API from your app, you can prompt the Gemini model to generate text from multimodal inputs. These multimodal prompts can incorporate various types of input, such as text combined with images, PDFs, video, or audio.
The documentation provides an excellent example of multimodal input with text and video inside android, but it lacks an example for text and audio. In this blog post, we’ll demonstrate how to incorporate audio into a multimodal prompt.

To use Gemini with Vertex AI in Firebase SDK you can follow the documentation steps here. The below implementation is tested with these dependencies that are placed inside the app’s build.gradle file:

// Vertex AI for Firebase SDK for Android
implementation("com.google.firebase:firebase-vertexai:16.0.0-beta06")

// Import the BoM for the Firebase platform
implementation(platform("com.google.firebase:firebase-bom:33.4.0"))

First we need to convert an audio file into a ByteArray. The simple way to use an audio file is to place it inside the assets folder of the android project:

.mp3 inside the assets folder.

Then we load the .mp3 file and we convert it in a ByteArray:

    private fun readAudioFromAssets(fileName: String): ByteArray? {
return try {
val inputStream = context.assets.open(fileName)
val buffer = ByteArray(inputStream.available())
inputStream.read(buffer)
inputStream.close()

buffer
} catch (e: IOException) {
e.printStackTrace()
null
}
}

// Use with
val bytes = readAudioFromAssets("audio_guitar.mp3")

Having the ByteArray ready we can use it with the multimodal prompt of the SDK:

private val _spokenText = MutableStateFlow("")

viewModelScope.launch(Dispatchers.Default) {
try {
val bytes = readAudioFromAssets("audio_guitar.mp3")
// Provide a prompt that includes the audio specified above and text
val prompt = content {
bytes?.let { it1 -> blob("audio/mp3", it1) }
text("Tell me what audio instrument you here")
}
generativeModel.generateContentStream(prompt).collect { chunk ->
_spokenText.value += chunk.text ?: ""
}
} catch (e: Exception) {
_spokenText.value =
context.getString(R.string.an_error_occurred_please_try_again)
}
}

Extra information: Use below for a .pdf file:

viewModelScope.launch(Dispatchers.Default) {
try {
val bytes = readAudioFromAssets("cv.pdf")
val prompt = content {
bytes?.let { it1 -> blob("application/pdf", it1) }
text("What programming languages George knows?")
}
generativeModel.generateContentStream(prompt).collect { chunk ->
_spokenText.value += chunk.text ?: ""
}
} catch (e: Exception) {
_spokenText.value =
context.getString(R.string.an_error_occurred_please_try_again)
}
}

Check more mime types that Gemini can support here.

And that’s it! The Gemini model delivers responses in chunks to provide faster output to users. Or you can use the API without streaming like:

val response = generativeModel.generateContent(prompt)
Log.d(TAG, response.text ?: "")

Conclusion
In this guide we demonstrated how to incorporate audio into multimodal prompts when calling the Gemini API with Vertex AI in the Firebase SDK for Android. While current documentation covers text and video examples, this post provides a step-by-step approach for adding audio input. The implementation included instructions for setting up dependencies in `build.gradle`, loading an audio file as a `ByteArray` from the assets folder, and using it in a multimodal prompt with the Gemini API. The response can be returned in chunks, allowing for faster, real-time output.

--

--

George Soloupis

I am a pharmacist turned android developer and machine learning engineer. Right now I’m a senior android developer at Invisalign, a ML & Android GDE.