Use Gemma-it 1.1 version for a QnA task

George Soloupis
6 min readApr 11, 2024

Written by George Soloupis ML and Android GDE.

This blog post tackles the challenge of deploying Gemma 1.1, a latest-generation instruction-tuned language model, for offline question answering within an Android application. Question answering is a particularly demanding task for large language models (LLMs), and this post demonstrates how to set up, deploy, and achieve results with Gemma 1.1 using clear code snippets. We’ll showcase the ease of integrating this powerful model for offline use.

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is named after the Latin gemma, meaning “precious stone.” The Gemma model weights are supported by developer tools that promote innovation, collaboration, and the responsible use of artificial intelligence (AI). The version of the model that we are showcasing here is the:

  • Instruction tuned — These versions of the model are trained with human language interactions and can respond to conversational input, similar to a chat bot.

Recently Gemma-it had a significant update to version 1.1 which made it possible to handle more demanding tasks like QnA. Below we are showing how you can use it inside an android application.

One library that can use directly the Gemma models (in a specific format) is the MediaPipe. Inside the documentation for LLM inference you can find a vast amount of information about converting specific LLMs and use them inside Android, Web and iOs.

Gemma comes in a lot of formats like Keras , Pytorch, Transformers, C++, TensorRT, TensorFlow Lite and others. For our QnA task we can directly download the TensorFlow Lite format from the Kaggle web site but it is worth mentioning how you can convert the Transformers model which is in the .safetensors format into a bin file that can be used inside a mobile. If you check the link above you will see that we are using the int8 which is more appropriate than the int4 model for our difficult task.

You can use the Colaboratory for your conversion procedure which has the below steps:

  1. Import the libraries
import ipywidgets as widgets
from IPython.display import display
install_out = widgets.Output()
display(install_out)
with install_out:
!pip install mediapipe
!pip install huggingface_hub
import os
from huggingface_hub import hf_hub_download
from mediapipe.tasks.python.genai import converter

install_out.clear_output()
with install_out:
print("Setup done.")

2. Login to HugginFace (you should have an account and a HF Token)

from huggingface_hub import notebook_login
notebook_login()

3. Create dropdown lists

model = widgets.Dropdown(
options=["Gemma 2B", "Falcon 1B", "StableLM 3B", "Phi 2"],
value='Falcon 1B',
description='model',
disabled=False,
)

backend = widgets.Dropdown(
options=["cpu", "gpu"],
value='cpu',
description='backend',
disabled=False,
)

token = widgets.Password(
value='',
placeholder='huggingface token',
description='HF token:',
disabled=False
)

def on_change_model(change):
if change["new"] != 'Gemma 2b':
token_description.layout.display = "none"
token.layout.display = "none"
else:
token_description.layout.display =

4. Have the links for the files to download

def gemma_download(token):
REPO_ID = "google/gemma-1.1-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = token
with out:
for filename in FILENAMES:
hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2b-it")

5. Convert the .safetensors format to an appropriate one

def gemma_convert_config(backend):
input_ckpt = '/content/gemma-2b-it/'
vocab_model_file = '/content/gemma-2b-it/'
output_dir = '/content/intermediate/gemma-2b-it/'
output_tflite_file = f'/content/converted_models/gemma_{backend}.tflite'
return converter.ConversionConfig(input_ckpt=input_ckpt, ckpt_format='safetensors', model_type='GEMMA_2B', backend=backend, output_dir=output_dir, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file)

You can use the colab file that has been slightly altered from the official one so you can use directly HF login.

After converting the latest model then we can use the MediaPipe solution for LLM INference. The procedure consists of:

  1. Using the dependency of the library
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}

2. Uploading the converted .bin model inside mobile on a specific path

/data/local/tmp/llm/model.bin

3. Setting up the library inside the project

class InferenceModel private constructor(context: Context) {
private var llmInference: LlmInference

private val modelExists: Boolean
get() = File(MODEL_PATH).exists()

private val _partialResults = MutableSharedFlow<Pair<String, Boolean>>(
extraBufferCapacity = 1,
onBufferOverflow = BufferOverflow.DROP_OLDEST
)
val partialResults: SharedFlow<Pair<String, Boolean>> = _partialResults.asSharedFlow()

init {
if (!modelExists) {
throw IllegalArgumentException("Model not found at path: $MODEL_PATH")
}

val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(MODEL_PATH)
.setMaxTokens(1024)
.setTemperature(0.0f)
.setResultListener { partialResult, done ->
_partialResults.tryEmit(partialResult to done)
}
.build()

llmInference = LlmInference.createFromOptions(context, options)
}

fun generateResponseAsync(prompt: String) {
llmInference.generateResponseAsync(prompt)
}

companion object {
private const val MODEL_PATH = "/data/local/tmp/llm/model.bin"
private var instance: InferenceModel? = null

fun getInstance(context: Context): InferenceModel {
return if (instance != null) {
instance!!
} else {
InferenceModel(context).also { instance = it }
}
}
}
}

Pay attention above to the:

.setTemperature(0.0f)

This is used so the model will be deterministic meaning we want the exact answer and not to be creative. Other parameters that can be used are summarized here.

All the above have been inserted into the existing TensorFlow Lite android app that is used to demonstrate QnA. Inside it there are multiple paragraphs as a context and various questions to ask. That project is a case study when it comes to the QnA task!

How it works? You select a context in the beginning and then you click on a question and wait for the model to find the answer based only on the context!

Gif showcasing the original TEnsorFlow Lite app.

We’ve made a modification to incorporate not only the BERT model but also the Gemma model. This adjustment was made primarily to address two key factors:

  1. The necessity for accommodating larger input text sizes. Gemma 2b supports inputs as lengthy as 2048 tokens, while the 7b model supports up to 3072 input tokens.
  2. The aspiration for answers that transcend mere simplicity, aiming instead for responses crafted into sentences. For instance, while BERT might provide the exact text from the context for a question like “When was TensorFlow released?” (e.g., “November 9, 2015”), LLMs such as Gemma would offer a more refined response like “TensorFlow was released on the 9th of November, 2015”.

For the BERT model we had to use the context and append the question in the end. Then we tokenized the result and padded until the length that the model desires (the official documentation uses a model that requires a 384 integer array).

Passing the Context for Normandy plus the Question to the BERT model the answer is:

"France"

Now with the release of Gemma version 1.1 we just need to format the prompt in a specific simple way:

Passage: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) 
were the people who in the 10th and 11th centuries gave their name
to Normandy, a region in France. They were descended
from Norse ("Norman" comes from "Norseman") raiders and pirates
from Denmark, Iceland and Norway who, under their leader Rollo,
agreed to swear fealty to King Charles III of West Francia.
Through generations of assimilation and mixing with the native
Frankish and Roman-Gaulish populations, their descendants would
gradually merge with the Carolingian-based cultures of West Francia.
The distinct cultural and ethnic identity of the Normans emerged
initially in the first half of the 10th century, and it continued
to evolve over the succeeding centuries.
Question: In what country is Normandy located?
Answer:

Passing the above as a whole string then the answer from the Gemma model is:

The passage states that "The Normans were the people who gave their name 
to Normandy a region in France".
So the Normans are located in France."

Totally better response!

You can find the whole implementation inside the altered android project and the branch with the Gemma implementation here. Building it you can see an option to a trained BERT model with 512 inputs, the Gemma option and an online Gemini implementation to compare the results from the offline models (an API key has to be used and a VPN where it is required).

Response from the Gemma model.

Conclusion
This blog post introduces the integration of Gemma 1.1, an advanced instruction-tuned language model, into an Android application for an offline question answering task. The post addresses the challenges of deploying Gemma 1.1, demonstrating setup, deployment, and results using clear code snippets. Notably, Gemma’s capacity for larger input text sizes, enhances its suitability for demanding tasks like QnA. Unlike simple word responses, Gemma delivers nuanced answers crafted in concise sentences, enriching user interactions. The post also outlines the conversion process for integrating Gemma models into mobile applications using libraries like MediaPipe. Through detailed steps and examples, it showcases Gemma’s capability to provide more refined responses compared to traditional models like BERT, making it a valuable addition to AI-driven applications.

--

--

George Soloupis

I am a pharmacist turned android developer and machine learning engineer. Right now I’m a senior android developer at Invisalign, a ML & Android GDE.