Fine-tune Gemma for any language on Vertex AI and deploy it to Android

7 min readFeb 13, 2025

Written by George Soloupis ML and Android GDE

Gemma’s logo

Looking to use the power of a large language model without the complexity? Google’s Gemma is a safe choice for AI development due to its focus on safety, efficiency, and performance. In this post, we’ll show you how to efficiently fine-tune Gemma for your specific language using Vertex AI, Google Cloud Platform’s (GCP) leading service for building, deploying, and scaling machine learning models. GCP provides a complete ecosystem, including data storage, powerful GPUs, and streamlined deployment tools, making it easier than ever to customize Gemma for your applications.

We will provide a step-by-step guide on how to run a Jupyter notebook within the Vertex AI environment and then convert the resulting model into a format suitable for deployment on an Android device.

Open the GCP and go to Vertex AI environment:

Vertex AI environment inside Google Cloud Platform — Vertex AI environment

2. Select the Colab Enterprise option:

Golab enterprise environment in Vertex AI — Colab enterprise environment

You can import a notebook or create a new one.

3. You have to create a runtime that is going to be used while fine-tuning the Gemma model. A L4 GPU was sufficient for our case:

This image shows a runtime configuration with selecting the machine type — Runtime configuration

4. By creating the runtime, this will appear in the Runtimes section.

5. Clicking your desired notebook in the My Notebook sections, it will open and then you select your runtime with the GPU. By clicking connect you are ready to fine-tune the Gemma model:

This image shows the Jupyter notebook opened and the selection of the runtime on the right — Gemma notebook with the runtime selection on the right

The steps to fine tune the model are a lot. Here we are going to highlight the important ones:

1. For this project, we used Google Translator to generate Greek language question-answer examples based on a subset of the Alpaca dataset. This subset contained only 1000 pairs but you can experiment with more or your own dataset:


from peft import PeftModel, LoraConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer, TrainingArguments, DataCollatorForSeq2Seq
from trl import SFTTrainer
import json

from googletrans import Translator ## 113 languages available. Read at https://readthedocs.org/projects/py-googletrans/downloads/pdf/latest/

from datasets import Dataset, DatasetDict, load_dataset

2. We used LoRA for fine-tuning. LoRA (Low-Rank Adaptation) of Large Language Models (LLMs) is a powerful technique for fine-tuning LLMs efficiently. Instead of updating all the model’s parameters, LoRA freezes the pre-trained weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture:

# Configure Lora parameters
lora_config = LoraConfig(
    r=32,  # Rank of the low-rank matrices. Smaller values lead to faster inference and smaller model size, but potentially reduced performance.  32 is a common starting point.
    lora_alpha=32,  # Scaling factor applied to the merged LoRA weights.  Often set equal to `r`.  Affects the learning rate scaling of the LoRA parameters.
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],  # Names of the modules within the transformer architecture to apply LoRA to.  These typically represent the projection matrices within attention and feedforward layers.  Choosing appropriate target modules is crucial for effective LoRA training.
    task_type="CAUSAL_LM",  # Specifies the type of task being performed. This helps in configuring appropriate biases and optimizations for specific task types. "CAUSAL_LM" indicates causal language modeling (like text generation).

3. We experimented with Gemma 2.0 2B instruction tuned model. The size is adequate to fit and run into a mobile device:

# Download the model from Hugging Face.
modelName = "google/gemma-2-2b-it" # google/gemma-2-2b-it or google/gemma-2-2b

eval_tokenizer = AutoTokenizer.from_pretrained(modelName, token=hf_token)
base_model = AutoModelForCausalLM.from_pretrained(modelName,
                 device_map="auto",
                 token=hf_token)

4. Before using the dataset all the examples have to be converted into Gemma’s expected format. You can find more information here:

gemma_prompt = """user
{}: {}
model
{}"""

eos_token = eval_tokenizer.eos_token
pad_token = eval_tokenizer.pad_token
eval_tokenizer.padding_side = "right"

eos_token, pad_token

# Convert to Gemma format
def convert_json_to_gemma_format(json_file_path, gemma_prompt):
    """Converts a JSON file to the Gemma format.

    Args:
        json_file_path: Path to the JSON file.
        gemma_prompt: The Gemma prompt template string.

    Returns:
        A dictionary containing the formatted text data.
    """
    with open(json_file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    texts = []
    for item in data:
        instruction = item.get("instruction", "")
        input_text = item.get("input", "") # Using input_text to avoid shadowing built-in 'input'
        output = item.get("output", "")
        text = gemma_prompt.format(instruction, input_text, output) + eos_token
        texts.append(text)

    return {"text": texts}

json_file_path = "/content/greek_output_1000.json"

gemma_data = convert_json_to_gemma_format(json_file_path, gemma_prompt)

5. The training parameters are really important. Check that we do not use any quantization as this will be applied when we convert the model to the gguf format for using it inside an android device:

# Training arguments
# Adjust per your needs and how powerful your working environment is
train_args = TrainingArguments(
    per_device_train_batch_size=4,  # Each GPU processes 4 examples per step.
    gradient_accumulation_steps=1,  # Gradients are accumulated over 1 step before updating weights.
    warmup_steps=30,  # Learning rate warms up (gradually increases) for the first 30 steps.
    max_steps=1000,  # Total number of optimization steps for training.
    # num_train_epochs=3,  # Not used because `max_steps` defines the training duration.
    gradient_checkpointing=True,  # Saves memory by recomputing activations during backpropagation.
    learning_rate=3e-4,  # Base learning rate for the optimizer.
    fp16=False,  # FP16 precision is disabled (not used).
    bf16=False,  # Enables bfloat16 precision, optimized for RTX 4090 GPUs. # True creates more stable outputs on mobile
    logging_steps=20,  # Logs training metrics every 20 steps.
    optim="adamw_8bit",  # Uses AdamW optimizer with 8-bit precision for optimizer states to save memory.
    weight_decay=0.01,  # Regularization to prevent overfitting by penalizing large weights.
    lr_scheduler_type="linear",  # Linearly decays learning rate after the warmup period.
    output_dir="outputs",  # Directory where model checkpoints and logs will be saved.
    report_to="none",  # Disables logging to external tools like TensorBoard or WandB.
    #evaluation_strategy="steps", # Evaluation is performed every eval_steps
    #eval_steps=80  # Evaluate every 80 steps
)

6. Finally you start the fine-tuning procedure:

# Create the trainer
trainer = SFTTrainer(
    model=base_model,
    tokenizer=eval_tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    data_collator=data_collator
    )

trainer.train()

7. Once you are done then you save the merged model and the tokenizer to a directory:

# Save the trainer
trainer.save_model("trainer_gemma_2_2b")

# Load your fine-tuned model
ft_model = PeftModel.from_pretrained(base_model, "trainer_gemma_2_2b")

# Merge adapters with the base model
merged_model = ft_model.merge_and_unload()

# Save the merged model to a directory
output_dir = "/content/merged_model"
merged_model.save_pretrained(output_dir)
eval_tokenizer.save_pretrained(output_dir)

8. You can push your model to the HF server:

# HuggingFace repository ID
repo_id = f"gsoloupis/gemma2_2B_it_greek_full_32"

# Push the model and tokenizer to HuggingFace Hub
merged_model.push_to_hub(repo_id, token=True, max_shard_size="5GB", safe_serialization=True)
eval_tokenizer.push_to_hub(repo_id, token=True)

9. You can convert your model directly inside the Jupyter notebook and get the .gguf file:

!git clone https://github.com/ggerganov/llama.cpp.git

!pip install -r llama.cpp/requirements.txt

!python llama.cpp/convert_hf_to_gguf.py -h

!python llama.cpp/convert_hf_to_gguf.py /content/merged_model \
  --outfile gemma_greek_2_2b_it_q8_0.gguf \
  --outtype q8_0

10. Or you can go to this repository inside HF and convert it automatically:

https://huggingface.co/spaces/ggml-org/gguf-my-repo

After getting the .gguf file you can use an android project to host it and run it. One example project for reference is this. Fine-tuning Gemma 2.0, even with a limited dataset in your desired language, yields remarkably high-quality results:

This is a screenshot of an android phone showcasing the use of a fine tuned Gemma model in greek language. — Example of Gemma usage inside android

This image is a screenshot and showcases the Gemma model when it is prompted to translate something in Greek — Example of translation

This is an example of asking for a summary the fine-tuned model — Example of summary

This is a screenshot of the app running and explaining a poem to the user — Example of poem explanation

Github repository with the whole python notebook.

Hugginface repository with models and datasets.

Conclusion
Vertex AI simplifies the process of fine-tuning Gemma for your own dataset. The platform offers an accessible environment, including a Jupyter notebook interface and pre-configured runtimes with powerful GPUs, making it easy to start fine-tuning. Gemma 2.0 proves to be well-suited for fine-tuning, even with relatively small datasets. The resulting models deliver remarkably high-quality performance, as demonstrated by the examples of translation, summarization, and explanation, showcasing Gemma’s potential in various applications after being fine-tuned on Vertex AI. A key advantage is the ability to convert the fine-tuned model directly into a .gguffile within the Vertex AI environment (or through a Hugging Face Space) for easy deployment on Android devices.

Fine-tune Gemma for any language on Vertex AI and deploy it to Android

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Georgios Soloupis

No responses yet