Train a tiny Llama model to help a specific domain task.

George Soloupis
4 min readFeb 6, 2024

Written by George Soloupis ML and Android GDE.

In this blog post, we explore the fascinating realm of leveraging an open-source Large Language Model (LLM), affectionately known as Llama, to enhance performance in a targeted domain task. While LLMs have demonstrated remarkable capabilities across diverse tasks, their typical size poses challenges for implementation in devices with constrained memory and computational resources, such as mobile phones, single-board computers, and micro-controllers. However, an innovative alternative emerges through the utilization of Karpathy’s implementation of the Llama model in the C language. This implementation offers a high level of customization, enabling the fine-tuning and compression of the Llama model to generate compact .bin files. These files can seamlessly integrate into a C environment, making them well-suited for deployment on resource-constrained devices. You can read how these files can be used inside android at this blog post.

The domain we are targeting is a small assistant that can interpret user’s language into specific commands and alter the volume of the mobile phone automatically. In a previous work we showcased offline speech to text with TensorFlow and the Whisper model and the usage of the Whisper .tflite model inside an android application. Following text generation by the Whisper model, the application previously forwarded the text to either ChatGPT or the Gemini API for word summarization into commands. However, in the current implementation, we’ve seamlessly integrated a Llama model in C language, which performs the same task offline!

Screenshot of mobile phone with options.

To train the model you can use the Colab notebook I have prepared here. First of all you have to clone Karpathy’s Github repository and install the requirements that will be used to train the model. Inside there you will find the Sentencepiece library which is the heart of the tokenization procedure in this example.

!git clone https://github.com/karpathy/llama2.c.git
%cd llama2.c
!pip install -r requirements.txt

The original repository uses the TinyStories dataset. I have altered the code so you can use whatever .txt file you have. For example for a QnA task you can use the ‘natural_questions/longt5’ from the TensorFlow Datasets examples and create a .txt file that will have the below form:

'what is the definition of the name tiffany = Epiphany'
'what is the maximum depth of the atlantic ocean = 8 , 486 m ( 27 , 841 ft ) '
'what is the population of nashville tennessee metropolitan area = 1 , 865 , 298'
'when was the first animal sent into space = 1947'

To convert the dataset to a .txt file use the last cells of the Colab notebook:


import tensorflow as tf
import tensorflow_datasets as tfds
import numpy
nqa = tfds.load('natural_questions/longt5', as_supervised=False)
print(nqa['train'])

prefetchdataset = nqa['train']
print(len(prefetchdataset))

def remove_first_character(string):
return string[2:-1]

all_qna = []
n=0
samples = 307373
for element in prefetchdataset:
if "NULL"in str(element['answer'].numpy()):
continue
tensordata = element['question']+" = " + element['answer']
stringdata = remove_first_character(str(tensordata.numpy()))
all_qna.append(stringdata)

n+=1
if n==samples:
break
print(all_qna)

# Name of the text file
file_name = "output.txt"

# Open the file in write mode and write each string to a new line
with open(file_name, 'w') as file:
for string in all_qna:
file.write(f"{string}\n")

print(f"Strings written to {file_name}")

Having the .txt file ready we have to pretokenize the dataset and create the vocabulary to train the model. I have created the pre_training_script.py file that can be used as:

import time
!python pre_training_script.py train_vocab --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt
time.sleep(5)
!python pre_training_script.py pretokenize --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt

You can alter the vocab size as per your preference based on the task and the size of the dataset.

Three files are going to be generated in the file directory. You can convert the tok1200.model into a proper .bin file ready to be used inside the embedded devices as:

!python tokenizer.py --tokenizer-model=/content/llama2.c/data/tok1200.model

Then with the train_abstract.py script file you can start the training:

!python train_abstract.py --vocab_source=custom --vocab_size=1200

Inside the training script file you can see at the top lines all the parameters you can alter so the final model can be efficient but small in the meantime. You can see the last pages of the Chincilla paper to change the parameters of the model.

After the training you can create an executable file:

!make runfast

and use the model and the tokenizer .bin files:

model_file = '/content/llama2.c/out/model.bin'
tokenizer = '/content/llama2.c/data/tok1200.bin'

# Generate args
max_token = 96 #@param {type:"slider", min:32, max:1024, step:32}
temperature = 0 #@param {type:"slider", min:0.0, max:1, step:0.05}
top_p = 0.9 #@param {type:"slider", min:0.0, max:1.0, step:0.05}
prompt = "the music" #@param {type:"string"}

print(f"model: {model_file}, max_token: {max_token}, temperature: {temperature}, top_p: {top_p}, prompt: {prompt}")
print(f"----------------------------\n")

cmd = f'./run {model_file} -z {tokenizer} -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"'
!{cmd}

The above 2 files can be used then inside the android project with native development kit.

Conclusion

In this blog post we explored the potential of integrating a compact Large Language Model (LLM) into constrained devices, addressing challenges posed by the size of traditional LLMs. The implementation in C language offers a solution, enabling compression into .bin files for deployment on small devices. We showcased the development of a mobile assistant capable of interpreting user language and executing commands, with previous iterations utilizing the Whisper model. The latest implementation seamlessly integrates the Llama model offline, revolutionizing the process. Detailed instructions guide developers through dataset preparation, tokenization, vocabulary creation, and model training. The flexibility to customize training parameters ensures efficiency and size optimization. The trained model and tokenizer files are easily integrated into Android projects, marking a significant advancement in AI deployment on resource-limited devices.

--

--

George Soloupis

I am a pharmacist turned android developer and machine learning engineer. Right now I’m a senior android developer at Invisalign, a ML & Android GDE.