Use a tiny Llama model inside android

4 min readFeb 7, 2024

Written by George Soloupis ML and Android GDE.

In this blog post, we explore the integration of a lightweight Llama model, developed in C language, within an Android application. This specialized model serves a singular purpose: assisting users with volume adjustments. While the training process for this model has been detailed in a previous post, here, our focus shifts to present the Android code for seamless implementation.

The model has been integrated to a previous PoC which was demonstrated here where speech is converted into text and the text was sent to APIs like ChatGPT or Gemini Pro for summarization. The implementation is based on Karpathy’s Github repo where a Llama model can be trained from scratch altering the model parameters and after training this can be saved to be read in C environments. Here we are going to show only the android code which is necessary to run the model inside the project which is based on the android wrapper that is included at that repository.

First the two models that have been generated from the training procedure have to be transferred inside the assets folder of the project:

Above, you’ll find the files model_2570.bin and tok2570.bin, which respectively correspond to the model and the tokenizer designed for a specific task. Remarkably, the model size is around 27MB, while the tokenizer is a mere 33KB. This stands out considering that Large Language Models (LLMs) typically require gigabytes of storage to perform effectively across various tasks. However, given our narrow focus on a single task, the compact size of these files proves more than sufficient for our purposes.

Then the models can be read directly or during build procedure and initialization of the application can be copied to phone’s external memory:

viewModelScope.launch {
    val checkpoint = "model_2570.bin"
    val tokenizer = "tok2570.bin"
    val assetsFolder = application.copyAssets(arrayOf(checkpoint, tokenizer))
}

private fun userAssetPath(context: Context?): String {
    if (context == null)
        return ""
    val extDir = context.getDir("assets", 0).absolutePath
    return extDir
}

fun Context.copyAssets(listFiles: Array<String>):String {
    val extFolder = userAssetPath(this)
    try {
        assets.list("")
            ?.filter { listFiles.contains(it) }
            ?.filter { !File(extFolder, it).exists() }
            ?.forEach {
                val target = File(extFolder, it)
                assets.open(it).use { input ->
                    FileOutputStream(target).use { output ->
                        input.copyTo(output)
                        Log.i("Utils", "Copied from apk assets folder to ${target.absolutePath}")
                    }
                }
            }
    } catch (e: Exception) {
        Log.e("Utils", "asset copy failed", e)
    }
    return extFolder
}

Having the models in place we can initialize the Inference runner and wait the text to be generated so we can process it:

class InferenceRunnerManager(
    callback: InferenceRunner.InferenceCallback,
    private val folderPath: String,
    private val checkpointFileName: String,
    private val tokenizerFileName: String,
    private val ompThreads: Int = DEFAULT_OMP_THREADS
) {
    private val applicationScope = CoroutineScope(Dispatchers.IO + SupervisorJob())

    init {
        InferenceRunner.setInferenceCallback(callback)
    }

    fun run(
        prompt: String = "",
        temperature: Float = DEFAULT_TEMPERATURE,
        steps: Int = DEFAULT_STEPS,
        topp: Float = DEFAULT_TOPP
    ) {
        applicationScope.launch {
            InferenceRunner.run(
                checkpoint = "$folderPath/$checkpointFileName",
                tokenizer = "$folderPath/$tokenizerFileName",
                temperature = temperature,
                steps = steps,
                topp = topp,
                prompt = prompt,
                ompthreads = ompThreads
            )
        }
    }

    companion object {
        private const val DEFAULT_OMP_THREADS = 4
        private const val DEFAULT_TEMPERATURE = 0.0f
        private const val DEFAULT_STEPS = 64
        private const val DEFAULT_TOPP = 0.9f
    }
}

Check above that we pick TEMPERATURE = 0 to run the LLM since here we do not want the model to be creative but to be deterministic.

Within the InferenceRunner object, we have two functions tightly integrated with native development:

object InferenceRunner {

    private var inferenceCallback: InferenceCallback? = null

    fun setInferenceCallback(callback: InferenceCallback) {
        inferenceCallback = callback
    }

    external fun run(
        checkpoint: String,
        tokenizer: String,
        temperature: Float,
        steps: Int,
        topp: Float,
        prompt: String,
        ompthreads: Int
    )

    external fun stop()

    fun onNewToken(token: String) {
        inferenceCallback?.onNewResult(token)
    }

    interface InferenceCallback {
        fun onNewResult(token: String?)
    }
}

and in C:

JNIEXPORT void JNICALL
Java_com_example_talkandexecute_llm_InferenceRunner_run(JNIEnv *env, jobject thiz, jstring checkpoint,
                                                    jstring tokenizer, jfloat temperature,
                                                    jint steps, jfloat topp, jstring prompt,
                                                    jint ompthreads) {

    const char *checkpoint_path = (*env)->GetStringUTFChars(env, checkpoint, 0);
    const char *tokenizer_path = (*env)->GetStringUTFChars(env, tokenizer, 0);
    const char *_prompt = (*env)->GetStringUTFChars(env, prompt, 0);

    if (ompthreads >= 0 && ompthreads < 8) {
        omp_set_num_threads(ompthreads);
    } else {
        LOGE("incorrect number of threads for openMP! expect: 1..8\n");
    }

    LOGI("inference loaded checkpoint path: %s tokenizer path: %s temperature %f, steps %d prompt %s",
         checkpoint_path, tokenizer_path, temperature, steps, _prompt);

    run_inference(env, thiz, checkpoint_path, tokenizer_path, temperature, steps, topp, _prompt);

    (*env)->ReleaseStringUTFChars(env, checkpoint, checkpoint_path);
    (*env)->ReleaseStringUTFChars(env, tokenizer, tokenizer_path);
    (*env)->ReleaseStringUTFChars(env, prompt, _prompt);
}

JNIEXPORT void JNICALL
Java_com_example_talkandexecute_llm_InferenceRunner_stop(JNIEnv *env, jobject thiz) {
    stop();
}

By clicking the button on screen now instead of sending the text to online to the APIs now we can use the local LLM:

fun runOfflineLLM() {
    inferenceRunnerManager.run(transcribedText)
}

You can find the whole C implementation with the model and tokenizer loading, the encoding and decoding of the text and the final text generation inside the inference.c file. The whole project for clining is available here.

Conclusion

In the blog post we explored the integration of a lightweight Llama model into an Android application. Developed in C language, this specialized model aims to assist users with volume adjustments, offering a compact alternative to larger language models. The implementation process involves transferring two generated files — model_2570.bin and tok2570.bin — into the assets folder of the Android project. Despite their modest sizes, these models prove effective, showcasing the efficiency of focused task-oriented models. Key functions within the Android code manage the initialization and execution of the Inference Runner, bridging the gap between Kotlin and native development in C. Overall, the blog post provides valuable insights into integrating specialized models into Android applications, promoting efficiency and reducing reliance on online APIs for text generation.

Use a tiny Llama model inside android

Conclusion

Written by George Soloupis