Stable diffusion example in an android application — Part 1

6 min readJan 25, 2023

This is a blog post that demonstrates how you can deploy a stable diffusion pipeline inside and android app. Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. It consists of four parts — a Tokenizer, an Encoder, the Diffusion model and the Decoder. At part 1 we focus on the Tokenizer and the Encoder. Check part 2 for the Diffusion model and the Decoder.

The Tokenizer

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods and Advanced Deep Learning-based architectures like Transformers. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types — word, character, and subword (n-gram characters) tokenization. One of the most common problems is OOV which stands for Out Of Vocabulary words. The clip tokenizer that goes hand to hand with the diffusion pipeline handles that problem with Subword Tokenization by doing Byte Pair Encoding (BPE).

You can find the source code of the tokenizer procedure at the keras-cv stable diffusion example where you can see the presence of the vocabulary. The procedure that the tokenizer is using is pretty straight forward for a Byte Pair Encoding example but really hard to implement to a Java equivalent. That way the decision was to use the exact same python code inside android.

Using Python inside android
Chaquopy’s SDK comes really handy with this experiment as it is a Python SDK for android development. Since the middle of 2022 it is free and open-source. The SDK’s full source code is available on GitHub under the MIT license. Following the basic setup you can add it in your projects and for each ABI you will see an increase of 10MBs at the final .apk’s size. Usually ABI `armeabi-v7a` is enough for experiments and for most android devices. The final set up at app’s build.gradle file along with the libraries that have to be downloaded for the tokenizer is:

    sourceSets {
            main {
                python {
                    srcDirs = ["src/main/python"]
                }
            }
        }
        python {
            //buildPython "/usr/bin/python3.8"
            //buildPython "/usr/bin/python3"
            pip {
                // A requirement specifier, with or without a version number:
                install "numpy"
                install "ftfy"
                install "regex"
                install "requests"
            }
        }
        ndk {
            abiFilters "armeabi-v7a"//, 'arm64-v8a'
        }

If you follow the Chaquopy docs then you should probably have a python folder at the project structure. Inside there you have to create a text_encoder.py file and place the code that you can find here. A trancated code snippet is below:

import gzip
import ftfy
import regex as re
import html
import os
from os.path import dirname, join

# Below implementation from Keras-cv clip_tokenizer.py
# https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/stable_diffusion/clip_tokenizer.py


def bytes_to_unicode():
    """Return a list of utf-8 bytes and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    bs = (
        list(range(ord("!"), ord("~") + 1))
        + list(range(ord("¡"), ord("¬") + 1))
        + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

.....
.....

At the Application file of the project call Python.start()

class DiffusionModelsApp : Application() {

    override fun onCreate() {
        super.onCreate()
        // Start Python
        if (!Python.isStarted()) {
            Python.start(AndroidPlatform(this))
        }
    }
}

and in a Activity or a Fragment get the tokenization result by passing the text as this example:

val python = Python.getInstance()
val pythonFile = python.getModule("encode_text")
val encodedObject: IntArray =
    pythonFile.callAttr("encodeText", editText.text.toString())
         .toJava(IntArray::class.java)

The above will be pretty straight forward, fast and will guarantee that the result will be identical with the result when you run Python code locally or on the cloud. Let’s take a look what the final array of integers will look like for a text example like “two cats doing surfing”:

49406, 1237, 3989, 1960, 2379, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407

The array consists of the starting integer 49406 continues with the integers from the vocabulary that represent the words “two”, “cats”, “doing”, “surfing” and finally it is padded until the end with the <|endoftext|> equivalent which is integer 49407. Now the array is ready to be used by the encoder.

The encoder

For getting the result from the encoder we are going to use the Java API of the TensorFlow Lite Interpreter. At the build.gradle file we have to include the dependency:

implementation 'org.tensorflow:tensorflow-lite:2.11.0'

We can save a huge amount of the final .apk size and have shorter build times by placing the .tflite file of the encoder at the internal storage of the phone instead of the assets folder. To do that before proceding with the encoder build and run your app at your phone. This will create a directory at the mobile’s internal storage and you can place the .tflite file at:

data/data/com.example.diffusionmodelsapp/files

as shown in the image:

You can download the model from this link.

Then you can initialize the Interpreter as you can see at this file:

    val interpreter = getInterpreter(context, 
        "text_encoder_chollet_float_16.tflite",
        false)
    ...
    ...
    @Throws(IOException::class)
    private fun getInterpreter(
        context: Context,
        modelName: String,
        useGpu: Boolean = false
    ): Interpreter {
        val tfliteOptions = Interpreter.Options()
        tfliteOptions.numThreads = numberThreads
        val mByteBuffer = loadModelFromInternalStorage(context, modelName)
        return Interpreter(mByteBuffer)
    }

    private fun loadModelFromInternalStorage(
        context: Context,
        modelName: String
    ): MappedByteBuffer {
        val modelPath: String = context.filesDir.path + "/" + modelName
        val file = File(modelPath)
        val inputStream = FileInputStream(file)
        return inputStream.channel.map(FileChannel.MapMode.READ_ONLY, 0, file.length())
    }

The .tflite file is pretty big but it will not give you OOM errors and runs pretty fast even at old devices. Having the interpreter loaded we can feed it with arrays and get a result. You can downloaded it from here. For the diffusion model we need to run the encoder’s interpreter 2 times. Once for the context and then for the unconditional context. For context we will use the array of the tokenizer and for the unconditional context a dummy array:

49406, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407

We can visualize the inputs and the outputs of the .tflite file at netron.app:

The code to get the two results will be like:

            val arrayOutputsContext = Array(1) {
                Array(77) {
                    FloatArray(768)
                }
            }
            val arrayOutputsUnconditionalContext = Array(1) {
                Array(77) {
                    FloatArray(768)
                }
            }
            val contextInput = Array(1) {
                intArray // the array of the text tokenizer
            }
            val unconditionalContextInput = Array(1) {
                unconditionalTokens
            }
            val positionInput = Array(1) {
                intArrayOfPositions
            }
            val outputsContext: MutableMap<Int, Any> = HashMap()
            outputsContext[0] = arrayOutputsContext
            interpreterEncoder.runForMultipleInputsOutputs(
                arrayOf<Any>(
                    contextInput,
                    positionInput
                ), outputsContext
            )
            val outputsUnconditionalContext: MutableMap<Int, Any> = HashMap()
            outputsUnconditionalContext[0] = arrayOutputsUnconditionalContext
            interpreterEncoder.runForMultipleInputsOutputs(
                arrayOf<Any>(
                    unconditionalContextInput,
                    positionInput
                ), outputsUnconditionalContext
            )

Full code at this file.

The above procedure will give us 2 arrays. One array for the context that will be of shape [1,77,768] and a second one for the unconditional context of the same shape. These 2 arrays will be fed to the diffusion model along with other parameters for the number of images to be generated and the number of steps.

Conclusion
Stable diffusion pipeline has several steps that have to be followed in order to deploy it inside an android application. This blog post covered the Tokenizer and the Encoder. To see the diffusion model in action and the final generation of the image read Part 2 (will be published shortly).

Credits: Special thanks to Chansung Park and Sayak Paul for their hints and their help with the Stable Diffusion pipeline.

Stable diffusion example in an android application — Part 1

The Tokenizer

The encoder

Written by George Soloupis