Estimate depth in RGB images.
Written by George Soloupis ML GDE.
This is a tutorial on how to convert a Pytorch model to TensorFlow Lite, deploy the model inside an android phone and produce the input buffer for the inference. Because the model expects an input of shape [1, 3, w, h] and not [1, w, h, 3] as usual, the generation of the ByteBuffer is different on this occasion.
General info about the depth estimation procedure.
The Pytorch model is taken from this paper and it is a CNN-based single depth estimation model. CNN-based methods have recently demonstrated promising results on estimating depth from a single image. Due to the difficulty of collecting labeled datasets, earlier approaches often focus on specific visual domains such as indoor scenes or street views. While the accuracy of these approaches is not yet competitive with multi-view stereo algorithms, that research and the output model is particularly promising due to the availability of larger and more diverse training datasets from relative depth annotations, multi-view stereo, 3D movies and synthetic data. For cases where only one single color image is available, they have obtained the depth estimation through a pre-trained depth estimation model. Removing the dependency on stereo or multiple images as input has made their method more widely applicable to all the existing photos.
Following along with the provided colab notebook you can see the conversion to ONNX, TensorFlow and finally to TensorFlow Lite to obtain the model that was finally used inside the android application. Inside the notebook you can observe all the pre and post-process of the images so an array will be available to be used with the TensorFlow Lite Interpreter. As stated above the Pytorch model expects [1, 3, width, height] format of the input and so does the final TensorFlow Lite model.
The output of the model is an array of [1, 1, Width, Height] shape. This array is converted to a grayscale image and then on screen you can observe the input image and a grayscale one with the depth estimation in various tones of gray. Selecting specific values of pixels above a certain number we focus on the objects inside the image that are closer to the camera. That objects remain unchanged and the background is converted to B/W, blurred or sepia. Below you can see some mobile screenshots:


Explore the code
As stated above the difference in this project is the [1, 3, w, h] input shape. So instead of creating the ByteBuffer as usual for [1, w, h, 3] input shape, we have to use a different method. We can use Float Array or ByteBuffer as inputs for our model:
- Float Array
2. Bytebuffer
Difference in inference time is about 300ms so it is better to use the bitmapToByteBuffer method to gain some speed!
Output of the model is an array of [1, 1, w, h]. This array has to be transformed to a grayscale image so it can be shown on screen. The code for this procedure is as below:
The above method is based on that for each channel of the final bitmap we use the same value of the float array as it is a gray scale image. Also this method produces a black/transparent image that is later used to bring on screen the background effect! By changing the value of the pixels’ color (here it is 150) you can choose what objects are going to be shown unchanged on screen. You can choose between 0 (total black) to 255 (total white) value. With values closer to 255 at the end you have objects unchanged on screen that are really close to the mobile device.
You can see the application in use below:
Project available here:
This project is in Kotlin language and has:
- TensorFlow Lite interpreter usage and also:
- Databinding
- MVVM with Coroutines
- Koin DI
Future scopes for improvement:
- Try to improve inference time.
- Try to develop a custom Tensorflow Lite Support Library operator that can create ByteBuffer for models that expect inputs of shape [1, 3, w, h].
This brings us to the end of the tutorial. I hope you have enjoyed reading it and will apply what you learned to your real-world applications with TensorFlow Lite.