Living on the Edge: On-Device Image Captioning

ml systems
ml
Lightweight image captioning model that runs on an iPhone
Author

Vibha Masti

Published

December 1, 2023

My introduction into the efficiency space was through a course I took at CMU called On-Device Machine Learning, co-taught by Prof. Emma Strubell and Prof. Yonatan Bisk. I enjoyed this course so much that I ended up TAing for it right after. I learned how to optimize models for low-resource hardware inference through quantization, pruning, and distillation. I also learned how to use parameter-efficient techniques to fine-tune and post-train. For this class’s final project, my teammate and I built a lightweight image captioning model that runs on an iPhone with low latency.

Living on the Edge: On-Device Image Captioning Model

Team: Vibha Masti, Eesha Shetty

Motivated by accesibility screen readers, our goal was to create an on-device near real-time image scene description model to run on hardware that is available to everyone. Running our model offline and on-device lets us maintain privacy, while also remaining lightweight with low inference latency. We deployed the model on an iPhone 13 Pro and achieved an inference latency of ~150 ms per image. We used Apple’s CoreML [10] framework to convert and optimize our model for iOS deployment.

Living on the Edge logo generated by Gemini nano banana

Figure 1: Living on the Edge logo generated by Gemini nano banana

Background

Image captioning is generally a two-step process: 1: extract image features using a vision encoder, and 2: generate text using a language decoder. For the longest time, CNNs [13, 14, 15, 16] remained the de-facto backbone for image downstream tasks, until the introduction of the Vision Transformer (ViT) [1]. CLIP introduced contrastive learning to jointly train and align image and text objectives. BLIP [3], BLIP-2 [4], and CLIPCap [5] were introduced shortly after. LightCap [7] and MobileViT [8] are lightweight architectures for edge devices. Apple has also open-sourced a multi-task neural architecture for on-device scene analysis [9], built on an optimized MobileNetV3 [11] backbone. Camera2Caption [12] implemented a simplistic encoder and decoder model for low-power devices.

Method

We used a subset of the Flickr30k Dataset [17, 18] containing 5000 training and 5000 validation samples to fine-tune BLIP. We measured n-gram overlap using BLEU (Bilingual Evaluation Understudy) scores [19]. We chose BLIP as our baseline model. We tried out both quantization and pruning techniques to optimize our model inference latency and memory footprint. We experimented with different image sizes and downsampling techniques (crop, resize) and compared BLEU scores scross model sizes (base, large). The figure below shows the effects of input size on model performance.

BLEU vs Image Input Size

Figure 2: BLEU vs image input size for crop and resize downsampling techniques on BLIP.

We experimented with both post-training quantization and post-training pruning to see what memory and latency reductions we could achieve. We calculated latency by running the CoreML .mlpackage model on an iPhone 13 Pro running iOS 17.1.1 using XCode’s latency calculator. We used the coremltoools Python package to convert our PyTorch model to CoreML and apply optimizations.

Quantization

We converted the PyTorch model first to a compiled graph and then to a CoreML Model Package (.mlpackage). We applied post-training linear quantization on weights (not activations) from 32-bit float to 16-bit float and 8-bit int and measure latency and model disk size.

Comparison of Non-Quantized (fp32) and Quantized (fp16, int8) Models
Measure fp32 fp16 int8
Inference Latency (ms) 238.84 143.06 142.46
Model Size (MB) 989.9 448.3 226.2

Pruning

We performed iterative unstructured magnitude pruning at a sparsity of 20% on our fine-tuned model and compared BLEU, disk size, and latency. We measure BLEU by against the validation dataset by running on MacOS and continue to measure latency on an iPhone 13 Pro. We see a significant drop in BLEU after 48.8% sparsity.

Performance Metrics at Different Sparsity Levels
Iteration Sparsity (%) BLEU Latency (ms) Disk Size (MB)
0 0.0% 0.241 238.84 989.92
1 20.0% 0.236 400.37 823.21
2 36.0% 0.172 318.12 665.55
3 48.8% 0.072 321.07 539.44
4 59.0% 0.000 301.12 438.01
5 67.2% 0.000 323.74 357.29

We also run a layer-based sensitivity analysis to compare the effects of pruning embedding/convolution layers. Interestingly, we see that increased sparsity in embedding layers slows down inference, but increased sparsity in convolutional layers speeds it up. Both pruning techniques lead to an overall increase in latency, which is likely due to the overhead of handling sparse matrices on mobile hardware.

0% Sparisty

(a) 0% Sparisty

20% Sparisty

(b) 20% Sparisty

40% Sparisty

(c) 36% Sparisty

48.8% Sparisty

(d) 48.8% Sparisty

59% Sparisty

(e) 59% Sparisty

67.2% Sparisty

(f) 67.2% Sparisty

Figure 3: Example captions generated at different sparsity levels

Sensitivity Analysis of Pruning Embedding and Convolution Layers
Pruned Layers Sparsity (%) BLEU Latency (ms) Disk Size (MB)
Conv 20.0% 0.2406 281.04 972.83
Conv 40.0% 0.2220 264.68 953.18
Embedding 20.0% 0.2406 275.90 972.83
Embedding 40.0% 0.2220 280.68 953.18

Conclusion

The overall best model we found was the int8 quantized model with an inference latency of 142.46 ms and a disk size of 226.2 MB. We found that pruning did not help with latency, but quantization did. We also found that resizing the image to 348x348 led to the best BLEU score.

References

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

  2. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision

  3. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  4. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  5. Mokady, R., Hertz, A., & Bermano, A. H. (2021). CLIPCap: CLIP Prefix for Image Captioning

  6. Radford, A., Wu, J., Child, R., et al. (2019). Language Models are Unsupervised Multitask Learners, OpenAI Blog

  7. Wang, N., Xie, J., Luo, H., et al. (2023). Efficient Image Captioning for Edge Devices

  8. Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

  9. Apple. (2022). A multi-task neural architecture for on-device scene analysis

  10. Apple. CoreML Documentation

  11. Howard, A., Sandler, M., Chu, G., et al. (2019). Searching for MobileNetV3

  12. Mathur, P., Gill, A., Yadav, A., et al. (2017). Camera2Caption: a real-time image caption generator

  13. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks

  14. Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition

  15. Szegedy, C., Liu, W., Jia, Y., et al. (2015). Going Deeper with Convolutions

  16. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition

  17. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision.

  18. HuggingFace. Flickr30k Dataset

  19. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002,). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).