Leveraging Pretrained Unimodal Models for Efficient Image-Text Retrieval
- Existing approaches to image-text retrieval often require large-scale models, extensive data, and substantial computational resources, limiting their accessibility for smaller research groups. We introduce LiteITR, an efficient self-supervised vision-language model that leverages pretrained unimodal encoders with contrastive learning and self-supervised knowledge distillation. While not reaching state-of-the-art performance, our approach demonstrates reasonable performance on retrieval tasks with dramatically reduced resources, requiring only 3M image-text pairs and costing approximately $20 to train. These findings underscore the potential for designing efficient multimodal retrieval systems that are trainable by researchers with limited resources.
| Author: | Tim Cares, Adrian PigorsGND, Volker AhlersORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:960-opus4-38185 |
| DOI: | https://doi.org/10.25968/opus-3818 |
| DOI original: | https://doi.org/10.1109/IDAACS68557.2025.11322222 |
| ISBN: | 979-8-3315-8045-2 |
| ISSN: | 2770-4254 |
| Parent Title (English): | 2025 IEEE 13th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS) |
| Publisher: | IEEE |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of Completion: | 2025 |
| Publishing Institution: | Hochschule Hannover |
| Release Date: | 2026/01/16 |
| Tag: | efficiency; image-text retrieval; knowledge distillation; self-supervised learning; vision-language model |
| GND Keyword: | FreitextsucheGND; Information RetrievalGND |
| First Page: | 741 |
| Last Page: | 746 |
| Institutes: | Fakultät IV - Wirtschaft und Informatik |
| Data|H - Institute for Applied Data Science Hannover | |
| DDC classes: | 004 Informatik |
| Licence (German): | Urheberrechtlich geschützt |






