Volltext-Downloads (blau) und Frontdoor-Views (grau)

Leveraging Pretrained Unimodal Models for Efficient Image-Text Retrieval

  • Existing approaches to image-text retrieval often require large-scale models, extensive data, and substantial computational resources, limiting their accessibility for smaller research groups. We introduce LiteITR, an efficient self-supervised vision-language model that leverages pretrained unimodal encoders with contrastive learning and self-supervised knowledge distillation. While not reaching state-of-the-art performance, our approach demonstrates reasonable performance on retrieval tasks with dramatically reduced resources, requiring only 3M image-text pairs and costing approximately $20 to train. These findings underscore the potential for designing efficient multimodal retrieval systems that are trainable by researchers with limited resources.

Download full text files

  • Volltexteng
    (310KB)

    Accepted Manuscript. © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI of the original publication: https://doi.org/10.1109/IDAACS68557.2025.11322222

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Tim Cares, Adrian PigorsGND, Volker AhlersORCiDGND
URN:urn:nbn:de:bsz:960-opus4-38185
DOI:https://doi.org/10.25968/opus-3818
DOI original:https://doi.org/10.1109/IDAACS68557.2025.11322222
ISBN:979-8-3315-8045-2
ISSN:2770-4254
Parent Title (English):2025 IEEE 13th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)
Publisher:IEEE
Document Type:Conference Proceeding
Language:English
Year of Completion:2025
Publishing Institution:Hochschule Hannover
Release Date:2026/01/16
Tag:efficiency; image-text retrieval; knowledge distillation; self-supervised learning; vision-language model
GND Keyword:FreitextsucheGND; Information RetrievalGND
First Page:741
Last Page:746
Institutes:Fakultät IV - Wirtschaft und Informatik
Data|H - Institute for Applied Data Science Hannover
DDC classes:004 Informatik
Licence (German):License LogoUrheberrechtlich geschützt