Large-scale Vision-Language Models, such as CLIP, demonstrate impressive capabilities and have multiple applications, from text-to-image generation to zero-shot classification. Recent work has suggested that visual prompts, such as a red circle, can steer the vision encoder to the circled region. While such vision prompts have now been used in various applications, they might be model-specific and depend on the model learning these behaviours from its training data. Discovering and evaluating various prompts might not be feasible given different models, tasks, and datasets. In this paper, we propose Highlight, a method to learn a visual prompt that highlights a region in an image or refines a manually engineered visual prompt. Using our framework, we can learn to highlight in a supervised way using a dataset of text-image region pairs or in an unsupervised way using synthetic captions or images only. Highlight outperforms other visual prompts, prompt learning approaches, and compute-intensive methods that use ensembles of multiple models and visual prompts.
Highlight generates a visual prompt, which is then alpha-blended with object proposals in the image. To learn the visual prompt, we construct positive and negative pairs for each object using: (i) supervised text and bounding box pairs (e.g. from RefCOCO), (ii) unsupervised text and bounding box pairs obtained by captioning the bounding box, (iii) a visual representation of the object, e.g. a crop of the bounding box as in the Figure, or the original image, visually prompted with a red circle. Here, (ii) and (iii) are unsupervised in that they do not require manual text-image region annotations. The CLIP image and text encoder are kept frozen.
Our supervised prompt outperforms prior works that use ensembles of several models and prompts, which require up to 6x more forward passes. Our unsupervised prompt outperforms all other single-prompts methods by 6.2% onaverage.
When training with our proposed image-image loss Highlight can be used to optimise existing prompts. We see the biggest gains when using Crop, but the best performance when using Reverse Blur.
We compare training a Highlight from scratch with pretraining the Highlight to output a red circle initially. We observe that the weaker the learning signal is, i.e. the weaker the supervsion is, the more important it is to pretrain Highlight meaningfully.
We train with our proposed Image-Image loss with different pretrain shapes and see that it is not only important to pretrain, but also colour and shape have an effect on the final result.
We train Highlight in the three different modes, (i) unsupervised im2im, using image-image pairs only, (ii) unsupervised t2im, using synthetic text- image pairs, and (iii) supervised t2im, using ground truth text-image pairs. Overall, we see that pretraining improves performance in the unsupervised image-image regime, and does not help or hurts performance when either ground-truth or unsupervised captions are used.
@misc{zeller2024highlight,
title={Highlight: Learning Visual Prompts for Vision-Language Models},
author={Jana Zeller and Aleksandar Shtedritski and Christian Rupprecht},
year={2024},
eprint={TODO},
archivePrefix={arXiv},
primaryClass={cs.LG}
}