Highlight: Learning Visual Prompts for Vision-Language Models

{jana, suny, chrisr}@robots.ox.ac.uk
Visual Geometry Group, University of Oxford

Given an image collection and, optionally, image descriptions, we automatically learn a Highlight to prompt images. Highlight outperforms a red circle by 15% on the RefCOCO, RefCOCO+ and RefCOCOg datasets, on average.

Abstract

Large-scale Vision-Language Models, such as CLIP, demonstrate impressive capabilities and have multiple applications, from text-to-image generation to zero-shot classification. Recent work has suggested that visual prompts, such as a red circle, can steer the vision encoder to the circled region. While such vision prompts have now been used in various applications, they might be model-specific and depend on the model learning these behaviours from its training data. Discovering and evaluating various prompts might not be feasible given different models, tasks, and datasets. In this paper, we propose Highlight, a method to learn a visual prompt that highlights a region in an image or refines a manually engineered visual prompt. Using our framework, we can learn to highlight in a supervised way using a dataset of text-image region pairs or in an unsupervised way using synthetic captions or images only. Highlight outperforms other visual prompts, prompt learning approaches, and compute-intensive methods that use ensembles of multiple models and visual prompts.

Method

Highlight generates a visual prompt, which is then alpha-blended with object proposals in the image. To learn the visual prompt, we construct positive and negative pairs for each object using: (i) supervised text and bounding box pairs (e.g. from RefCOCO), (ii) unsupervised text and bounding box pairs obtained by captioning the bounding box, (iii) a visual representation of the object, e.g. a crop of the bounding box as in the Figure, or the original image, visually prompted with a red circle. Here, (ii) and (iii) are unsupervised in that they do not require manual text-image region annotations. The CLIP image and text encoder are kept frozen.

Results

Our supervised prompt outperforms prior works that use ensembles of several models and prompts, which require up to 6x more forward passes. Our unsupervised prompt outperforms all other single-prompts methods by 6.2% onaverage.

† as reported by Razei et al. ‡ as reported by Shtedritski et al.

Optimised Highlights

More Models

We train Highlight in the three different modes, (i) unsupervised im2im, using image-image pairs only, (ii) unsupervised t2im, using synthetic text- image pairs, and (iii) supervised t2im, using ground truth text-image pairs. Overall, we see that pretraining improves performance in the unsupervised image-image regime, and does not help or hurts performance when either ground-truth or unsupervised captions are used.

BibTeX

@misc{zeller2024highlight,
          title={Highlight: Learning Visual Prompts for Vision-Language Models}, 
          author={Jana Zeller and Aleksandar Shtedritski and Christian Rupprecht},
          year={2024},
          eprint={TODO},
          archivePrefix={arXiv},
          primaryClass={cs.LG}
      }