BDC-Adapter:
Brownian Distance Covariance for
Better Vision-Language Reasoning

1Harbin Institute of Technology    2Southern University of Science and Technology
3Carnegie Mellon University    4Pengcheng Laboratory
BMVC 2023

*Indicates Equal Contribution

Abstract

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.

Motivation

  • The current state-of-the-art Tip-Adapter method, establishes a key-value cache model and evaluates the similarities of the query image feature and features of support few-shot training samples to perform classification.
  • However, we recognize that Tip-Adapter simply considers the covariance between each image feature pair, which only measures marginal distributions and captures linear relations.
  • In this paper, we introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning to provide a robust metric for measuring feature dependence. While classical covariance can only capture linear relations, Brownian covariance can model all possible relations.

MY ALT TEXT

Method

  • Multi-Modal Few-Shot Learning. After feature extraction, we concatenate the image and text features and use this joint features to train a one-layer multi-modal reasoning network by cross-entropy loss。
  • Class-Specific BDC Prototype Generation. Given all the BDC matrices of M images within class y, we define the prototype of class y to be the average of the BDC matrices.
  • BDC-Adapter Inference. During inference, BDC-Adapter integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks.

MY ALT TEXT

Experiments

Performance comparisons on few-shot learning on 11 datasets. For each dataset, we report the accuracy on 1-/2-/4-/8-/16-shot settings. The top-left subfigure shows the average accuracy over all 11 datasets.

MY ALT TEXT



Performance comparisons on robustness to natural distribution shifts. All the experiments are conducted with ResNet-50 visual backbone. The best results are in bold and the second are underlined.

MY ALT TEXT



Illustration of a few-shot learning instance from the Bongard-HOI benchmark. The left side shows positive images that depict the visual relationship of a person washing a dog. In contrast, negative examples do not exhibit such relationships. The right side shows query images, where the ground-truth labels are positive or negative, respectively.

MY ALT TEXT



Performance comparisons on the Bongard-HOI dataset. The last column shows the average accuracy. The best results are in bold and the second are underlined.

MY ALT TEXT

Conclusion

In this work, we innovatively introduce Brownian Distance Covariance to the field of visionlanguage reasoning, which provides a more robust metric for measuring feature dependence to enable beter generalization capability. Based on this, we present a novel method called BDC-Adapter, which takes advantage of the BDC metric in computing the similarities between the few-shot BDC prototypes and the BDC matrix of the test image. Meanwhile, BDC-Adapter only introduces a one-layer multi-modal reasoning network that learns from multi-modal few-shot instances, to adapt VLMs to downstream tasks using limited training data. Our extensive experiment results indicate the effectiveness of our proposed BDCAdapter method for fine-tuning VLMs. With its lightweight and parameter-efficient design, BDC-Adapter not only exhibits better vision-language reasoning capabilities but also has lower computational complexity, which makes it suitable for practical applications.

Video Presentation

Poster

BibTeX

@inproceedings{zhang2023bdc,
  title={BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning},
  author={Zhang, Yi and Zhang, Ce and Liao, Zihan and Tang, Yushun and He, Zhihai},
  booktitle={British Machine Vision Conference},
  year={2023}
}