Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.
In this work, we innovatively introduce Brownian Distance Covariance to the field of visionlanguage reasoning, which provides a more robust metric for measuring feature dependence to enable beter generalization capability. Based on this, we present a novel method called BDC-Adapter, which takes advantage of the BDC metric in computing the similarities between the few-shot BDC prototypes and the BDC matrix of the test image. Meanwhile, BDC-Adapter only introduces a one-layer multi-modal reasoning network that learns from multi-modal few-shot instances, to adapt VLMs to downstream tasks using limited training data. Our extensive experiment results indicate the effectiveness of our proposed BDCAdapter method for fine-tuning VLMs. With its lightweight and parameter-efficient design, BDC-Adapter not only exhibits better vision-language reasoning capabilities but also has lower computational complexity, which makes it suitable for practical applications.
@inproceedings{zhang2023bdc,
title={BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning},
author={Zhang, Yi and Zhang, Ce and Liao, Zihan and Tang, Yushun and He, Zhihai},
booktitle={British Machine Vision Conference},
year={2023}
}