Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

School of Computer Science, Carnegie Mellon University
NeurIPS 2024

Highlights

  • We propose dual prototype evolving (DPE), a novel test-time adaptation method for VLMs that progressively captures more accurate multi-modal representations for target classes during test time.
  • To promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes across modalities.
  • Experimental evaluations demonstrate that our DPE consistently outperforms current state-of-the-art methods across 15 diverse datasets while maintaining competitive computational efficiency.

Abstract

Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes—textual and visual—to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.

Motivation

  • Background: Current VLMs excel at general-purpose classifications, but often struggle in highly specialized domains due to distribution shift.
  • Goal: In this work, we propose a novel approach for test-time adaptation (TTA) that refines predictions using only unlabeled data.
  • Current Limitations of SOTA for unlabeled data: (1) Existing methods treat each test sample as independent and need to restart from the original model for each sample; (2) TTA might benefit from utilizing multiple modalities, yet it is often only utilizing a single modality.
  • Our Approach: We propose Dual Prototype Evolving (DPE), a novel test-time VLM adaptation approach that effectively accumulates task-specific knowledge from multi-modalities.

Experimental Results

Results on Robustness to Natural Distribution Shifts
fail
Table 1. Performance comparisons on robustness to natural distribution shifts. We present top-1 accuracy (%) results for all evaluated methods employing both ResNet-50 and ViT-B/16 visual backbones of CLIP. The best results are highlighted in bold.

Results on Cross-Datasets Generalization
fail
Table 2. Performance comparisons on cross-datesets generalization. We also present top-1 accuracy (\%) for all methods on two backbones of CLIP. The best results are highlighted in bold.

Efficiency Comparison
fail
Table 3. Efficiency comparison on ImageNet. We report the testing time, the achieved accuracy, and the performance gains compared to zero-shot CLIP.

Ablation Studies
fail
Figure 3. Ablation Studies. (Left) Sensitivity analysis of $\tau_t$ and $M$ on Caltech101; (Middle) Analysis of the performance contributions from various learnable parameter settings across three datasets; (Right) Performance on three datasets with varying scale factor $\lambda$.

Video Demonstration

Video demonstration coming soon.

BibTeX

@article{zhang2024dual,
  title={Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models},
  author={Zhang, Ce and Stepputtis, Simon and Sycara, Katia and Xie, Yaqi},
  journal={arXiv preprint arXiv:2410.12790},
  year={2024}
}