Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Highlights

We propose dual prototype evolving (DPE), a novel test-time adaptation method for VLMs that progressively captures more accurate multi-modal representations for target classes during test time.
To promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes across modalities.
Experimental evaluations demonstrate that our DPE consistently outperforms current state-of-the-art methods across 15 diverse datasets while maintaining competitive computational efficiency.

Abstract

Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes—textual and visual—to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.

Motivation

Background: Current VLMs excel at general-purpose classifications, but often struggle in highly specialized domains due to distribution shift.
Goal: In this work, we propose a novel approach for test-time adaptation (TTA) that refines predictions using only unlabeled data.
Current Limitations of SOTA for unlabeled data: (1) Existing methods treat each test sample as independent and need to restart from the original model for each sample; (2) TTA might benefit from utilizing multiple modalities, yet it is often only utilizing a single modality.
Our Approach: We propose Dual Prototype Evolving (DPE), a novel test-time VLM adaptation approach that effectively accumulates task-specific knowledge from multi-modalities.

Experimental Results

Results on Robustness to Natural Distribution Shifts

fail — **Table 1. Performance comparisons on robustness to natural distribution shifts.** We present top-1 accuracy (%) results for all evaluated methods employing both ResNet-50 and ViT-B/16 visual backbones of CLIP. The best results are highlighted in **bold**.

Results on Cross-Datasets Generalization

Efficiency Comparison

Ablation Studies

Video Demonstration

Video demonstration coming soon.

BibTeX

@article{zhang2024dual,
  title={Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models},
  author={Zhang, Ce and Stepputtis, Simon and Sycara, Katia and Xie, Yaqi},
  journal={arXiv preprint arXiv:2403.12964},
  year={2024}
}