Ce Zhang | Carnegie Mellon University

About Me

Hi there! I’m Ce Zhang I am currently a third-year PhD candidate in the Robotics Institute at Carnegie Mellon University (CMU), with an expected graduation of 2028.

Ph.D. in Robotics

Carnegie Mellon University

🗓️ 2025 – 2028 (expected)

🤝 Advisor: Prof. Katia Sycara

M.Sc. in Machine Learning

Carnegie Mellon University

🗓️ 2023 – 2024

🤝 Advisor: Prof. Katia Sycara

B.Eng. in Communication Engineering

SUSTech

🗓️ 2019 – 2023

🤝 Advisor: Prof. Zhihai He

Feel free to reach out if you're interested in my work or would like to discuss potential collaborations!

Research Interests

I build multi-modal AI systems that are efficient and reliable enough for real-world use.

Currently working on

My research focuses on:

How can these models seamlessly interact with humans and their environments—advancing capabilities in long-form/streaming video understanding and spatial reasoning?
How can we responsibly deploy these models while addressing real-world concerns—mitigating misinformation and hallucinations, and promoting efficiency and safety?

News

Jun 2026 Our paper “LENS: Adaptive Spatio-Temporal Zooming for Keyframe Sampling in Long-Form Videos” is accepted to ECCV 2026.
May 2026 I joined TikTok as a research scientist intern, working on streaming video understanding.
Feb 2026 Our paper “Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory” is accepted to CVPR 2026.
Jan 2026 Our paper “pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning” is accepted to ICLR 2026.

Show more Show less

Jan 2026 Our paper "VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models" is accepted to TMLR.
Mar 2025 I will be joining the Ph.D. in Robotics program at Carnegie Mellon University (CMU) in Fall 2025.
Jan 2025 Our paper "Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models" is accepted to ICLR 2025.

Publications

2026 5 papers ECCV 2026LENS: Adaptive Spatio-Temporal Zooming for Keyframe Sampling in Long-Form Videos ACL 2026 (Main)WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models CVPR 2026Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory ICLR 2026pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning TMLR 2026VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models 2025 5 papers ICCV 2025ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models ICIP 2025Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation ACL 2025 (Main)InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning ICLR 2025Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models WACV 2025Enhancing Vision-Language Few-Shot Adaptation with Negative Learning 2024 4 papers NeurIPS 2024Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models CVPR 2024HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation AAAI 2024Concept-Guided Prompt Learning for Generalization in Vision-Language Models WACV 2024Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation 2023 4 papers BMVC 2023BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning CVPR 2023Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation CVPR 2023Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation IEEE TAICritical Sampling for Robust Evolution Operator Learning of Unknown Dynamical Systems

2026

ECCV 2026
LENS: Adaptive Spatio-Temporal Zooming for Keyframe Sampling in Long-Form Videos

Ce Zhang, Jinxi He, Katia Sycara, Yaqi Xie

European Conference on Computer Vision (ECCV), 2026.

Malmö, Sweden, September 8–12, 2026

PDF Website Abstract BibTeX

Despite rapid progress in Multi-modal Large Language Models (MLLMs), understanding long-form videos is still bottlenecked by limited context windows. While recent keyframe sampling methods attempt to mitigate this by distilling video inputs into a compact set of query-relevant frames, navigating the vast spatio-temporal search space remains challenging, as spatial detail and temporal coverage often conflict. To address this, we introduce LENS, a training-free keyframe sampling framework that dynamically decides when to zoom in for fine-grained details and when to zoom out for broader context based on the text query. Concretely, LENS adaptively allocates a limited frame budget between spatial zoom-ins, which highlight query-relevant regions within individual frames, and temporal zoom-outs, which expand the temporal scope through multi-frame aggregation, enabling the model to reason across multiple granularities while capturing both high-fidelity details and long-range context. Across diverse long-form video benchmarks, LENS consistently outperforms prior state-of-the-art keyframe sampling methods and delivers substantial gains over uniform sampling, improving Video-MME accuracy from 53.3% to 60.7% with Qwen2.5-VL. Code is available at https://github.com/zhangce01/LENS.
```
@inproceedings{zhang2026lens,
    title     = {LENS: Adaptive Spatio-Temporal Zooming for Keyframe Sampling in Long-Form Videos},
    author    = {Zhang, Ce and He, Jinxi and Sycara, Katia and Xie, Yaqi},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year      = {2026}
}
```

ACL 2026 (Main)
WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong

Annual Meeting of the Association for Computational Linguistics (ACL), 2026.

San Diego, CA, US, July 2-7, 2026

PDF Abstract BibTeX

Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
```
@inproceedings{wang2026webaggregator,
    title     = {WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models},
    author    = {Rui Wang and Ce Zhang and Jun-Yu Ma and Jianshu Zhang and Hongru Wang and Yi Chen and Boyang Xue and Tianqing Fang and Zhisong Zhang and Hongming Zhang and Haitao Mi and Dong Yu and Kam-Fai Wong},
    booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
    pages     = {24486--24517},
    year      = {2026}
}
```

CVPR 2026
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang^*, Jinxi He^*, Junyi He, Katia Sycara, Yaqi Xie (^*Equal contribution)

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

Also at ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving.

Denver, CO, US, June 3-7, 2026

PDF Website Abstract BibTeX

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs.
```
@inproceedings{zhang2026evolving,
    title     = {Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory},
    author    = {Zhang, Ce and He, Jinxi and He, Junyi and Sycara, Katia and Xie, Yaqi},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {41182--41192},
    year      = {2026}
}
```

ICLR 2026
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Zhanpeng Luo^*, Ce Zhang^*, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie (^*Equal contribution)

International Conference on Learning Representations (ICLR), 2026.

Rio de Janeiro, Brazil, April 23-27, 2026

PDF Website Abstract BibTeX

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
```
@inproceedings{luo2026pyspatial,
    title     = {pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning},
    author    = {Zhanpeng Luo and Ce Zhang and Silong Yong and Cunxi Dai and Qianwei Wang and Haoxi Ran and Guanya Shi and Katia P. Sycara and Yaqi Xie},
    booktitle = {The Fourteenth International Conference on Learning Representations},
    year      = {2026},
    url       = {https://openreview.net/forum?id=yv15C8ql24}
}
```

TMLR 2026
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, Dong Yu

Transactions on Machine Learning Research (TMLR), 2026.

Also at ICML 2025 Workshop on Efficient Systems for Foundation Models (ES-FoMo III).

PDF Website Abstract BibTeX

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91× speedup in prefilling and a 10× reduction in FLOPs, while retaining 95.4% of the original performance.
```
@article{zhang2026vscan,
    title   = {{VS}can: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models},
    author  = {Ce Zhang and Kaixin Ma and Tianqing Fang and Wenhao Yu and Hongming Zhang and Zhisong Zhang and Haitao Mi and Dong Yu},
    journal = {Transactions on Machine Learning Research},
    issn    = {2835-8856},
    year    = {2026},
    url     = {https://openreview.net/forum?id=KZYhyilFnt}
}
```

2025

ICCV 2025
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Zifu Wan^*, Ce Zhang^*, Silong Yong, Martin Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie (^*Equal contribution)

IEEE/CVF International Conference on Computer Vision (ICCV), 2025.

Honolulu, HI, US, October 19–23, 2025

PDF Website Abstract BibTeX

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our ONLY approach consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.
```
@inproceedings{wan2025only,
    title     = {ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models},
    author    = {Wan, Zifu and Zhang, Ce and Yong, Silong and Ma, Martin Q. and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
    pages     = {3225--3234},
    year      = {2025}
}
```

ICIP 2025
Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation

Ce Zhang, Zifu Wan, Simon Stepputtis, Katia Sycara, Yaqi Xie

International Conference on Image Processing (ICIP), 2025.

Anchorage, AK, US, September 14–17, 2025

Selected for a Lecture presentation.

PDF Video Abstract BibTeX

Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.
```
@inproceedings{zhang2025spectral,
    title     = {Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation},
    author    = {Zhang, Ce and Wan, Zifu and Stepputtis, Simon and Sycara, Katia and Xie, Yaqi},
    booktitle = {IEEE International Conference on Image Processing (ICIP)},
    pages     = {43--48},
    publisher = {IEEE},
    year      = {2025}
}
```

ACL 2025 (Main)
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara

Annual Meeting of the Association for Computational Linguistics (ACL), 2025.

Also at AAAI 2024 Workshop on Public Sector LLMs: Algorithmic and Sociotechnical Design.

Vienna, Austria, July 27 - August 1, 2025

PDF Website Abstract BibTeX

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields.
```
@inproceedings{wan2025instructpart,
    title     = {InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning},
    author    = {Wan, Zifu and Xie, Yaqi and Zhang, Ce and Lin, Zhiqiu and Wang, Zihan and Stepputtis, Simon and Ramanan, Deva and Sycara, Katia P.},
    booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month     = {July},
    year      = {2025},
    address   = {Vienna, Austria},
    publisher = {Association for Computational Linguistics},
    pages     = {24202--24227},
    doi       = {10.18653/v1/2025.acl-long.1179}
}
```

ICLR 2025
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

Ce Zhang^*, Zifu Wan^*, Zhehan Kan, Martin Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie (^*Equal contribution)

International Conference on Learning Representations (ICLR), 2025.

Also at NeurIPS 2024 Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models.

Singapore, April 24-28, 2025

PDF Website Video Abstract BibTeX

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.
```
@inproceedings{zhang2025self,
    title     = {Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models},
    author    = {Ce Zhang and Zifu Wan and Zhehan Kan and Martin Q. Ma and Simon Stepputtis and Deva Ramanan and Russ Salakhutdinov and Louis-Philippe Morency and Katia P. Sycara and Yaqi Xie},
    booktitle = {The Thirteenth International Conference on Learning Representations},
    year      = {2025},
    url       = {https://openreview.net/forum?id=tTBXePRKSx}
}
```

WACV 2025
Enhancing Vision-Language Few-Shot Adaptation with Negative Learning

Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.

Also at ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.

Tucson, AZ, US, February 28 - March 4, 2025

PDF Video Abstract BibTeX

Large-scale pre-trained Vision-Language Models (VLMs) have exhibited impressive zero-shot performance and transferability, allowing them to adapt to downstream tasks in a data-efficient manner. However, when only a few labeled samples are available, adapting VLMs to distinguish subtle differences between similar classes in specific downstream tasks remains challenging. In this work, we propose a Simple yet effective Negative Learning approach, SimNL, to more efficiently exploit the task-specific knowledge from few-shot labeled samples. Unlike previous methods that focus on identifying a set of representative positive features defining "what is a {CLASS}", SimNL discovers a complementary set of negative features that define "what is not a {CLASS}", providing additional insights that supplement the positive features to enhance task-specific recognition capability. Further, we identify that current adaptation approaches are particularly vulnerable to potential noise in the few-shot sample set. To mitigate this issue, we introduce a plug-and-play few-shot instance reweighting technique to suppress noisy outliers and amplify clean samples for more stable adaptation. Our extensive experimental results across 15 datasets validate that the proposed SimNL outperforms existing state-of-the-art methods on both few-shot learning and domain generalization tasks while achieving competitive computational efficiency. Code is available at https://github.com/zhangce01/SimNL.
```
@inproceedings{zhang2025enhancing,
    author    = {Ce Zhang and Simon Stepputtis and Katia P. Sycara and Yaqi Xie},
    title     = {Enhancing Vision-Language Few-Shot Adaptation with Negative Learning},
    booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    pages     = {5905--5915},
    publisher = {IEEE},
    year      = {2025},
    doi       = {10.1109/WACV61041.2025.00576}
}
```

2024

NeurIPS 2024
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie

Conference on Neural Information Processing Systems (NeurIPS), 2024.

Also at ICML 2024 Workshop on Foundation Models in the Wild.

Vancouver, Canada, December 10-15, 2024

PDF Website Video Abstract BibTeX

Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes—textual and visual—to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.
```
@inproceedings{zhang2024dual,
    title     = {Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models},
    author    = {Ce Zhang and Simon Stepputtis and Katia P. Sycara and Yaqi Xie},
    booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year      = {2024},
    url       = {https://openreview.net/forum?id=jsgYYXaSiS}
}
```

CVPR 2024
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Ce Zhang, Simon Stepputtis, Joseph Campbell, Katia Sycara, Yaqi Xie

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Also at NeurIPS 2023 New Frontiers in Graph Learning Workshop.

Seattle, WA, US, June 17-21, 2024

PDF Website Video Abstract BibTeX

Being able to understand visual scenes is a precursor for many downstream tasks including autonomous driving robotics and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however many existing approaches assume undisturbed vision i.e. the absence of real-world corruptions such as fog snow smoke as well as non-uniform perturbations like sun glare or water drops. In this work we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further we introduce a corresponding approach Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG) providing a strong baseline for scene graph generation under such challenging setting. At its core HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at https://github.com/zhangce01/HiKER-SGG.
```
@inproceedings{zhang2024hiker,
    title     = {HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation},
    author    = {Zhang, Ce and Stepputtis, Simon and Campbell, Joseph and Sycara, Katia and Xie, Yaqi},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {28233--28243},
    year      = {2024}
}
```

AAAI 2024
Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He

AAAI Conference on Artificial Intelligence (AAAI), 2024.

Vancouver, Canada, February 22-25, 2024

PDF Video Abstract BibTeX

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable concept-guided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.
```
@inproceedings{zhang2024concept,
    author    = {Yi Zhang and Ce Zhang and Ke Yu and Yushun Tang and Zhihai He},
    title     = {Concept-Guided Prompt Learning for Generalization in Vision-Language Models},
    booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
    volume    = {38},
    number    = {7},
    pages     = {7377--7386},
    year      = {2024},
    doi       = {10.1609/aaai.v38i7.28568}
}
```

WACV 2024
Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation

Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, Zhihai He

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.

Waikoloa, HI, US, January 4-8, 2024

PDF Video Abstract BibTeX

Pre-trained Visual-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic elements. When CLIP is used for depth estimation tasks, the patches, divided from the input images, can be combined with a series of semantic descriptions of the depth information to obtain similarity results. The coarse estimation of depth is then achieved by weighting and summing the depth values, called depth bins, corresponding to the predefined semantic descriptions. The zero-shot approach circumvents the computational and time-intensive nature of traditional fully-supervised depth estimation methods. However, this method, utilizing fixed depth bins, may not effectively generalize as images from different scenes may exhibit distinct depth distributions. To address this challenge, we propose a few-shot-based method which learns to adapt the VLMs for monocular depth estimation to balance training costs and generalization capabilities. Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference. Additionally, we incorporate learnable prompts to preprocess the input text to convert the easily human-understood text into easily model-understood vectors and further enhance the performance. With only one image per scene for training, our extensive experiment results on the NYU V2 dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
```
@inproceedings{hu2024learning,
    author    = {Xueting Hu and Ce Zhang and Yi Zhang and Bowen Hai and Ke Yu and Zhihai He},
    title     = {Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation},
    booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    pages     = {5582--5591},
    publisher = {IEEE},
    year      = {2024},
    doi       = {10.1109/WACV57701.2024.00550}
}
```

2023

BMVC 2023
BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Yi Zhang^*, Ce Zhang^*, Zihan Liao, Yushun Tang, Zhihai He (^*Equal contribution)

British Machine Vision Conference (BMVC), 2023.

Aberdeen, UK, November 20-24, 2023

PDF Website Video Abstract BibTeX

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.
```
@inproceedings{zhang2023bdc,
    author    = {Yi Zhang and Ce Zhang and Zihan Liao and Yushun Tang and Zhihai He},
    title     = {BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning},
    booktitle = {34th British Machine Vision Conference (BMVC)},
    publisher = {BMVA},
    year      = {2023}
}
```

CVPR 2023
Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Yushun Tang, Ce Zhang, Heng Xu, Shuoshuo Chen, Jie Cheng, Luziwei Leng, Qinghai Guo, Zhihai He

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Vancouver, Canada, June 18-22, 2023

PDF Video Abstract BibTeX

Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. We take inspiration from the biological plausibility learning where the neuron responses are tuned based on a local synapse-change procedure and activated by competitive lateral inhibition rules. Based on these feed-forward learning rules, we design a soft Hebbian learning process which provides an unsupervised and effective mechanism for online adaptation. We observe that the performance of this feed-forward Hebbian learning for fully test-time adaptation can be significantly improved by incorporating a feedback neuro-modulation layer. It is able to fine-tune the neuron responses based on the external feedback generated by the error back-propagation from the top inference layers. This leads to our proposed neuro-modulated Hebbian learning (NHL) method for fully test-time adaptation. With the unsupervised feed-forward soft Hebbian learning being combined with a learned neuro-modulator to capture feedback from external responses, the source model can be effectively adapted during the testing process. Experimental results on benchmark datasets demonstrate that our proposed method can significantly improve the adaptation performance of network models and outperforms existing state-of-the-art methods.
```
@inproceedings{tang2023neuro,
    title     = {Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation},
    author    = {Tang, Yushun and Zhang, Ce and Xu, Heng and Chen, Shuoshuo and Cheng, Jie and Leng, Luziwei and Guo, Qinghai and He, Zhihai},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {3728--3738},
    year      = {2023}
}
```

CVPR 2023
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation

Zhehan Kan, Shuoshuo Chen, Ce Zhang, Yushun Tang, Zhihai He

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Vancouver, Canada, June 18-22, 2023

PDF Video Abstract BibTeX

A central challenge in human pose estimation, as well as in many other machine learning and prediction tasks, is the generalization problem. The learned network does not have the capability to characterize the prediction error, generate feedback information from the test sample, and correct the prediction error on the fly for each individual test sample, which results in degraded performance in generalization. In this work, we introduce a self-correctable and adaptable inference (SCAI) method to address the generalization challenge of network prediction and use human pose estimation as an example to demonstrate its effectiveness and performance. We learn a correction network to correct the prediction result conditioned by a fitness feedback error. This feedback error is generated by a learned fitness feedback network which maps the prediction result to the original input domain and compares it against the original input. Interestingly, we find that this self-referential feedback error is highly correlated with the actual prediction error. This strong correlation suggests that we can use this error as feedback to guide the correction process. It can be also used as a loss function to quickly adapt and optimize the correction network during the inference process. Our extensive experimental results on human pose estimation demonstrate that the proposed SCAI method is able to significantly improve the generalization capability and performance of human pose estimation.
```
@inproceedings{kan2023self,
    title     = {Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation},
    author    = {Kan, Zhehan and Chen, Shuoshuo and Zhang, Ce and Tang, Yushun and He, Zhihai},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {5537--5546},
    year      = {2023}
}
```

IEEE TAI
Critical Sampling for Robust Evolution Operator Learning of Unknown Dynamical Systems

Ce Zhang, Kailiang Wu, Zhihai He

IEEE Transactions on Artificial Intelligence, 2023.

Also at First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023.

Atlanta, GA, US, November 6-9, 2023

PDF Abstract BibTeX

Given an unknown dynamical system, what is the minimum number of samples needed for effective learning of its governing laws and accurate prediction of its future evolution behavior, and how to select these critical samples? In this work, we propose to explore this problem based on a design approach. Starting from a small initial set of samples, we adaptively discover critical samples to achieve increasingly accurate learning of the system evolution. One central challenge here is that we do not know the network modeling error since the ground-truth system state is unknown, which is, however, needed for critical sampling. To address this challenge, we introduce a multistep reciprocal prediction network where forward and backward evolution networks are designed to learn the temporal evolution behavior in the forward and backward time directions, respectively. Very interestingly, we find that the desired network modeling error is highly correlated with the multistep reciprocal prediction error, which can be directly computed from the current system state. This allows us to perform a dynamic selection of critical samples from regions with high network modeling errors for dynamical systems. In addition, a joint spatial-temporal evolution network is introduced, which incorporates spatial dynamics modeling into the temporal evolution prediction for robust learning of the system evolution operator with few samples. Our extensive experimental results demonstrate that our proposed method is able to dramatically reduce the number of samples needed for effective learning and accurate prediction of evolution behaviors of unknown dynamical systems by up to hundreds of times.
```
@article{zhang2024critical,
    author  = {Ce Zhang and Kailiang Wu and Zhihai He},
    title   = {Critical Sampling for Robust Evolution Operator Learning of Unknown Dynamical Systems},
    journal = {IEEE Transactions on Artificial Intelligence},
    volume  = {5},
    number  = {6},
    pages   = {2856--2871},
    year    = {2024},
    doi     = {10.1109/TAI.2023.3327676}
}
```

Experiences

Research Scientist Intern

TikTok

📍 San Jose, CA, USA 🗓️ May – Aug 2026 🤝 Mentor: Ming Zhou

Research Intern

Tencent AI Lab

📍 Shenzhen, China 🗓️ Feb – Jul 2025 🤝 Mentor: Dr. Kaixin Ma

Honors and Awards

Liang Zhao Fellowship, RI Departmental PhD Fellowship, CMU, October 2025
Top 10 Summa Cum Laude Graduates, Southern University of Science and Technology, June 2023
National Scholarship, Ministry of Education of the People’s Republic of China, November 2022

Services

Journal Reviewer: IJCV, IEEE TCSVT (>10), IEEE TIP, IEEE TMM, IEEE TAI, Neurocomputing, Knowledge-Based Systems, Pattern Recognition
Conference Reviewer: NeurIPS ‘25 ‘26, ICLR ‘25 ‘26, CVPR ‘26, ICCV ‘25, AAAI ‘26 ‘27, WACV ‘25 ‘26 ‘27 (Outstanding Reviewer), ICASSP ‘25, BMVC ‘24 ‘25 ‘26 (Outstanding Reviewer), ICME ‘24 ‘25, IJCNN ‘25 ‘26, AVSS ‘25

visitors since Sep 2023.