HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

School of Computer Science, Carnegie Mellon University
CVPR 2024

Highlights

  • We propose HiKER-SGG, a novel method for generating scene graphs through a hierarchical inference approach over structured domain knowledge, allowing it to gradually specify increasingly granular classifications through iterative sub-selection.
  • We introduce a new synthetic VG-C benchmark for SGG, containing 20 challenging image corruptions, including simple transformations and severe weather conditions.
  • Extensive experiments demonstrate that HiKER-SGG outperforms current state-of-the-art methods on SGG tasks, while simultaneously providing a strong zero-shot baseline for generating scene graphs from corrupted images.

Abstract

Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks.

VG-C Benchmark

To standardize and evaluate SGG robustness, we create a corrupted Visual Genome (VG-C) benchmark, which comprises 20 corruption types designed to simulate realistic corruptions that may occur in real-world scenarios. Specifically, the first 15 types of corruption introduced by ImageNet-C are widely recognized as standard benchmarks for evaluating robustness. To further align with real-world scenarios, we introduce 5 additional types of natural corruption to our evaluation: sun glare, water-drop, wildfire smoke, rain, and dust.

fail
Figure 3. All the 20 corruption types we used in our corrupted experiments. The first 15 types of corruption are introduced by ImageNet-C, and we introduce 5 additional types of natural corruptions for a more comprehensive and practical evaluation.

Experimental Results

Results on Clean VG Dataset
fail
Table 1. Performance comparison with the state-of-the-art SGG methods on the Visual Genome dataset. The best results for each metric are in bold, while the second-best results are underlined. "-" denotes unavailable results due to incompatible experimental settings.

Results on Corrupted VG-C Dataset
fail
Table 2. Performance comparison with the state-of-the-art SGG methods for the PredCls task on the corrupted Visual Genome dataset. We report the accuracy in percentage for the mR@20: UC/C, mR@50: UC/C, mR@100: UC/C metrics, structured in six rows. The best results for each metric are in bold. The last column reports the average mean recall across all 20 types of corruption, and the percentage decrease in

blue

when compared to the mean recall on clean images. We evaluate these methods using the codes provided by the authors.

Qualitative Comparison
fail
Figure 4. Qualitative comparisons on the PredCls task. The visualized predicted predicates are picked from the top 50 predicted triplets. Here,

red

dashed lines denote undetected predicates, solid

red

lines denote incorrect predictions, and solid

green

lines indicate correct predictions. For an easier comparison, predicates correctly predicted by our method but incorrectly by GB-Net are highlighted in

dark green

.

Poster

Video Presentation

BibTeX

@inproceedings{zhang2024hiker,
  title={HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation},
  author={Zhang, Ce and Stepputtis, Simon and Campbell, Joseph and Sycara, Katia and Xie, Yaqi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={28233--28243},
  year={2024}
}