Item request has been placed!

Item request cannot be made.

Processing Request

Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: Park, Subin; Kim, Jung Uk
الموضوع:
Computer Vision and Pattern Recognition
نوع التسجيلة:
Working Paper
الدخول الالكتروني :
https://arxiv.org/abs/2604.06824

معلومة اضافية
- الموضوع:
  2026
- نبذة مختصرة :
  Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.
  Accepted to CVPR 2026
- الرقم المعرف:
  edsarx.2604.06824

تعليقات

No Comments.