Training-Free Region-Level Semantic Understanding via Interpreting Internal Representations of VLMs
  • 분류 2026년 8월
  • 작성일 2026.04.07
  • 작성자 이종휘
  • 조회수 133

논문 제목: Training-Free Region-Level Semantic Understanding via Interpreting Internal Representations of Vision-Language Models


논문 요약: Vision-Language Models (VLMs) have demonstrated remarkable performance across a wide range of computer vision tasks that require understanding of the overall semantics of an image, such as image captioning and visual question answering. However, they still struggle with region-level scene understanding, which involves distinguishing and interpreting specific objects or regions within an image. To address this limitation, recent studies have adopted approaches that construct large-scale datasets consisting of region annotations such as bounding boxes and segmentation masks paired with textual descriptions, and fine-tune VLMs on these datasets. Such approaches not only require substantial costs for dataset construction and model training, but may also degrade the generalization capability that VLMs originally acquired through pretraining. In this paper, we propose a method that enables region-level scene understanding through the interpretation of VLM internal representations, without the need for additional dataset construction or fine-tuning.



학위연월: 2026년 8월


E-mail: tmvlzj49@pusan.ac.kr



지도교수: 전상률 교수님


키워드: Vision-Language Models, Region-Level Scene Understanding, Multi-Modal Representation Learning, Interpretability of Vision-Language Models, Training-Free Method


웹페이지:https://sites.google.com/pusan.ac.kr/interpret-vlms/%ED%99%88

첨부파일이(가) 없습니다.