1. VLM 소개
Vision Language Model
비전정보를 이용하여 언어정보로 변환하는 모델
그림을 보여주면 그림을 설명하는 텍스트를 생성하는 모델
자율주행모델에서 모델이 출력한 제어명령의 판단근거를 설명해주는 모델
2. 서베이 논문
Vision-Language Models for Vision Tasks: A Survey, 2023
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10445007
An Introduction to Vision-Language Modeling, 2024
https://arxiv.org/abs/2405.17247
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges, 2025
https://arxiv.org/abs/2501.02189
3. 주요논문
CLIP : Learning transferable visual models from natural language supervision, 2021, OpenAI
https://arxiv.org/pdf/2103.00020
https://github.com/OpenAI/CLIP
-> contrastive learning을 이용하여 fine tuning 없이 zero shot prediction이 가능함을 보여준 논문
SLIP: Self-supervision meets Language-Image Pre-training, 2021, UC berkeley
https://arxiv.org/abs/2112.12750
-> CLIP는 지도학습을 사용했으나 SLIP는 자기지도학습을 통해 VLM을 학습시키는 방법제안
UniCL : Unified Contrastive Learning in Image-Text-Label Space, 2022, Microsoft
https://arxiv.org/abs/2204.03610
-> Image-Text-Label 을 이용한 학습제안
CoCa: Contrastive Captioners are Image-Text Foundation Models, 2022, Google
https://arxiv.org/abs/2205.01917v2
-> image -> caption 변환 모델
FLAVA: A Foundational Language And Vision Alignment Model,2021, facebook
https://arxiv.org/abs/2112.04482
-> Vision Alignment 모델