COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models
본문 미리보기
arXiv:2606.28696v1 Announce Type: new Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided gen
전체 내용이 궁금하다면?
원문을 직접 읽어보세요