Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions
본문 미리보기
arXiv:2606.28770v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated the ability to simulate human-like OCEAN personality traits in generated text. Previous efforts have focused on prompt engineering or fine-tuning to shape LLM personality. In this work, we propose a mechanistic interpretability approach that directly intervenes on the model's latent features. Our method identifies latent directions in the residual stream corresponding to a target OCEAN trait using spa
전체 내용이 궁금하다면?
원문을 직접 읽어보세요