Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

본문 미리보기

arXiv:2606.28770v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated the ability to simulate human-like OCEAN personality traits in generated text. Previous efforts have focused on prompt engineering or fine-tuning to shape LLM personality. In this work, we propose a mechanistic interpretability approach that directly intervenes on the model's latent features. Our method identifies latent directions in the residual stream corresponding to a target OCEAN trait using spa

Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

본문 미리보기

관련 글

What Drives Interactive Improvement from Feedback?

Contrastive Reflection for Iterative Prompt Optimization

How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats, Embeddings, and Retrieval Strategies

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation