Bypassing Refusal Behavior in Qwen Models via Activation Steering — LessWrong
AI disclosure
Summary
Summary Preventing AI misalignment, potentially by bad actors, is the most important goal of AI safety. I ran experiments exploring refusal behavior…