Title: Understanding Safety & Alignment with Mechanistic Theory
Speaker:Eric Wong (University of Pennsylvania)
Date and time: 4:00 p.m. on June 30
Online Venue (Zoom): The URL will be provided only to registered participants.
Abstract: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model's attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model's attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code.
Bio: Eric Wong is an assistant professor at the Department of Computer and Information Science at the University of Pennsylvania. He leads Brachio Lab on debugging machine learning and making systems actually do what we want them to do. He is also a part of the ASSET Center on safe, explainable, and trustworthy AI systems. Previously, He completed PhD at CMU advised by Zico Kolter, and did a postdoc with Aleksander Madry.