I'm Alex McKenzie. I care a lot about mitigating risks from Artificial Intelligence. I also care about music, Effective Altruism, and cats.
I work at AE Studio on alignment.
I blog occasionally at alexmck.com.
- Detecting High-Stakes Interactions with Activation Probes — a paper on using activation probes to efficiently detect high-stakes LLM interactions, with accompanying codebase
- Endogenous Resistance to Activation Steering in Language Models — a paper on how LLMs can internally resist task-misaligned activation steering, with accompanying codebase



