Stephen Casper


Hi, I’m Cas.


Hi, I’m Stephen Casper, but most people call me Cas. I’m a second year Ph.D student at MIT in Computer Science (EECS) in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. I’m supported by the Vitalik Buterin Fellowship from the Future of Life Institute. Formerly, I have worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI. My main focus is in developing tools for more interpretable and robust AI by studying interpretability, adversaries, and diagnostic tools in deep learning.

You can find me on Google Scholar, Github, Twitter, or LessWrong. If you want to get to know me, I made an autobiographical document to introduce myself to people and make friends. I also have a personal feedback form. Feel free to use it to send me anonymous, constructive feedback about how I can be better. For now, I’m not posting my resume/CV here, but please email me if you’d like to talk about projects or opportunities.

You can also ask me about my hissing cockroaches, a bracelet I made for a monkey, witchcraft, what I learned from getting my genome sequenced, jumping spiders, a time I helped with a “mammoth” undertaking, a necklace that I wear every full moon, or a jar I keep in my windowsill.


Casper, S.*, Li, Y.*, Li, J.*, Bu, T.*, Zhang, K.*, Hadfield-Menell, D., (2023). Benchmarking Interpretability Tools for Deep Neural Networks. arXiv preprint arXiv:2302.10894

Casper, S.*, Hariharan, K.*, Hadfield-Menell, D., (2022). Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks. arXiv preprint. ***Best paper award — 2022 NeurIPS Machine Learning Safety Workshop***

Räuker, T.*, Ho, A.*, Casper, S.*, & Hadfield-Menell, D. (2022). Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. In SATML 2023.

Casper, S., Hadfield-Menell, D., Kreiman, G (2022). White-Box Adversarial Policies in Deep Reinforcement Learning. arXiv preprint arXiv:2209.02167

Casper, S.*, Hod, S.*, Filan, D.*, Wild, C., Critch, A., & Russell, S. (2021). Graphical Clusterability and Local Specialization in Deep Neural Networks, Pair^2Struct Workshop, ICLR 2022.

Hod, S.*, Casper, S.*, Filan, D.*, Wild, C., Critch, A., & Russell, S. (2021). Detecting Modularity in Deep Neural Networks. arXiv preprint

Casper, S.*, Nadeau, M.*, Hadfield-Menell, D., Kreiman, G (2021). Robust Feature-Level Adversaries are Interpretability Tools. In NeurIPS, 2022.

Chen, Y.*, Hysolli, E.*, Chen, A.*, Casper, S.*, Liu, S., Yang, K., … & Church, G. (2022). Multiplex base editing to convert TAG into TAA codons in the human genome. Nature communications, 13(1), 1-13.

Casper, S.*, Boix, X.*, D’Amario, V., Guo, L., Schrimpf, M., Vinken, K., & Kreiman, G. (2021). Frivolous Units: Wider Networks Are Not Really That WideIn Proceedings of the AAAI Conference on Artificial Intelligence (Vol 35,)

Filan, D.*, Casper, S.*, Hod, S.*, Wild, C., Critch, A., & Russell, S. (2021). Clusterability in Neural Networks. arXiv

Casper, S. (2020). Achilles Heels for AGI/ASI via Decision Theoretic Adversaries. arXiv

Saleh, A., Deutsch, T., Casper, S., Belinkov, Y., & Shieber, S. M. (2020, July). Probing Neural Dialog Models for Conversational Understanding. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI (pp. 132-143).

Posts and More

AXRP Podcast 21 – Interpretability for Engineers with Stephen Casper

The Slippery Slope from DALLE-2 to Deepfake Anarchy

Pitfalls with Proofs

Functional Decision Theory

Procrastination Paradoxes

A daily routine I do for AI safety research work

Deep Dives: My Advice for Pursuing Work in Research

Avoiding Perpetual Risk from TAI