Stephen Casper


Stephen Casper


Hi, I’m Stephen Casper, but most people call me Cas. I’m a second year Ph.D student at MIT in Computer Science (EECS) in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. I’m supported by the Vitalik Buterin Fellowship from the Future of Life Institute. Formerly, I have worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI. My main focus is in developing tools for more interpretable and robust AI. Lately, I have been particularly interested in finding (mostly) automated ways of finding/fixing flaws in how deep neural networks handle human-interpretable concepts. I’m also an Effective Altruist trying to do the most good I can.

You’re welcome to email me (and do it a second time if I don’t respond). I like meeting and talking with new people. You can find me on Google Scholar, Github, Twitter, or LessWrong. I also have a personal feedback form. Feel free to use it to send me anonymous, constructive feedback about how I can be better. For now, I’m not posting my resume/CV here, but please email me if you’d like to talk about projects or opportunities.

You can also ask me about my hissing cockroaches, a bracelet I made for a monkey, witchcraft, what I learned from getting my genome sequenced, jumping spiders, a time I helped with a “mammoth” undertaking, a necklace that I wear every full moon, or a jar I keep in my windowsill.


Casper, S.*, Hariharan, K.*, Hadfield-Menell, D., (2022). Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks. arXiv preprint

Räuker, T.*, Ho, A.*, Casper, S.*, & Hadfield-Menell, D. (2022). Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. In SATML 2023.

Casper, S., Hadfield-Menell, D., Kreiman, G (2022). White-Box Adversarial Policies in Deep Reinforcement Learning. arXiv preprint arXiv:2209.02167

Casper, S.*, Hod, S.*, Filan, D.*, Wild, C., Critch, A., & Russell, S. (2021). Graphical Clusterability and Local Specialization in Deep Neural Networks, Pair^2Struct Workshop, ICLR 2022.

Hod, S.*, Casper, S.*, Filan, D.*, Wild, C., Critch, A., & Russell, S. (2021). Detecting Modularity in Deep Neural Networks. arXiv preprint

Casper, S.*, Nadeau, M.*, Hadfield-Menell, D., Kreiman, G (2021). Robust Feature-Level Adversaries are Interpretability Tools. In NeurIPS, 2022.

Chen, Y.*, Hysolli, E.*, Chen, A.*, Casper, S.*, Liu, S., Yang, K., … & Church, G. (2022). Multiplex base editing to convert TAG into TAA codons in the human genome. Nature communications, 13(1), 1-13.

Casper, S.*, Boix, X.*, D’Amario, V., Guo, L., Schrimpf, M., Vinken, K., & Kreiman, G. (2021). Frivolous Units: Wider Networks Are Not Really That WideIn Proceedings of the AAAI Conference on Artificial Intelligence (Vol 35,)

Filan, D.*, Casper, S.*, Hod, S.*, Wild, C., Critch, A., & Russell, S. (2021). Clusterability in Neural Networks. arXiv

Casper, S. (2020). Achilles Heels for AGI/ASI via Decision Theoretic Adversaries. arXiv

Saleh, A., Deutsch, T., Casper, S., Belinkov, Y., & Shieber, S. M. (2020, July). Probing Neural Dialog Models for Conversational Understanding. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI (pp. 132-143).

Posts and Such

The Slippery Slope from DALLE-2 to Deepfake Anarchy

Pitfalls with Proofs

Functional Decision Theory

Procrastination Paradoxes

A daily routine I do for AI safety research work

Deep Dives: My Advice for Pursuing Work in Research


I’m working on a few projects involving automated discovery of interpretable adversarial examples for vision models, red-teaming language models, and benchmarking interpretability tools. Feel free to reach out.


The Arete Fellowship, a program I founded with the Harvard Effective Altruism club. It has been adopted/adapted by over 90 Effective Altruism groups in the US, Canada, and China with thousands of alumni.