Stephen Casper


Stephen Casper

Hi, I’m Stephen Casper. I’m in the Harvard College class of 2021 and doing research with the Center for Brains, Minds, and Machines under the HMS Kreiman Lab. For the summer of 2020, I also interned with the Center for Human-Compatible AI. I’m majoring in statistics, and my main interests are machine learning and technical AI alignment. More specifically, research interests of mine include interpretability, adversaries, and decision theory. I’m also an Effective Altruist trying to do the most good I can.

Find me on Google Scholar, Github, LinkedIn, LessWrong, EA Forum, Medium, and Facebook. I also have a personal feedback form. Feel free to use it to send me anonymous, constructive feedback about how I can be a better or more effective person. For now, I’m not posting my resume/CV here, but please email me if you’d like to talk about projects or jobs.

Talk to me about AI alignment, machine learning, Effective Altruism, rationality, decision theory, or paradoxes.

Some Projects

Clusterability in Neural Networks: Deep neural networks are often only understood at the level of a black box. In this paper, however, we find that networks typically develop a structure in which their neurons cluster well into groups. This suggests an opportunity to interpret networks in terms of modules of neurons within them. Read our paper on arXiv: Clusterability in Neural Networks. A followup project focused on bridging this clusterability and functional modularity is in the works.

Frivolous Units in Neural Networks: What types of features do deep neural networks develop to constrain their effective capacities and avoid overfitting? Read our paper on arXiv: Frivolous Units: Wider Neural Networks are not Reallt That Wide. In it, we present novel findings relating to network design, interpreting units, compression, and initialization. This paper is featured in the AAAI 2021 conference.

Learned Adversarial Policies: Understanding adversaries is key to building robust and safe AI. In reinforcement learning, certain types of adversaries can be created by simply training one agent with the goal of making another fail. A few works have investigated them, but they tend to use brute force techniques that would not be realistic threat models in the real world. I’m working on several strategies for improving sample efficiency in these attacks. Feel free to read my (slightly dated) research proposal. (Photo credit to Kurach et al. 2019)

Achilles Heels for AGI/ASI via Decision Theoretic Adversaries: Given rapid progress in AI and the possibility of systems with par-human or superhuman intelligence, it is important to understand how AI systems will behave and in what ways they may fail. In a paper on what I call the “Achilles Heel hypothesis, I argue that even if an AI system is generally very good at achieving its goals, it still can have delusions which can cause egregious failures in unique circumstances.

Research Blog Posts: I’m interested in paradoxes and tricky decision theoretic dilemmas. Two posts in which I discuss key issues and present novel frameworks for understanding them are Dissolving Confusion around Functional Decision Theory and Procrastination Paradoxes: The Good, the Bad, and the Ugly. I also like Adversarial machine learning and wrote a post called A PAC Framework for Bayesian Black Box Adversarial Attacks in which I focus on techniques for gradient modeling and derive a surprisingly tight bound for MVN variable estimation.

The Arete Fellowship: I founded, chaired, and designed the curriculum for the Arete Fellowship program under the Harvard College Effective Altruism club. The fellowship is a semester-long program based on reading, discussing, and writing which introduces participants to key themes in rationality, philosophy, cause evaluation, and contemporary issues. The fellowship, which began in the fall of 2018, has since been adopted by over 25 other Effective Altruism university groups in the US, Canada, and Hong Kong.

What I think about…