How do we solve the alignment problem?
Introduction to an essay series on paths to safe, useful superintelligence.
(Podcast version here (read by the author), or search for “Joe Carlsmith Audio” on your podcast app.1)
We want the benefits that superintelligent AI agents could create. And some people are trying hard to build such agents. I expect efforts like this to succeed – and maybe, very soon.
But superintelligent AI agents might also be difficult to control. They are, to us, as adults to children, except much more so. In the same direction, relative to us, as advanced aliens; as demi-gods; as humans relative to ants. If such agents “go rogue” – if they start ignoring human instructions, resisting correction or shut-down, trying to escape from their operating environment, seeking unauthorized resources and other forms of power, etc – we might not be able to stop them.
Worse, because power/resources/freedom/survival etc are useful for many goals, superintelligent agents with a variety of different motivations would plausibly have incentives to go rogue in this way, suggesting that problems with AI motivations could easily lead to such behavior. And if this behavior goes uncorrected at scale, humans might lose control over civilization entirely – permanently, involuntarily, maybe violently. Superintelligent AI agents, acting on their own, would be the dominant actors on the planet. Humans would be sidelined, or dead.
Getting safe access to the benefits of superintelligence requires avoiding this kind of outcome. And this despite incentives among human actors to build more and more capable and agentic systems (and including: to do so faster than someone else), and despite the variety of actors that might proceed unsafely. Call this the “alignment problem.”
I’ve written, before, about why I’m worried about this problem.2 But I’ve said much less about how we might solve it. In this series of essays, I try to say more.3 Here’s a summary of the essays I’ve released thus far:
In the first essay, “What is it to solve the alignment problem?”, I define solving the alignment problem as: building full-blown superintelligent AI agents, and becoming able to safely elicit their main beneficial capabilities, while avoiding the sort of “loss of control” scenario discussed above. I also define some alternatives to both solving the problem and failing on the problem – namely, what I call “avoiding” the problem (i.e., not building superintelligent AI agents at all, and looking for other ways to get access to similar benefits), and “handling” the problem (namely, using superintelligent AI agents in more restricted ways, and looking for other ways to get access to the sort of benefits their full capabilities would unlock). I think these alternatives should be on the table too. I also contrast my definition of solving the problem with some more exacting standards – namely, what I call “safety at all scales,” “fully competitive safety,” “permanent safety,” “near-term safety,” and “complete alignment.” And I discuss how solving the problem, in my less-exacting sense, fits into the bigger picture.
I may add more overall remarks here later. But I think it’s possible that my perspective on the series as a whole will change as I finish it. So for now, I’ll stick with a few notes.
First: the series is not a solution to the alignment problem. It’s more like: a high-level vision of how we get to a solution, and of what the space of possible solutions looks like. I, at least, have wanted more of this sort of vision over the years, and it feels at least clearer now, even if still disturbingly vague. And while many of my conclusions are not new, still: I wanted to think it through, and to write it down, for myself.
Second: as far as I can currently tell, one of the most important sources of controllable variance in the outcome, here, is the safety, efficacy, and scale of frontier AI labor that gets used for well-chosen, safety-relevant applications – e.g., alignment research, monitoring/oversight, risk evaluation, cybersecurity, hardening-against-AI-attack, coordination, governance, etc. In the series, I call this “AI for AI safety.” I think it’s a big part of the game. In particular: whether we can figure out how to do it well; and how much we invest in it, relative to pushing forward AI capabilities. AI companies, governments, and other actors with the potential to access and direct large amounts of compute have an especially important role to play, here. But I think that safety-focused efforts, in general, should place special emphasis on figuring out how to use safe AI labor as productively as possible – and especially if time is short, as early as possible – and then doing it.
Third: the discussion of “solutions” in the series might create a false sense of comfort. I am trying to chart the best paths forward. I am trying to figure out what will help most on the margin. And I am indeed more optimistic about our prospects than some vocal pessimists. But I want to be very clear: our current trajectory appears to me extremely dangerous. We are hurtling headlong towards the development of artificial agents that will plausibly be powerful enough to destroy everything we care about if we fail to control their options and motivations in the right way. And we do not know if we will be able to control their options and motivations in the right way. Nor are we on any clear track to have adequate mechanisms and political will for halting further AI development, if efforts at such control are failing, or are likely to fail if we continue forward.
And if we fail hard enough, then you, personally, will be killed, or forcibly disempowered. And not just you. Your family. Your friends. Everyone. And the human project will have failed forever.
These are the stakes. This is what fucking around with superintelligent agents means. And it looks, to me, like we’re at serious risk of fucking around.
I don’t know what will happen. I expect we’ll find out soon enough.
Here’s one more effort to help.
This series represents my personal views, not the views of my employer.
Thanks to Nick Beckstead, Sam Bowman, Catherine Brewer, Collin Burns, Joshua Clymer, Owen Cotton-Barratt, Ajeya Cotra, Tom Davidson, Sebastian Farquhar, Peter Favaloro, Lukas Finnveden, Katja Grace, Ryan Greenblatt, Evan Hubinger, Holden Karnofsky, Daniel Kokotajlo, Jan Leike, David Lorell, Max Nadeau, Richard Ngo, Buck Shlegeris, Rohin Shah, Carl Shulman, Nate Soares, John Wentworth, Mark Xu, and many others for comments and/or discussion. And thanks to Claude for comments and suggestions as well.
This essay is cross-posted from my website (as are the other essays on this substack). Normally I link people to the website versions of my essays, but I’m trying out linking to the substack versions instead.
In 2021, I wrote a report about it, and on the probability of failure; and in 2023, I wrote another report about the version that worries me most – what I called “scheming.”
Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.