Being nicer than Clippy

Let's be the sort of species that aliens wouldn’t fear the way we fear paperclippers.

Joe Carlsmith

Jan 16, 2024

Article voiceover

1×

0:00

-47:30

(Podcast version here, or search for “Joe Carlsmith Audio” on your podcast app.

This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a summary of the essays that have been released thus far.)

In my last essay, I discussed a certain kind of momentum, in some of the philosophical vibes underlying the AI risk discourse,1 towards deeming more and more agents – including: human agents – "misaligned" in the sense of: not-to-be-trusted to optimize the universe hard according to their values-on-reflection. We can debate exactly how much mistrust to have in different cases, here, but I think the sense in which AI risk issues can extend to humans, too, can remind us of the sense in which AI risk is substantially (though, not entirely) a generalization and intensification of the sort of "balance of power between agents with different values" problem we already deal with in the context of the human world. And I think it may point us towards guidance from our existing ethical and political traditions, in navigating this problem, that we might otherwise neglect.

In this essay, I try to gesture at a part of these traditions that I see as particularly important: namely, the part that advises us to be "nicer than Clippy" – not just in what we do with spare matter and energy, but in how we relate to agents-with-different-values more generally. Let me say more about what I mean.

Utilitarian vices

As many have noted, Yudkowsky's paperclip maximizer looks a lot like total utilitarian. In particular, its sole aim is to "tile the universe" with a specific sort of hyper-optimized pattern. Yes, in principle, the alignment worry applies to goals that don't fit this schema (for example: "cure cancer" or "do god-knows-whatever kludge of weird gradient-descent-implanted proxy stuff"). But somehow, especially in Yudkowskian discussions of AI risk, the misaligned AIs often end up looking pretty utilitarian-y, and a universe tiled with something – and in particular, "tiny-molecular-blahs" – often ends seeming like a notably common sort of superintelligent Utopia.

What's more, while Yudkowsky doesn't think human values are utilitarian, he thinks of us (or at least, himself) as sufficiently galaxy-eating that it's easy to round off his "battle of the utility functions" narrative into something more like a "battle of the preferred-patterns" – that is, a battle over who gets to turn the galaxies into their favored sort of stuff. The AIs want to tile the universe with paperclips; the humans, in Yudkowsky's world, want to tile it with "Fun." (Tiny-molecular-Fun?)

But actually, the problem Yudkowsky talks about most – AIs killing everyone – isn't actually a paperclips vs. Fun problem. It's not a matter of your favorite uses for spare matter and energy. Rather, it's something else.

Thus, consider utilitarianism. A version of human values, right? Well, one can debate. But regardless, put utilitarianism side-by-side with paperclipping, and you might notice: utilitarianism is omnicidal, too – at least in theory, and given enough power. Utilitarianism does not love you, nor does it hate you, but you're made of atoms that it can use for something else. In particular: hedonium (that is: optimally-efficient pleasure, often imagined as running on some optimally-efficient computational substrate).

But notice: did it matter what sort of onium? Pick your favorite optimal blah-blah. Call it Fun instead if you'd like (though personally, I find the word "Fun" an off-putting and under-selling summary of Utopia). Still, on a generalized utilitarian vibe, that blah-blah is going to be a way more optimal use of atoms, energy, etc than all those squishy inefficient human bodies. They never told you in philosophy class? It's not just organ-harvesting and fat-man-pushing. The utilitarians have paperclipper problems, too.2

Oh, maybe you heard this about the negative utilitarians. "Doesn't your philosophy want to kill everyone?" But the negative utilitarians protest: "so does the classical version!" And straw-Yudkowsky, at least, is not surprised. In straw-Yudkowsky's universe, killing everyone is, like, the first thing that (almost) any strong-enough rational agent does. After all, "everyone" is in the way of that agent's yang.

But are foomed-up humans actually this omnicidal? I hope not. And real-Yudkowsky, at least, doesn't think so. There's a bit in his interview with Lex Fridman, where Yudkowsky tries to get Lex to imagine being trapped in a computer run by extremely slow-moving aliens who want their society to be very different from how Lex wants it to be (in particular: the aliens have some sort of equivalent of factory farming). Yudkowsky acknowledges that Lex is presumably "nice," and so would not, himself, actually just slaughter all of these aliens in the process of escaping. And eventually, Lex agrees.

What is this thing, "nice"? Not, apparently, the same thing as "preferring the right tiny-molecular-pattern." Existing creatures are unlikely to be in this pattern by default, so if that's the sum total of your ethics, you're on the omnicide train with Clippy and Bentham. Rather, it seems, niceness is something else: something where, when you wake up in an alien civilization, you don't just kill everyone first thing, even though you're strong enough to get away with it. And this even-though (gasp) their utility functions are different from yours. What gives?

"Something very contingent and specific to humans, or at least to evolved creatures, and which won't occur in AIs by default in any way we'd like" answers Yudkowsky. And maybe so.3 But I'm interested, here, not in whether AIs will be nice-like-us, but rather, in understanding what our niceness consists in, and what it might imply about the sorts of otherness and control issues I've been talking about in this series.

In particular: a key feature of niceness, in my view, is some sort of direct responsiveness to the preferences of the agents you're interacting with. That is, "nice" values give the values of others some sort of intrinsic weight. The aliens don't want to be killed, and this, in itself, is a pro tanto reason not to kill them. In this sense, niceness allows some aspect of yin into its agency. It is influenced by others; it receives others; it allows itself to channel – or at least, to respect and make space for – the yang of others.

The extreme version of this is preference utilitarianism, which tries to make of itself, solely, a conduit of everyone else. And it might seem, prima facie, an attractive view. In particular: to someone who doesn't like the idea of imposing their own arbitrary, contingent will upon the world, an ideal that instead enacts some sort of "universal compromise will" (i.e., the combination of everyone's preferences) can seem to regain the kind of objective and other-centered footing that anti-realism about ethics threatens to deny. But as I've written about previously, I think the appeal of a pure preference utilitarianism fades on closer scrutiny.4 In particular: I think it founders on possible people, on paperclippers, and in particular, on sadists.

But rejecting a pure preference utilitarianism does not mean embracing a stance that refuses to ever give the preferences of others intrinsic weight.5 And my sense is that sometimes the AI safety discourse goes too far in this respect. It learns, from paperclippers, the strange and unappealing places that the preferences of arbitrary others can lead. Indeed, Yudkowsky takes explicit steps to break his audience's temptation towards sympathy with Clippy's preferences (this is the point of the abstract notion of "paperclips"), and to place Clippy's agency firmly in the role of "adversary" (see, e.g., the "true prisoner's dilemma"). And against such a backdrop, it's easy (though: not endorsed by Yudkowsky) for the idea that preferences like Clippy's deserve any intrinsic weight to fall out of the picture. After all: Clippy doesn't give our preferences any weight. And aren't we and Clippy ultimately alike, modulo our favored blah-blah-onium?

No. In addition to liking happier onium than Clippy, we are nicer than Clippy to agents-with-different-values. Or: we should be. Indeed, I think we should strive to be the sort of agents that aliens would not fear the way Yudkowsky fears paperclippers, if the aliens discovered they were on the verge of creating us. This doesn't mean we should just adopt the alien preferences as our own – and especially not if the stuff they like is actively evil rather than merely meaningless (more below). But it does mean, for example, not killing them. But also: actively helping them (on their own terms) in cheap ways, treating them with respect and dignity, not enslaving them or oppressing them, and more.6

Alien alignment researcher thinking about p(doom)

That is: human values themselves have stuff to say about how we should treat agents-with-different-values – including, non-humans. Indeed, a huge portion of our ethics and politics ends up dealing with this in one form or another. AI otherness will be new, yes – but we have deep, richly textured, and at-least-somewhat battle-tested traditions to draw on in orienting towards it. Too often, utilitarian vibes forget about these traditions ("isn't it all just an empirical question about what-causes-the-utils?"). And too often, fear that the agents-with-different-values might hurt us makes us forget, too (which, I re-emphasize, isn't to say that agents-with-different-values won't hurt us – cf all this stuff about bears and Nazis and the-brutality-of-nature etc in the previous essays). But faced with a new class of others/fellow-creatures/potential-threats, we should be drawing on every source of wisdom we can.

Boundaries

Let me give an example of ways in which bringing to mind some of the less utilitarian dimensions of human ethics can make a difference to how we orient towards AI systems with values different from our own.

In "Does AI risk 'other' the AIs?," I mentioned two worries the AI alignment discourse has about paperclippers:

That they'll kill everyone (and relatedly: violate people's basic rights, steal people's stuff, and violently overthrow the government).
That they'll gain power in a way that results in their values (rather than human values) steering the trajectory of earth-originating civilization, thereby leading to a future of ~zero value.

These two worries are often lumped together under the more unified concern that the AIs will have the "wrong values." After all, if they had the right values, presumably they would do neither of these things.

But the two worries are importantly distinct.7 For one thing, as has been oft-noted, different human ethical views might disagree about their respective importance. But beyond this, these two worries interact very differently with our existing ethical and political norms governing how agents with different values should relate to one another.

In particular: as a civilization, we have extremely deep and robust norms prohibiting agents from doing worry-number-1-style behavior: i.e., killing other people, stealing other people's stuff, and trying to overthrow the government (though of course, there are exceptions and complexities). That is, worry-number-1 casts the AIs in a role that triggers very directly our sense that we are dealing with aggressors who are violating important boundaries -- boundaries that lie at the core of human cooperative arrangements – and whose behavior therefore warrants unusually strong forms of defensive response. For example: if someone is breaking into your home with nano-bots trying to kill you, you are morally permitted – on the basis of self-defense – to do things that would otherwise be impermissible (even to save your own life) in other contexts: for example, killing them (where this is necessary and proportionate).8 Similarly: you are justified in doing things to people who are invading your country that you aren't justified in doing if they aren't invading your country, and so forth. The misaligned AIs, according to worry-number-1, are enemies of this deep and familiar sort.9

"Hitler watching German soldiers march into Poland in September 1939." An example of a worry-number-1-style boundary violation. (Image source here.)

But what of worry-number-2? Here, hmm: if we take worry-number-1 full off the table, I think it becomes quite a bit less clear what standard (western, liberal, broadly democratic) ethical and political norms have to say about worry-number-2 on its own. To see this, consider the following thought experiment (caveat: I'm really, really not saying that misaligned AIs will be like this).

Imagine a liberal society very much like our own, except with the addition of one extra human cultural group: namely, the humans-who-like-paperclips. The humans-who-like-paperclips are a sect of humans that arose at some point in the sixties and has been growing ever since. They are meticulously law-abiding, kind, and cooperative, but they have one weird quirk: the main thing they all want to do with their personal resources is to make paperclips. Passing by a house owned by a human-who-likes-paperclips, you'll often see large, neatly-sorted stacks of paperclip boxes in their backyards, and through the windows of their garages, and sometimes in the living rooms. The richer humans-who-like-paperclips own whole warehouses. The paperclip industry is booming.

Yeah sometimes he just stands there looking at them...

Now, let's start by noticing that in this context, it's not at all clear that "the humans-who-like-paperclips have different values from us" qualifies as a problem, at least by the lights of basic western, liberal norms (here I mean liberalism in the political-philosophy sense roughly at stake in this Wikipedia page, rather than in the "liberals vs. republicans" sense). What the humans-who-like-paperclips do with their private resources, and in the privacy of their homes/backyards, is their own business, conditional on its compatibility with certain basic norms around harm, consent, and so forth. After all: Alicia down the street spends her free time and money listening to noise music; Jim sits around watching trashy TV in a drunken haze; Felipe has sex with other men; Maria collects stamps; and Jason is Mormon. Are the humans-who-like-paperclips importantly different? What happened to liberal tolerance?

Now, of course, utilitarianism-in-theory was never, erm, actually very tolerant. Utilitarianism is actually kinda pissed about all these hobbies. For example: did you notice the way they aren't hedonium? Seriously tragic. And even setting aside the not-hedonium problem (it applies to all-the-things), I checked Jim's pleasure levels for the trashy-TV, and they're way lower than if he got into Mozart; Mary's stamp-collecting is actually a bit obsessive and out-of-balance; and Mormonism seems too confident about optimal amount of coffee. Oh noes! Can we optimize these backyards somehow? And Yudkowsky's paradigm misaligned AIs are thinking along the same lines – and they've got the nano-bots to make it happen.

I sometimes think about this sort of vibe via the concept of "meddling preferences." That is: roughly, we imagine dividing up the world into regions ("spaces," "spheres") that are understood as properly owned or controlled by different agents/combinations of agents. Literal property is a paradigm example, but these sorts of boundaries and accompanying divisions-of-responsibility occur at all sorts of levels – in the context of bodily autonomy, in the context of who has the right to make what sort of social and ethical demands of others, and so forth (see also, in more interpersonal contexts, skills involved in "having boundaries," "maintaining your own sovereignty," etc).

Some norms/preferences concern making sure that these boundaries function in the right way – that transactions are appropriately consensual, that property isn't getting stolen, that someone's autonomy is being given the right sort of space and respect. A lot of deontology, and related talk about rights, is about this sort of thing (though not all). And a lot of liberalism is about using boundaries of this kind of help agents with different values live in peace and mutual benefit.

Meddling preferences, by contrast, concern what someone else does within the space that is properly "theirs" – space that liberal ethics would often designate as "private," or as "their own business." And being pissed about people using their legally-owned and ethically-gained resources to make paperclips looks a lot like this. So, too, being pissed about noise-musicians, stamp-collectors, gay people, Mormons, etc. Traditionally, a liberal asks, of the humans-who-like-paperclips: are they violating any laws? Are they directly hurting anyone? Are they [insert complicated-and-contested set of further criteria]? If not: let them be, and may they do the same towards "us."

Humans-who-like-stamps, at a convention. (Image source here.)

Many "axiologies" (that is, ways of evaluating the "goodness" of the world) are meddling in a way that creates tension with this sort of liberal vibe. After all: axiologies concern the goodness of the entire world. Which means: all the "regions." In this sense, axiology is no respecter of boundaries. Of course, you could have an axiology that prefers worlds precisely insofar as they obey some set of boundary-related norms, and which has no preferences about what-happens-in-back-yards, but one finds this rarely in practice. To the contrary, many axiologies are concerned, for example, with the welfare of the agents involved (the average welfare, the total welfare, etc), or the beauty/friendship/complexity/fun etc occurring in the different regions. And if you give people liberal freedoms in their own spheres, sometimes they make those spheres less-than-optimally welfare-y/beautiful/complex/fun etc. Thus that classic tension between goodness and freedom (cf. "top down" vs. "bottom up"; and see also Nozick's critique of "end-state" and "patterned" principles of justice).

The "utility functions" that Yudkowskian rational agents pursue need not be axiologies in a traditional sense. But somehow, they often end up pretty axiology-vibed.10 No wonder, then, that Clippy is no respecter of boundaries, either. Indeed, in many respects, Yudkowsky's AI nightmare is precisely the nightmare of all-boundaries-eroded. The nano-bots eat through every wall, and soon, everywhere, a single pattern prevails. After all: what makes a boundary bind? In Yudkowsky's world (is he wrong?), only two things: hard power, and ethics. But the AIs will get all the hard power, and have none of the ethics. So no walls will stand in their way.

But I claim that humans often have the ethics bit.11 Or at least, human liberals, on their current self-interpretation. Of course, this isn't to say that liberals are OK with anything happening inside "walled" zones that might be intuitively understood as "private." For example: it's a contested question what aspects of a child's life should be under the control of a parent, but clearly, you aren't allowed to abuse or torture your own children (or anyone else), even in your own living room with the blinds drawn. And similarly, at a larger scale: the borders between nation states are a paradigm example of a certain kind of "boundary," but we believe, nevertheless, that certain sorts of human-rights-abuses inside a sovereign nation warrant infringing this boundary and righting the relevant wrong.

Often, though, these sorts of boundary infringements are justified precisely insofar as they are necessary to prevent some other boundary violation (e.g., child abuse, genocide) taking place within the first boundary. Indeed, Yudkowsky often turns to this sort of thing when he tries to prompt humans to behave in a manner analogous to a paperclipping AI. Thus, in "Three Worlds Collide," he specifically has humans encounter (and then: decide to intervene on violently) an alien species that eats their own conscious, suffering children – rather than, e.g., a species that just spends its resources making paperclips. And in trying to induce Lex to try to take over an alien world he wakes up in ("don't think of it as 'world domination'," Yudkowsky says with a grin, "think of it as 'world optimization'"), Yudkowsky specifically appeals to the idea that the alien civilization involves a lot harm and suffering – via war, or via some equivalent of factory farming – that Lex could alleviate, rather than to the idea that the aliens use their resources (and still less: their atoms) on boring/meaningless/sub-optimal things.

And to be clear: I agree that preventing harm, suffering, genocide, and so forth can justify infringing otherwise-important boundaries. (Indeed, I think that as it becomes possible to create suffering and harm in digital minds using personal computers, we're going to have to grapple with new tensions in this respect. Your backyard is yours, yes: but just as you can't abuse your children there, neither can you abuse digital minds.) But I also want to be clear that what's going on with the part of human values that says "no torturing people even in your own backyard" is much more specific, and much more compatible with "niceness" in other contexts, than what's going on with an arbitrary rational optimizer stealing your atoms to make its favored form of blah-blah-onium.

For example: if Lex were to wake up in a civilization of peaceful paperclippers, whose civilization involves no suffering (but also, let's say, very little happiness), but who spend all of their resources on paperclips, it seems very plausible to me that the right thing for Lex to do is to mostly leave them alone, rather than to engage in some project of world-domination/optimization (maybe Lex escapes to some other planet, but he doesn't take over the alien government and turn their paperclip factories into Fun-onium factories instead). And this even though Lex likes fun a lot more than paperclips.

Yudkowsky, to his credit, is attuned to this aspect of human ethics (the humans in Three Worlds Collide, for example, look for ways to respect and preserve baby-eater culture while still saving the babies) – but his rhetoric can easily leave it in the background. For example, in trying to induce Lex to world-dominate/optimize, Yudkowsky reminds him: "the point is: they want the world to be one way, you want the world to be a different way." But for a liberal: that's not good enough. All the time, my preferences conflict with the preferences of others. All the time, according to me, they could be using their private resources more optimally. Does this mean I dominate/optimize their backyards as soon as I'm powerful enough to get away with it? Not, I claim, if I am nice.

Of course, an even-remotely-sophisticated ethics of "boundaries" requires engaging with a ton of extremely gnarly and ambiguous stuff. When, exactly, does something become "someone's"? Do wild animals, for example, have rights to their "territory"? See all of the philosophy of property for just a start on the problems. And aspirations to be "nice" to agents-with-different-values clearly need ways of balancing the preferences of different agents of this kind – e.g., maybe you don't steal Clippy's resources to make fun-onium; but can you tax the rich paperclippers to give resources to the multitudes of poor staple-maximizers?12 Indeed, remind me your story about the ethics of taxation in general?

I'm not saying we have a settled ethic here, and still less, that its rational structure is sufficiently natural and privileged that tons of agents will converge on it. Rather, my claim is that we have some ethic here – an ethic that behaves towards "agents with different values" in a manner importantly different from (and "nicer" than) paperclipping, utilitarianism, and a whole class of related forms of consequentialism; and in particular, an ethic that doesn't view the mere presence of (law-abiding, cooperative) people-who-like-paperclips as a major problem.

And such an ethic seems well-suited, too, to handling the possibility – discussed in the previous essay – that different humans might end up with pretty different values-on-reflection as well. Liberalism does not ask that agents sharing a civilization be "aligned" with each other in the sense at stake in "optimizing for the same utility function." Rather, it asks something more minimal, and more compatible with disagreement and diversity – namely, that these agents respect certain sorts of boundaries; that they agree to transact on certain sorts of cooperative and mutually-beneficial terms; that they give each other certain kinds of space, freedom, and dignity. Or as a crude and distorting summary: that they be a certain kind of nice. Obviously, not all agents are up for this – and if they try to mess it up, then liberalism will, indeed, need hard power to defend itself. But if we seek a vision of a future that avoids Yudkowsky's nightmare, I think the sort of pluralism and tolerance at the core of liberalism will often be more a promising guide than "getting the utility function that steers the future right."

What if the humans-who-like-paperclips get a bunch of power, though?

Let's keep going, though, with the thought experiment about the humans-who-like-paperclips, until it hits on worry-number-2 more directly. In particular: thus far the humans-who-like-paperclips are just one human group among others. But what happens if we imagine them becoming the dominant human group – albeit, via means entirely compatible with respect for the boundaries of others, and with conformity to liberal ethics and laws.

Thus, let's say that the humans-who-like-paperclips are quite a bit smarter, more productive, and better coordinated than basically everyone else. As a result of their labors in the economy and their upstanding citizenship, humans in general are richer, happier, stronger, and healthier relative to a world without them. But for closely related reasons, and without violating any legal or ethical norms (all the economic transactions they engage in are consensual, fully-informed, and mutually beneficial), they are gradually accumulating more and more power. Their population is growing unusually fast; they own a larger and larger share of capital; and they exert more and more influence over politics and public opinion – albeit, in entirely above-board ways (much more above board, indeed, than many of the other groups vying for influence). Analysts are projecting that in a few decades, humans-who-like-paperclips will be the most powerful human group, for most measures of power – more powerful, indeed, than all the other groups combined. And they're predicting that for various reasons to do with the pace of technological development, this dominance will grant the humans-who-like-paperclips enormous influence over the trajectory of humanity's future.

Now, it's natural to wonder whether, once the humans-who-like-paperclips achieve sufficient dominance, all this niceness and cooperativeness and good-citizenship and respect-for-the-law stuff might fall by the wayside, and whether they might start looking more hungrily at your babies and your atoms. But suppose that somehow, you know that this won't happen. Rather, the humans-who-like-paperclips will continue to meticulously respect legal and ethical norms (or at least, the sort of minimal, boundary-related ethical norms I gestured at above). No one will get nano-bot-ed; the humans-who-like-paperclips won't sneak any suffering or slavery into their paperclip piles; and the humans-who-like-other-stuff (e.g. "Fun") will be able to happily pursue this other stuff from within secure backyards that are extremely ample by today's standards. But most of the resources of the future will go towards paperclips regardless.13

How bad is this outcome? Different ethical views will disagree, and a less-crude analysis would obviously include factors other than "conformity to very basic liberal norms" and "what happens with the galaxies." Crudely, my own view is that the galaxy thing is actually a huge deal, and that even with basic liberal norms secure, turning ~all reachable resources into literal paperclips would be a catastrophic waste of potential.14 But I also want to acknowledge that this is a very different sort of big deal than someone, or some group, killing everyone else and taking their stuff (and note that distant galaxies are not, in any meaningful sense, "ours," despite transhumanist talk about "our cosmic endowment"). In particular: the pure galaxies thing implicates different, and more fraught, ethical questions about otherness and control.

Thus: once we specify that basic liberal norms will be respected regardless, further disputes-over-the-galaxies look much more like a certain kind of raw competition for resources. It's much less akin to a country defending itself from an invader, and much more akin to one country racing another country to settle and control some piece of currently-uninhabited territory.15 The dispute is less about upholding the basic conditions of cooperation and peace-among-differences, and more about whose hobbies get-done-more; who gets the bigger backyard. Does it all come down to land use?

Well, even it did: land use is actually a very big deal.16 And to be clear: I don't like paperclips any more than you do. I much prefer stuff like joy and understanding and beauty and love. But I also want to be clear about what sort of ground I am standing on, according to my own values, when I fight for these things in different ways in different contexts. And according to my own values: it is one thing to defend your boundaries and your civilization's basic norms against aggressors and defectors. It is another to compete with someone who prefers-different-stuff, even while those norms are secure. And it is a third, yet, to become an aggressor/defector yourself, in pursuit of the stuff-you-prefer. But to talk, only, about "having different values" – and especially, to assume that the main thing re: values is your favored use of unclaimed energy/matter, your preferred blah-blah-onium – obscures these distinctions.

In particular: the defending-boundaries thing is where liberalism goes most readily to identify the forms of "otherness" that are not OK: namely, otherness done Nazi-style; otherness that actually, really, is trying to kill you and eat your babies. But the otherness at stake in "cooperative and nice, but still has a different favorite-use-of-resources" is quite different. It's the sort of otherness that liberalism wants to tolerate, respect, include, and even celebrate. Cf noise music, Mormonism, and that greatest test of tolerance: sub-optimally-efficient pleasure. Such tolerance/respect/etc is compatible with certain kinds of competition, yes. But not fighting-the-Nazis style. Not, for example, with the same sort of moral righteousness; and relatedly, not with the same sorts of justifications for violence and coercion.

Indeed, importantly not, if you want peace and diversity both. After all, the wider the set of differences-in-values you allow to justify violence and coercion, the more you are asking either for violence/coercion, or for everyone-having-the-same-values. Or perhaps most likely: violence/coercion in the service of everyone-having-the-same-values. Cf cleansing, purging. Like how the paperclipper does it. But we can do better.

An aside on AI sentience

I want to pause here to address an objection: namely, "Joe, all this talk about tolerance and respect etc – for example, re: the humans-who-like-paperclips – is assuming that the Others being tolerated/respected/etc are sentient. But the AIs-with-different-values – even: the cooperative, nice, liberal-norm-abiding ones – might not even be sentient! Rather, they might be mere empty machines. Should you still tolerate/respect/etc them, then?"

My sense is that I'm unusually open to "yes," here.17 I'm not going to try to defend this openness in depth here, but in brief: while I take consciousness very seriously,18 and definitely care a lot about something-in-the-vicinity-of-consciousness, I don't feel very confident that our current concepts of "sentience" and "consciousness" are going to withstand enough scrutiny to handle the moral weight that some people currently want to put on them;19 I think focus on consciousness does poorly on golden-rule-like tests when applied to civilizations with different conceptions of the precise sorts of functional mental architectures that matter (e.g., aliens that would look at us and say "these agents aren't schmonscious, because their introspection doesn't have blah-precise-functional-set-up" – see e.g. this story for an intuition pump); and I think some of the more cooperation-focused origins and functions of niceness/liberalism/boundaries (including: functions I discuss below re: liberalism and real-politik, where sentience more clearly doesn't matter20) don't point towards consciousness as a key desideratum (and note that I'm here specifically talking about the bits of ethics that are cooperation-flavored, rather than the bits associated with what you personally do in your backyard).21 Plus, more generally, I think this is all sufficiently confusing territory that we should err on the side of caution and inclusivity in allocating our moral concern, rather than saying e.g. "whatever, this cognitively-sophisticated-agent-with-preferences isn't conscious – by which I mean, um, that we-know-not-what-thing, that least-understood-thing – so it's fine to torture it, deprive it of basic rights, etc."

Of course, if you stop using sentience as a necessary condition for being worthy-of-tolerance/respect etc, then you need to say additional stuff about where you do draw the sorts of lines I discussed a few essays ago: e.g., "OK to eat apples but not babies," "furbies and thermostats don't get the vote," "you can own a laptop but not a slave,"22 and so on.23 And indeed, gnarly stuff. My current best guess here would be to hand-wave about agenty-ness and cognitive sophistication and who-would've-been-a-good-target-for-cooperation-in-other-circumstances – but obviously, one needs to say quite a bit more.

For the purposes of understanding the ethical underpinnings of the AI risk discourse, though, I don't think that we need to resolve questions about whether non-sentient AIs-with-different-values are worthy of tolerance/respect. Why? Because the core bits of the Yudkowskian narrative I've been discussing apply even if all the AIs-with-different-values are sentient. The classic paperclipper-doom story, for example, does not require that the paperclipper be insentient: it still kills all the humans, it still turns the galaxies into paperclips, and that's enough.24 And Yudkowsky himself would find the possibility of conscious AIs, at least, obvious. Where this includes, presumably, conscious paperclippers. (In reality, my sense is that Yudkowsky thinks consciousness unusually scarce – for example, he's skeptical that pigs are conscious. But this view isn't important to his story.) So for now, in talking about tolerating/respecting AIs with-different-values, I'll just assume they're sentient, and see what follows.

Indeed: did you think it matters a lot, to the Yudkowsky narrative, whether the AI was sentient? If so, then I suspect you are thinking of this narrative as a less familiar story than it truly is. Ultimately, AI risk is not about humans vs. AIs (in that case, it really would be species-ism/bio-chauvinism), or sentience vs. insentience (the AIs might well be sentient). Rather, it's about something more ancient and basic: namely, agents with different values competing for power. So I encourage you: run the story with conscious humans-with-different-values in the place of the AIs-with-different-values – humans to whom you are more immediately inclined to ascribe moral status, rights, citizenship, tolerance-worthiness, and so forth. You want to make sure that you get the differences-in-values different enough, sure (though: "maximize paperclips" is an unfortunate cartoon; thinking about where RLHF + foom leads seems a better guide). And as I said earlier: people with souls can still be enemy soldiers. But if you're finding that words like "human" or "sentient" are making the agents-with-different-values seem substantially less like enemies, then you're not yet fully keyed to the particular sort of conflict that Yudkowsky has in mind.

Giving AIs-with-different-values a stake in civilization

Let me give another example of a place where I worry that a naïve Yudkowskian discourse can too-easily neglect the virtues of niceness and liberalism: namely, the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with.

Thus, consider Yudkowsky's "proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets": namely, use it to implement humanity's "coherent extrapolated volition" ("CEV"). This means, basically: have the AI do what currently-existing humans would want it to do if they were "idealized" (see more here), to the extent those idealized humans would want the same things.

We see, in Yudkowsky's discussion of CEV, some of his effort to implement a less power-grabby ethic than a simple interpretation of his philosophy might imply. That is: Yudkowsky (at least in 2004) is explicitly imagining a team of AGI programmers who are in the position to take over the world and have their particular (idealized) values rule the future (let's set aside questions about the degree of resemblance this scenario is likely to have to the actual dynamics surrounding AGI development, and treat it, centrally, as a thought experiment). And one might've thought, given the apparent convergence of oh-so-many-rational-agents on the advisability of taking over the world, that Yudkowsky's programmers would do the same.25 But he suggests that they should not.

Part of this, says Yudkowsky, is about not ending up like ancient greeks who impose values on the future they wouldn't actually endorse if they understood better. But that only gets you, in Yudkowsky's ontology, to the programmers making sure to extrapolate their own volitions. It doesn't get you to including the rest of humanity in the process.

What gets you to giving that wider circle a say? Yudkowsky mentions various values – "fairness," "not being a jerk," trying to act as you would wish other agents would act in your place, cooperation/real-politik, not acting like you are uniquely appointed to determine humanity's destiny, and others. I won't interrogate these various considerations in detail here (though see footnote for a bit more discussion).26 Rather, my point is about how far the pluralism they motivate should extend.

In particular: Yudkowsky's "extrapolation base" – that is, the set of agents his process grants direct influence over the future – stops at humanity. But it seems plausible to me that whatever considerations motivate empowering all of humanity, in a thought experiment like this, should motivate empowering certain kinds of AIs-with-different-values as well, at least if we are already sharing the world with such AIs by the time the relevant sort of power is being thought-experimentally allocated. For example, in this thought experiment: if at the time the programmers are making this sort of decision, there are lots of moral-patienty AIs with human-level-or-higher intelligence running around, who happen to have very different values from humans, I think they should plausibly be included in the "extrapolation" base too. After all, why wouldn't they be? "Because they're not humans" is actually species-ism. But absent such species-ism, the most salient answer is "because their values are different from ours, so giving them influence will make the future worse by our lights." But that answer could easily motivate not-empowering many humans as well – and the logic, in the limit, might well prompt the programmers to empower only themselves.

Now, the details here about what it means to empower moral-patienty AIs-with-different-values in the right way get gnarly fast (see e.g. Bostrom and Shulman (2022) for a flavor). Indeed, questions about how to handle the empowerment of such AIs are one of the few places I've seen Yudkowsky, in his words, "give up and flee screaming into the night." See, also, one of his characters' exclamation in the face of a sentient iPhone that's been stalking him, and which begs not to be wiped: "I don't know what the fuck else I'm supposed to do! Someone tell me what the fuck else I'm supposed to do here!" At least as of 2008 (has he written on this since?27), Yudkowsky's central advice, in the face of the moral dilemma posed by creating AI moral patients with different values, seems to be: don't do it, at least until you're much readier than we are. And indeed: yes. Just like how: don't create AGI at all until you're much readier than we are. But unfortunately, in both cases: I worry that we're going to need a better plan.

I won't try to outline such a plan here. Rather, I mostly want to point at the general fact that, insofar as we are in fact aiming to build a world that succeeds at whatever "liberalism" and "boundaries" and "niceness" are trying to do, this world should probably be inclusive, tolerant, and pluralistic with respect to AIs-with-different-values (or at least, moral patient-y ones) as well as humans-with-different-values – at least absent some clear and not-just-species-ist story about why AIs-with-different-values should be excluded. And note, importantly, that this doesn't mean tolerating arbitrarily horrible value systems doing whatever they want, or arbitrarily alien value systems trampling on other people's backyards. This is part of why I think it's worth being clear – indeed, clearer than I've been thus far – about the sorts of values differences liberalism/boundaries/niceness gets fussed about.28 Peaceful, cooperative AIs that want to make paperclips in their backyards – that's one thing. Paperclippers who want to murder everyone; sadists who want to use their backyards as torture chambers; people who demand that they be able to own sentient, suffering slaves – that's, well, a different thing. Yes, drawing the lines requires work. And also: it probably requires drawing on specific human (or at least, not-fully-universal) values for guidance. I'm not saying that liberalism/niceness/boundaries is a fully "neutral arbiter" that isn't "taking a stand." Nor am I saying that we know what stand it, or the best version of it, takes. Rather, my point is that this stand probably does not treat "those AIs-we-share-the-world-with have different values from us" as enough, in itself, to justify excluding them from influence over the society we share.

The power of niceness, community, and civilization

So far, I've been making the case for this sort of inclusivity centrally on ethical grounds. But liberalism/niceness/boundaries clearly have practical benefits as well. Nice people, for example, are nicer to interact with. Free and tolerant societies are more attractive to live in, work in, immigrate to. Secure boundaries save resources otherwise wasted on conflict. And so on. There's a reason so many European scientists – including German scientists – ended up working on the Manhattan project, rather than with the Nazis; and it seems closely related to differences in "niceness."

Indeed, these benefits are enough, at times, to soften the atheism of certain rationalists. For example: Scott Alexander.29 As I mentioned in a previous essay: Alexander, in writing about liberalism/niceness/boundaries (e.g. here and here), attributes to it a kind of mysterious power. "Somehow Elua is still here. No one knows exactly how. And the gods who oppose Him tend to find Themselves meeting with a surprising number of unfortunate accidents." Liberalism/niceness/boundaries is not, for Alexander, just another utility function. Still less is it actively weak. Rather, it is a "terrifying unspeakable elder God." "Elua is the god of flowers and free love and he is terrifying. If you oppose him, there will not be enough left of you to bury, and it will not matter because there will not be enough left of your city to bury you in."

Here, Alexander's vibe is un-Yudkowskian in a number of ways. First, Alexander seems to want to trust, at least partly, in something mysterious – namely, the ongoing power of liberalism/niceness/boundaries, which Alexander admits he does not fully understand. Indeed, I think that various more consequentialist-y stories about the justification for deontological-y norms and virtues – including the ones at stake in liberalism/niceness/boundaries – have some of this flavor as well. That is: consequentialists often argue that you should abide by deontological norms, or be blah sort of virtuous, even when it seems like doing so will make things worse, because somehow, actually, doing so will make things better (for example: because at the level of choosing a policy, or adjusting for biases, or dealing with the constraints of a bounded mind, deontology/virtue does better than consequentialist calculation). Deontology/virtue, on this story, is its own form of power-to-achieve-your-goals – but a form that remains at least somewhat cognitively inaccessible while it is being put-into-practice (otherwise, it could be more fully subsumed within a direct consequentialist calculation). So trust in deontology/virtue, in the hard cases, requires trusting in something not-fully-calculated. (Though of course, there are tons of ways to trust-wrongly, here, too.)30

But beyond his willingness to trust-in-something-mysterious, Alexander's attribution of power to Elua is also in tension with certain kinds of orthogonality between ethics and optimization power. That is, to the extent that Elua represents a set of values, Elua, in a Yudkowskian ontology, is orthogonal to intelligence at least – and thus, to a key source of power. "Paperclips," after all, are neither elder Gods nor younger Gods, neither unspeakable nor speakable. They are, rather, just another direction that power can try to drive an indifferent universe. Why would niceness be any different?

Well, we can think of reasons. Plausibly, for example, the indifferent universe is steered more easily in some directions vs. others. Indeed, the social/evolutionary histories of niceness/boundaries/liberalism are themselves testaments to the ways in which the indifferent universe favors Elua under certain conditions – favoritism that plays a key role in explaining why we ended up valuing Elua-stuff intrinsically, to the extent we do. In this sense, our values are not fully orthogonal to the "universe's values." True, we are not simple might-makes-right-ists, who love, only, whatever is in fact most powerful. But our hearts have, in fact, been shaped by power – so we should not be all that surprised if the stuff we love is also powerful.

Will power of this kind persist into a post-AGI future – and in particular, in a way that should motivate extending various sorts of tolerance and inclusivity towards AIs-with-different-values on pragmatic rather than purely ethical grounds? My sense is that Yudkowskian-ism often imagines that it won't. In particular: the practical benefits of liberalism/niceness/boundaries often have to do with the ways in which they allow agents with different values, but broadly comparable levels of power, to cooperate and to live together in harmony rather than to engage in conflict. But as I discussed above: Yudkowsky is typically imagining a post-AGI world in which AIs-with-different-values and humans do not have broadly comparable levels of power. Rather, either AIs-with-different-values have all the power, or (somehow, due to a miracle) humans do. So finding a modus vivendi can seem less practically necessary.

Again, I'm not going to delve into these dynamics in any detail, but I'm skeptical that we should be writing off the purely practical benefits of extending various forms of niceness/liberalism/boundaries to AIs-with-different-values, especially from our current epistemic position. In particular: I think there may well be crucial stages along the path to a post-AGI future in which AIs-with-different-values and humans do indeed have sufficiently comparable levels of power, at least in expectation, that the practical virtues of niceness/liberalism/boundaries may well have a positive role to play – including: a role that helps us avoid having to put our trust in any foomed-up concentration of power, whether human or artificial. I am especially interested, here, in visions of a post-AGI distribution of power that would give various AIs-with-different-values more of an incentive, ex ante, to work with humans to realize the vision in question, as a part of a broadly fair and legitimate project, rather than as part of an effort, on humanity's part, to use (potentially misaligned and unwilling) AI labor to empower human values in particular. But fleshing this out is a task for another time.

Is niceness enough?

My main aim, in this essay, has been to point at the distinction between a paradigmatically paperclip-y way of being, and some broad and hazily defined set of alternatives that I've grouped under the label "liberalism/niceness/boundaries" (and obviously, there are tons of other options as well). Too often, I think, a simplistic interpretation of the alignment discourse imagines that humans and paperclippers are both paperclippy at heart – but just, with a different favored sort of stuff. I think this picture neglects core aspects of human ethics that are, themselves, about navigating precisely the sorts of differences-in-values that the possibility of AIs-with-different-values forces us to grapple with. I think that attention to these aspects of human ethics can help us be better than the paperclippers we fear – not just in what we do with spare resources, but in how we relate to the distribution of power amongst a plurality of value systems more broadly. And I think it may have practical benefits as well, in navigating possible conflicts both between different humans, and between humans and AIs.

That said: depending on how exactly we interpret liberalism/niceness/boundaries, it's also possible to imagine futures compatible with various versions (and especially, minimal versions – e.g., property rights are respected, laws don't get broken, laws are passed democratically, etc), but which are nevertheless bleak and even horrifying in other respects – for example, because love and joy and beauty and even consciousness have vanished entirely from the world.31 In this sense, and depending on the details, the bits of ethics I've been gesturing at here aren't necessarily enough, on their own, for even a minimally good future (let alone a great one). In particular: absent help from an indifferent universe, in order to have substantive amounts of love/joy/beauty in the future, you need agents who care about these things having enough power to keep them around to the relevant degree – and different conceptions of liberalism/niceness/boundaries may not guarantee this. So even beyond the yin of being nice/liberal/boundary-respecting towards agents who don't like love/joy/beauty, some kind of active yang, in the direction of love/joy/beauty etc, is necessary, too.32 In the next essay, I'll return to questions about this sort of yang – and in particular, questions about whether it involves attempting to exert inappropriate levels of control.

In particular, vibes related to the "fragility of value," "extremal Goodhardt," and "the tails come apart."

Though in fairness, forms of "threshold deontology" that introduce constraints that can only be violated if the stakes are high enough – e.g., you can only push the fat man if it will save x lives, where x is quite a bit larger than utilitarianism would suggest – face this issue, too. E.g., the onium at stake can quickly become more-than-x. Thanks to Will MacAskill for discussion here.

See here for some debate. Part of my argument, in this essay, is that we should not do the "teach the aliens the value of friendship" thing that Soares seems to endorse here.

Though: I don't think it disappears.

Remember: caring about an agent's preferences is conceptually distinct from caring about her welfare.

And I think we should be open to doing this even if they aren't sentient – more below.

Hanson's critique of the alignment discourse emphasizes the distinction.

As a maybe-clearer example: if a team of five people breaks into your house trying to kill you, you can kill all of them if necessary to save yourself. But if you are on the way to the hospital and the only way to save yourself is to run over five people on the road, you aren't permitted to do it.

Though note that we're creating them – and doing so, in the AI risk story, without adequate care to avoid the relevant sorts of aggressions, for the sake of other not-always-fully-laudatory motives. This complicates the moral narrative.

Maybe something about "consequentialism" in AIs-that-get-things-done is to blame? But even if you add in deontological constraints, Yudkowsky (as I understand him) predicts that the AIs will simply pursue the "nearest unblocked neighbor" of those constraints.

Though: human society today often also puts adequate hard power behinds its walls, given the current attempted-invasions. And let's keep it that way, even as the invasions get oomphier.

Thanks to Howie Lempel for discussion of this point.

We can wonder why the existing political order lets this happen, but let's set this aside for now.

Roughly twenty billion galaxies, according to Toby Ord's The Precipice, p. 233.

"Like the colonialists?" Well: the "uninhabited" bit is really important – at least if you're a boundary-respecter. But let's not pretend that colonialist vibes are so far off in the distance, here.

In particular: lots of human and animal lineages have suffered, died, and disappeared for lack of land (and this is not to mention: having their land actively stolen, invaded, and so on). And what are most wars fought over? Thanks to Carl Shulman for discussion here.

Though I remain pretty uncertain/confused about various of the issues here. And obviously, it would be great to first get a bunch more ethical clarity about this sort of thing before having to make decisions about it.

More seriously than e.g. the illusionists.

E.g., I worry it'll end up looking like people saying "if an agent doesn't have phlogiston, it doesn't deserve any moral weight."

Game theory works regardless of whether the agents you're interacting with are conscious.

In the context of choosing-what-to-build-in-your-backyard, I feel much happier to focus directly on getting the "thing-that-matters-in-the-vicinity-what-we-currently-call-consciousness" thing right. But here I'm talking about the bits of ethics that are about relating-to-other-backyards (but: still in a terminal-values sense, not a game-theory sense).

We're assuming that you're not running any slaves on the laptop.

Thanks to Howie Lempel for discussion.

And note that just because it's sentient doesn't mean the world it creates involves a lot of sentience.

Though perhaps not: that Yudkowsky would advise them to do the same.

For some of these rationales, note that it's not actually clear how this gets him away from the programmers just extrapolating their own volitions. After all, if their own extrapolated volitions would value fairness, not being a jerk, golden-ruling, etc in the manner in question, then the output of the extrapolation process would presumably reflect this (Yudkowsky uses this sort of dynamic to respond to various other objections to his proposal: e.g., "if that's a good objection, our extrapolated volitions will notice and adjust for it"). And if not, they would have avoided a mistake by their own lights by keeping the circle narrow.

Indeed, in a simple version of Yudkowsky's ontology, it's unclear how the programmers could possibly do better than just extrapolating their own volitions. Their own extrapolated volitions, after all, set the standard (on Yudkowsky's anti-realist ethics) for what the right choice would be. Is Yudkowsky imagining programmers who face the option to make a correct-by-definition choice, and advising them to maybe make a mistake instead?

Well, let's be careful. Some choices can't be unmade – including choices to find out what-you-should-have-done. Suppose, at t1, that your mother is about to drown, and you have a choice between saving her, or asking a genie for advice/service. If you ask the genie "what is the right decision at t1?", it might well answer at t2, "you should have saved your mother, who just drowned." And if you ask it "figure out what I should have done at t1, and then do it," it might be too late. So, too, with the choice to seek power. Power is useful for many values, yes, but famously, obviously, seeking power can compromise your values too. Indeed, it often does, given how many of our ethical values are specifically about regulating who gets what sort of power (cf "boundaries" above) – plus, you know, the power-corrupts thing, the biased-in-favor-of-yourself thing, and so on. And this holds true even if the power in question will grant you arbitrary insight into the values you compromised. If you take-over-the-world in the process of finding out whether you should've taken-over-the-world – well, you can still have fucked up.

And beyond this, certain kinds of cooperation, coordination, and commitment often involve making choices that might seem at the time, from the perspective of a certain kind of narrow rational calculation, like "mistakes." The way, for example, cooperating in a prisoner's dilemma – or paying in the city in "Parfit's hitchhiker" – is a "mistake." The type of mistake that seems, mysteriously, to get made by agents who end up rich, or alive-at-all. Is it a mystery? Sometimes, being the sort of person that others can trust, coordinate with, rely on, get-to-the-pareto-frontier-with, and so on requires being such that you don't just grab power for yourself (or lie, or steal, or crush the outgroup, or throw out the procedural norms of your democracy, or...) when you can get away with it, or think you can – even if that's what would get you the most (extrapolated) utility at the time (at least, for some notion of "would").

And we can talk about other possible reasons why Yudkowsky's programmers might use a wider "extrapolation base" than their own volitions as well (see e.g. Yudkowsky's original paper, and discussion on Arbital here, for longer discussion).

I'm not counting the "Comp sci in 2027" as really laying out a position re: what to do.

For example, in the context of whether animals should be empowered, Yudkowsky worries: what happens if you "uplift" a bear, or a chimp, or an ichneumonid wasp, and it just wants to eat babies, or to sit atop some violent and oppressive dominance hierarchy, or to lay parasitic eggs inside of everyone? And Yudkowsky worries about humans in this respect as well – see, e.g., his discussion of the "selfish bastards" problem here, in which so many present-day humans want sentient, suffering slaves that humanity's CEV says yes. But as I've tried to emphasize: these aren't just any old values differences. Rather, these are precisely the sort of values differences that liberalism/niceness/boundaries gets fussed about.

Though: he was always less of an atheist than Yudkowsky.

And blind hope that blah sort of deontological-seeming behavior will somehow lead to the best consequences can easily fail to grapple with the trade-offs that actual-deontology actually implies.

If you think of libertarianism as encoding a minimal form of niceness/liberalism/boundaries, then a libertarian-ish, Age-of-Em-ish world where eventually all the sentient agents die/lose their property/get outcompeted, but through legal and minimal-ethical-constraint-respecting processes, might be one example here.

And of course, even working on behalf of liberalism/niceness/boundaries is a form of yang in its own right.

A. Strunk

Jan 17, 2024

Whenever you post a new entry to this series, the first thing I do is check the number of vibes. This post has 9 vibes. "When 'yang' goes wrong" beats it with 10 vibes. Every post in this series has at least one vibe. The mean over the series so far is 5.3333 vibes. I'm not particularly against the word vibe, but its overuse is a decent symbol of a writing style that is sometimes trying too hard to be fun.

Expand full comment

Joe Carlsmith's Substack

Discussion about this post