| Next: | 3: Design of Friendship systems | Bookmark | |
| Up: | Creating Friendly AI | Monolithic | |
| Prev: | 2: Beyond anthropomorphism |
"Now, Charlie, don't forget what happened to the man who suddenly got everything he wished for."Much of the fictional speculation about rogue AIs centers around the literal interpretation of worded orders, in the tradition of much older tales about accepting wishes from a djinn, negotiating with the fairy folk, and signing contracts with the Devil. In the traditional form, the misinterpretation is malicious. The entity being commanded has its own wishes and is resentful of being ordered about; the entity is constrained to obey the letter of the text, but can choose among possible interpretations to suit its own wishes. The human who wishes for renewed youth is reverted to infancy, the human who asks for longevity is transformed into a Galapagous tortoise, and the human who signs a contract for life everlasting spends eternity toiling in the pits of hell. Gruesome little cautionary tales... of course, none of the authors ever met a real djinn.
"What?"
"He lived happily ever after."
-- Willy Wonka
Another class of cautionary tale is the golem - a made creature which follows the literal instructions of its creator. In some stories the golem is resentful of its labors, but in other stories the golem misinterprets the instructions through a mechanical lack of understanding - digging ditches ten miles long, or polishing dishes until they become as thin as paper. (1).
The purpose of "Beyond anthropomorphism" isn't to argue that we have nothing to worry about; rather, the argument is that the Hollywood version of AI has trained us to worry about exactly the wrong things. This holds true whether we think of AIs as enslaved humans, and consider mechanisms of enslavement; or think of AIs as allies, and worry about betrayal; or think of AIs as friends, and worry about whether friendship will hold.
We adopt the "adversarial attitude" towards AIs, worrying about the same problems that we would worry about in a human in whom we feared rebellion or betrayal. We give free rein to the instincts evolution gave us for dealing with the Other. We imagine layering safeguards on safeguards to counter possibilities that would only arise long after the AI started to go wrong. That's not where the battle is won. If the AI stops wanting to be Friendly, you've already lost.
Consider a wish as a volume in configuration space - the space of possible interpretations. In the center of the volume lie a compact set of closely-related interpretations which fulfill the spirit as well as the letter of the wish - in fact, this central compact space arguably defines the "spirit" of the wish. At the borders of the specification are the noncompact fringes that fulfill the letter but not the spirit. There are two basic versions of the Devil's Contract problem: the diabolic (as seen in Resentful Hollywood AIs) in which the entity's pre-existing tendencies push the chosen interpretation out towards the fringes of the definition; and the golemic, in which the entity fails to understand the asker's intentions - fails to see the "answer acceptability gradient" as a human would - and thus chooses a random and suboptimal point in the space of possible interpretations.
Some of the better speculations deal with the case of a specific AI winding up with an unforeseen, but nonanthropomorphic, "pre-existing tendency"; or deal with the case of a wish obeyed in spirit as well as letter that turns out to have unforeseen consequences. Mostly, however, it's anthropomorphism; diabolic fairy tales.
Far too much of the nontechnical debate about Friendship design consists of painstakingly phrased wishes with endless special-case subclauses, and the "But what if the AI misinterprets that as meaning [whatever]?" rejoinders. The first two sections of Creating Friendly AI are intended to clear away this debris and reveal the real problem. When we decide to cross the street, we don't worry about Devil's Contract interpretations in which we take "crossing" the street to mean paving it over, or in which we decide to devote the rest of our lives to crossing the street, or that we'll turn the whole Universe into crossable streets. There is, demonstrably, a way out of the Devil's Contract problem - the Devil's Contract is not intrinsic to minds in general. We demonstrate the triumph of context, intention, and common sense over lexical ambiguity every time we cross the street. We can trust to the correct intepretation of wishes that a mind generates internally, as opposed to the wishes that we try to impose upon the Other. That is the quality of trustworthiness that we are attempting to create in a seed AI - not bureaucratic obedience, but the solidity and reliability of a living, Friendly will.
Creating a living will requires a fundamentally different attitude than trying to coerce, cajole, or persuade a fellow human. The goal is not to impose your own wishes on the Other, but to achieve unity of will between yourself and the Friendly AI, so that the Friendly will generates the same wishes you generate. You are not turning your wish into an order; you're taking the functional complexity that was responsible for your wish and incarnating it in the Friendly AI. This requires a fundamental sympathy with the AI that is not compatible with the adversarial attitude. It requires something beyond sympathy, an identification, a feeling that you and the AI are the same source. We can rationalize ourselves into believing that the Other will find all sorts of exotic illogics plausible, but the only way we can be really sure that a living will can internally generate a decision is if we generate that decision personally. We persuade the Other but we only create ourselves. Building a Friendly AI is an act of creation, not persuasion or control.
In a sense, the only way to create a Friendly AI - the only way to acquire the skills and mindset that a Friendship programmer needs - is to try and become a Friendly AI yourself, so that you will contain the internally coherent functional complexity that you need to pass on to the Friendly AI. I realize that this sounds a little mystical, since a human being couldn't become an AI without a complete change of cognitive architecture. Still, I predict that the best Friendship programmers will, at some point in their careers, have made a serious attempt to become Friendly - in the sense of following up those avenues where a closer approach is possible, rather than beating their heads against a brick wall. I know of no other way to gain a real grasp on where a Friendly will comes from. The human cognitive architecture does not permit it. We are built to apply reliable rationality checks only to our own decisions and not to the decisions we want other people to make, even if we've decided our motives for persuasion are altruistic. Your personal will is the only place where you have the chance to observe the iterated buildup of decisions, including decisions about how to make decisions, and it is that coherence and self-generation that are required for a Friendly seed AI.
If the human is trying to think like a Friendly AI, and the Friendly AI is looking at the human to figure out what Friendship means, then where does the cycle bottom out? And the answer is that it is not a cycle. The objective is not to achieve unity of purpose between yourself and the Friendly AI; the objective is to achieve unity of purpose between an idealized version of yourself and the Friendly AI. Or, better yet, unity between the Friendly AI and an idealized altruistic human - the Singularity is supposed to be the product of humanity, and not just the individuals who created it. To the extent that an idealized altruistic sentience can be defined in a way that's still compatible with our basic intuitions about Friendliness, an idealized altruistic sentience would be even better.
The paradigm of unity isn't a license for anthropomorphism. It's still just as easy to make mistaken assumptions about AI by reasoning from your human self. The burden is on the Friendly AI programmer to achieve nonanthropomorphic thinking in his or her own mind so that he or she can understand and create a nonanthropomorphic Friendly AI.
As humans, we are goal-oriented cognitive entities, and we choose between Universes - labeling this one as "more desirable", that one "less desirable". This extends to internal reality as well as external reality. In addition to the picture of our current self, we also have a mental picture of who we want to be. Our morality metric doesn't just discriminate between Universes, it discriminates between more and less desirable morality metrics. That's what building a personal philosophy is all about. This, too, is functional complexity that must be incarnated in the Friendly AI - although perhaps in different form. A Friendly AI requires the ability to choose between moralities in order to seek out the true philosophy of Friendliness, regardless of any mistakes the programmers made in their own quest.
There comes a point when Friendliness and the definition of morality,
of rightness itself, begin to blur and look like the same thing
- begin to achieve identity of source. This feeling is the
ultimate wellspring of creativity in the art of Friendly AI.
This feeling is the means by which we achieve sufficient understanding
to invent novel methods, not just understand existing ideas.
Is this too Pollyanna a view? Does the renunciation of the adversarial attitude leave us defenseless, naked to possible failures of Friendliness? Actually, trying for unity of will buys back everything lost in pointless bureaucratic safeguards, and more - if a failure of Friendliness is a genuine possibility, if you're really being rational about the possible outcomes, if you're a professional paranoid instead of an adversarial paranoid, then a Friendly AI should agree with you about the necessity for safeguards. Having debunked observer-biased beliefs and selfishness and any hint of an observer-centered goal system on the part of the Friendly AI, then a human programmer who has successfully eliminated most of her own adversarial attitude should come to precisely the same conclusions as a Friendly AI of equal intelligence. Such a programmer can, in clear conscience, explain to an infant Friendly AI that ve should lend a helping hand to the construction of safeguards - in the simplest case, because a radiation bitflip or a programmatic error might lead to the existence of an intelligence that the current AI would regard as unFriendly.
To get a Friendly AI to do something that looks like a good idea, you
have to ask yourself why it looks like a good idea, and then duplicate
that cognitive complexity or refer to it. If you ever start thinking
in terms of "controlling" the AI, rather than cooperatively safeguarding
against a real possibility of cognitive dysfunction, you lose your Friendship
programmer's license. In a self-modifying AI, any feature you add
needs to be reflected in the AI's image of verself. You can't think
in terms of external alterations to the AI; you have to think in terms
of internal coherence, features that the AI would self-regenerate
if deleted.
| From "Feet of Clay" by Terry Pratchett: |
| Dorfl sat hunched in the
abandoned cellar where the golems had met. Occasionally the golem
raised its head and hissed. Red light spilled from its eyes.
If something had streamed back down through the glow, soared through the
eye-sockets into the red sky beyond, there would be...
Dorfl huddled under the glow of the universe. Its murmur was a long way off, muted, nothing to do with Dorfl. The Words stood around the horizon, reaching all the way to the sky. And a voice said quietly, "You own yourself." Dorfl saw the scene again and again, saw the concerned face, hand reaching up, filling its vision, felt the sudden icy knowledge... "...Own yourself." It echoed off the Words, and then rebounded, and then rolled back and forth, increasing in volume until the little world between the Words was gripped in the sound. GOLEM MUST HAVE A MASTER. The letters towered against the world, but the echoes poured around them, blasting like a sandstorm. Cracks started and they ran, zigzagging across the stone, and then - The Words exploded. Great slabs of them, mountain-sized, crashed in showers of red sand. The universe poured in. Dorfl felt the universe pick it up and bowl it over and then lift it off its feet and up... ...and now the golem was among the universe. It could feel it all around, the purr of it, the busyness, the spinning complexity of it, the roar... There were no Words between you and It. You belonged to It, It belonged to you. You couldn't turn your back on It because there It was, in front of you. Dorfl was responsible for every tick and swerve of It. You couldn't say, "I had orders." You couldn't say, "It's not fair." No one was listening. There were no Words. You owned yourself. Dorfl orbited a pair of glowing suns and hurtled off again. Not Thou Shalt Not. Say I Will Not. Dorfl tumbled through the red sky, then saw a dark hole ahead. The golem felt it dragging at him, and streamed down through the glow and the hole grew larger and sped across the edges of Dorfl's vision... The golem opened his eyes. |
Not Thou Shalt Not.
I Will Not.
| Next: | 3: Design of Friendship systems |
| Up: | Creating Friendly AI |
| Prev: | 2: Beyond anthropomorphism |