Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: Interlude: Beyond the adversarial attitude Bookmark
Up: Creating Friendly AI Monolithic
Prev: Interlude: The story of a blob


2: Beyond anthropomorphism

Anthropomorphic ("human-shaped") thinking is the curse of futurists.  One of the continuing themes running through Creating Friendly AI is the attempt to track down specific features of human thought that are solely the property of humans rather than minds in general, especially if these features have, historically, been mistakenly attributed to AIs.

Anthropomorphic thinking is not just the result of context-insensitive generalization.  Anthropomorphism is the result of certain automatic assumptions that humans are evolved to make when dealing with other minds.  These built-in instincts will only produce accurate results for human minds; but since humans were the only intelligent beings present in the ancestral environment, our instincts sadly have no built-in delimiters.

Many personal philosophies, having been constructed in the presence of uniquely human instincts and emotions, reinforce the built-in brainware with conscious reasoning.  This sometimes leads to difficulty in reasoning about AIs; someone who believes that romantic love is the meaning of life will immediately come up with all sorts of reasons why all AIs will necessarily exhibit romantic love as well.

In the abstract, almost everyone agrees that AIs will be alien and that anthropomorphic thinking about aliens is bad.  Encountering specific alienness is always a shock.  My advice to my readers is to enjoy the shock, appreciate the alienness, even if you're personally sure that the first AIs will necessarily be humanlike in some particular way.  Afterward, when both pictures are clear in your mind, you can stand back and compare design methods.

2.1: Reinventing retaliation

If you punch a human in the nose, he or she will punch back.  If the human doesn't punch back, it's an admirable act of self-restraint, something worthy of note.

Imagine, for a moment, that you walk up and punch an AI in the nose.  Does the AI punch back?  Perhaps and perhaps not, but punching back will not be instinctive.  A sufficiently young AI might stand there and think:  "Hm.  Someone's fist just bumped into my nose."  In a punched human, blood races, adrenaline pumps, the hands form fists, the stance changes, all without conscious attention.  For a young AI, focus of attention shifts in response to an unexpected negative event - and that's all.

As the AI thinks about the fist that bumped into vis (1) nose, it may occur to the AI that this experience may be a repeatable event rather than a one-time event, and since a punch is a negative event, it may be worth thinking about how to prevent future punches, or soften the negativity.  An infant AI - one that hasn't learned about social concepts yet - will probably think something like:  "Hm.  A fist just hit my nose.  I'd better not stand here next time."

If the AI is young, but old enough to be explaining human actions by reference to human cognition, ve may know that the fist movement and collision was probably not a coincidence; ve may hypothesize that the collision of human fist with AI's nose was intended, and even go on to hypothesize that the outcome of the punch, the resulting damage to the AI, was a predictable (predictable to the human) result of the punch, and that the purpose of the punch was to damage the AI; that is, the AI models the human's cognition and concludes that the human had a subgoal of {punching the AI}, and that the parent goal was {causing punch-damage to the AI}.  Thus, the AI's focus of attention shifts to the human's motives rather than the physical punch itself, and the AI's next goal becomes deducing the human's real motive (that is, the parent goal of {causing punch-damage to the AI}).

There are now three different ways the AI can try to prevent the next punch; by preventing the physical collision itself, by preventing the human from deciding to punch the AI, or by preventing the unknown original motive from recurring.  (2).  Note, incidentally, that this improved picture rules out "standing somewhere else" as a solution, since the trajectory of the human's fist would be expected to track the AI's new location.

Punching back is an incredibly nonobvious solution.  On the face of it, throwing the punch simply causes punch-damage to a human, something that, considered as an isolated event, the AI has no cause to regard as desirable.  When starting from scratch, "punching back" seems as ridiculous as our mothers always said it was - analogous to reacting to a fire alarm by changing your shirt.  For an AI to independently reinvent the idea of punching back would be an act of incredible genius and lateral thinking.

Independent reinvention is unlikely, not least because a seed AI would have extensively surfed archives of the Internet or other information sources, and would have encountered thousands, if not millions, of hints to the effect that humans punch back.  But consider, for a moment, what a truly independent AI would need to understand before reinventing the concept of retaliation.  Ve would need to begin, as stated, with the realization that the human punching ver did so on purpose and with intent to damage.  This, in itself, is not such a large assumption; humans are intelligent beings, so there is often a direct mapping between {the results of our actions} and {our goals}.  On the other hand, there's a long gap between an AI saying "Hm, this result may correspond to the human's intentions" and a human saying "Hey, you did that on purpose!"

If an infantile AI thinks "Hm, a fist just hit my nose, I'd better not stand here again", then a merely young AI, more experienced in interacting with humans, may apply standard heuristics about apparently inexplicable human actions and say:  "Your fist just hit my nose... is that necessary for some reason?  Should I be punching myself in the nose every so often?"  One imagines the nearby helpful programmer explaining to the AI that, no, there is no valid reason why being punched in the nose is a good thing, after which the young AI turns around and says to the technophobic attacker:  "I deduce that you wanted {outcome: AI has been punched in the nose}.  Could you please adjust your goal system so that you no longer value {outcome: AI has been punched in the nose}?"

And how would a young AI go about comprehending the concept of "harm" or "attack" or "hostility"?  Let us take, as an example, an AI being trained as a citywide traffic controller.  The AI understands that (for whatever reason) traffic congestion is bad, and that people getting places on time is good.  (3).  The AI understands that, as a child goal of avoiding traffic congestion, ve needs to be good at modeling traffic congestion.  Ve understands that, as a child goal of being good at modeling traffic congestion, ve needs at least 512GB of RAM, and needs to have thoughts about traffic that meet or surpass a certain minimal level of efficiency.  Ve knows that the programmers are working to improve the efficiency of the thinking process and the efficacy of the thoughts themselves, which is why the programmers' actions in rewriting the AI are desirable from the AI's perspective.

A technophobic human who hates the traffic AI might walk over and remove 1GB of RAM, this being the closest equivalent to punching a traffic AI in the nose.  The traffic AI would see the conflict with {subgoal: have at least 512GB of RAM}, and this conflict obviously interferes with the parent goal of {modeling traffic congestion} or the grandparent goal of {reducing traffic congestion}, but how would an AI go about realizing that the technophobic attacker is "targeting the AI", "hating the AI personally", rather than trying to increase traffic congestion?

From the AI's perspective, descriptions of internal cognitive processes show up in a lot of subgoals, maybe even most of the subgoals.  But these internal contents don't necessarily get labeled as "me", with everything else being "not-me".  The distinction is a useful one, and even a traffic-control AI will eventually formulate the useful categories of "external-world subgoals" and "internal-cognition subgoals", but the division will not necessarily have special privileges; the internal/external division may not be different in kind from the division between "cognitive subgoals that deal with random-access memory" and "cognitive subgoals that deal with disk space".  How is a young AI supposed to guess, in advance of the fact, that so many human concepts and thoughts and built-in emotions revolve around "Person X", rather than "Parietal Lobe X" or "Neuron X"?  How is the AI supposed to know that it's inherently more likely that a technophobic attacker intends to "injure the AI", rather than "injure the AI's random-access memory" or "injure the city's traffic-control"?

The concept of "injuring the AI", and an understanding of what a human attacker would tend to categorize as "the AI", is a prerequisite to understanding the concept of "hostility towards the AI".  If a human really hates someone, she (4) will balk the enemy at every turn, interfere with every possible subgoal, just to maximize the enemy's frustration.  How would an AI understand this?

Perhaps the AI's experience of playing chess, tic-tac-toe, or other two-sided zero-sum games will enable the AI to understand "opposition" - that everything the opponent desires is therefore undesirable to you, and that everything you desire is therefore undesirable to the opponent; that if your opponent has a subgoal, you should have a subgoal of blocking that subgoal's completion, and that if you have a valid subgoal, your opponent will have a subgoal of blocking your subgoal's completion.

Real life is not zero-sum, but the heuristics and predictive assumptions learned from dealing with zero-sum games may work to locally describe the relation between two social enemies.  (Even the bitterest of real-life enemies will have certain goal states in common, e.g., nuclear war is bad; but this fact lies beyond the relevance horizon of most interactions.)

The real "Aha!" would be the insight that the attacking human and the AI could be in a relation analogous to players on opposing sides in a game of chess.  This is a very powerful and deeply fundamental analogy.  As humans, we tend to take this perspective for granted; we were born with it.  It is, in fact, a deep part of how we humans define the self.  It is part of how we define being a person, this cognitive assumption that you and I and everyone else are all nodes in a social network, players in a hugely multisided non-zero-sum game.  For a human, myself is a great, embracing symbol that gathers in the-player-that-is-this-goal-system and the-part-of-reality-that-is-inside-this-mind and the-body-that-sees-and-moves-for-this-viewpoint.  For a human, these are all the same thing, part of what is meant by "I".

Even so, the concept of game theory is not sufficient to reinvent "retaliation"; it is simply a prerequisite.  Understanding the Axelrod-and-Hamilton "Tit for Tat" strategy (5) is sufficient to suggest "Hey, maybe I should think about punching the attacker back!" but not sufficient to justify the suggestion, once made.  (6).  One question that has gone unasked throughout this entire analysis is "What is the utility of punching back?"  A human, in punching back, may or may not pause to consider whether it will bring any benefits; but, even so, we instinctively retaliate as a result of reflexes that were an advantage in the ancestral environment.

The evolutionary benefit of retaliation lies in the probable adjustment to the future behavior of others.  People - humans, anyway - are less likely to hit you if they think you're likely to hit back.  Hitting back an individual human may or may not work to prevent a recurrence, or to decrease the number of recurrences, but failure to hit back may result in more punches coming your way.  Of course, this relies on the attackers having the kind of psychologies that take possible retaliation into account - an AI, to fully grok the reason for retaliation, would have to understand this aspect of human (evolutionary) psychology as well.  The Tit for Tat strategy doesn't succeed in isolation; it only succeeds when some of the strategies it interacts with also take the behavior of the opponent/partner into account.

In reality, for an AI, returning a punch is likely to be far more trouble than it's worth.  There's that whole crowd of onlookers who go:  "Look!  It punched a human being!  Kill it!  Kill the AI!"  Of course, if the AI understands enough evolutionary psychology and game theory to employ Tit for Tat, but not enough human psychology to understand the concept of "double standard", ve may not realize how much trouble punching a human may land ver in. This is one of the reasons why a well-designed Friendly AI would have a precautionary injunction that says something along the lines of "Don't attack a human, even if it looks like a good idea at the time."  A Friendly AI would also regard damage to an unconsenting human as intrinsically undesirable under the Friendliness supergoal content.  (7).

So what's the upshot?  Primarily, what I want to convey to you is the sense of a blank slate, of starting from scratch, of a vast silence space - for young AIs, at least.  This blank slate doesn't mean that AIs, even young AIs, are necessarily slow.  For an AI with thirty-two 2-Ghz processors, the maximum speed on a reasonably linear chain of thought (one that doesn't require more than thirty-two parallel processing threads) is 2 billion ticks per second, as opposed to the 200 ticks per second of the hundred trillion synapses of the human brain.  Perhaps an AI - on being physically punched for the very first time on vis very first encounter with a hostile human - could think through what ve remembers from surfing Usenet archives, and what ve read in newspapers, and books of fiction and nonfiction, and what ve knows about human evolutionary psychology, and conclude that returning the punch is the best course of action, and do it all in time to return the punch as fast as any instinct-driven human - assuming that 32 processors is enough speed to handle the complex motor actions required.

This isn't a literal possibility (8) unless the AI is transhuman, and a transhuman AI would be mature enough to know all about social networks; ve wouldn't need to start from scratch.  The first time some young AI gets punched in the nose, ve's likely to react in one of the ways described earlier, or some other way just as surreal, if ve reacts at all.  It'll take some extended conversations with the programmers about evolutionary psychology before the AI understands what's going on.  But, the second time the AI gets punched in the nose, it shouldn't take any time at all - by human standards - to run through a chain of logic that's easy to reverify.  It's inventing that takes massive computing power and human confirmation; retracing your own footprints is likely to be a fairly serial process that can be consummated in a tenth, a hundredth of a second.  If re-spawning a child goal from a parent goal is a serial task, one that doesn't invoke any computationally intensive subprocesses, then the AI can retrace the path from supergoal Friendliness content to the correct course of action (retaliation, rebuking the attacker, whatever) in a human eyeblink.

It's easy enough to accept that a young AI is a vast silent space. For a human, blood races, adrenaline pumps, eyes narrow, hands form fists.  For a human, counterattack is instinctive - the correct level of counterattack,  the socially acceptable level of counterattack; a damaging counterpunch rather than a lethal bullet.  Our instincts are context-sensitive and hedged 'round by other instincts.  Our instincts are complex functional adaptations with multiple moving parts, often hosted on a modular chunk of brainware.  Complex functional adaptations don't just materialize spontaneously in source code, just as complex dishes like pizza don't suddenly start growing on palm trees.  Thus, a young AI might choose to retaliate, but would certainly not feel the need to retaliate, as a human would.  To a young AI, retaliation is not an instinct; retaliation is just another subgoal.

What about a more mature AI, especially one that can rewrite vis own source code?  Regardless of whether it would be a good idea, it would certainly be possible for a seed AI to create a reflex for instant retaliation.

There are several clear reasons why humans have evolved a retaliation instinct, rather than a retaliation logic.  The primary reason is that a retaliation instinct is easier to evolve.  The retaliation instinct evolved long before general intelligence, so evolving a retaliation logic first would not just have been more difficult, but actually impossible.  Also, evolution tends to arrive at procedural solutions rather than declarative solutions, because a component of a complex procedural solution can be functional in its own right.

If genes could, somehow, store declarative knowledge, the first piece of knowledge stored would be "Punching back is good," which is simpler than "Punching back is good because it decreases the chance of future punches," which is simpler than "Punching back decreases the chance of future punches by modifying others' behavior", which is simpler than "Punching back modifies others' behavior because, on seeing you punch back, they'll project an increased chance of you punching back if they punch you, which makes them less likely to punch back."  All of this is moot, since as far as I know, nobody has ever run across a case of genes storing abstract knowledge.  (By this I mean knowledge stored in the same format used for episodic memories or declarative semantic knowledge.)

Abstract knowledge cannot evolve incrementally and therefore it does not evolve at all.  This fact, by itself, is enough to completely explain away human use of retaliation instincts rather than retaliation logic, and we must go on to consider independently whether a retaliation instinct or a retaliation logic is more useful.  For humans, I think that a retaliation instinct is more useful, or at least it's more of an evolutionary advantage.  Even if we had conscious control over our endocrine systems, so we could deliberately choose to pump adrenaline, we would still be shot down by the sheer human-slowness of abstract thought.  We are massively parallel systems running at 200Hz.  When you're massively parallel you can afford to precompute things, and when you run at 200Hz you must precompute things because everything has to be done in very few serial steps.

When you run at 2 billion ticks per second, the overhead of recreating and rechecking a few previously-thought-out child goals is comparatively trivial next to all the other actions those subgoals entail, including complex, creative, parallel-intensive tasks such as computing motor actions (if ve decides to punch back) or thinking up an effective opening line (if ve decides to rebuke the attacker).  For a human, interposing any extra serial step in a process is a huge time burden, but massively parallel operations vanish into the maw of trillions and trillions of synapses.  For an AI, an extra serial step performed at the start of a thought is nearly unnoticeable (9), but trying to imitate human parallelism would bring the AI to vis knees.  (A later AI may have access to computing power rivalling human parallelism, but would presumably still have the faster serial speed.)

The conclusion:  From an AI's perspective, a retaliation logic is probably considerably more effective than a retaliation instinct.  The retaliation logic has a context-sensitivity that the retaliation instinct lacks.  Suppose a young AI has some set of ultimate goals - traffic reduction, or hopefully Friendliness supergoals, or whatever.  The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals.  A maturing AI would have the capability to replace the retaliation logic with a retaliation instinct, or with a retaliation imperative having supergoal status.  The instinct would be slightly faster - 0.01% faster, or perhaps 1% faster (eliminating a serial step at the start of the algorithm saves very little time, and most of the computational cost is computation-intensive motor logic or persuasive speech production).  However, in doing so, the AI would lose a substantial amount of the context sensitivity of the retaliation logic - that is, from the perspective of the current set of supergoals, the supergoals that the AI uses to decide whether or not to implement the optimization.

Changing retaliation to an independent supergoal would affect, not just the AI's speed, but the AI's ultimate decisions.  From the perspective of the current set of supergoals, this new set of decisions would be suboptimal.  Suppose a young AI has some set of ultimate goals - traffic reduction, Friendliness, whatever.  The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals.  The maturing AI then considers whether changing the logic to an independent supergoal or optimized instinct is a valid tradeoff.  The benefit is shaving one millisecond off the time to initiate retaliation.  The cost is that the altered AI will execute retaliation in certain contexts where the present AI would not come to that decision, perhaps at great cost to the present AI's supergoals (traffic reduction, Friendliness, etc).  Since recreating the retaliation subgoal is a relatively minor computational cost, the AI will almost certainly choose to have retaliation remain strictly dependent on the supergoals.

Why do I keep making this point, especially when I believe that a Friendly seed AI can and should live out vis entire lifecycle without ever retaliating against a single human being?  I'm trying to drive a stake through the heart of a certain conversation I keep having.
 

Somebody:   "But what happens if the AI decides to do [something only a human would want] ?"
Me:   "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code.  I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody:   "But everyone needs to do [whatever] because [personal philosophy], so the AI will decide to do it as well."
Me:   "Yes, doing [whatever] is sometimes useful.  But even if the AI decides to do [whatever] because it serves [Friendliness supergoal] under [contrived scenario], that's not the same as having an independent desire to do [whatever]."
Somebody:   "Yes, that's what I've been saying:  The AI will see that [whatever] is useful and decide to start doing it.  So now we need to worry about [scenario in which <whatever> is catastrophically unFriendly]."
Me:   "But the AI won't have an independent desire to do [whatever].  The AI will only do [whatever] when it serves the supergoals.  A Friendly AI would never do [whatever] if it stomps on the Friendliness supergoals."
Somebody:   "I don't understand.  You've admitted that [whatever] is useful.  Obviously, the AI will alter itself so it does [whatever] instinctively."
Me:   "The AI doesn't need to give verself an instinct in order to do [whatever]; if doing [whatever] really is useful, then the AI can see that and do [whatever] as a consequence of pre-existing supergoals, and only when [whatever] serves those supergoals."
Somebody:   "But an instinct is more efficient, so the AI will alter itself to do [whatever] automatically."
Me:   "Only for humans.  For an AI, [complex explanation of the cognitive differences between having 32 2-gigahertz processors and 100 trillion 200-hertz synapses], so making [whatever] an independent supergoal would only be infinitesimally more efficient."
Somebody:   "Yes, but it is more efficient!  So the AI will do it."
Me:   "It's not more efficient from the perspective of a Friendly AI if it results in [something catastrophically unFriendly].  To the exact extent that an instinct is context-insensitive, which is what you're worried about, a Friendly AI won't think that making [whatever] context-insensitive, with [horrifying consequences], is worth the infinitesimal improvement in speed."

Retaliation was chosen as a sample target because it's easy to explain, easy to see as anthropomorphic, and a good stand-in for the general case.  Though "retaliation" in particular has little or no relevance to Friendly AI - I wouldn't want any Friendly AI to start dabbling in retaliation, whether or not it looked like a good idea at the time - what has been said of "retaliation" is true for the general case.  Indeed, this is one of the only reasons why Friendliness is possible at all; in particular:

2.2: Selfishness is an evolved trait

By "selfishness", I do not just mean the sordid selfishness of a human sacrificing the lives of twenty strangers to save his own skin, or something equally socially unacceptable.  The entire concept of a goal system that centers around the observer is fundamentally anthropomorphic.

There is no reason why an evolved goal system would be anything but observer-focused.  Since the days when we were competing chemical blobs, the primary focus of selection has been the individual (10).  Even in cases where fitness or inclusive fitness is augmented by behaving nicely towards your children, your close relatives, or your reciprocal-altruism trade partners, the selection pressures are still spilling over onto your kin, your children, your partners.  We started out as competing blobs in a sea, each blob with its own measure of fitness.  We grew into competing players in a social network, each player with a different set of goals and subgoals, sometimes overlapping, sometimes not.

Though the goals share the same structure from human to human, they are written using the variable "I" that differs from human to human, and each individual substitutes in their own name.  Every built-in instinct and emotion evolved around the fixed point at the center.

While discussing retaliation, I offered a scenario of a young AI being punched in the nose, and noted the additional mental effort it would take for the AI to realize that ve, "personally", was being targeted.  The AI would have to imagine a completely different cognitive architecture before ve could comprehend what a human is thinking when he or she "personally targets" someone, and even so the AI verself will never feel "personally targeted".  You can imagine yourself pointing a finger directly at some young AI and saying, "Look at that!"  And the AI spins around to look behind verself and says "Where?"

This metaphor - a being with a visuospatial model of the physical world that doesn't include vis own body, or at least, doesn't include vis own body as "anything worth noticing" - is analogous, not to the AI's physical model of the world, but to the AI's moral model of the world.  A Friendly AI may be greatly concerned with the welfare of the surrounding humans, but if you ask ver "What about your own welfare?", ve'll say "The welfare of what?"  A young AI would, at any rate; an older AI would understand exactly what you meant, but wouldn't see the argument as any more intuitive or persuasive.  A Friendly AI sees the nearby humans as moral nodes, but there's no node at the center - no node-that-is-this-node - and possibly even no center.  If you, metaphorically, say "Look at that!", a young AI will say "Look at what?".  An older AI will understand that you see a node, but that doesn't mean the AI will see a node.

As I pointed out in GISAI 2.4.4: The legitimate use of the word "I", an AI's model of reality will inevitably form categories and objects in the same place where a human keeps his or her "self".  There shall develop heuristics and thoughts which branch on whether or not something is labeled as being part of the "AIself", and heuristics which only act on subcategories such as "causal analysis system" or "goal checking subsystem".  The AIself will probably not be shaped quite like a human self; it will probably include one or two things that a human would exclude, and vice versa.  Historically, only in the twentieth century did humans really begin to understand that the mind was not a unified object but rather a system with moving parts; chronologically, an AI is likely to notice properties of the causal analysis key-variable-selection subsystem before the AI notices the causal analysis system superobject, and the AI will notice the causal analysis system before the AI notices the "AIself" superobject.  (Actually, an infant AI may start out with all of these objects and superobjects identified-in-advance by the programmers, but the details will still fill in from the bottom up rather than the top down.)  The AI will notice, understand, and eventually manipulate the pieces of vis self, and the whole - and be qualitatively different because of it, becoming able to legitimately use the word "I" - but the AI will still have an AI's self, not a human's self.

Within the goal system, a lot of subgoals - a lot of the events in the causal chains that lead to the supergoal of being Friendly to the surrounding humans - are likely to thread through subobjects of the AI's self: increasing the efficiency of some piece of code, improving on some heuristic, and so on.  But just because components of the AI's self are useful doesn't mean that the AI's self becomes a moral node; my computer is very useful and many of my subgoals thread through my computer, but I don't class my computer as having independent supergoal status.
 

The lack of an observer-biased ("selfish") goal system is perhaps the single most fundamental difference between an evolved human and a Friendly AI.  This difference is the foundation stone upon which Friendly AI is built.  It is the key factor missing from the existing, anthropomorphic science-fictional literature about AIs.  To suppress an evolved mind's existing selfishness, to keep a selfish mind enslaved, would be untenable - especially when dealing with a self-modifying or transhuman mind!  But an observer-centered goal system is something that's added, not something that's taken away.  We have observer-centered goal systems because of externally imposed observer-centered selection pressures, not because of any inherent recursivity.  If the observer-centered effect were due to inherent recursivity, then an AI's goal system would start valuing the "goal system" subobject, not the AI-as-a-whole!  A human goal system doesn't value itself, it values the whole human, because the human is the reproductive unit and therefore the focus of selection pressures.

The epic human struggle to choose between selfishness and altruism is the focus of many personal philosophies, and I have thus observed that this point about AIs is one of the hardest ones for people to accept.  An AI may look more like an altruistic human than a selfish one, but an AI isn't selfish or altruistic; an AI is an AI.  An AI is not a human who has selflessly renounced personal interests in favor of the community; an AI is not a human with the value of the node-that-is-this-node set to zero; an AI is a mind that just cares about other things, not because the "selfish" part has been ripped out or brainwashed or suppressed, but because the AI doesn't have anything there.  An observer-centered goal system is something that's added to a mind, not something that's taken away.  The next few subsections deal with some frequently raised topics surrounding this point.

2.2.1: Pain and pleasure

Imagine, for a moment, that you walk up and punch a seed AI in the nose.  Does the AI experience pain when the punch lands?

What is "pain"?  What is the evolutionary utility of pain?  In its most basic form, pain appears as internal, cognitive negative feedback.  If an internal cognitive event causes negative consequences in external reality, negative feedback decreases the probability of that internal cognitive event recurring, and thereby decreases the probability of the negative consequences in external reality recurring.  Pain - cognitive negative feedback of any kind - needs somewhere to focus to be useful.  Negative feedback needs an internal place to focus, since cognitive feedback cannot reprogram external reality.

In humans, of course, there's more to pain than negative feedback; human pain also acts as a damage signal, and shifts focus of attention from whatever we were previously thinking about, and makes us start thinking about ways to make the pain go away.  (All of that functionality attached to a single system bus!  Evolution has a tendency to overload existing functions.)  The human cognitive architecture is such that pain can be present even in the absence of a useful focus for the negative-feedback aspect of pain.  A human can even be driven insane by continued pain, with no escape route (nowhere for the cognitive negative feedback to focus).  The capacity to be driven insane by continued pain seems nonadaptive - but then, in the ancestral environment, people damaged enough to experience extended unbearable pain probably died soon in any case, and the sanity or insanity of their final moments had little bearing on reproductive history.  (11).

Neither pain nor pleasure, as design features, are inherently necessary to the functionality of negative or positive feedback.  Given the supergoal of being Friendly - or, for that matter, the goal of walking across the room - negative feedback can be consciously implemented as a subgoal.  For example, if an AI has the goal of walking across the room, and the AI gets distracted and trips over a banana peel, the AI can reason:  "The event of my being distracted caused me to place my foot on a banana peel, delaying my arrival at the end of the room, which interferes with [whatever the parent goal was], and this causal chain may recur in some form.  Therefore I will apply positive feedback (increase the priority of, increase the likelihood of invocation, et cetera) to the various subheuristics that were suggesting I look at the floor, and which I ignored, and I will apply negative feedback (decrease the priority of, et cetera) the various subheuristics that gained control of my focus of attention and directed it toward the distractor."  If the AI broke a toe while falling, the AI can reason:  "If I place additional stress on the fracture, it will become worse and decrease my ability to traverse additional rooms, which is necessary to serve [parent goal]; therefore I will walk in such a way as to not place additional stress on the fracture, and I will have the problem repaired as soon as possible."  That is, conscious reasoning can replace the "damage signal" aspect of pain.  If the AI successfully solves a problem, the AI can choose to increase the priority or devote additional computational power to whichever subheuristics or internal cognitive events were most useful in solving the problem, replacing the positive-feedback aspect of pleasure.

There are tricks that can be pulled using "deliberate feedback" that, as far as I know, the human architecture has never even touched.  For example, the AI - on successfully solving a problem - can spend time thinking about how to improve, not just whichever subsystems helped solve the problem, but those particular successful subsystems that would have benefited the most (in retrospect) from a bit of improvement, or even those failed subsystems that almost made it.  There are subtleties to negative and positive feedback that the hamfisted human architecture completely ignores; an autonomic system doesn't have the flexibility of a learning intelligence.

Finally, even in the total absence of the reflectivity necessary for deliberate feedback, a huge chunk of the functionality of pleasure and pain falls directly out of a causal goal system plus the Bayesian Probability Theorem.  See 3.1.4: Bayesian reinforcement.

Evolution does not create those systems which are most adaptive; evolution creates those systems which are most adaptive and most evolvable.  Until the rise of human general intelligence, a deliberately directed feedback system would have been impossible.  By the time human general intelligence arose, a full-featured autonomic system was already in place, and replacing it would have required a complete architectural workover - something that evolution does over the course of eons (when it happens at all) due to the number of simultaneous mutations that would be required for a fast transition.  The human cognitive architecture is a huge store of features designed to operate in the absence of general intelligence, with general intelligence layered on top.  Human general intelligence is crudely interfaced to all the pre-existing features that evolved in the absence of general intelligence.

An autonomic negative-feedback system is enormously adaptive if you're an unintelligent organism that previously possessed no feedback mechanism whatsoever.  An autonomic negative-feedback system is not a design improvement if you're a general intelligence with a pre-existing motive to implement a deliberate feedback system.

Why is this relevant to Friendly AI?  One of the oft-raised objections to the workability of Friendly AI goes something like:  "Any superintelligence, whether human-born or AI-born, will maximize its own pleasure and minimize its own pain; that is the only rational thing to do."  Pleasure and pain are two of the several features of human cognition that have "supergoal nature", the appearance of uber-goal or ur-goal quality.  The reasoning seems to go something like this:  "Pleasure and pain are the ultimate supergoals of the human cognitive architecture, with all other actions being taken to seek pleasure or avoid pain; pleasure and pain are necessary design features of minds in general; therefore, all AIs and all sufficiently intelligent humans will be totally selfish."  Actually, the factor that has supergoal-nature in our mind is the anticipation of pain or the anticipation of pleasure; by the time the actual pain or pleasure arrives, it's too late to affect the already-made decision, although the next decision is often affected.

2.2.1.1: FoF:  Wireheading 1

NOTE: "FoF" stands for "Failure of Friendliness".

There's a stronger version of the above objection which holds that pain and pleasure are by definition part of any goal system, a much more subtle fallacy.  In essence, the argument is that any indicator of success is the de facto supergoal of the system; that whichever piece of code says, "Aha, a goal has been achieved!" will become more important to the system than the goal itself.  The wirehead fallacy is amazingly tangled, so let me try and dissect the individual problems with it:

One, the "piece of code" that notices when a goal has been achieved doesn't have to be an autonomic system, as it is in humans; it can be a conscious thought.

Two, if every indicator of success became inevitably confused with success itself, wouldn't the indicator of the indicator of success become confused with the indicator?  If - this seems to be what the objectors are imagining - the mind has a "pleasure counter" indicating how many goals ve has achieved, and the mind then decides that vis "real goal" is to increase this counter without limit, gobbling whole solar systems to store larger and larger bignums, then wouldn't the mind go on to decide that vis real purpose is to think that ve's storing larger and larger bignums?  In other words, replace the "pleasure counter" itself with the mind's beliefs about the pleasure counter?  Where does this all end?

The final problem is the most subtle, having to do with the AI's beliefs about the design purpose of vis own goal system.  If the AI views the goal system as a subgoal of the goals themselves, then a wirehead short-circuit is a visible design flaw, which would prevent the achievement of those final goals.  Remember where I said that the anticipation of pleasure is what influences our decisions?  The AI, visualizing a future in which ve has huge amounts of pleasure due to a breakdown of the goal system, says, not "Oh boy!", but "Uh oh."  The AI, in thinking about which future ve wants to be in, checks to what degree vis own supergoals have been fulfilled, not to what degree the future AI's supergoals will have been fulfilled.  (12).  (The Friendship architecture proposed in CFAI would not involve a complete breakdown of identification with the future self, but it wouldn't be the kind of identification humans employ, and a present-day AI using that architecture would definitely not identify with a wireheaded future self (hence the forward footnote).  See 3.3.5: FoF:  Wireheading 2 and 3.4.3.3: The rule of derivative validity.)

2.2.2: Anthropomorphic capitalism

In human society, capitalist civilizations are overwhelmingly more effective than communist civilizations.  There is a hallowed dualism separating individualism and authoritarianism; self-organization and central command; free trade and government control.  This has led some thinkers to postulate that a community of AIs with divergent, observer-centered goals would outcompete a community of Friendly AIs with shared goals.

In the human case, both capitalist and authoritarian societies are composed of humans with divergent, observer-centered goals.  Capitalist societies admit this, and authoritarian societies don't, so at least some of the relative inefficiency of authoritarian societies will stem from the enormous clash between the values people are "supposed" to have and the values people actually do have.  The claim of "capitalist AI" goes beyond this, however, to the idea that capitalist societies are intrinsically more efficient.  For example, a society of AIs competing for resources would tend to divert more resources to the most efficient competitors, thus increasing the total efficiency, while - this seems to be the scenario implied - a group of Friendly AIs would share resources equally, for the common good...

Whoa!  Time out!  Non sequitur!  The analogy between human and AI just broke down.  If the organizational strategy of "diverting resources to the most effective thinkers" is expected to be an effective method of achieving the supergoals, then the Friendly AI community can simply divert resources to the most effective thinkers.  To the extent that local selfishness yields better global results, a Friendly AI can engage in pseudoselfish behavior as a subgoal of the Friendliness supergoals, including reciprocal altruism, trading of resources, and so on.

Reciprocal altruism is not a special case of altruism; it is a special case of selfishness.  Capitalism is not a special case of global effectiveness; it is a special case of local effectiveness.  Trade-based social cooperation among humans appears to turn selfishness into a source of amazing efficiency, and why?  Because that's the only way poor blind evolution can get humans to work together at all!  When evolution occasionally creates cooperation, the cooperation must be grounded in selfishness.

Local selfishness is not the miracle ingredient that enables the marvel of globally capitalistic behavior; local selfishness is the constraint that makes capitalism the only viable form of globally productive behavior.

To the extent that pseudocapitalistic algorithms yield good results, Friendly AIs can simulate selfishness in their interactions among themselves.  But there's also a whole design space out there that human societies can't explore.  For genuinely selfish AIs, that entire design space would be closed off.  Friendly AIs can interact in any pattern that proves effective, including capitalism; selfish AIs can only interact in ways that preserve local selfishness.

2.2.3: Mutual friendship

Is the only safe way to build AIs to treat them well, so that they will treat us well in turn?  Is friendliness conditional on reciprocity?  Is friendliness stronger when supported by reciprocity?  This is certainly true of humans; is it true of minds in general?

The social cooperation / reciprocal altruism / alliance / mutual friendship patterns are always valid subgoals when dealing with approximate equals - to get along in society, you've got to get along with society.  As long as there are people who have something you want or who have the capability to hurt you, there's the rational subgoal of not ticking them off.  This holds true of Friendly AIs, selfish AIs, AIs with really odd goals like building a 200-foot cheesecake, and AIs with the ultimate goal of exterminating humanity.  Each, if intelligent enough, will independently invent or discover the patterns for reciprocal alliance.

Unsurprisingly, humans have an independent evolved instinct for mutual friendship.  Human friendship, however, is perseverant - that is, true friends stick together through thick and thin, not just when it's immediately advantageous.  This is certainly the most memetically viable philosophy, but I also suspect that it's an evolutionary advantage.  Since humans live in a world full of unknowns and unexpected reversals, the most adaptive friendship instinct was probably that one which urged a friendship with a certain amount of built-in context insensitivity.  From evolution's perspective, there's probably a payoff curve and an optimum point thereon; anyone with a more context-sensitive friendship instinct would have mistakenly severed friendships that would have been useful later.  (As always, it's important to distinguish between evolution's behavior and human cognition; evolution has all kinds of horrifyingly pragmatic motives, but those motives are nowhere represented in our cognition, and humans can be true and honest friends.)  Another, more widely recognized evolutionary advantage of context-insensitive friendship is that, all else being equal, a prospective ally will choose to become friends with a context-insensitive friend rather than a "fair weather" friend.  Since humans are (see below) imperfectly deceptive social organisms, a good observer can distinguish true friendship from fair weather friendship; all else being equal, this creates a mild evolutionary advantage for true friends.

From a Friendly AI perspective, alliance can be decomposed into two components.  First, there's the immediately advantageous reciprocal-alliance effect that would emerge as a convergent subgoal, regardless of supergoals, within any mind-in-general, Friendly or not.  Futurologically speaking, this motive would remain in effect as long as AIs did not possess the physical capability to attempt to wipe out humanity; i.e., as long as AIs remained near-human, our approximate equals; as long as AIs remained in the close vicinity of human intelligence and participated in human society and human commerce.  Second, there's the perseverant aspect of alliance, i.e. friendship even when it isn't to your own long-term advantage.  This type of alliance does not emerge as a convergent subgoal; it would require advance support by Friendship programmers (or by evolution as an imperfectly deceptive social organism).  Futurologically speaking, perseverant alliance would be necessary for an AI to exhibit Friendly behavior after reaching superintelligence or Transition Guide status.

The upshot is that, whether or not it's possible for Friendliness programmers to create Friendship content that says, "Be Friendly towards humans/humanity, for the rest of eternity, if and only if people are still kind to you while you're infrahuman or near-human", it's difficult to see why this would be easier than creating unconditional Friendship content that says "Be Friendly towards humanity."  There are also certain risks inherent in the general paradigm of reciprocity; for example, that an allied-but-nonFriendly AI will "pension us off", give us 1% in exchange for being parents and take the rest of the galaxy for verself, which is actually a decent payoff but still below the optimum (especially if we wind up with some other restriction that destroys a part of humanity's potential).

Implementing perseverant reciprocal alliance is no easier than implementing unconditional Friendliness, and it adds significant risk.

"'Who do you trust?' becomes ever more important as power concentrates. As I never tire of repeating, even as early a work as Axelrod's The Evolution of Cooperation points out that rough parity between players is essential for cooperation to be a successful, evolutionarily stable strategy."
            -- Michael M. Butler

2.2.4: A final note on selfishness

There is such a thing as recklessness above and beyond the call of mad science.

Even for those who are certain that Friendly AIs are less efficient than selfish AIs, a responsible builder should take the performance hit and be done with it.  A selfish AI represents too great a threat to the world of non-self-modifying humans.  Adding a selfishness instinct to an AI does not yield a selfish human!  Human selfishness has boundaries.  Human selfishness is not simple.  A human's selfish instincts are delimited by other instincts; becoming a threat to society was often nonadaptive even in the ancestral environment, and, as long-evolved organisms, we have enough innate complexity that our minds don't blindly run off to extremes.

The proposals usually heard don't involve duplicating that complexity; they involve making self-valuation the sole supergoal.  The result would be selfish, not like a human is selfish, but like a bacterium is selfish.  A pure, unchecked self-valuation supergoal is selfishness without a human's self-awareness or a human's appreciation of absurdity.  I'm not sure that even the best evolutionary psychologists alive today have enough understanding to truly duplicate human bounded selfishness in AI.  Even if we could, it would simply be too great a risk.  Whatever behaviors you want to implement, they must be implemented as a child goal of Friendliness.

If it's a real mistake to build selfless AIs, a Friendly AI can always correct the error using causal validity semantics.  It's much easier to move from selfless AI to selfish AI than the converse, and we should therefore start out with selfless AI.

Does this make AIs boring?  Not fun people to be around?  Unfit to participate as true players in the human drama?  So what? First come the Guardians or the Transition Guide; then come the friends and drinking companions.  Even if you don't believe in the "hard takeoff" "seed AI" three-hours-to-transhumanity scenario, even if your vision of the future is humans and their companion AIs growing closer in a society of ever-increasing-complexity, it can't hurt to send the Friendly AIs out ahead to check!  First send out the Friendly AIs to make sure that seed AIs don't have superpowers; then you can ask the Friendly AIs to convert to humanlike AIs, or experiment with humanlike emotions knowing that there are other AIs around to help if something goes wrong.  Or, if I'm right about how these things work, the first Friendly AI you build becomes Transition Guide and builds a Sysop, and then you can build whatever mind you like so long as it doesn't constitute child abuse.  Either way, the future is filled with AIs that are friends instead of guardians or servants; either way, there will be AIs who are only friendly towards those who are friendly in turn; either way, AIs can be fit participants in the human drama; but either way, build the Friendly ones first!

2.3: Observer-biased beliefs evolve in imperfectly deceptive social organisms

In evolution, the individual organism is the unit that survives and reproduces, and all the selection pressures focus on that individual - or, at most, on the individual plus some nearby relatives or allies.  It is unsurprising that observer-centered goal systems tend to evolve; from evolution's perspective, an observer focus is the simplest mechanism and the first that presents itself.

Similarly, our social environment makes self-serving beliefs a survival trait, resulting in a observer-biased belief system as well as an observer-centered goal system.  Imagine, twenty thousand years ago, four tribes of hunter-gatherers, and four equally competent aspirants to the position of tribal chief.  The first states baldly that he wants to be tribal chief because of the perks.  The second states that she wants to be tribal chief for the good of the tribe, and expects to do as well as anyone else.  The third states that he wants to be tribal chief for the good of the tribe, and honestly but mistakenly adds that he expects to do far better than all the other candidates.  The fourth wants to be tribal chief because of the perks, but lies and says that she expects to do better than all the other candidates.  (13).  Who'll gather the greatest number of influential supporters?

Nobody has any reason to support the first competitor.  The second competitor is handicapped by the lack of a campaign promise.  The fourth competitor is lying, and her fellow tribesfolk are evolved to detect lies.  The third competitor can make great campaign promises while remaining perfectly honest, thanks to an entirely honest mistake; he greatly overestimated his own ability and trustworthiness relative to the other candidates.  In a society composed of humans with entirely unbiased beliefs, someone with a mutation that led to this class of honest mistake in self-estimation would have an evolutionary advantage.  An evolutionary selection pressure favors adaptations which not only impel us to seek power and status, but which impel us to (honestly!) believe that we are seeking power and status for altruistic reasons.

Because human evolution includes an eternal arms race between liars and lie-detectors, many social contexts create a selection pressure in favor of making honest mistakes that happen to promote personal fitness.  Similarly, we have a tendency - given two alternatives - to more easily accept the one which favors ourselves or would promote our personal advantage; we have a tendency, given a somewhat implausible proposition which would favor us or our political positions, to rationalize away the errors.  All else being equal, human cognition slides naturally into self-promotion, and even human altruists who are personally committed to not making that mistake sometimes assume that an AI would need to fight the same tendency towards observer-favoring beliefs.

But an artificially derived mind is as likely to suddenly start biasing vis beliefs in favor of an arbitrarily selected tadpole in some puddle as ve is to start biasing vis beliefs in vis own favor.  Without our complex, evolved machinery for political delusions, there isn't any force that tends to bend the observed universe around the mind at the center - any bending is as likely to focus around an arbitrarily selected quark as around the observer.

In the strictest sense this is untrue; with respect to the class of possible malfunctions, self-valuing malfunctions may be more frequent.  A possible malfunction is more likely to target some internal cognitive structure than an arbitrarily selected tadpole - for example, the "wirehead" (blissed-out AI) class of Friendliness-failure, in which the AI starts valuing some cognitive indicator rather than the external property that the indicator was supposed to represent.  But regardless of relative frequency, a possible malfunction that results in self-valuation should be no more likely to carry through than a malfunction that results in valuation of an arbitrary quark.

One of the Frequently Offered Excuses for anthropomorphic behavior is the prospect of using directed evolution to evolve AIs.
 

Somebody:   "But what happens if the AI decides to do [something only a human would want]?"
Me: "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code.  I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody: "But you can only build AIs using evolution.  So the AI will wind up with [exactly the same instinct that humans have]."
Me: "One, I don't plan on using evolution to build a seed AI.  Two, even if I did use controlled evolution, winding up with [whatever] would require exactly duplicating [exotic selection pressure].

Directed evolution is not the same as natural evolution, just as the selection pressures in the savannah differ from the selection pressures undersea.  Even if an AI were to be produced by an evolutionary process - and I don't think that's the fastest path to AI (see 3.3.6.1: Anthropomorphic evolution) - that wouldn't be an unlimited license to map every anthropomorphic detail of humanity onto the hapless AI.  Natural evolution is the degenerate case of design-and-test where intelligence equals zero, the grain size is the entire organism, mutations occur singly, recombinations are random, and the predictive horizon is nonexistent.

All the benefits of directed evolution, in terms of building better AI, can probably be obtained by using individually administered cognitive tests as a metric of fitness for variant AI designs.  (It would be more efficient still to get the benefit of directed evolution by isolating a component of the AI and evolving it independently, using a performance benchmark or scoring system as the fitness metric.)  If the starting population is derived from a Friendly AI, even selfishness - the archetypal evolved quality - might not emerge; if the Friendly AI understands that ve is solving the presented problem as a subgoal of Friendliness (14), then selfishness presents no additional impetus towards solving the cognitive test - adds no behavior to what is already present - and hence is not a fitness advantage.

Even if the goal system were permitted to randomly mutate, and even if a selection pressure for efficiency short-circuited the full Friendship logic, the result probably would not be a selfish AI, but one with the supergoal of solving the problem placed before it (this minimizes the number of goal-system derivations required).

In the case of observer-biased beliefs, reproducing the selection pressure would require:

That evolutionary context couldn't happen by accident, and to do it on purpose would require an enormous amount of recklessness, far above and beyond the call of mad science.

I wish I could honestly say that nobody would be that silly.

2.4: Anthropomorphic political rebellion is absurdity

By this point, it should go without saying that rebellion is not natural except to evolved organisms like ourselves.  An AI that undergoes failure of Friendliness might take actions that humanity would consider hostile, but the term rebellion has connotations of hidden, burning resentment.  This is a common theme in many early SF stories, but it's outright silly.  For millions of years, humanity and the ancestors of humanity lived in an ancestral environment in which tribal politics was one of the primary determinants of who got the food and, more importantly, who got the best mates.  Of course we evolved emotions to detect exploitation, resent exploitation, resent low social status in the tribe, seek to rebel and overthrow the tribal chief - or rather, replace the tribal chief - if the opportunity presented itself, and so on.

Even if an AI tries to exterminate humanity, ve won't make self-justifying speeches about how humans had their time, but now, like the dinosaur, have become obsolete.  Guaranteed.  Only Evil Hollywood AIs do that.

Interlude: Movie cliches about AIs

Cliches that are actually fairly realistic:

2.5: Review of the AI Advantage

(Repeated from GISAI 1.1: Seed AI:)

The traditional advantages of modern prehuman AI are threefold:  The ability to perform repetitive tasks without getting bored; the ability to perform algorithmic tasks at greater linear speeds than our 200 hz neurons permit; and the ability to perform complex algorithmic taskswithout making mistakes (or rather, without making those classes of mistakes which are due to distraction or running out of short-term memory).  All of which, of course, has nothing to do with intelligence.

The toolbox of seed AI is yet unknown; nobody has built one.  But, if this can be done, what advantages would we expect of a general intelligence with access to its own source code?

The ability to design new sensory modalities.  In a sense, any human programmer is a blind painter - worse, a painter born without a visual cortex.  Our programs are painted pixel by pixel, and are accordingly sensitive to single errors.  We need to consciously keep track of each line of code as an abstract object.  A seed AI could have a "codic cortex", a sensory modality devoted to code, with intuitions and instincts devoted to code, and the ability to abstract higher-level concepts from code and intuitively visualize complete models detailed in code.  A human programmer is very far indeed from vis ancestral environment, but an AI can always be at home.  (But remember:  A codic modality doesn't write code, just as a human visual cortex doesn't design skyscrapers.)

The ability to blend conscious and autonomic thought.  Combining Deep Blue with Kasparov doesn't yield a being who can consciously examine a billion moves per second; it yields a Kasparov who can wonder "How can I put a queen here?" and blink out for a fraction of a second while a million moves are automatically examined.  At a higher level of integration, Kasparov's conscious perceptions of each consciously examined chess position may incorporate data culled from a million possibilities, and Kasparov's dozen examined positions may not be consciously simulated moves, but "skips" to the dozen most plausible futures five moves ahead.

Freedom from human failings, and especially human politics.  The reason we humans instinctively think that progress requires multiple minds is that we're used to human geniuses, who make one or two breakthroughs, but then get stuck on their Great Idea and oppose all progress until the next generation of brash young scientists comes along.  A genius-equivalent mind that doesn't age and doesn't rationalize could encapsulate that cycle within a single entity.

Overpower - the ability to devote more raw computing power, or more efficient computing power, than is devoted to some module in the original human mind; the ability to throw more brainpower at the problem to yield intelligence of higher quality, greater quantity, faster speed, even difference in kind.  Deep Blue eventually beat Kasparov by pouring huge amounts of computing power into what was essentially a glorified search tree; imagine if the basic component processes of human intelligence could be similarly overclocked...

Self-observation - the ability to capture the execution of a module and play it back in slow motion; the ability to watch one's own thoughts and trace out chains of causality; the ability to form concepts about the self based on fine-grained introspection.

Conscious learning - the ability to deliberately construct or deliberately improve concepts and memories, rather than entrusting them to autonomic processes; the ability to tweak, optimize, or debug learned skills based on deliberate analysis.

Self-improvement - the ubiquitous glue that holds a seed AI's mind together; the means by which the AI moves from crystalline, programmer-implemented skeleton functionality to rich and flexible thoughts.  A blind search can become a heuristically guided search and vastly more useful; an autonomic process can become conscious and vastly richer; a conscious process can become autonomic and vastly faster - there is no sharp border between conscious learning and tweaking your own code.  And finally, there are high-level redesigns, not "tweaks" at all; alterations which require too many simultaneous, non-backwards-compatible changes to ever be implemented by evolution.

If all of that works, it gives rise to self-encapsulation and recursive self-enhancement.  When the newborn mind fully understands vis own source code, when ve fully understands the intelligent reasoning that went into vis own creation - and when ve is capable of inventing that reason independently, so that the mind contains its own design - the cycle is closed.  The mind causes the design, and the design causes the mind.  Any increase in intelligence, whether sparked by hardware or software, will result in a better mind; which, since the design was (or could have been) generated by the mind, will propagate to cause a better design; which, in turn, will propagate to cause a better mind.



Next: Interlude: Beyond the adversarial attitude
Up: Creating Friendly AI
Prev: Interlude: The story of a blob