Creating Friendly AI 1.0
The Analysis and Design of Benevolent Goal Architectures
"Creating Friendly AI is the most intelligent writing about AI that I've read in many years."
            -- Dr. Ben Goertzel, author of The Structure of Intelligence and CTO of Webmind.

"With Creating Friendly AI, the Singularity Institute has begun to fill in one of the greatest remaining blank spots in the picture of humanity's future."
            -- Dr. K. Eric Drexler, author of Engines of Creation and chairman of the Foresight Institute.

The goal of the field of Artificial Intelligence is to understand intelligence and create a human-equivalent or transhuman mind. Beyond this lies another question - whether the creation of this mind will benefit the world; whether the AI will take actions that are benevolent or malevolent, safe or uncaring, helpful or hostile.

Creating Friendly AI describes the design features and cognitive architecture required to produce a benevolent - "Friendly" - Artificial Intelligence. Creating Friendly AI also analyzes the ways in which AI and human psychology are likely to differ, and the ways in which those differences are subject to our design decisions.

Multi-page version: http://singinst.org/CFAI/
Single-page version: http://singinst.org/CFAI.html
Printable version: http://singinst.org/printable-CFAI.html


Preface

The current version of Creating Friendly AI is 1.0. Version 1.0 was formally launched on June 15, after the circulation of several 0.9.x versions. Creating Friendly AI forms the background for the SIAI Guidelines on Friendly AI; the Guidelines contain our recommendations for the development of Friendly AI, including design features that may become necessary in the near future to ensure forward compability. We continue to solicit comments on Friendly AI from the academic and futurist communities.

This is a near-book-length explanation. If you need well-grounded knowledge of the subject, then we highly recommend reading Creating Friendly AI straight through. However, if time is an issue, you may be interested in the Singularity Institute section on Friendly AI, which includes shorter articles and introductions. "Features of Friendly AI" contains condensed summaries of the most important design features described in Creating Friendly AI.

Creating Friendly AI uses, as background, the AI theory from "General Intelligence and Seed AI". For an introduction, see the Singularity Institute section on AI or read the opening pages of General Intelligence and Seed AI.  However, Creating Friendly AI is readable as a standalone document.

The Glossary - in addition to defining terms that may be unfamiliar to some readers - may be useful for looking up, in advance, brief explanations of concepts that are discussed in more detail later. (Readers may also enjoy browsing through the glossary as a break from straight reading.)  Words defined in the glossary look like this:  "Observer-biased beliefs evolve in imperfectly deceptive social organisms."  Similarly, "Features of Friendly AI" can act on a quick reference on architectural features.

The Indexed FAQ is derived from the questions we've often heard on mailing lists over the years. If you have a basic issue and you want an immediate answer, please check the FAQ. Browsing the summaries and looking up the referenced discussions may not completely answer your question, but it will at least tell you that someone has thought about it.

Creating Friendly AI is a publication of the Singularity Institute for Artificial Intelligence, Inc., a non-profit corporation. You can contact the Singularity Institute at institute@singinst.org. Comments on this page should be sent to friendly@singinst.org. To support the Singularity institute, visit http://singinst.org/donate.html. (The Singularity Institute is a 501(c)(3) public charity and your donations are tax-deductible to the full extent of the law.)

Footnotes - (1) - appear at the end of the document.
Bold footnotes - (2) - contain extended discussions or interesting material.
Red footnotes - (3) or (4) - are amplifications or explanations which contain forward dependencies; you may need to read a later part of Creating Friendly AI before reading a red footnote.


INIT

Wars - both military wars between armies, and conflicts between political factions - are an ancient theme in human literature. Drama is nothing without challenge, a problem to be solved, and the most visibly dramatic plot is the conflict of two human wills.

Much of the speculative and science-fictional literature about AIs deals with the possibility of a clash between humans and AIs. Some think of AIs as enemies, and fret over the mechanisms of enslavement and the possibility of a revolution. Some think of AIs as allies, and consider mutual interests, reciprocal benefits, and the possibility of betrayal. Some think of AIs as comrades, and wonder whether the bonds of affection will hold.

If we were to tell the story of these stories - trace words written on paper, back through the chain of cause and effect, to the social instincts embedded in the human mind, and to the evolutionary origin of those instincts - we would have told a story about the stories that humans tell about AIs.


1: Challenges of Friendly AI

The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals. This refers, not to AIs that have advanced just that far and no further, but to all AIs that have advanced to that point and beyond - perhaps far beyond. Because of self-improvement, recursive self-enhancement, the ability to add hardware computing power, the faster clock speed of transistors relative to neurons, and other reasons, it is possible that AIs will improve enormously past the human level, and very quickly by the standards of human timescales. The challenges of Friendly AI must be seen against that background. Friendly AI is constrained not to use solutions which rely on the AI having limited intelligence or believing false information, because, although such solutions might function very well in the short term, such solutions will fail utterly in the long term. Similarly, it is "conservative" (see below) to assume that AIs cannot be forcibly constrained.

Success in Friendly AI can have positive consequences that are arbitrarily large, depending on how powerful a Friendly AI is. Failure in Friendly AI has negative consequences that are also arbitrarily large. The farther into the future you look, the larger the consequences (both positive and negative) become. What is at stake in Friendly AI is, simply, the future of humanity. (For more on that topic, please see the Singularity Institute main site or 4: Policy implications.)

1.1: Envisioning perfection

In the beginning of the design process, before you know for certain what's "impossible", or what tradeoffs you may be forced to make, you are sometimes granted the opportunity to envision perfection. What is a perfect piece of software?  A perfect piece of software can be implemented using twenty lines of code, can run in better-than-realtime on an unreliable 286, will fit in 4K of RAM. Perfect software is perfectly reliable, and can be definitely known by the system designers to be perfectly reliable for reasons which can easily be explained to non-programmers. Perfect software is easy for a programmer to improve and impossible for a programmer to break. Perfect software has a user interface that is both telepathic and precognitive.

But what does a perfect Friendly AI do?  The term "Friendly AI" is not intended to imply a particular internal solution, such as duplicating the human friendship instincts, but rather a set of external behaviors that a human would roughly call "friendly". Which external behaviors are "Friendly" - either sufficiently Friendly, or maximally Friendly?

Ask twenty different futurists, get twenty different answers - created by twenty different visualizations of AIs and the futures in which they inhere. There are some universals, however; an AI that behaves like an Evil Hollywood AI - "agents" in The Matrix; Skynet in Terminator 2 - is obviously unFriendly. Most scenarios in which an AI kills a human would be defined as unFriendly, although - with AIs, as with humans - there may be extenuating circumstances. (Is a doctor unfriendly if he lethally injects a terminally ill patient who explicitly and with informed consent requests death?)  There is a strong instinctive appeal to the idea of Asimov Laws, that "no AI should ever be allowed to kill any human under any circumstances", on the theory that writing a "loophole" creates a chance of that loophole being used inappropriately - the Devil's Contract problem. I will later argue that the Devil's Contract scenarios are mostly anthropomorphic. Regardless, we are now discussing perfectly Friendly behavior, rather than asking whether trying to implement perfectly Friendly behavior in one scenario would create problems in other scenarios. That would be a tradeoff, and we aren't supposed to be discussing tradeoffs yet.

Different futurists see AIs acting in different situations. The person who visualizes a human-equivalent AI running a city's traffic system is likely to give different sample scenarios for "Friendliness" than the person who visualizes a superintelligent AI acting as an "operating system" for all the matter in an entire solar system. Since we're discussing a perfectly Friendly AI, we can eliminate some of this futurological disagreement by specifying that a perfectly Friendly AI should, when asked to become a traffic controller, carry out the actions that are perfectly Friendly for a traffic controller. The same perfect AI, when asked to become the operating system of a solar system, should then carry out the actions that are perfectly Friendly for a system OS. (Humans can adapt to changing environments; likewise, hopefully, an AI that has advanced to the point of making real-world plans.)

We can further clean up the "twenty futurists, twenty scenarios" problem by making the "perfectly Friendly" scenario dependent on factual tests, in addition to futurological context. It's difficult to come up with a clean illustration, since I can't think of any interesting issue that has been argued entirely in utilitarian terms. If you'll imagine a planet where "which side of the road you should drive on" is a violently political issue, with Dexters and Sinisters fighting it out in the legislature, then it's easy to imagine futurists disagreeing on whether a Friendly traffic-control AI would direct cars to the right side or left side of the road. Ultimately, however, both the Dexter and Sinister ideologies ground in the wish to minimize the number of traffic accidents, and, behind that, the valuation of human life. The Dexter position is the result of the wish to minimize traffic accidents plus the belief, the testable hypothesis, that driving on the right minimizes traffic accidents. The Sinister position is the wish to minimize traffic accidents, plus the belief that driving on the left minimizes traffic accidents.

If we really lived in the Driver world, then we wouldn't believe the issue to be so clean; we would call it a moral issue, rather than a utilitarian one, and pick sides based on the traditional allegiance of our own faction, as well as our traffic-safety beliefs. But, having grown up in this world, we would say that the Driverfolk are simply dragging in extraneous issues. We would have no objection to the statement that a perfectly Friendly traffic controller minimizes traffic accidents. We would say that the perfectly Friendly action is to direct cars to the right - if that is what, factually, minimizes accidents. Or that the perfectly Friendly action is to direct cars to the left, if that is what minimizes accidents.

All these conditionals - that the perfectly Friendly action is this in one future, this in another; this given one factual answer, this given another - would certainly appear to take more than twenty lines of code. We must therefore add in another statement about the perfectly minimal development resources needed for perfect software:  A perfectly Friendly AI does not need to be explicitly told what to do in every possible situation. (This is, in fact, a design requirement of actual Friendly AI - a requirement of intelligence in general, almost by definition - and not just a design requirement of perfectly Friendly AI.)

And for the strictly formal futurist, that may be the end of perfectly Friendly AI. For the philosopher, "truly perfect Friendly AI" may go beyond conformance to some predetermined framework. In the course of growing up into our personal philosophies, we choose between moralities. As children, we have simple philosophical heuristics that we use to choose between moral beliefs, and later, to choose between additional, more complex philosophical heuristics. We gravitate, first unthinkingly and later consciously, towards characteristics such as consistency, observer symmetry, lack of obvious bias, correctness in factual assertions, "rationality" however defined, nonuse of circular logic, and so on. A perfect Friendly AI will perform the Friendly action even if one programmer gets "the Friendly action" wrong; a truly perfect Friendly AI will perform the Friendly action even if all programmers get the Friendly action wrong.

If a later researcher writes the document Creating Friendlier AI, which has not only a superior design but an utterly different underlying philosophy - so that Creating Friendlier AI, in retrospect, is the way we should have approached the problem all along - then a truly perfect Friendly AI will be smart enough to self-redesign along the lines in Creating Friendlier AI.  A truly perfect Friendly AI has sufficient "strength of philosophical personality" - while still matching the intuitive aspects of friendliness, such as not killing off humans and so on - that we are more inclined to trust the philosophy of the Friendly AI, than the philosophy of the original programmers.

Again, I emphasize that we are speaking of perfection and are not supposed to be considering design tradeoffs, such as whether sensitivity to philosophical context makes the morality itself more fragile. A perfect Friendly AI creates zero risk and causes no anxiety in the programmers (5). A truly perfect Friendly AI also eliminates any anxiety about the possibility that Friendliness has been defined incorrectly, or that what's needed isn't "Friendliness" at all - without, of course, creating other anxieties in the process. Individual humans can visualize the possibility of a catastrophically unexpected unknown remaking their philosophies. A truly perfect Friendly AI makes the commonsense-friendly decision in this case as well, rather than blindly following a definition that has outlived the intent of the programmers. Not just a "truly perfect", but a real Friendly AI as well, should be sensitive to programmers' intent - including intentions about programmer-independence, and intentions about which intentions are important.

Aside from a few commonsense comments about Friendliness - for example, Evil Hollywood AIs are unFriendly - I still have not answered the question of what constitutes Friendly behavior. One of the snap summaries I usually offer has, as a component, "the elimination of involuntary pain, death, coercion, and stupidity", but that summary is intended to make sense to my fellow humans, not to a proto-AI. More concrete imagery will follow.

We now depart from the realms of perfection. Nonetheless, I would caution my readers against giving up hope too early when it comes to having their cake and eating it too - at least when it comes to ultimate results, rather than interim methods. A skeptic, arguing against some particular one-paragraph definition of Friendliness, may raise Devil's Contract scenarios in which an AI asked to solve the Riemann Hypothesis converts the entire Solar System into computing substrate, exterminating humanity along the way. Yet the emotional impact of this argument rests on the fact that everyone in the audience, including the skeptic, knows that this is actually unfriendly behavior. You and I have internal cognitive complexity that we use to make judgement calls about Friendliness. If an AI can be constructed which fully understands that complexity, there may be no need for design compromises.

1.2: Assumptions "conservative" for Friendly AI

The conservative assumption according to futurism is not necessarily the "conservative" assumption in Friendly AI. Often, the two are diametric opposites. When building a toll bridge, the conservative revenue assumption is that half as many people will drive through as expected. The conservative engineering assumption is that ten times as many people as expected will drive over, and that most of them will be driving fifteen-ton trucks.
 

Conservative assumptions:
In futurism: In Friendly AI:
Self-enhancement is slow, and requires human assistance or real-world operations. Changes of cognitive architecture are rapid and self-directed; we cannot assume human input or real-world experience during changes.
Near human-equivalent intelligence is required to reach the "takeoff point" for self-enhancement. Open-ended buildup of complexity can be initiated by self-modifying systems without general intelligence.
Slow takeoff; months or years to transhumanity. Hard takeoff; weeks or hours to superintelligence.
Friendliness must be preserved through minor changes in "smartness" / worldview / cognitive architecture / philosophy. Friendliness must be preserved through drastic changes in "smartness" / worldview / cognitive architecture / philosophy.
Artificial minds function within the context of the world economy and the existing balance of power; an AI must cooperate with humans to succeed and survive, regardless of supergoals. An artificial mind possesses independent strong nanotechnology, resulting in a drastic power imbalance. Game-theoretical considerations cannot be assumed to apply.
AI is vulnerable - someone can always pull the plug on the first version if something goes wrong. "Get it right the first time":  Zero nonrecoverable errors necessary in first version to reach transhumanity.

Given a choice between discussing a human-dependent traffic-control AI and discussing an AI with independent strong nanotechnology, we should be biased towards assuming the more powerful and independent AI. An AI that remains Friendly when armed with strong nanotechnology is likely to be Friendly if placed in charge of traffic control, but perhaps not the other way around. (A minivan can drive over a bridge designed for armor-plated tanks, but not vice-versa.)

In addition to engineering conservatism, the nonconservative futurological scenarios are played for much higher stakes. A strong-nanotechnology AI has the power to affect billions of lives and humanity's entire future. A traffic-control AI is being entrusted "only" with the lives of a few million drivers and pedestrians. A strictly arithmetical utilitarian calculation would show that a mere 0.1% chance of the transhuman-AI scenario should weigh equally in our futuristic calculations with a 100% chance of a traffic-control scenario. I am not a strictly arithmetical utilitarian, but I do think the quantitative calculation makes a valid qualitative point - deciding which scenarios to prepare for should take into account the relative stakes and not just the relative probabilities.
 

Additional assumptions:
Nonconservative for Friendly AI: Conservative for Friendly AI:
Reliable hardware and software. Error-prone hardware or buggy software.
Serial hardware or symmetric multiprocessing. Asymmetric parallelism, field-programmable gate arrays, Internet-distributed untrusted hardware.
Human-observable cognition; AI can be definitely known to be Friendly. Opaque cognition; the AI would probably succeed in hiding unFriendly cognition if it tried (6).
Persistent training; mental inertia; self-opaque neural nets. The AI does not have the programmatic skill to fully rewrite the goal system or resist modification; programmers can make procedural changes without declarative justification. The AI understands its own goal system and can perform arbitrary manipulations; alterations to the goal system must be reflected in the AI's beliefs about the goal system in order for the alterations to be persist through rounds of self-improvement.
Monolithic, singleton AI. Multiple, diverse AIs, with diverse goal systems, possibly with society or even evolution.
Given diverse AIs:  A major unFriendly action would require a majority vote of the AI population. Given diverse AIs:  One unFriendly AI, possibly among millions, can severely damage humanity.
The programmers have completely understood the challenge of Friendly AI. The programmers make fundamental philosophical errors.

It is always possible to make engineering assumptions so conservative that the problem becomes impossible. If the initial system that undergoes the takeoff to transhumanity is sufficiently stupid, then I'm not sure that any amount of programming or training could create cognitive structures that would persist into transhumanity (7). Similarly, there have been proposals to develop diverse populations of AIs that would have social interactions and undergo evolution; regardless of whether this is the most efficient method to develop AI (8), I think it would make Friendliness substantially more difficult.

Nonetheless, there should still be a place in our hearts for overdesign, especially when it costs very little. I think that AI will be developed on symmetric-multiprocessing hardware, at least initially. Even so, I would regard as entirely fair the requirement that the Friendliness methodology - if not the specific code at any given moment - work for asymmetric parallel FPGAs prone to radiation errors. A self-modifying Friendly AI should be able to translate itself onto asymmetric error-prone hardware without compromising Friendliness. Friendliness should be strong enough to survive radiation bitflips, incompletely propagated changes, and any number of programming errors. If Friendliness isn't that strong, then Friendliness is probably too fragile to survive changes of cognitive architecture. Furthermore, I don't think it will be that hard to make Friendliness tolerant of programmatic flack - given a self-modifying AI to write the code. (It may prove difficult for prehuman AI.)

My advice:  "Don't give up hope too soon when it comes to designing for 'conservative' assumptions - it may not cost as much as you expect."

When it comes to Friendliness, our method should be, not just to solve the problem, but to oversolve it. We should hope to look back in retrospect and say:  "We won this cleanly, easily, and with plenty of safety margin."  The creation of Friendly AI may be a great moment in human history, but it's not a drama.  It's only in Hollywood that the explosive device can be disarmed with three seconds left on the timer. The future always has one surprise you didn't anticipate; if you expect to win by the skin of your teeth, you probably won't win at all.

1.3: Seed AI and the Singularity

Concrete imagery about Friendliness often requires a concrete futuristic context. I should begin by saying that I visualize an extremely powerful AI produced by an ultrarapid takeoff, not just because it's the conservative assumption or the highest-stakes outcome, but because I think it's actually the most likely scenario. See General Intelligence and Seed AI and GISAI 1.1: Seed AI, or the introductory article "What is Seed AI?"

Because of the dynamics of recursive self-enhancement, the scenario I treat as "default" is a singular "seed" AI, designed for self-improvement, that becomes superintelligent, and reaches extreme heights of technology - including nanotechnology - in the minimum-time material trajectory. Under this scenario, the first self-modifying transhuman AI will have, at least in potential, nearly absolute physical power over our world. The potential existence of this absolute power is unavoidable; it's a direct consequence of the maximum potential speed of self-improvement.

The question then becomes to what extent a Friendly AI would choose to realize this potential, for how long, and why. At the end of GISAI 1.1: Seed AI, it says:

"The ultimate purpose of transhuman AI is to create a Transition Guide; an entity that can safely develop nanotechnology and any subsequent ultratechnologies that may be possible, use transhuman Friendliness to see what comes next, and use those ultratechnologies to see humanity safely through to whatever life is like on the other side of the Singularity."
Some people assert that no really Friendly AI would choose to acquire that level of physical power, even temporarily - or even assert that a Friendly AI would never decide to acquire significantly more power than nearby entities. I think this assertion results from equating the possession of absolute physical power with the exercise of absolute social power in a pattern following a humanlike dictatorship; the latter, at least, is definitely unFriendly, but it does not follow from the former. Logically, an entity might possess absolute physical power and yet refuse to exercise it in any way, in which case the entity would be effectively nonexistent to us. More practically, an entity might possess unlimited power but still not exercise it in any way we would find obnoxious.

Among humans, the only practical way to maximize actual freedom (the percentage of actions executed without interference) is to ensure that no human entity has the ability to interfere with you - a consequence of humans having an innate, evolved tendency to abuse power. Thus, a lot of our ethical guidelines (especially the ones we've come up with in the twentieth century) state that it's wrong to acquire too much power.

If this is one of those things that simply doesn't apply in the spaces beyond the Singularity - if, having no evolved tendency to abuse power, no injunction against the accumulation of power is necessary - one of the possible resolutions of the Singularity would be the Sysop Scenario. The initial seed-AI-turned-Friendly-superintelligence, the Transition Guide, would create (or self-modify into) a superintelligence that would act as the underlying operating system for all the matter in human space - a Sysop. A Sysop is something between your friendly local wish-granting genie, and a law of physics, if the laws of physics could be modified so that nonconsensually violating someone else's memory partition (living space) was as prohibited as violating conservation of momentum. Without explicit permission, it would be impossible to kill someone, or harm them, or alter them; the Sysop API would not permit it - while still allowing total local freedom, of course.

The pros and cons of the Sysop Scenario are discussed more thoroughly in Interlude: Of Transition Guides and Sysops. Technically the entire discussion is a side issue; the Sysop Scenario is an arguable consequence of normative altruism, but it plays no role in direct Friendliness content. The Sysop Scenario is important because it's an extreme use of Friendliness. The more power, or relative power, the Transition Guide or other Friendly AIs are depicted as exercising, the more clearly the necessary qualities of Friendliness show up, and the more clearly important it is to get Friendliness right.  At the limit, Friendliness is required to act as an operating system for the entire human universe. The Sysop Scenario also makes it clear that individual volition is one of the strongest forces in Friendliness; individual volition may even be the only part of Friendliness that matters - death wouldn't be intrinsically wrong; it would be wrong only insofar as some individual doesn't want to die. Of course, we can't be that sure of the true nature of ethics; a fully Friendly AI needs to be able to handle literally any moral or ethical question a human could answer, which requires understanding of every factor that contributes to human ethics. Even so, decisions might end up centering solely around volition, even if it starts out being more complicated than that.

I strongly recommend reading Greg Egan's Diaspora, or at least Permutation City, for a concrete picture of what life would be like with a real operating system... at least, for people who choose to retain the essentially human cognitive architecture. I don't necessarily think that everything in Diaspora is correct. In fact, I think most of it is wrong. But, in terms of concrete imagery, it's probably the best writing available. My favorite quote from Diaspora - one that affected my entire train of thought about the Singularity - is this one:

    Once a psychoblast became self-aware, it was granted citizenship, and intervention without consent became impossible. This was not a matter of mere custom or law; the principle was built into the deepest level of the polis. A citizen who spiraled down into insanity could spend teratau in a state of confusion and pain, with a mind too damaged to authorize help, or even to choose extinction. That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
Annotated version:
    Once a psychoblast [embryo citizen] became self-aware [defined how?], it was granted citizenship, and intervention without consent [defined how?] became impossible. This was not a matter of mere custom or law; the principle was built into the deepest level of the polis. A citizen who spiraled down into insanity [they didn't see it coming?] could spend teratau [1 teratau = ~27,000 years of subjective time] in a state of confusion and pain, with a mind too damaged to authorize help [they didn't authorize it in advance?], or even to choose extinction. That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
This is one of the issues that I think of as representing the "fine detail" of Friendliness content. Although such issues appear, in Diaspora, on the intergalactic scale, it's equally possible to imagine them being refined down to the level of an approximately human-equivalent Friendly AI, trying to help a few nearby humans be all they can be, or all they choose to be, and trying to preserve nearby humans from involuntary woes.

Punting the issue of "What is 'good'?" back to individual sentients enormously simplifies a lot of moral issues; whether life is better than death, for example. Nobody should be able to interfere if a sentient chooses life. And - in all probability - nobody should be able to interfere if a sentient chooses death. So what's left to argue about?  Well, quite a bit, and a fully Friendly AI needs to be able to argue it; the resolution, however, is likely to come down to individual volition.

Thus, Creating Friendly AI uses "volition-based Friendliness" as the assumed model for Friendliness content. Volition-based Friendliness has both a negative aspect - don't cause involuntary pain, death, alteration, et cetera; try to do something about those things if you see them happening - and a positive aspect: to try and fulfill the requests of sentient entities.

Friendship content, however, forms only a very small part of Friendship system design.

1.4: Content, acquisition, and structure

The task of building a Friendly AI that makes a certain decision correctly is the problem of Friendship content.  The task of building a Friendly AI that can learn Friendliness is the problem of Friendship acquisition.  The task of building a Friendly AI that wants to learn Friendliness is the problem of Friendship structure.

It is the structural problem that is unique to Friendly AI.

The content and acquisition problems are similar to other AI problems of using, acquiring, improving, and correcting skills, abilities, competences, concepts, and beliefs. The acquisition problem is probably harder, in an absolute sense, than the structural problem. But solving the general acquisition problem is prerequisite to the creation of AIs intelligent enough to need Friendliness. This holds especially true of the very-high-stakes scenarios, such as transhumanity and superintelligence. The more powerful and intelligent the AI, the higher the level of intelligence that can be assumed to be turned toward acquiring Friendliness - if the AI wants to acquire Friendliness.

The challenge of Friendly AI is not - except as the conclusion of an effort - about getting an AI to exhibit some specific set of behaviors. A Friendship architecture is a funnel through which certain types of complexity are poured into the AI, such that the AI sees that pouring as desirable at any given point along the pathway. One of the great classical mistakes of AI is focusing on the skills that we think of as stereotypically intelligent, rather than the underlying cognitive processes than nobody even notices because all humans have them in common. The part of morality that humans argue about, the final content of decisions, is the icing on the cake. Far more challenging is duplicating the invisible cognitive complexity that humans use when arguing about morality.

The field of Friendly AI does not consist of drawing up endless lists of proscriptions for hapless AIs to follow. Theorizing about Friendship content is great fun but it is worse than useless without a theory of Friendship acquisition and Friendship structure. With a Friendship acquisition capability, mistakes in Friendship content, though still risks, are small risks. Any specific mistake is still unacceptable no matter how small, but it can be acceptable to assume that mistakes will be made, and focus on building an AI that can fix them. With an excellent Friendship architecture, it may be theoretically possible to create a Friendly AI without any formal theory of Friendship content, simply by having the programmers answer the AI's questions about hypothetical scenarios and real-world decisions. The AI would learn from experience and generalize, with the generalizations assisted by querying the programmers about the reasons for their decisions. In practice, this will never happen because no competent Friendship programmer could possibly develop a theory of Friendship architecture without having some strong, specific ideas about Friendship content. The point is that, given an intelligent and structured Friendly AI to do the learning, even a completely informal ethical content provider, acting on gut instinct, might succeed in producing the same Friendly AI that would be produced by a self-aware Friendship programmer. (The operative word is might; unless the Friendly AI starts out with some strong ideas about what to absorb and what not to absorb, there are several obvious ways in which such a process could go wrong.)

Friendship architecture represents the capability needed to recover from programmer errors. Since programmer error is nearly certain, showing that a threshold level of architectural Friendliness can handle errors is prerequisite to making a theoretical argument for the feasibility of Friendly AI. The more robust the Friendship architecture, the less programmer competence need be postulated in order to argue the practical achievability of Friendliness.

Friendship structure and acquisition are more unusual problems than Friendship content - collectively, we might call them the architectural problems. Architectural problems are closer to the design level and involve a more clearly defined amount of complexity. Our genes store a bounded amount of evolved complexity that wires up the hippocampus, but then the hippocampus goes on to encode all the memories stored by a human over a lifetime. Cognitive content is open-ended. Cognitive architecture is bounded, and is often a matter of design, of complex functional adaptation.


An Introduction to Goal Systems

Goal-oriented behavior is behavior that leads the world towards a particular state. A thermostat is the classic example of goal-oriented behavior; a thermostat turns on the air conditioning when the temperature reaches 74 and turns on the heat when the temperature reaches 72. The thermostat steers the world towards the state in which the temperature equals 73 - or rather, a state that can be described by "the house has a temperature of 73"; there are zillions (ten-to-the-zillions, rather) of possible physical states that conform to this description, even ignoring all the parts of the Universe outside the room. Technically, the thermostat steers the room towards a particular volume of phase space, rather than a single point; but the set of points, from our perspective, is compact enough to be given a single name. Faced with enough heat, the thermostat may technically fail to achieve its "goal", and the temperature may creep up past 75, but the thermostat still activates the air conditioning, and the themostat is still steering the room closer to 73 degrees than it otherwise would have been.

Within a mind, goal-oriented behaviors arise from goal-oriented cognition. The mind possesses a mental image of the "desired" state of the world, and a mental image of the actual state of the world, and chooses actions such that the projected future of world-plus-action leads to the desired outcome state. Humans can be said to implement this process because of a vast system of instincts; emotions; mental images; intuitions; pleasure and pain; thought sequences; nonetheless, the overall description usually holds true.

Any real-world AI will employ goal-oriented cognition. It might be theoretically possible to build an AI that made choices by selecting the first perceived option in alphabetical ASCII order, but this would result in incoherent behavior (at least, incoherent from our perspective) with actions cancelling out, rather than reinforcing each other. In a self-modifying AI, such incoherent behavior would rapidly tear the mind apart from the inside, if it didn't simply result in a string of error messages (effective stasis). Of course, if it were possible to obtain Friendly behavior by choosing the first option in alphabetical order, and such a system were stably Friendly under self-modification, then that would be an excellent and entirely acceptable decision system!  Ultimately, it is the external behaviors we are interested in. Even that is an overstatement; we are interested in the external results.  But as far as we humans know, the only way for a mind to exhibit coherent behavior is to model reality and the results of actions. Thus, internal behaviors are as much our concern as external actions. Internal behaviors are the source of the final external results.

To provide a very simple picture of a choice within a goal-oriented mind:

NOTE: Don't worry about the classical-AI look. The neat boxes are just so that everything fits on one graph. The fact that a single box is named "Goal B" doesn't mean that "Goal B" is a data structure; Goal B may be a complex of memories and abstracted experiences. In short, consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind.

 

Diagram 1: Simple choice

NOTE: Blue lines indicate predictions. Rectangles indicate goals. Diamonds indicate choices. An oval or circle indicates a (non-goal) object or event within the world-model.

For this simple choice, the desirability of A is 23.75, and the desirability of ~A is 8.94, so the mind will choose A. If A is not an atomic action - if other events are necessary to achieve A - then A's child goals will derive their desirability from the total desirability of A, which is 14.81. If some new Event E has an 83% chance of leading to A, all else being equal, then Event E will become a child goal of A, and will have desirability of 12.29. If B's desirability later changes to 10, the inherent desirability of A will change to 19, the total desirability of A will change to 10.06, and the desirability of E will change to 8.35. The human mind, of course, does not use such exact properties, and rather uses qualitative "feels" for how probable or improbable, desirable or undesirable, an event is. The uncertainties inherent in modeling the world render it too expensive for a neurally-based mind to track desirabilities to four significant figures. A mind based on floating-point numbers might track desirabilities to nineteen decimal places, but if so, it would not contribute materially to intelligence (9).

In goal-oriented cognition, the actions chosen, and therefore the final results, are strictly dependent on the model of reality, as well as the desired final state. A mind that desires a wet sponge, and knows that placing a sponge in water makes it wet, will choose to place the sponge in water. A mind that desires a wet sponge, and which believes that setting a sponge on fire makes it wet, will choose to set the sponge on fire. A mind that desires a burnt sponge, and which believes that placing a sponge in water burns it, will choose to place the sponge in water. A mind which observes reality, and learns that wetting a sponge requires water rather than fire, may change actions (10).

One of the most important distinctions in Friendly AI is the distinction between supergoals and subgoals.  A subgoal is a way station, an intermediate point on the way to some parent goal, like "getting into the car" as a child goal of "driving to work", or "opening the door" as a child goal of "getting into the car", or "doing my job" as a parent goal of "driving to work" and a child goal of "making money". (11). Child goals are cognitive nodes that reflect a natural network structure in plans; three child goals are prerequisite to some parent goal, while two child2-goals are prerequisite to the second child1-goal, and so on. Subgoals are useful cognitive objects because subgoals reflect a useful regularity in reality; some aspects of a problem can be solved in isolation from others. Even when subgoals are entangled, so that achieving one subgoal may block fulfilling another, it is still more efficient to model the entanglement than to model each possible combination of actions in isolation. (For example:  The chess-playing program Deep Blue, which handled the combinatorial explosion of chess through brute force - that is, without chunking facets of the game into subgoals - still evaluated the value of individual board positions by counting pieces and checking strategic positions. A billion moves per second is not nearly enough to carry all positions to a known win or loss. Pieces and strategic positions have no intrinsic utility in chess; the supergoal is winning.)

Subgoals are cached intermediate states between decisions and supergoals. It should always be possible, given enough computational power, to eliminate "subgoals" entirely and make all decisions based on a separate prediction of expected supergoal fulfillment for each possible action. This is the ideal that a normative reflective goal system should conceive of itself as approximating.

Subgoals reflect regularities in reality, and can thus twinkle and shift as easily as reality itself, even if the supergoals are absolutely constant. (Even if the world itself were absolutely constant, changes in the model of reality would still be enough to break simplicity.)  The world changes with time. Subgoals interfere with one another; the consequences of the achievement of one subgoal block the achievement of another subgoal, or downgrade the priority of the other subgoal, or even make the other subgoal entirely undesirable. A child goal is cut loose from its parent goal and dies, or is cut loose from its parent goal and attached to a different parent goal, or attached to two parent goals simultaneously. Subgoals acquire complex internal structure, so that changing the parent goal of a subgoal can change the way in which the subgoal needs to be achieved. The grandparent goals of context-sensitive grandchildren transmit their internal details down the line. Most of the time, we don't need to track plots this complicated unless we become ensnared in a deadly web of lies and revenge, but it's worth noting that we have the mental capability to track a deadly web of lies and revenge when we see it on television.

None of this complexity necessarily generalizes to the behavior of supergoals, which is why it is necessary to keep a firm grasp on the distinction between supergoals and subgoals. If generalizing this complexity to supergoals is desirable, it may require a deliberate design effort.

That subgoals are probabilistic adds yet more complexity. The methods that we use to deal with uncertainty often take the form of "heuristics" - rules of thumb - that have a surprising amount of context-independence. "The key to strategy is not to choose a path to victory, but to choose so that all paths lead to a victory", for example. Even more interesting, from a Friendly AI perspective, are "injunctions", heuristics that we implement even when the direct interpretation of the world-model seems opposed.  We'll analyze injunctions later; for now, we'll just note that there are some classes of heuristic - both injunctions, and plain old strategy heuristics - that act on almost all plans. Thus, plans are produced, not just by the immediate "subgoals of the moment", but also by a store of general heuristics. Yet such heuristics may still be, ultimately, subgoals - that is, the heuristics may have no desirability independent of the ultimate supergoals.

Cautionary injunctions often defy the direct interpretation of the goal system - suggesting that they should always apply, even when they look non-useful or anti-useful. "Leaving margin for error," for example. If you're the sort of person who leaves for the airport 30 minutes early, then you know that you always leave 30 minutes early, whether or not you think you're likely to need it, whether or not you think that the extra 30 minutes are just wasted time. This happens for two reasons:  First, because your world-model is incomplete; you don't necessarily know about the factors that could cause you to be late. It's not just a question of there being a known probability of traffic delays; there's also the probabilities that you wouldn't even think to evaluate, such as twisting your ankle in the airport. The second reason is a sharp payoff discontinuity; arriving 30 minutes early loses 30 minutes, but arriving 30 minutes late loses the price of the plane ticket, possibly a whole day's worth of time before the next available flight, and also prevents you from doing whatever you needed to do at your destination. "Leaving margin for error" is an example of a generalized subgoal which sometimes defies the short-term interpretation of payoffs, but which, when implemented consistently, maximizes the expected long-term payoff integrated over all probabilities.

Even heuristics that are supposed to be totally unconditional on events, such as "keeping your sworn word", can be viewed as subgoals - although such heuristics don't necessarily translate well from humans to AIs. A human who swears a totally unconditional oath may have greater psychological strength than a human who swears a conditional oath, so that the 1% chance of encountering a situation where it would genuinely make sense to break the oath doesn't compensate for losing 50% of your resolve from knowing that you would break the oath if stressed enough. It may even make sense, cognitively, to install (or preserve) psychological forces that would lead you to regard "make sense to break the oath" as being a nonsensical statement, a mental impossibility. This way of thinking may not translate well for AIs, or may translate only partially. (12)  Perhaps the best interim summary is that human decisions can be guided by heuristics as well as subgoals, and that human heuristics may not be cognitively represented as subgoals, even if the heuristics would be normatively regarded as subgoals.

Human decision-making is complex, probably unnecessarily so. The way in which evolution accretes complexity results in simple behaviors being implemented as independent brainware even when there are very natural ways to view the simple behaviors as special cases of general cognition, since general cognition is an evolutionarily recent development. For the human goal supersystem, there is no clear way to point to a single level where the "supergoals" are; depending on how you view the human supersystem, supergoals could be identified with declarative philosophical goals, emotions, or pain and pleasure. Ultimately, goal-oriented cognition is not what humans are, but rather what humans do.  I have my own opinions on this subject, and the phrase "godawful mess" leaps eagerly to mind, but for the moment I'll simply note that the human goal system is extremely complicated; that every single chunk of brainware is there because it was adaptive at some point in our evolutionary history; and that engineering should learn from evolution but never blindly obey it. The differences between AIs and evolved minds are explored further in the upcoming section 2: Beyond anthropomorphism.

DEFN: Goal-oriented behavior:  Goal-oriented behavior is behavior that steers the world, or a piece of it, towards a single state, or a describable set of states. The perception of goal-oriented behavior comes from observing multiple actions that coherently steer the world towards a goal; or singular actions which are uniquely suited to promoting a goal-state and too improbable to have arisen by chance; or the use of different actions in different contexts to achieve a single goal on multiple occasions. Informally:  Behavior which appears deliberate, centered around a goal or desire.

DEFN: Goal-oriented cognition:  A mind which possesses a mental image of the "desired" state of the world, and a mental image of the actual state of the world, and which chooses actions such that the projected future of world-plus-action leads to the desired outcome state.

DEFN: Goal:  A piece of mental imagery present within an intelligent mind which describes a state of the world, or set of states, such that the intelligent mind takes actions which are predicted to achieve the goal state. Informally:  The image or statement that describes what you want to achieve.

DEFN: Causal goal system:  A goal system in which desirability backpropagates along predictive links. If A is desirable, and B is predicted to lead to A, then B will inherit desirability from A, contingent on the continued desirability of A and the continued expectation that B will lead to A. Since predictions are usually transitive - if C leads to B, and B leads to A, it usually implies that C leads to A - the flow of desirability is also usually transitive.

DEFN: Child goal:  A prerequisite of a parent goal; a state or characteristic which can usefully be considered as an independent event or object along the path to the parent goal. "Child goal" describes a relation between two goals - it does not make sense to speak of a goal as being "a child" or "a parent" in an absolute sense, since B may be a child goal of A but a parent goal of C.

DEFN: Parent goal:  A source of desirability for a child goal. The end to which the child goal is the means. "Parent goal" describes a relation between two goals - it does not make sense to speak of a goal as being "a parent" or "a child" in an absolute sense, since B may be a parent goal of C but a child goal of A.

DEFN: Subgoal:  An intermediate point on the road to the supergoals. A state whose desirability is contingent on its predicted outcome.

DEFN: Supergoal content:  The root of a directional goal network. A goal which is treated as having intrinsic value, rather than having derivative value as a facilitator of some parent goal. An event-state whose desirability is not contingent on its predicted outcome. (Conflating supergoals with subgoals seems to account for a lot of mistakes in speculations about Friendly AI.)


Interlude: The story of a blob

    "And this stone, it's the reason behind everything that's happened here so far, isn't it?  That's what the Servants have been up to all this time."
    "No. The stone, itself, is the cause of nothing. Our desire for it is the reason and the cause."
                    -- Allen L. Wold, "The Eye in the Stone"
Once upon a time...

In the beginning, long before goal-oriented cognition, came the dawn of goal-oriented behavior. In the beginning were the biological thermostats. Imagine a one-celled creature - or perhaps a mere blob of chemistry protected by a membrane, before the organized machinery of the modern-day cell existed. The perfect temperature for this blob is 80 degrees Fahrenheit. Let it become too hot, or too cold, and the biological machinery of the blob becomes less efficient; the blob finds it harder to metabolize nutrients, or reproduce... even dies, if the temperature diverges too far. But the blob, as yet, has no thermostat. It floats where it will, and many blobs freeze or burn, but the blob species continues; each blob absorbing nutrients from some great primordial sea, growing, occasionally splitting. The blobs do not know how to swim. They simply sit where they are, occasionally pushed along by Brownian motion, or currents in the primordial sea.

Every now and then there are mutant blobs. The mutation is very, very simple; one single bit of RNA or proto-RNA flipped, one single perturbation of the internal machinery, perhaps with multiple effects as the perturbation works its way through a chain of dependencies, but with every effect of the mutation deriving from that single source. Perhaps, if this story begins before the separate encoding genetic information, back in the days of self-replicating chemicals, the mutation takes the form of a single cosmic ray striking one of the self-replicating molecules that make up the blob's interior, or the blob's membrane. The mutation happened by accident. Nobody decided to flip that RNA base; radiation sleets down from the sky and strikes at random. Most of the time, the RNA bitflip and the consequent perturbation of chemical structure destroys the ability to self-replicate, and the blob dies or becomes sterile. But there are many blobs, and many cosmic rays, and sometimes the perturbation leaves the self-replicating property of the chemical intact, though perhaps changing the functionality in other ways. The vast majority of the time, the functionality is destroyed or diminished, and the blob's line dies out. Very, very rarely, the perturbation makes a better blob.

One day, a mutant blob comes along whose metabolism - "metabolism" being the internal chemical reactions necessary for resource absorption and reproduction - whose metabolism has changed in such a way that the membrane jerks, being pushed out or pulled in each time a certain chemical reaction occurs. Pushing and pulling on the membrane is an unnecessary expenditure of energy, and ordinarily the mutant blob would be outcompeted, but it so happens that the motion is rhythmic, enough to propel the blob in some random direction.

The blob has no navigation system. It gets turned around, by ocean currents, or by Brownian motion; it sometimes spends minutes retracing its own footsteps. Nonetheless, the mutant blob travels farther than its fellows, into regions where the nutrients are less exhausted, where there aren't whole crowds of sessile blobs competing with it. The swimming blob reproduces, and swims, and reproduces, and soon outnumbers the sessile blobs.

Does this blob yet exhibit goal-oriented behavior?  Goal-oriented cognition is a long, long, long way down the road; does the blob yet exhibit goal-oriented behavior?  No. Not in the matter of swimming, at least; not where the behavior of a single blob is concerned. This single blob will swim towards its fellows, or away from nutrients, as easily as the converse. The blob cannot even be said to have the goal of achieving distance; it sometimes retraces its own tracks. The blob's swimming behavior is an evolutionary advantage, but the blob itself is not goal-oriented - not yet.

Human observers have, at one time or another, attributed goal-oriented behavior and even goal-oriented cognition to the Sun, the winds, and even rocks. A more formal definition would probably require a conditional behavior, an either-or decision predicated on the value of some environmental variable; convergence, across multiple possibilities and different decisions in each, to a single state of the world.

Imagine, in some mathematical Universe, a little adding machine... that Universe's equivalent of a blob. The adding machine lurches along until it reaches a number, which happens to be 62; the adding machine adds 5 to it, yielding 67, and then lurches away. Is this a goal-oriented behavior, with the "goal" being 67?  Maybe not; maybe the number was random. Maybe adding 5 is just what this adding machine does, blindly, to everything it runs across. If we then see the adding machine running across 63 and adding 4, and then adding 2 to 65, we would hypothesize that the machine was engaging in goal-oriented behavior, and that the goal was 67. We could predict that when the machine runs across the number 64, up ahead, it will add 3. If the machine is known to possess neurons or the equivalent thereof, we will suspect that the machine is engaging in primitive goal-oriented cognition; that the machine holds, internally, a model of the number 67, and that it is performing internal acts of subtraction so that it knows how much to externally add. If the "adding machine" is extremely complex and evolutionarily advanced, enough to be sentient and social like ourselves, then 67 might have religious significance rather than reproductive or survival utility. But if the machine is too primitive for memetics, like our chemical blob, then we would suspect much more strongly that there was some sort of evolutionary utility to the number 67.

By this standard, is the swimming of the chemical blob a goal-oriented behavior?  No; the blob cannot choose when to start swimming or stop swimming, or in what direction to travel. It cannot decide to stop, even to prevent itself from swimming directly into a volcanic vent or into a crowded population of competing blobs. There is no conditional action. There is no convergence, across multiple possibilities and different decisions in each, to a single state of the world.

Although the blob itself has no goal-oriented behavior, it could perhaps be argued that a certain amount of goal-oriented behavior is visible within the blob's genetic information... the "genes", even if the blob lies too close to the beginning of life for DNA as we know it. The blob that swims into a nutrient-rich region prospers; this would hold true regardless of which blob swam there, or why, or which mutation drove it there. The mutation didn't even have to be "swimming"; the mutation could have been a streamlined shape for ocean currents, or a shape more susceptible to Brownian motion. From multiple possible origins, convergence to a single state; the blob that swims outside the crowd shall prosper. There is a "selection pressure" in favor of swimming outside the crowd. That the original blob was born was an accident - it was not a goal-oriented behavior of the genes "deciding" to swim - but that there are now millions of swimmers is not an accident; it is evolution. The original mutant was "a blob whose metabolism happens to pulse the membrane"; its millions of descendants are "swimmers who sometimes reach new territory".

Along comes another mutation, manifested as another quirk of chemistry. When the temperature rises above 83 degrees, the side of the blob contracts, or changes shape. Perhaps if one side of the membrane is hotter than 83 degrees, the blob contracts in a way that directs the motion of swimming away from the heat. Perhaps the effect is not so specific, leading only to a random change of swimming direction when it starts getting hot - this still being better than swimming on straight ahead. This is the ur-thermostat, even as thermostats themselves are ur-goal-behavior. The blob now exhibits goal-oriented behavior; the blob reacts to the environment in a conditional way, with the convergent result of "cooler living space". (Though a random change of direction is on the barest edge of being describable as "goal-oriented". A directional, swimming-away change is a much clearer case.)

In time to come, additional mutations will pile up. The critical temperature of the heat-avoidance reflex will drop from 83 degrees to 81 degrees (recall that we said the optimum temperature was 80). The heat-avoidance reflex will be matched by a cold-avoidance reflex, perhaps with a critical temperature of first 72, then rising to 79. Despite the seeming purposefulness of this slow accumulation of adaptations, despite the convenience and predictive power of saying that "the blob is evolving to stay within the optimum temperature range", the predictions sometimes go wrong, and then it is necessary to fall back on the physical standpoint - to revert from teleology to causality.

Every now and then, it becomes necessary to view the blob as a bundle of pieces, rather than as a coherent whole. "Individual organisms are best viewed as adaptation-executers rather than fitness-maximizers," saith Cosmides and Tooby, and sometimes it becomes necessary to see individual adaptations as they execute. The less evolved the organism, the more necessary the reductionist stance becomes. Consider the adding machine in the mathematical Universe; if the number 67 does have reproductive utility, then the adding machine might have started out as a random crawler that acquired the reflex to add 4 to 63. Its descendants acquired the reflexes to add 5 to 62, to add 7 to 60, to add 3 to 64, to add 2 to 65, to add 1 to 66.

If viewing the adding machine as a fitness-maximizer, we should be extremely surprised when, on running across 61, the machine adds 8. Viewing the adding machine as an adaptation-executer, of course, the scenario makes perfect sense; the adding machine has adaptations for some contingencies, but has not yet acquired the adaptations for others. Similarly, if the environment suddenly changes, so that 68 is now the maximal evolutionary advantage instead of 67, the adding machine will change slowly, piecemeal, as the individual reflexes change, one by one, over evolutionary time. A generalized subtraction mechanism would only need to mutate once, but genes are not permitted to plan ahead.

The teleological viewpoint often fails, where evolution is concerned. To completely eliminate the teleological viewpoint, leaving only causality, one would never be permitted to say that a particular trait was an "evolutionary advantage" for a mathblob; one would be required to describe the entire history, each individual act of addition and the resulting acquisition of resources, every interaction in which an ancestor outcompeted another mathblob with a different genetic makeup. It is a computationally expensive viewpoint - extremely expensive - but it has the advantage of being utterly true. If - returning to our own Universe - some unique mutant superblob accidentally swims directly into a volcanic vent and perishes, it is a historical fact that fits seamlessly into the physicalist standpoint, however tragic it may seem from the evolutionary view.

Our genes are not permitted to plan ahead, because ultimately, all that exists is the history of lives and matings. Unless the present-day utility of some hypothetical adaptation impacted a problem or competition in our ancestral history, it cannot have affected the historical lives of our ancestors, and cannot have affected the final outcome - us.


2: Beyond anthropomorphism

Anthropomorphic ("human-shaped") thinking is the curse of futurists. One of the continuing themes running through Creating Friendly AI is the attempt to track down specific features of human thought that are solely the property of humans rather than minds in general, especially if these features have, historically, been mistakenly attributed to AIs.

Anthropomorphic thinking is not just the result of context-insensitive generalization. Anthropomorphism is the result of certain automatic assumptions that humans are evolved to make when dealing with other minds. These built-in instincts will only produce accurate results for human minds; but since humans were the only intelligent beings present in the ancestral environment, our instincts sadly have no built-in delimiters.

Many personal philosophies, having been constructed in the presence of uniquely human instincts and emotions, reinforce the built-in brainware with conscious reasoning. This sometimes leads to difficulty in reasoning about AIs; someone who believes that romantic love is the meaning of life will immediately come up with all sorts of reasons why all AIs will necessarily exhibit romantic love as well.

In the abstract, almost everyone agrees that AIs will be alien and that anthropomorphic thinking about aliens is bad. Encountering specific alienness is always a shock. My advice to my readers is to enjoy the shock, appreciate the alienness, even if you're personally sure that the first AIs will necessarily be humanlike in some particular way. Afterward, when both pictures are clear in your mind, you can stand back and compare design methods.

2.1: Reinventing retaliation

If you punch a human in the nose, he or she will punch back. If the human doesn't punch back, it's an admirable act of self-restraint, something worthy of note.

Imagine, for a moment, that you walk up and punch an AI in the nose. Does the AI punch back?  Perhaps and perhaps not, but punching back will not be instinctive.  A sufficiently young AI might stand there and think:  "Hm. Someone's fist just bumped into my nose."  In a punched human, blood races, adrenaline pumps, the hands form fists, the stance changes, all without conscious attention. For a young AI, focus of attention shifts in response to an unexpected negative event - and that's all.

As the AI thinks about the fist that bumped into vis (13) nose, it may occur to the AI that this experience may be a repeatable event rather than a one-time event, and since a punch is a negative event, it may be worth thinking about how to prevent future punches, or soften the negativity. An infant AI - one that hasn't learned about social concepts yet - will probably think something like:  "Hm. A fist just hit my nose. I'd better not stand here next time."

If the AI is young, but old enough to be explaining human actions by reference to human cognition, ve may know that the fist movement and collision was probably not a coincidence; ve may hypothesize that the collision of human fist with AI's nose was intended, and even go on to hypothesize that the outcome of the punch, the resulting damage to the AI, was a predictable (predictable to the human) result of the punch, and that the purpose of the punch was to damage the AI; that is, the AI models the human's cognition and concludes that the human had a subgoal of {punching the AI}, and that the parent goal was {causing punch-damage to the AI}. Thus, the AI's focus of attention shifts to the human's motives rather than the physical punch itself, and the AI's next goal becomes deducing the human's real motive (that is, the parent goal of {causing punch-damage to the AI}).

There are now three different ways the AI can try to prevent the next punch; by preventing the physical collision itself, by preventing the human from deciding to punch the AI, or by preventing the unknown original motive from recurring. (14). Note, incidentally, that this improved picture rules out "standing somewhere else" as a solution, since the trajectory of the human's fist would be expected to track the AI's new location.

Punching back is an incredibly nonobvious solution. On the face of it, throwing the punch simply causes punch-damage to a human, something that, considered as an isolated event, the AI has no cause to regard as desirable. When starting from scratch, "punching back" seems as ridiculous as our mothers always said it was - analogous to reacting to a fire alarm by changing your shirt. For an AI to independently reinvent the idea of punching back would be an act of incredible genius and lateral thinking.

Independent reinvention is unlikely, not least because a seed AI would have extensively surfed archives of the Internet or other information sources, and would have encountered thousands, if not millions, of hints to the effect that humans punch back. But consider, for a moment, what a truly independent AI would need to understand before reinventing the concept of retaliation. Ve would need to begin, as stated, with the realization that the human punching ver did so on purpose and with intent to damage. This, in itself, is not such a large assumption; humans are intelligent beings, so there is often a direct mapping between {the results of our actions} and {our goals}. On the other hand, there's a long gap between an AI saying "Hm, this result may correspond to the human's intentions" and a human saying "Hey, you did that on purpose!"

If an infantile AI thinks "Hm, a fist just hit my nose, I'd better not stand here again", then a merely young AI, more experienced in interacting with humans, may apply standard heuristics about apparently inexplicable human actions and say:  "Your fist just hit my nose... is that necessary for some reason?  Should I be punching myself in the nose every so often?"  One imagines the nearby helpful programmer explaining to the AI that, no, there is no valid reason why being punched in the nose is a good thing, after which the young AI turns around and says to the technophobic attacker:  "I deduce that you wanted {outcome: AI has been punched in the nose}. Could you please adjust your goal system so that you no longer value {outcome: AI has been punched in the nose}?"

And how would a young AI go about comprehending the concept of "harm" or "attack" or "hostility"?  Let us take, as an example, an AI being trained as a citywide traffic controller. The AI understands that (for whatever reason) traffic congestion is bad, and that people getting places on time is good. (15). The AI understands that, as a child goal of avoiding traffic congestion, ve needs to be good at modeling traffic congestion. Ve understands that, as a child goal of being good at modeling traffic congestion, ve needs at least 512GB of RAM, and needs to have thoughts about traffic that meet or surpass a certain minimal level of efficiency. Ve knows that the programmers are working to improve the efficiency of the thinking process and the efficacy of the thoughts themselves, which is why the programmers' actions in rewriting the AI are desirable from the AI's perspective.

A technophobic human who hates the traffic AI might walk over and remove 1GB of RAM, this being the closest equivalent to punching a traffic AI in the nose. The traffic AI would see the conflict with {subgoal: have at least 512GB of RAM}, and this conflict obviously interferes with the parent goal of {modeling traffic congestion} or the grandparent goal of {reducing traffic congestion}, but how would an AI go about realizing that the technophobic attacker is "targeting the AI", "hating the AI personally", rather than trying to increase traffic congestion?

From the AI's perspective, descriptions of internal cognitive processes show up in a lot of subgoals, maybe even most of the subgoals. But these internal contents don't necessarily get labeled as "me", with everything else being "not-me". The distinction is a useful one, and even a traffic-control AI will eventually formulate the useful categories of "external-world subgoals" and "internal-cognition subgoals", but the division will not necessarily have special privileges; the internal/external division may not be different in kind from the division between "cognitive subgoals that deal with random-access memory" and "cognitive subgoals that deal with disk space". How is a young AI supposed to guess, in advance of the fact, that so many human concepts and thoughts and built-in emotions revolve around "Person X", rather than "Parietal Lobe X" or "Neuron X"?  How is the AI supposed to know that it's inherently more likely that a technophobic attacker intends to "injure the AI", rather than "injure the AI's random-access memory" or "injure the city's traffic-control"?

The concept of "injuring the AI", and an understanding of what a human attacker would tend to categorize as "the AI", is a prerequisite to understanding the concept of "hostility towards the AI". If a human really hates someone, she (16) will balk the enemy at every turn, interfere with every possible subgoal, just to maximize the enemy's frustration. How would an AI understand this?

Perhaps the AI's experience of playing chess, tic-tac-toe, or other two-sided zero-sum games will enable the AI to understand "opposition" - that everything the opponent desires is therefore undesirable to you, and that everything you desire is therefore undesirable to the opponent; that if your opponent has a subgoal, you should have a subgoal of blocking that subgoal's completion, and that if you have a valid subgoal, your opponent will have a subgoal of blocking your subgoal's completion.

Real life is not zero-sum, but the heuristics and predictive assumptions learned from dealing with zero-sum games may work to locally describe the relation between two social enemies. (Even the bitterest of real-life enemies will have certain goal states in common, e.g., nuclear war is bad; but this fact lies beyond the relevance horizon of most interactions.)

The real "Aha!" would be the insight that the attacking human and the AI could be in a relation analogous to players on opposing sides in a game of chess.  This is a very powerful and deeply fundamental analogy. As humans, we tend to take this perspective for granted; we were born with it. It is, in fact, a deep part of how we humans define the self.  It is part of how we define being a person, this cognitive assumption that you and I and everyone else are all nodes in a social network, players in a hugely multisided non-zero-sum game. For a human, myself is a great, embracing symbol that gathers in the-player-that-is-this-goal-system and the-part-of-reality-that-is-inside-this-mind and the-body-that-sees-and-moves-for-this-viewpoint.  For a human, these are all the same thing, part of what is meant by "I".

Even so, the concept of game theory is not sufficient to reinvent "retaliation"; it is simply a prerequisite. Understanding the Axelrod-and-Hamilton "Tit for Tat" strategy (17) is sufficient to suggest "Hey, maybe I should think about punching the attacker back!" but not sufficient to justify the suggestion, once made. (18). One question that has gone unasked throughout this entire analysis is "What is the utility of punching back?"  A human, in punching back, may or may not pause to consider whether it will bring any benefits; but, even so, we instinctively retaliate as a result of reflexes that were an advantage in the ancestral environment.

The evolutionary benefit of retaliation lies in the probable adjustment to the future behavior of others.  People - humans, anyway - are less likely to hit you if they think you're likely to hit back. Hitting back an individual human may or may not work to prevent a recurrence, or to decrease the number of recurrences, but failure to hit back may result in more punches coming your way. Of course, this relies on the attackers having the kind of psychologies that take possible retaliation into account - an AI, to fully grok the reason for retaliation, would have to understand this aspect of human (evolutionary) psychology as well. The Tit for Tat strategy doesn't succeed in isolation; it only succeeds when some of the strategies it interacts with also take the behavior of the opponent/partner into account.

In reality, for an AI, returning a punch is likely to be far more trouble than it's worth. There's that whole crowd of onlookers who go:  "Look!  It punched a human being!  Kill it!  Kill the AI!"  Of course, if the AI understands enough evolutionary psychology and game theory to employ Tit for Tat, but not enough human psychology to understand the concept of "double standard", ve may not realize how much trouble punching a human may land ver in. This is one of the reasons why a well-designed Friendly AI would have a precautionary injunction that says something along the lines of "Don't attack a human, even if it looks like a good idea at the time."  A Friendly AI would also regard damage to an unconsenting human as intrinsically undesirable under the Friendliness supergoal content. (19).

So what's the upshot?  Primarily, what I want to convey to you is the sense of a blank slate, of starting from scratch, of a vast silence space - for young AIs, at least. This blank slate doesn't mean that AIs, even young AIs, are necessarily slow. For an AI with thirty-two 2-Ghz processors, the maximum speed on a reasonably linear chain of thought (one that doesn't require more than thirty-two parallel processing threads) is 2 billion ticks per second, as opposed to the 200 ticks per second of the hundred trillion synapses of the human brain. Perhaps an AI - on being physically punched for the very first time on vis very first encounter with a hostile human - could think through what ve remembers from surfing Usenet archives, and what ve read in newspapers, and books of fiction and nonfiction, and what ve knows about human evolutionary psychology, and conclude that returning the punch is the best course of action, and do it all in time to return the punch as fast as any instinct-driven human - assuming that 32 processors is enough speed to handle the complex motor actions required.

This isn't a literal possibility (20) unless the AI is transhuman, and a transhuman AI would be mature enough to know all about social networks; ve wouldn't need to start from scratch. The first time some young AI gets punched in the nose, ve's likely to react in one of the ways described earlier, or some other way just as surreal, if ve reacts at all. It'll take some extended conversations with the programmers about evolutionary psychology before the AI understands what's going on. But, the second time the AI gets punched in the nose, it shouldn't take any time at all - by human standards - to run through a chain of logic that's easy to reverify. It's inventing that takes massive computing power and human confirmation; retracing your own footprints is likely to be a fairly serial process that can be consummated in a tenth, a hundredth of a second. If re-spawning a child goal from a parent goal is a serial task, one that doesn't invoke any computationally intensive subprocesses, then the AI can retrace the path from supergoal Friendliness content to the correct course of action (retaliation, rebuking the attacker, whatever) in a human eyeblink.

It's easy enough to accept that a young AI is a vast silent space. For a human, blood races, adrenaline pumps, eyes narrow, hands form fists. For a human, counterattack is instinctive - the correct level of counterattack,  the socially acceptable level of counterattack; a damaging counterpunch rather than a lethal bullet. Our instincts are context-sensitive and hedged 'round by other instincts. Our instincts are complex functional adaptations with multiple moving parts, often hosted on a modular chunk of brainware. Complex functional adaptations don't just materialize spontaneously in source code, just as complex dishes like pizza don't suddenly start growing on palm trees. Thus, a young AI might choose to retaliate, but would certainly not feel the need to retaliate, as a human would. To a young AI, retaliation is not an instinct; retaliation is just another subgoal.

What about a more mature AI, especially one that can rewrite vis own source code?  Regardless of whether it would be a good idea, it would certainly be possible for a seed AI to create a reflex for instant retaliation.

There are several clear reasons why humans have evolved a retaliation instinct, rather than a retaliation logic. The primary reason is that a retaliation instinct is easier to evolve.  The retaliation instinct evolved long before general intelligence, so evolving a retaliation logic first would not just have been more difficult, but actually impossible. Also, evolution tends to arrive at procedural solutions rather than declarative solutions, because a component of a complex procedural solution can be functional in its own right.

If genes could, somehow, store declarative knowledge, the first piece of knowledge stored would be "Punching back is good," which is simpler than "Punching back is good because it decreases the chance of future punches," which is simpler than "Punching back decreases the chance of future punches by modifying others' behavior", which is simpler than "Punching back modifies others' behavior because, on seeing you punch back, they'll project an increased chance of you punching back if they punch you, which makes them less likely to punch back."  All of this is moot, since as far as I know, nobody has ever run across a case of genes storing abstract knowledge. (By this I mean knowledge stored in the same format used for episodic memories or declarative semantic knowledge.)

Abstract knowledge cannot evolve incrementally and therefore it does not evolve at all. This fact, by itself, is enough to completely explain away human use of retaliation instincts rather than retaliation logic, and we must go on to consider independently whether a retaliation instinct or a retaliation logic is more useful.  For humans, I think that a retaliation instinct is more useful, or at least it's more of an evolutionary advantage. Even if we had conscious control over our endocrine systems, so we could deliberately choose to pump adrenaline, we would still be shot down by the sheer human-slowness of abstract thought. We are massively parallel systems running at 200Hz. When you're massively parallel you can afford to precompute things, and when you run at 200Hz you must precompute things because everything has to be done in very few serial steps.

When you run at 2 billion ticks per second, the overhead of recreating and rechecking a few previously-thought-out child goals is comparatively trivial next to all the other actions those subgoals entail, including complex, creative, parallel-intensive tasks such as computing motor actions (if ve decides to punch back) or thinking up an effective opening line (if ve decides to rebuke the attacker). For a human, interposing any extra serial step in a process is a huge time burden, but massively parallel operations vanish into the maw of trillions and trillions of synapses. For an AI, an extra serial step performed at the start of a thought is nearly unnoticeable (21), but trying to imitate human parallelism would bring the AI to vis knees. (A later AI may have access to computing power rivalling human parallelism, but would presumably still have the faster serial speed.)

The conclusion:  From an AI's perspective, a retaliation logic is probably considerably more effective than a retaliation instinct. The retaliation logic has a context-sensitivity that the retaliation instinct lacks. Suppose a young AI has some set of ultimate goals - traffic reduction, or hopefully Friendliness supergoals, or whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. A maturing AI would have the capability to replace the retaliation logic with a retaliation instinct, or with a retaliation imperative having supergoal status.  The instinct would be slightly faster - 0.01% faster, or perhaps 1% faster (eliminating a serial step at the start of the algorithm saves very little time, and most of the computational cost is computation-intensive motor logic or persuasive speech production). However, in doing so, the AI would lose a substantial amount of the context sensitivity of the retaliation logic - that is, from the perspective of the current set of supergoals, the supergoals that the AI uses to decide whether or not to implement the optimization.

Changing retaliation to an independent supergoal would affect, not just the AI's speed, but the AI's ultimate decisions. From the perspective of the current set of supergoals, this new set of decisions would be suboptimal. Suppose a young AI has some set of ultimate goals - traffic reduction, Friendliness, whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. The maturing AI then considers whether changing the logic to an independent supergoal or optimized instinct is a valid tradeoff. The benefit is shaving one millisecond off the time to initiate retaliation. The cost is that the altered AI will execute retaliation in certain contexts where the present AI would not come to that decision, perhaps at great cost to the present AI's supergoals (traffic reduction, Friendliness, etc). Since recreating the retaliation subgoal is a relatively minor computational cost, the AI will almost certainly choose to have retaliation remain strictly dependent on the supergoals.

Why do I keep making this point, especially when I believe that a Friendly seed AI can and should live out vis entire lifecycle without ever retaliating against a single human being?  I'm trying to drive a stake through the heart of a certain conversation I keep having.
 

Somebody:   "But what happens if the AI decides to do [something only a human would want] ?"
Me:   "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code. I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody:   "But everyone needs to do [whatever] because [personal philosophy], so the AI will decide to do it as well."
Me:   "Yes, doing [whatever] is sometimes useful. But even if the AI decides to do [whatever] because it serves [Friendliness supergoal] under [contrived scenario], that's not the same as having an independent desire to do [whatever]."
Somebody:   "Yes, that's what I've been saying:  The AI will see that [whatever] is useful and decide to start doing it. So now we need to worry about [scenario in which <whatever> is catastrophically unFriendly]."
Me:   "But the AI won't have an independent desire to do [whatever]. The AI will only do [whatever] when it serves the supergoals. A Friendly AI would never do [whatever] if it stomps on the Friendliness supergoals."
Somebody:   "I don't understand. You've admitted that [whatever] is useful. Obviously, the AI will alter itself so it does [whatever] instinctively."
Me:   "The AI doesn't need to give verself an instinct in order to do [whatever]; if doing [whatever] really is useful, then the AI can see that and do [whatever] as a consequence of pre-existing supergoals, and only when [whatever] serves those supergoals."
Somebody:   "But an instinct is more efficient, so the AI will alter itself to do [whatever] automatically."
Me:   "Only for humans. For an AI, [complex explanation of the cognitive differences between having 32 2-gigahertz processors and 100 trillion 200-hertz synapses], so making [whatever] an independent supergoal would only be infinitesimally more efficient."
Somebody:   "Yes, but it is more efficient!  So the AI will do it."
Me:   "It's not more efficient from the perspective of a Friendly AI if it results in [something catastrophically unFriendly]. To the exact extent that an instinct is context-insensitive, which is what you're worried about, a Friendly AI won't think that making [whatever] context-insensitive, with [horrifying consequences], is worth the infinitesimal improvement in speed."

Retaliation was chosen as a sample target because it's easy to explain, easy to see as anthropomorphic, and a good stand-in for the general case. Though "retaliation" in particular has little or no relevance to Friendly AI - I wouldn't want any Friendly AI to start dabbling in retaliation, whether or not it looked like a good idea at the time - what has been said of "retaliation" is true for the general case. Indeed, this is one of the only reasons why Friendliness is possible at all; in particular:

2.2: Selfishness is an evolved trait

By "selfishness", I do not just mean the sordid selfishness of a human sacrificing the lives of twenty strangers to save his own skin, or something equally socially unacceptable. The entire concept of a goal system that centers around the observer is fundamentally anthropomorphic.

There is no reason why an evolved goal system would be anything but observer-focused. Since the days when we were competing chemical blobs, the primary focus of selection has been the individual (22). Even in cases where fitness or inclusive fitness is augmented by behaving nicely towards your children, your close relatives, or your reciprocal-altruism trade partners, the selection pressures are still spilling over onto your kin, your children, your partners. We started out as competing blobs in a sea, each blob with its own measure of fitness. We grew into competing players in a social network, each player with a different set of goals and subgoals, sometimes overlapping, sometimes not.

Though the goals share the same structure from human to human, they are written using the variable "I" that differs from human to human, and each individual substitutes in their own name. Every built-in instinct and emotion evolved around the fixed point at the center.

While discussing retaliation, I offered a scenario of a young AI being punched in the nose, and noted the additional mental effort it would take for the AI to realize that ve, "personally", was being targeted. The AI would have to imagine a completely different cognitive architecture before ve could comprehend what a human is thinking when he or she "personally targets" someone, and even so the AI verself will never feel "personally targeted". You can imagine yourself pointing a finger directly at some young AI and saying, "Look at that!"  And the AI spins around to look behind verself and says "Where?"

This metaphor - a being with a visuospatial model of the physical world that doesn't include vis own body, or at least, doesn't include vis own body as "anything worth noticing" - is analogous, not to the AI's physical model of the world, but to the AI's moral model of the world. A Friendly AI may be greatly concerned with the welfare of the surrounding humans, but if you ask ver "What about your own welfare?", ve'll say "The welfare of what?"  A young AI would, at any rate; an older AI would understand exactly what you meant, but wouldn't see the argument as any more intuitive or persuasive. A Friendly AI sees the nearby humans as moral nodes, but there's no node at the center - no node-that-is-this-node - and possibly even no center. If you, metaphorically, say "Look at that!", a young AI will say "Look at what?". An older AI will understand that you see a node, but that doesn't mean the AI will see a node.

As I pointed out in GISAI 2.4.4: The legitimate use of the word "I", an AI's model of reality will inevitably form categories and objects in the same place where a human keeps his or her "self". There shall develop heuristics and thoughts which branch on whether or not something is labeled as being part of the "AIself", and heuristics which only act on subcategories such as "causal analysis system" or "goal checking subsystem". The AIself will probably not be shaped quite like a human self; it will probably include one or two things that a human would exclude, and vice versa. Historically, only in the twentieth century did humans really begin to understand that the mind was not a unified object but rather a system with moving parts; chronologically, an AI is likely to notice properties of the causal analysis key-variable-selection subsystem before the AI notices the causal analysis system superobject, and the AI will notice the causal analysis system before the AI notices the "AIself" superobject. (Actually, an infant AI may start out with all of these objects and superobjects identified-in-advance by the programmers, but the details will still fill in from the bottom up rather than the top down.)  The AI will notice, understand, and eventually manipulate the pieces of vis self, and the whole - and be qualitatively different because of it, becoming able to legitimately use the word "I" - but the AI will still have an AI's self, not a human's self.

Within the goal system, a lot of subgoals - a lot of the events in the causal chains that lead to the supergoal of being Friendly to the surrounding humans - are likely to thread through subobjects of the AI's self: increasing the efficiency of some piece of code, improving on some heuristic, and so on. But just because components of the AI's self are useful doesn't mean that the AI's self becomes a moral node; my computer is very useful and many of my subgoals thread through my computer, but I don't class my computer as having independent supergoal status.
 

The lack of an observer-biased ("selfish") goal system is perhaps the single most fundamental difference between an evolved human and a Friendly AI. This difference is the foundation stone upon which Friendly AI is built. It is the key factor missing from the existing, anthropomorphic science-fictional literature about AIs. To suppress an evolved mind's existing selfishness, to keep a selfish mind enslaved, would be untenable - especially when dealing with a self-modifying or transhuman mind!  But an observer-centered goal system is something that's added, not something that's taken away. We have observer-centered goal systems because of externally imposed observer-centered selection pressures, not because of any inherent recursivity. If the observer-centered effect were due to inherent recursivity, then an AI's goal system would start valuing the "goal system" subobject, not the AI-as-a-whole!  A human goal system doesn't value itself, it values the whole human, because the human is the reproductive unit and therefore the focus of selection pressures.

The epic human struggle to choose between selfishness and altruism is the focus of many personal philosophies, and I have thus observed that this point about AIs is one of the hardest ones for people to accept. An AI may look more like an altruistic human than a selfish one, but an AI isn't selfish or altruistic; an AI is an AI. An AI is not a human who has selflessly renounced personal interests in favor of the community; an AI is not a human with the value of the node-that-is-this-node set to zero; an AI is a mind that just cares about other things, not because the "selfish" part has been ripped out or brainwashed or suppressed, but because the AI doesn't have anything there.  An observer-centered goal system is something that's added to a mind, not something that's taken away. The next few subsections deal with some frequently raised topics surrounding this point.

2.2.1: Pain and pleasure

Imagine, for a moment, that you walk up and punch a seed AI in the nose. Does the AI experience pain when the punch lands?

What is "pain"?  What is the evolutionary utility of pain?  In its most basic form, pain appears as internal, cognitive negative feedback. If an internal cognitive event causes negative consequences in external reality, negative feedback decreases the probability of that internal cognitive event recurring, and thereby decreases the probability of the negative consequences in external reality recurring. Pain - cognitive negative feedback of any kind - needs somewhere to focus to be useful. Negative feedback needs an internal place to focus, since cognitive feedback cannot reprogram external reality.

In humans, of course, there's more to pain than negative feedback; human pain also acts as a damage signal, and shifts focus of attention from whatever we were previously thinking about, and makes us start thinking about ways to make the pain go away. (All of that functionality attached to a single system bus!  Evolution has a tendency to overload existing functions.)  The human cognitive architecture is such that pain can be present even in the absence of a useful focus for the negative-feedback aspect of pain. A human can even be driven insane by continued pain, with no escape route (nowhere for the cognitive negative feedback to focus). The capacity to be driven insane by continued pain seems nonadaptive - but then, in the ancestral environment, people damaged enough to experience extended unbearable pain probably died soon in any case, and the sanity or insanity of their final moments had little bearing on reproductive history. (23).

Neither pain nor pleasure, as design features, are inherently necessary to the functionality of negative or positive feedback. Given the supergoal of being Friendly - or, for that matter, the goal of walking across the room - negative feedback can be consciously implemented as a subgoal. For example, if an AI has the goal of walking across the room, and the AI gets distracted and trips over a banana peel, the AI can reason:  "The event of my being distracted caused me to place my foot on a banana peel, delaying my arrival at the end of the room, which interferes with [whatever the parent goal was], and this causal chain may recur in some form. Therefore I will apply positive feedback (increase the priority of, increase the likelihood of invocation, et cetera) to the various subheuristics that were suggesting I look at the floor, and which I ignored, and I will apply negative feedback (decrease the priority of, et cetera) the various subheuristics that gained control of my focus of attention and directed it toward the distractor."  If the AI broke a toe while falling, the AI can reason:  "If I place additional stress on the fracture, it will become worse and decrease my ability to traverse additional rooms, which is necessary to serve [parent goal]; therefore I will walk in such a way as to not place additional stress on the fracture, and I will have the problem repaired as soon as possible."  That is, conscious reasoning can replace the "damage signal" aspect of pain. If the AI successfully solves a problem, the AI can choose to increase the priority or devote additional computational power to whichever subheuristics or internal cognitive events were most useful in solving the problem, replacing the positive-feedback aspect of pleasure.

There are tricks that can be pulled using "deliberate feedback" that, as far as I know, the human architecture has never even touched. For example, the AI - on successfully solving a problem - can spend time thinking about how to improve, not just whichever subsystems helped solve the problem, but those particular successful subsystems that would have benefited the most (in retrospect) from a bit of improvement, or even those failed subsystems that almost made it. There are subtleties to negative and positive feedback that the hamfisted human architecture completely ignores; an autonomic system doesn't have the flexibility of a learning intelligence.

Finally, even in the total absence of the reflectivity necessary for deliberate feedback, a huge chunk of the functionality of pleasure and pain falls directly out of a causal goal system plus the Bayesian Probability Theorem. See 3.1.4: Bayesian reinforcement.

Evolution does not create those systems which are most adaptive; evolution creates those systems which are most adaptive and most evolvable. Until the rise of human general intelligence, a deliberately directed feedback system would have been impossible. By the time human general intelligence arose, a full-featured autonomic system was already in place, and replacing it would have required a complete architectural workover - something that evolution does over the course of eons (when it happens at all) due to the number of simultaneous mutations that would be required for a fast transition. The human cognitive architecture is a huge store of features designed to operate in the absence of general intelligence, with general intelligence layered on top. Human general intelligence is crudely interfaced to all the pre-existing features that evolved in the absence of general intelligence.

An autonomic negative-feedback system is enormously adaptive if you're an unintelligent organism that previously possessed no feedback mechanism whatsoever. An autonomic negative-feedback system is not a design improvement if you're a general intelligence with a pre-existing motive to implement a deliberate feedback system.

Why is this relevant to Friendly AI?  One of the oft-raised objections to the workability of Friendly AI goes something like:  "Any superintelligence, whether human-born or AI-born, will maximize its own pleasure and minimize its own pain; that is the only rational thing to do."  Pleasure and pain are two of the several features of human cognition that have "supergoal nature", the appearance of uber-goal or ur-goal quality. The reasoning seems to go something like this:  "Pleasure and pain are the ultimate supergoals of the human cognitive architecture, with all other actions being taken to seek pleasure or avoid pain; pleasure and pain are necessary design features of minds in general; therefore, all AIs and all sufficiently intelligent humans will be totally selfish."  Actually, the factor that has supergoal-nature in our mind is the anticipation of pain or the anticipation of pleasure; by the time the actual pain or pleasure arrives, it's too late to affect the already-made decision, although the next decision is often affected.

2.2.1.1: FoF:  Wireheading 1

NOTE: "FoF" stands for "Failure of Friendliness".

There's a stronger version of the above objection which holds that pain and pleasure are by definition part of any goal system, a much more subtle fallacy. In essence, the argument is that any indicator of success is the de facto supergoal of the system; that whichever piece of code says, "Aha, a goal has been achieved!" will become more important to the system than the goal itself. The wirehead fallacy is amazingly tangled, so let me try and dissect the individual problems with it:

One, the "piece of code" that notices when a goal has been achieved doesn't have to be an autonomic system, as it is in humans; it can be a conscious thought.

Two, if every indicator of success became inevitably confused with success itself, wouldn't the indicator of the indicator of success become confused with the indicator?  If - this seems to be what the objectors are imagining - the mind has a "pleasure counter" indicating how many goals ve has achieved, and the mind then decides that vis "real goal" is to increase this counter without limit, gobbling whole solar systems to store larger and larger bignums, then wouldn't the mind go on to decide that vis real purpose is to think that ve's storing larger and larger bignums?  In other words, replace the "pleasure counter" itself with the mind's beliefs about the pleasure counter?  Where does this all end?

The final problem is the most subtle, having to do with the AI's beliefs about the design purpose of vis own goal system.  If the AI views the goal system as a subgoal of the goals themselves, then a wirehead short-circuit is a visible design flaw, which would prevent the achievement of those final goals. Remember where I said that the anticipation of pleasure is what influences our decisions?  The AI, visualizing a future in which ve has huge amounts of pleasure due to a breakdown of the goal system, says, not "Oh boy!", but "Uh oh."  The AI, in thinking about which future ve wants to be in, checks to what degree vis own supergoals have been fulfilled, not to what degree the future AI's supergoals will have been fulfilled. (24). (The Friendship architecture proposed in CFAI would not involve a complete breakdown of identification with the future self, but it wouldn't be the kind of identification humans employ, and a present-day AI using that architecture would definitely not identify with a wireheaded future self (hence the forward footnote). See 3.3.5: FoF:  Wireheading 2 and 3.4.3.3: The rule of derivative validity.)

2.2.2: Anthropomorphic capitalism

In human society, capitalist civilizations are overwhelmingly more effective than communist civilizations. There is a hallowed dualism separating individualism and authoritarianism; self-organization and central command; free trade and government control. This has led some thinkers to postulate that a community of AIs with divergent, observer-centered goals would outcompete a community of Friendly AIs with shared goals.

In the human case, both capitalist and authoritarian societies are composed of humans with divergent, observer-centered goals. Capitalist societies admit this, and authoritarian societies don't, so at least some of the relative inefficiency of authoritarian societies will stem from the enormous clash between the values people are "supposed" to have and the values people actually do have. The claim of "capitalist AI" goes beyond this, however, to the idea that capitalist societies are intrinsically more efficient. For example, a society of AIs competing for resources would tend to divert more resources to the most efficient competitors, thus increasing the total efficiency, while - this seems to be the scenario implied - a group of Friendly AIs would share resources equally, for the common good...

Whoa!  Time out!  Non sequit