Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: An Introduction to Goal Systems Bookmark
Up: Creating Friendly AI Monolithic
Prev: INIT


1: Challenges of Friendly AI

The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.  This refers, not to AIs that have advanced just that far and no further, but to all AIs that have advanced to that point and beyond - perhaps far beyond.  Because of self-improvement, recursive self-enhancement, the ability to add hardware computing power, the faster clock speed of transistors relative to neurons, and other reasons, it is possible that AIs will improve enormously past the human level, and very quickly by the standards of human timescales.  The challenges of Friendly AI must be seen against that background.  Friendly AI is constrained not to use solutions which rely on the AI having limited intelligence or believing false information, because, although such solutions might function very well in the short term, such solutions will fail utterly in the long term.  Similarly, it is "conservative" (see below) to assume that AIs cannot be forcibly constrained.

Success in Friendly AI can have positive consequences that are arbitrarily large, depending on how powerful a Friendly AI is.  Failure in Friendly AI has negative consequences that are also arbitrarily large.  The farther into the future you look, the larger the consequences (both positive and negative) become.  What is at stake in Friendly AI is, simply, the future of humanity.  (For more on that topic, please see the Singularity Institute main site or 4: Policy implications.)

1.1: Envisioning perfection

In the beginning of the design process, before you know for certain what's "impossible", or what tradeoffs you may be forced to make, you are sometimes granted the opportunity to envision perfection.  What is a perfect piece of software?  A perfect piece of software can be implemented using twenty lines of code, can run in better-than-realtime on an unreliable 286, will fit in 4K of RAM.  Perfect software is perfectly reliable, and can be definitely known by the system designers to be perfectly reliable for reasons which can easily be explained to non-programmers.  Perfect software is easy for a programmer to improve and impossible for a programmer to break.  Perfect software has a user interface that is both telepathic and precognitive.

But what does a perfect Friendly AI do?  The term "Friendly AI" is not intended to imply a particular internal solution, such as duplicating the human friendship instincts, but rather a set of external behaviors that a human would roughly call "friendly".  Which external behaviors are "Friendly" - either sufficiently Friendly, or maximally Friendly?

Ask twenty different futurists, get twenty different answers - created by twenty different visualizations of AIs and the futures in which they inhere.  There are some universals, however; an AI that behaves like an Evil Hollywood AI - "agents" in The Matrix; Skynet in Terminator 2 - is obviously unFriendly.  Most scenarios in which an AI kills a human would be defined as unFriendly, although - with AIs, as with humans - there may be extenuating circumstances.  (Is a doctor unfriendly if he lethally injects a terminally ill patient who explicitly and with informed consent requests death?)  There is a strong instinctive appeal to the idea of Asimov Laws, that "no AI should ever be allowed to kill any human under any circumstances", on the theory that writing a "loophole" creates a chance of that loophole being used inappropriately - the Devil's Contract problem.  I will later argue that the Devil's Contract scenarios are mostly anthropomorphic.  Regardless, we are now discussing perfectly Friendly behavior, rather than asking whether trying to implement perfectly Friendly behavior in one scenario would create problems in other scenarios.  That would be a tradeoff, and we aren't supposed to be discussing tradeoffs yet.

Different futurists see AIs acting in different situations.  The person who visualizes a human-equivalent AI running a city's traffic system is likely to give different sample scenarios for "Friendliness" than the person who visualizes a superintelligent AI acting as an "operating system" for all the matter in an entire solar system.  Since we're discussing a perfectly Friendly AI, we can eliminate some of this futurological disagreement by specifying that a perfectly Friendly AI should, when asked to become a traffic controller, carry out the actions that are perfectly Friendly for a traffic controller.  The same perfect AI, when asked to become the operating system of a solar system, should then carry out the actions that are perfectly Friendly for a system OS.  (Humans can adapt to changing environments; likewise, hopefully, an AI that has advanced to the point of making real-world plans.)

We can further clean up the "twenty futurists, twenty scenarios" problem by making the "perfectly Friendly" scenario dependent on factual tests, in addition to futurological context.  It's difficult to come up with a clean illustration, since I can't think of any interesting issue that has been argued entirely in utilitarian terms.  If you'll imagine a planet where "which side of the road you should drive on" is a violently political issue, with Dexters and Sinisters fighting it out in the legislature, then it's easy to imagine futurists disagreeing on whether a Friendly traffic-control AI would direct cars to the right side or left side of the road.  Ultimately, however, both the Dexter and Sinister ideologies ground in the wish to minimize the number of traffic accidents, and, behind that, the valuation of human life.  The Dexter position is the result of the wish to minimize traffic accidents plus the belief, the testable hypothesis, that driving on the right minimizes traffic accidents.  The Sinister position is the wish to minimize traffic accidents, plus the belief that driving on the left minimizes traffic accidents.

If we really lived in the Driver world, then we wouldn't believe the issue to be so clean; we would call it a moral issue, rather than a utilitarian one, and pick sides based on the traditional allegiance of our own faction, as well as our traffic-safety beliefs.  But, having grown up in this world, we would say that the Driverfolk are simply dragging in extraneous issues.  We would have no objection to the statement that a perfectly Friendly traffic controller minimizes traffic accidents.  We would say that the perfectly Friendly action is to direct cars to the right - if that is what, factually, minimizes accidents.  Or that the perfectly Friendly action is to direct cars to the left, if that is what minimizes accidents.

All these conditionals - that the perfectly Friendly action is this in one future, this in another; this given one factual answer, this given another - would certainly appear to take more than twenty lines of code.  We must therefore add in another statement about the perfectly minimal development resources needed for perfect software:  A perfectly Friendly AI does not need to be explicitly told what to do in every possible situation.  (This is, in fact, a design requirement of actual Friendly AI - a requirement of intelligence in general, almost by definition - and not just a design requirement of perfectly Friendly AI.)

And for the strictly formal futurist, that may be the end of perfectly Friendly AI.  For the philosopher, "truly perfect Friendly AI" may go beyond conformance to some predetermined framework.  In the course of growing up into our personal philosophies, we choose between moralities.  As children, we have simple philosophical heuristics that we use to choose between moral beliefs, and later, to choose between additional, more complex philosophical heuristics.  We gravitate, first unthinkingly and later consciously, towards characteristics such as consistency, observer symmetry, lack of obvious bias, correctness in factual assertions, "rationality" however defined, nonuse of circular logic, and so on.  A perfect Friendly AI will perform the Friendly action even if one programmer gets "the Friendly action" wrong; a truly perfect Friendly AI will perform the Friendly action even if all programmers get the Friendly action wrong.

If a later researcher writes the document Creating Friendlier AI, which has not only a superior design but an utterly different underlying philosophy - so that Creating Friendlier AI, in retrospect, is the way we should have approached the problem all along - then a truly perfect Friendly AI will be smart enough to self-redesign along the lines in Creating Friendlier AI.  A truly perfect Friendly AI has sufficient "strength of philosophical personality" - while still matching the intuitive aspects of friendliness, such as not killing off humans and so on - that we are more inclined to trust the philosophy of the Friendly AI, than the philosophy of the original programmers.

Again, I emphasize that we are speaking of perfection and are not supposed to be considering design tradeoffs, such as whether sensitivity to philosophical context makes the morality itself more fragile.  A perfect Friendly AI creates zero risk and causes no anxiety in the programmers (1).  A truly perfect Friendly AI also eliminates any anxiety about the possibility that Friendliness has been defined incorrectly, or that what's needed isn't "Friendliness" at all - without, of course, creating other anxieties in the process.  Individual humans can visualize the possibility of a catastrophically unexpected unknown remaking their philosophies.  A truly perfect Friendly AI makes the commonsense-friendly decision in this case as well, rather than blindly following a definition that has outlived the intent of the programmers.  Not just a "truly perfect", but a real Friendly AI as well, should be sensitive to programmers' intent - including intentions about programmer-independence, and intentions about which intentions are important.

Aside from a few commonsense comments about Friendliness - for example, Evil Hollywood AIs are unFriendly - I still have not answered the question of what constitutes Friendly behavior.  One of the snap summaries I usually offer has, as a component, "the elimination of involuntary pain, death, coercion, and stupidity", but that summary is intended to make sense to my fellow humans, not to a proto-AI.  More concrete imagery will follow.

We now depart from the realms of perfection.  Nonetheless, I would caution my readers against giving up hope too early when it comes to having their cake and eating it too - at least when it comes to ultimate results, rather than interim methods.  A skeptic, arguing against some particular one-paragraph definition of Friendliness, may raise Devil's Contract scenarios in which an AI asked to solve the Riemann Hypothesis converts the entire Solar System into computing substrate, exterminating humanity along the way.  Yet the emotional impact of this argument rests on the fact that everyone in the audience, including the skeptic, knows that this is actually unfriendly behavior.  You and I have internal cognitive complexity that we use to make judgement calls about Friendliness.  If an AI can be constructed which fully understands that complexity, there may be no need for design compromises.

1.2: Assumptions "conservative" for Friendly AI

The conservative assumption according to futurism is not necessarily the "conservative" assumption in Friendly AI.  Often, the two are diametric opposites.  When building a toll bridge, the conservative revenue assumption is that half as many people will drive through as expected.  The conservative engineering assumption is that ten times as many people as expected will drive over, and that most of them will be driving fifteen-ton trucks.
 

Conservative assumptions:
In futurism: In Friendly AI:
Self-enhancement is slow, and requires human assistance or real-world operations. Changes of cognitive architecture are rapid and self-directed; we cannot assume human input or real-world experience during changes.
Near human-equivalent intelligence is required to reach the "takeoff point" for self-enhancement. Open-ended buildup of complexity can be initiated by self-modifying systems without general intelligence.
Slow takeoff; months or years to transhumanity. Hard takeoff; weeks or hours to superintelligence.
Friendliness must be preserved through minor changes in "smartness" / worldview / cognitive architecture / philosophy. Friendliness must be preserved through drastic changes in "smartness" / worldview / cognitive architecture / philosophy.
Artificial minds function within the context of the world economy and the existing balance of power; an AI must cooperate with humans to succeed and survive, regardless of supergoals. An artificial mind possesses independent strong nanotechnology, resulting in a drastic power imbalance.  Game-theoretical considerations cannot be assumed to apply.
AI is vulnerable - someone can always pull the plug on the first version if something goes wrong. "Get it right the first time":  Zero nonrecoverable errors necessary in first version to reach transhumanity.

Given a choice between discussing a human-dependent traffic-control AI and discussing an AI with independent strong nanotechnology, we should be biased towards assuming the more powerful and independent AI.  An AI that remains Friendly when armed with strong nanotechnology is likely to be Friendly if placed in charge of traffic control, but perhaps not the other way around.  (A minivan can drive over a bridge designed for armor-plated tanks, but not vice-versa.)

In addition to engineering conservatism, the nonconservative futurological scenarios are played for much higher stakes.  A strong-nanotechnology AI has the power to affect billions of lives and humanity's entire future.  A traffic-control AI is being entrusted "only" with the lives of a few million drivers and pedestrians.  A strictly arithmetical utilitarian calculation would show that a mere 0.1% chance of the transhuman-AI scenario should weigh equally in our futuristic calculations with a 100% chance of a traffic-control scenario.  I am not a strictly arithmetical utilitarian, but I do think the quantitative calculation makes a valid qualitative point - deciding which scenarios to prepare for should take into account the relative stakes and not just the relative probabilities.
 

Additional assumptions:
Nonconservative for Friendly AI: Conservative for Friendly AI:
Reliable hardware and software. Error-prone hardware or buggy software.
Serial hardware or symmetric multiprocessing. Asymmetric parallelism, field-programmable gate arrays, Internet-distributed untrusted hardware.
Human-observable cognition; AI can be definitely known to be Friendly. Opaque cognition; the AI would probably succeed in hiding unFriendly cognition if it tried (2).
Persistent training; mental inertia; self-opaque neural nets. The AI does not have the programmatic skill to fully rewrite the goal system or resist modification; programmers can make procedural changes without declarative justification. The AI understands its own goal system and can perform arbitrary manipulations; alterations to the goal system must be reflected in the AI's beliefs about the goal system in order for the alterations to be persist through rounds of self-improvement.
Monolithic, singleton AI. Multiple, diverse AIs, with diverse goal systems, possibly with society or even evolution.
Given diverse AIs:  A major unFriendly action would require a majority vote of the AI population. Given diverse AIs:  One unFriendly AI, possibly among millions, can severely damage humanity.
The programmers have completely understood the challenge of Friendly AI. The programmers make fundamental philosophical errors.

It is always possible to make engineering assumptions so conservative that the problem becomes impossible.  If the initial system that undergoes the takeoff to transhumanity is sufficiently stupid, then I'm not sure that any amount of programming or training could create cognitive structures that would persist into transhumanity (3).  Similarly, there have been proposals to develop diverse populations of AIs that would have social interactions and undergo evolution; regardless of whether this is the most efficient method to develop AI (4), I think it would make Friendliness substantially more difficult.

Nonetheless, there should still be a place in our hearts for overdesign, especially when it costs very little.  I think that AI will be developed on symmetric-multiprocessing hardware, at least initially.  Even so, I would regard as entirely fair the requirement that the Friendliness methodology - if not the specific code at any given moment - work for asymmetric parallel FPGAs prone to radiation errors.  A self-modifying Friendly AI should be able to translate itself onto asymmetric error-prone hardware without compromising Friendliness.  Friendliness should be strong enough to survive radiation bitflips, incompletely propagated changes, and any number of programming errors.  If Friendliness isn't that strong, then Friendliness is probably too fragile to survive changes of cognitive architecture.  Furthermore, I don't think it will be that hard to make Friendliness tolerant of programmatic flack - given a self-modifying AI to write the code.  (It may prove difficult for prehuman AI.)

My advice:  "Don't give up hope too soon when it comes to designing for 'conservative' assumptions - it may not cost as much as you expect."

When it comes to Friendliness, our method should be, not just to solve the problem, but to oversolve it.  We should hope to look back in retrospect and say:  "We won this cleanly, easily, and with plenty of safety margin."  The creation of Friendly AI may be a great moment in human history, but it's not a drama.  It's only in Hollywood that the explosive device can be disarmed with three seconds left on the timer.  The future always has one surprise you didn't anticipate; if you expect to win by the skin of your teeth, you probably won't win at all.

1.3: Seed AI and the Singularity

Concrete imagery about Friendliness often requires a concrete futuristic context.  I should begin by saying that I visualize an extremely powerful AI produced by an ultrarapid takeoff, not just because it's the conservative assumption or the highest-stakes outcome, but because I think it's actually the most likely scenario.  See General Intelligence and Seed AI and GISAI 1.1: Seed AI, or the introductory article "What is Seed AI?"

Because of the dynamics of recursive self-enhancement, the scenario I treat as "default" is a singular "seed" AI, designed for self-improvement, that becomes superintelligent, and reaches extreme heights of technology - including nanotechnology - in the minimum-time material trajectory.  Under this scenario, the first self-modifying transhuman AI will have, at least in potential, nearly absolute physical power over our world.  The potential existence of this absolute power is unavoidable; it's a direct consequence of the maximum potential speed of self-improvement.

The question then becomes to what extent a Friendly AI would choose to realize this potential, for how long, and why.  At the end of GISAI 1.1: Seed AI, it says:

"The ultimate purpose of transhuman AI is to create a Transition Guide; an entity that can safely develop nanotechnology and any subsequent ultratechnologies that may be possible, use transhuman Friendliness to see what comes next, and use those ultratechnologies to see humanity safely through to whatever life is like on the other side of the Singularity."
Some people assert that no really Friendly AI would choose to acquire that level of physical power, even temporarily - or even assert that a Friendly AI would never decide to acquire significantly more power than nearby entities.  I think this assertion results from equating the possession of absolute physical power with the exercise of absolute social power in a pattern following a humanlike dictatorship; the latter, at least, is definitely unFriendly, but it does not follow from the former.  Logically, an entity might possess absolute physical power and yet refuse to exercise it in any way, in which case the entity would be effectively nonexistent to us.  More practically, an entity might possess unlimited power but still not exercise it in any way we would find obnoxious.

Among humans, the only practical way to maximize actual freedom (the percentage of actions executed without interference) is to ensure that no human entity has the ability to interfere with you - a consequence of humans having an innate, evolved tendency to abuse power.  Thus, a lot of our ethical guidelines (especially the ones we've come up with in the twentieth century) state that it's wrong to acquire too much power.

If this is one of those things that simply doesn't apply in the spaces beyond the Singularity - if, having no evolved tendency to abuse power, no injunction against the accumulation of power is necessary - one of the possible resolutions of the Singularity would be the Sysop Scenario.  The initial seed-AI-turned-Friendly-superintelligence, the Transition Guide, would create (or self-modify into) a superintelligence that would act as the underlying operating system for all the matter in human space - a Sysop.  A Sysop is something between your friendly local wish-granting genie, and a law of physics, if the laws of physics could be modified so that nonconsensually violating someone else's memory partition (living space) was as prohibited as violating conservation of momentum.  Without explicit permission, it would be impossible to kill someone, or harm them, or alter them; the Sysop API would not permit it - while still allowing total local freedom, of course.

The pros and cons of the Sysop Scenario are discussed more thoroughly in Interlude: Of Transition Guides and Sysops.  Technically the entire discussion is a side issue; the Sysop Scenario is an arguable consequence of normative altruism, but it plays no role in direct Friendliness content.  The Sysop Scenario is important because it's an extreme use of Friendliness.  The more power, or relative power, the Transition Guide or other Friendly AIs are depicted as exercising, the more clearly the necessary qualities of Friendliness show up, and the more clearly important it is to get Friendliness right.  At the limit, Friendliness is required to act as an operating system for the entire human universe.  The Sysop Scenario also makes it clear that individual volition is one of the strongest forces in Friendliness; individual volition may even be the only part of Friendliness that matters - death wouldn't be intrinsically wrong; it would be wrong only insofar as some individual doesn't want to die.  Of course, we can't be that sure of the true nature of ethics; a fully Friendly AI needs to be able to handle literally any moral or ethical question a human could answer, which requires understanding of every factor that contributes to human ethics.  Even so, decisions might end up centering solely around volition, even if it starts out being more complicated than that.

I strongly recommend reading Greg Egan's Diaspora, or at least Permutation City, for a concrete picture of what life would be like with a real operating system... at least, for people who choose to retain the essentially human cognitive architecture.  I don't necessarily think that everything in Diaspora is correct.  In fact, I think most of it is wrong.  But, in terms of concrete imagery, it's probably the best writing available.  My favorite quote from Diaspora - one that affected my entire train of thought about the Singularity - is this one:

    Once a psychoblast became self-aware, it was granted citizenship, and intervention without consent became impossible.  This was not a matter of mere custom or law; the principle was built into the deepest level of the polis.  A citizen who spiraled down into insanity could spend teratau in a state of confusion and pain, with a mind too damaged to authorize help, or even to choose extinction.  That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
Annotated version:
    Once a psychoblast [embryo citizen] became self-aware [defined how?], it was granted citizenship, and intervention without consent [defined how?] became impossible.  This was not a matter of mere custom or law; the principle was built into the deepest level of the polis.  A citizen who spiraled down into insanity [they didn't see it coming?] could spend teratau [1 teratau = ~27,000 years of subjective time] in a state of confusion and pain, with a mind too damaged to authorize help [they didn't authorize it in advance?], or even to choose extinction.  That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
This is one of the issues that I think of as representing the "fine detail" of Friendliness content.  Although such issues appear, in Diaspora, on the intergalactic scale, it's equally possible to imagine them being refined down to the level of an approximately human-equivalent Friendly AI, trying to help a few nearby humans be all they can be, or all they choose to be, and trying to preserve nearby humans from involuntary woes.

Punting the issue of "What is 'good'?" back to individual sentients enormously simplifies a lot of moral issues; whether life is better than death, for example.  Nobody should be able to interfere if a sentient chooses life.  And - in all probability - nobody should be able to interfere if a sentient chooses death.  So what's left to argue about?  Well, quite a bit, and a fully Friendly AI needs to be able to argue it; the resolution, however, is likely to come down to individual volition.

Thus, Creating Friendly AI uses "volition-based Friendliness" as the assumed model for Friendliness content.  Volition-based Friendliness has both a negative aspect - don't cause involuntary pain, death, alteration, et cetera; try to do something about those things if you see them happening - and a positive aspect: to try and fulfill the requests of sentient entities.

Friendship content, however, forms only a very small part of Friendship system design.

1.4: Content, acquisition, and structure

The task of building a Friendly AI that makes a certain decision correctly is the problem of Friendship content.  The task of building a Friendly AI that can learn Friendliness is the problem of Friendship acquisition.  The task of building a Friendly AI that wants to learn Friendliness is the problem of Friendship structure.

It is the structural problem that is unique to Friendly AI.

The content and acquisition problems are similar to other AI problems of using, acquiring, improving, and correcting skills, abilities, competences, concepts, and beliefs.  The acquisition problem is probably harder, in an absolute sense, than the structural problem.  But solving the general acquisition problem is prerequisite to the creation of AIs intelligent enough to need Friendliness.  This holds especially true of the very-high-stakes scenarios, such as transhumanity and superintelligence.  The more powerful and intelligent the AI, the higher the level of intelligence that can be assumed to be turned toward acquiring Friendliness - if the AI wants to acquire Friendliness.

The challenge of Friendly AI is not - except as the conclusion of an effort - about getting an AI to exhibit some specific set of behaviors.  A Friendship architecture is a funnel through which certain types of complexity are poured into the AI, such that the AI sees that pouring as desirable at any given point along the pathway.  One of the great classical mistakes of AI is focusing on the skills that we think of as stereotypically intelligent, rather than the underlying cognitive processes than nobody even notices because all humans have them in common.  The part of morality that humans argue about, the final content of decisions, is the icing on the cake.  Far more challenging is duplicating the invisible cognitive complexity that humans use when arguing about morality.

The field of Friendly AI does not consist of drawing up endless lists of proscriptions for hapless AIs to follow.  Theorizing about Friendship content is great fun but it is worse than useless without a theory of Friendship acquisition and Friendship structure.  With a Friendship acquisition capability, mistakes in Friendship content, though still risks, are small risks.  Any specific mistake is still unacceptable no matter how small, but it can be acceptable to assume that mistakes will be made, and focus on building an AI that can fix them.  With an excellent Friendship architecture, it may be theoretically possible to create a Friendly AI without any formal theory of Friendship content, simply by having the programmers answer the AI's questions about hypothetical scenarios and real-world decisions.  The AI would learn from experience and generalize, with the generalizations assisted by querying the programmers about the reasons for their decisions.  In practice, this will never happen because no competent Friendship programmer could possibly develop a theory of Friendship architecture without having some strong, specific ideas about Friendship content.  The point is that, given an intelligent and structured Friendly AI to do the learning, even a completely informal ethical content provider, acting on gut instinct, might succeed in producing the same Friendly AI that would be produced by a self-aware Friendship programmer.  (The operative word is might; unless the Friendly AI starts out with some strong ideas about what to absorb and what not to absorb, there are several obvious ways in which such a process could go wrong.)

Friendship architecture represents the capability needed to recover from programmer errors.  Since programmer error is nearly certain, showing that a threshold level of architectural Friendliness can handle errors is prerequisite to making a theoretical argument for the feasibility of Friendly AI.  The more robust the Friendship architecture, the less programmer competence need be postulated in order to argue the practical achievability of Friendliness.

Friendship structure and acquisition are more unusual problems than Friendship content - collectively, we might call them the architectural problems.  Architectural problems are closer to the design level and involve a more clearly defined amount of complexity.  Our genes store a bounded amount of evolved complexity that wires up the hippocampus, but then the hippocampus goes on to encode all the memories stored by a human over a lifetime.  Cognitive content is open-ended.  Cognitive architecture is bounded, and is often a matter of design, of complex functional adaptation.



Next: An Introduction to Goal Systems
Up: Creating Friendly AI
Prev: INIT