SIAI Guidelines on Friendly AI

Version 1.0:  June 14, 2001.
Version 1.0.1:  Dec 22, 2001.
©2001 by Singularity Institute for Artificial Intelligence, Inc.
URL:  http://www.singinst.org/ourresearch/publications/guidelines.html
Comments:  friendly@singinst.org


Foreword
Principles
Design
Conclusions



1: Foreword

The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals. Present-day AIs are enormously inferior to humans in almost every capacity and do not possess the capability to significantly harm or benefit humans. Yet growth in AI intelligence - though slow by the standards of most technologies - is astronomically faster than the rate of human evolution. There are also powerful theoretical reasons to believe that AI growth rates can move quickly compared to human cultural evolution. These reasons include steady exponential growth in underlying computing power, recursive self-improvement in self-modifying AIs, and the maximum switching speed of transistors relative to neurons. Although some presently consider it controversial whether real-world Artificial Intelligence can be achieved at all, let alone whether AIs will someday exceed human capacities, the need for advance planning is established by a strong theoretical argument for the possibility. The SIAI Guidelines on Friendly AI are produced by the Singularity Institute for Artificial Intelligence, Inc., a 501(c)(3) nonprofit corporation. The field of Artificial Intelligence is presently only beginning to explore the problems bound up in Friendly AI. Thus, the Guidelines do not currently represent an academic consensus or an industry standard. Rather, the Singularity Institute's commitment to Friendly AI is intended as a focal point around which debate and consensus can accrete. Friendly AI is a frontier AI challenge as well as a public safety issue, and creativity may turn out to be more in demand than standardization, but there is still a definite public safety benefit in the open sharing of any concrete suggestions for Friendly AI. Our development of the Guidelines was sparked by a theoretical analysis of Friendly AI which suggested several identifiable features of human cognition that would need to be duplicated in order to achieve Friendly AI, suggested specific design methods and cognitive architectures, and suggested that Friendship features might need to be implemented early in the course of AI development for maximum safety and to ensure forward compatibility with later versions. Furthermore, debates about the dangers and benefits of AI and other advanced technologies have recently begun to appear, with increasing frequency, in academic and public venues; thus, it is of immediate importance whether a strong theoretical case can be made for the feasibility of Friendly AI. Making safety recommendations for Artificial Intelligence is a unique challenge because the problem of Friendly AI is inextricably intertwined with the problem of AI itself. Creating Friendly cognition requires creating cognition. In other technologies where the need for safety guidelines is recognized, the safety guidelines are simpler, more obvious, and less controversial than the technical and scientific challenges of the field's frontiers.

Example:  Although biotechnology itself is still a rapidly growing science, the NIH Guidelines on Recombinant DNA precisely describe multiple levels of risk and provide detailed, technical instructions for containment of each risk group. Thus, although mandatory only for federally funded programs, the NIH Guidelines continue to be voluntarily and universally accepted within the biotechnology industry.

Example:  The Foresight Guidelines on Nanotechnology are designed to ensure safety of a technology which does not yet exist, and the Foresight Institute acknowledges that the recommendations made are probably only a small subset of the safety precautions needed, but the Foresight recommendations are both simple and obvious in retrospect. For example, that molecular blueprints, and especially the blueprints of manufacturing devices, should be encrypted in such a way that any transmission error between memory storage and manufacturing randomizes the blueprint.

Friendly AI, by contrast, is a challenge which lies at the frontiers of AI. Thus, these Guidelines are not intended as a proposal for future regulation or legislation. The current state of AI is such that it would be impossible to create a human-equivalent AI, or even a workable theory of AI, by appointing a panel of experts. A panel of experts would be unlikely to agree even on first principles. Any project that succeeds in developing AI has demonstrated exceptional competence, more than would be expected from a group selected by any other criterion. Thus, it would be very dangerous to take away the responsibility for implementing Friendliness, or even take away the responsibility for developing a basic theory of Friendliness, from whichever AI project is first successful in developing real AI. It is simply not possible, given the current condition of the field, to convene a committee to solve an AI problem, and Friendliness is a frontier AI problem as well as a public safety issue.

Although local AI projects should have final authority, local projects should nonetheless be aware of that authority. Future projects may have their own theories of Friendly AI, but they should be aware of their responsibility to have some theory of Friendly AI. If a particular safeguard is generally held to be a good idea, a project that decides not to implement the safeguard should have made the deliberate and explicit decision that the safeguard is unworkable, or unsafe, or incompatible with their theory of Friendly AI, or impossible under their cognitive architecture. Any sufficiently advanced AI project needs to be "Friendliness-aware". Such awareness is currently nonexistent. This is not currently as dangerous as it might be - the lack of safety awareness does not present the immediate crisis it would present in a more developed technology. Nearly all present-day AI projects are not "sufficiently advanced"; they are neither real-world AIs nor the intended precursors of real-world AIs. An AI that is not self-improving, and is not intended to become self-improving, probably does not need to implement the Guidelines' recommendations. An AI that does not possess a sufficiently general cognitive architecture cannot implement the Guidelines' design recommendations. But regardless of when specific Friendly AI features become necessary, we believe any AI project that states the future goals of general intelligence and self-improvement thereby incurs the responsibility to be Friendliness-aware.

For More Information

2: Principles

Theoretical grounds for analyzing Friendly AI are drawn from existing theories of normative decision-making and evolutionary psychology. Humans are presently the only subjects of cognitive science - the only intelligent systems that have been studied - but modern theories of human cognition are sophisticated enough that a principled attempt can be made to adjust the human theories for other types of mind. It is possible to link effects to causes, and to distinguish between causes that are unique to humans, causes that carry over to minds in general, and causes whose presence or absence is a design decision. Unfortunately, humans are also the only researchers of cognitive science. As humans, we have built-in, hardwired assumptions about other minds. In our ancestral environment, all other intelligent entities were humans sharing our built-in emotional and cognitive architecture. We are thus adapted to expect, in others, what is "natural" for us; we are adapted to expect human behaviors from minds in general because humans were the only minds present in our ancestral environment. Even today, humans are the only form of intelligent life of which we have experience, thus depriving us of perspective. Our experience tends to indicate that anthropomorphism - the inappropriate application of human-anticipating instincts or human-descriptory experience to nonhuman minds - is the single greatest source of human error in forward analysis of AI psychology, and in Friendly AI especially. Because our social instincts are emotional instincts, anthropomorphic errors often carry with them a weight of emotional investment, making them unusually hard to dispel. A detailed analysis of common anthropomorphisms is beyond the scope of the Guidelines; please see Creating Friendly AI and  CFAI 2: Beyond anthropomorphism. Once anthropomorphism is dispensed with, the task of creating a Friendly AI is found to not remotely resemble the task of ensuring ethical behavior in a possibly hostile human, or even the task of instilling ethical behavior in a growing human child. Human analogies are dangerous, both because they assume far too much built-in positive functionality, and because they warn against negative outcomes resulting from human behaviors probably not shared by an AI. It is a truism in AI that researchers, as humans, tend to notice those problems that are difficult for humans and that rise to the level of our conscious attention. Tasks that are automatically handled by our preconscious systems do not rise to our conscious attention, even if the tasks are extremely complex, or are prerequisite to the solution of the conscious problem at hand. Typically such preconscious tasks only come to the attention of AI after years of failure to solve the high-level problem without first implementing the prerequisite low-level cognition. Where the cognition at hand is moral cognition, the semantics of human moral disputes exacerbate the problem. The first class of error is the assertion of objectivity, which results in the programmer perceiving non-automatic functionality as "natural" or "obvious". This leads to the non-implementation of positive functionality. The second class of error is the assertion of arbitrariness, which interferes with programmer perception of error correction, context sensitivity, and design elegance. This leads to the non-prevention of negative functionality. The conclusion offered by Creating Friendly AI is that Friendliness is neither automatic nor arbitrary. That is a prerequisite condition for the existence of any Guidelines - effort is required to create a Friendship system, and there are constraints upon what can be created. It is necessary to take action and possible to make mistakes. However, it does not follow that Friendly AI researchers must make zero mistakes or that they must solve the entire problem immediately. A fundamental problem of AI is an AI that can, given some threshold of ability, acquire further abilities on its own - either through humanlike learning or "seed AI" self-improvement. The task of building a Friendly AI that makes a certain decision correctly is the problem of Friendship content.  The task of building a Friendly AI that can learn Friendliness is the problem of Friendship acquisition.  The task of building a Friendly AI that wants to learn Friendliness is the problem of Friendship structure.  The content and acquisition problems are similar to other AI problems of acquiring, improving, and correcting skills, abilities, competences, concepts, and beliefs. The structural problem is unique to Friendly AI. The acquisition problem is probably harder than the structural problem, but solving the general acquisition problem is prerequisite to the creation of AIs advanced enough require Friendliness. The more powerful and intelligent the AI, the more Friendliness content is required; but also, in turn, the higher the level of intelligence that can be assumed to be turned toward acquiring Friendliness - so long as that AI chooses to acquire Friendliness. The onset of the need for Friendship content is defined by the timing of the need to make real-world decisions that may benefit or harm humans. The onset of the need for Friendship structure is defined by the onset of the AI's ability to resist human manipulation if the AI does not see that manipulation as desirable. Given an AI with the structural Friendliness needed to accept human advice in situations where programmer competence in Friendliness exceeds the AI's own, a structurally correct AI needs only that threshold level of Friendship content required to know when to ask for advice; this is not true "competence" by the standards of AI, but it is safety. Because of the extremely high stakes associated with the creation of novel intelligent entities, it is necessary to be conservative in estimating how much Friendship content and structure is required at a given point in time. "Conservative", for Friendly AI, has the opposite polarity of "conservative" for AI in general; it means attempting to set upper bounds on the AI's potential rather than lower bounds on the AI's current abilities. The Singularity Institute presently distinguishes two conservative methods for Friendship-preparedness. The first method is "supersaturated" Friendliness, in which the maximum possible amounts of Friendship content and structure are infused; as soon as it becomes possible for an AI to usefully represent a Friendship feature, that feature is implemented. The second method is to pursue a "90/10" strategy for Friendship content and a "one step ahead" strategy for Friendship structure. It is proverbial in computer programming that the last 10% of the functionality requires 90% of the effort; thus, "90/10" refers to the strategy of implementing that 90% of Friendship content that requires 10% of the effort. "One step ahead" implies a development schedule divided into stages, with a given feature for structural Friendliness scheduled for completion at least one stage in advance of the stage where that feature is (conservatively) expected to become necessary. (Again, Friendship content becomes necessary in response to real-world capabilities; Friendship structure becomes necessary in response to internal capabilities for self-modification or cognitive content modification.) Supersaturated Friendliness is the safest policy, and also ensures maximal forward compatibility by implementing architectural features as early as possible. In an ideal world, all projects seriously striving for a sufficiently advanced AI would subscribe to the ideal of supersaturated Friendliness. 90/10 Friendliness would be reserved for AI projects trying for general intelligence or self-improvement, but without the explicit goal of real-world independent planning or transhumanity. In practice, the distinction between supersaturated Friendliness and 90/10 Friendliness is more likely to reflect the distinction between non-profit and for-profit projects; or the distinction between well-funded and shoestring projects; or the distinction between projects that strongly believe in the need for Friendly AI, and projects that were persuaded to implement some minimal level of Friendliness "just in case". However, we maintain that anything less than 90/10 Friendliness should probably not be considered Friendliness-aware. Furthermore, forward compatibility may require implementation of cognitive architectures in advance of when those architectures become directly necessary to Friendship structure. What is to be particularly avoided is the cognitive equivalent of a Y2K bug; a design requirement which is trivial to fulfill if anticipated in advance, but which is difficult and expensive if there already exists an "installed base" of source code or cognitive content. Thus, a Friendliness-aware AI project should be conscious of all architectural features currently predicted to be later required, no matter how far off. Where should efforts in Friendly AI research be concentrated? Friendship structure and acquisition are more unusual problems than Friendship content. Friendship structure and acquisition are closer to the design level and involve a more clearly defined amount of complexity. (Consider the difference between the bounded amount of adapted complexity required for humans to form memories, and the vast amount of complex data contained in all the memories formed over a lifetime.)  Friendship structure and acquisition are closer to the level of underlying cognition, and are thus less likely to be visible to naive introspection, arguing that these areas are likely to be underserved in existing speculations. Furthermore, Friendship architecture represents the capability needed to recover from programmer errors. Since programmer error is nearly certain, showing that a threshold level of architectural Friendliness can naturally handle a very wide range of errors is prerequisite to making a theoretical argument for the feasibility of Friendly AI. The more robust the Friendship architecture, the less programmer competence need be postulated in order to argue the practical achievability of Friendliness. Thus, the Guidelines' specific design recommendations focus on architecture.

3: Design

The specific design recommendations found in the Guidelines are highly condensed summaries of the features listed in "Features of Friendly AI", which in turn are highly condensed summaries of the discussions found in section 3 of Creating Friendly AI, "Design of Friendship systems". For more information see "Features of Friendly AI" or Creating Friendly AI.

1. Friendliness-topped goal system.  Friendliness should not be an afterthought or a tacked-on "control system" that rules out certain actions. Friendliness should be the sole top-level goal ("supergoal") within the system. Other goals, such as "self-improvement", should be subgoals deriving their desirability from the desirability of Friendliness. For example, self-improvement is predicted to lead to a more effective future AI, which, if the future AI is Friendly, is predicted to lead to greater fulfillment of the Friendliness supergoal. Friendliness does not overrule other goals; rather, all other goals' desirabilities are derived from Friendliness.
This is not the human organization, but it appears to embrace all necessary behaviors. If a programmer correctly sees a behavior as necessary and nonharmful to the existence and growth of a (Friendly) AI, then the behavior is, for that reason, a valid subgoal of Friendliness. The necessity of such a behavior may be supported by programmer affirmation of the link, even if the link is not understandable to an early AI.
We recommend (to Friendliness-aware projects) the use of a Friendliness-topped goal system in any AI having a declarative goal system.

2. Cleanly causal goal system.  A "causal" goal system is one in which desirability is the inverse of prediction; that is, if action A is predicted to lead to desirable state B, then B's desirability will flow to A. In a cleanly causal goal system, there are no other sources of desirability, and all desirability flows from supergoal content. Furthermore, desirability in a cleanly causal goal system is not perseverant. If desirability flows from B to A, then the continuing desirability of A is strictly contingent on the continuing desirability of B and on the continuing strength of the causal link between A and B. Thus, a behavior which is usually a subgoal of Friendliness, but which has unFriendly consequences in some specific case, will not be seen as desirable in that case. In a cleanly causal goal system, the property desirable behaves identically with the property is-predicted-to-lead-to-supergoal.
We recommend (to Friendliness-aware projects) the use of a cleanly causal goal system in any AI having a causal goal system.

3. Probabilistic supergoal content.  If supergoals are certain or "correct by definition", a self-modifying AI may resist all alterations to supergoals. For example:  If the supergoal is to achieve A, then switching the supergoal to achieving B is obviously not an action that serves A. If, however, the supergoal has a "90% probability" of being A, and new information arrives indicating that the supergoal is actually B, then no conflict should exist. The capacity to make changes to supergoal content will be perceived by the AI as desirable only if there is uncertainty in the current supergoals; such a capacity cannot inherit desirability from any specific supergoal content.
We recommend (to Friendliness-aware projects) the use of probabilistic supergoal content for any declarative goal system capable of learning, and we recommend that this feature be implemented at least one stage in advance of the system becoming capable of reflective reasoning or self-modification.

4. Acquisition of Friendliness sources.  A growing AI should acquire the cognitive complexity used by humans to make decisions about Friendliness, so that the AI can model, anticipate, and eventually improve on the decisions of the human programmers. The AI should preferentially acquire that cognitive complexity which is designated as "valid" - by the human programmers, or by the AI's current store of cognitive complexity. That is, the AI should only acquire "good" sources of Friendliness decisions.
We recommend (to Friendliness-aware projects) the implementation of Friendship acquisition architectures at least one stage in advance of when the system is expected to acquire the intelligence necessary to begin making actual hypotheses about Friendliness sources.

5. Causal validity semantics.  (Note:  This is a very advanced topic.)  Causal validity semantics requires that the AI model the causal process that led to the AI's creation and that the AI use its existing cognitive complexity (or programmer assistance) to make judgements about the validity or invalidity of factors in that causal process. Causal validity content would provide the cognitive framework for an AI to decide how it "should have been" built, to perceive as erroneous the unintended consequences of a programmatic architecture, to detect very deep errors on the part of the programmers, or to perceive as desirable the transition to new Friendliness architectures. Causal validity structure provides the conceptual capacity to see, as possibly desirable, architectural changes as well as content changes. The cognitive process enabling these benefits is the AI's causal model of the AI's creation, in which the programmers' intentions are designated as the cause of the AI's goal system, and the validity of the goal system content (and architecture) is derived from and contingent on the validity of the intentions of the programmers. This enables the AI to see shifts in expressed programmer intentions as possibly implying the desirability of changes to the goal system architecture, not just goal content.
We recommend (to Friendliness-aware projects) the implementation of causal validity semantics at least one stage in advance of the system becoming generally intelligent. We note that this is a highly advanced feature.

6. Injunctions.  Human actions, and particularly our refusals to take certain actions, are not always motivated by directly visible consequences. However, many such cases can be translated into normative goal reasoning by reference to consequences which are not directly visible. For example, a small probability of a very large long-term negative payoff may injunct against an action with visible short-term benefits. In cases where short-term benefits may become AI-predictable before the long-term penalties are understood, or other cases where negative outcomes may not be immediately visible to the AI, the designers should provide programmer-affirmed information about possible consequences. Note that this is normative goal-system content and does not require special-purpose code.
Experimentation with injunctions will probably be required in any AI project where the AI's actions can have negative consequences not completely understood by the AI. We recommend (to Friendliness-aware projects) the proactive use of specific injunctive content to prevent possible negative outcomes.

7. Self-modeling of fallibility.  A thought may be mistaken; the thought "X is green" does not have a 100% certain Bayesian association with the actual greenness of X. The same holds true for the thought "X is desirable". A goal system may be mistaken under its own standards of normativeness; a probabilistic goal system with reflection can imagine the possibility of an error. Modeling of fallibility enables the current AI and the programmer to cooperate against failures of Friendliness in future AIs; that is, the current AI will estimate such cooperation to be desirable.
Modeling of ordinary fallibility is required by AI projects investigating intelligence, in general. We recommend (to Friendliness-aware projects) the proactive use of programmer-assisted modeling of fallibility, or programmer-affirmed knowledge about fallibility, to prevent negative outcomes stemming from nonawareness of fallibility, or to enable important conclusions and behaviors based on self-modeling of fallibility.

8. Controlled ascent.  A self-improving system should have an "improvements counter" which increments each time an improvement of a recognized type is made. This enables detection if improvements begin occurring at a rate much faster than usual. By measuring the rate of change of the improvements counter under normal conditions, the programmers can designate some safe level of improvement which, if exceeded, causes the system to halt and page the programmers and not continue until approval is received.
Within a primitive system, a "controlled ascent" feature can be implemented programmatically, using special-purpose code. Since this is a very simple and inexpensive precaution, it should be taken for any recursively self-improving system, no matter how primitive, on general principles. (Recursive self-improvement should be distinguished from learning systems that improve, but not self-improve.)
For general intelligences and self-understanding AIs, a controlled ascent subgoal can be desirable because of a self-model in which too much unsupervised self-improvement has a probability of leading to Friendliness errors.
The purpose of a controlled ascent feature is not to prevent an AI from "awakening", but rather to ensure that the process occurs under human supervision, and can be slowed or paused to allow the installation of further Friendship features if the project is unready. Controlled ascent is strictly a temporary measure and is not viable as a permanent policy.
We recommend the implementation of a programmatic controlled ascent feature to any recursively self-improving AI where there exists an obvious metric for the number of self-improvements made. We recommend (to Friendliness-aware projects) the programmer affirmation of a controlled ascent subgoal as soon as this cognitive content can be represented by the AI.

4: Conclusions

The present-day Internet contains an amount of networked computing power exceeding most estimates of the computing power of a single human brain. AI still appears to be very far off, and this distance is real, but the distance may consist of software rather than hardware, and software capabilities can improve very quickly. This does not mean predicting that software capabilities will move quickly; only that software capabilities may move quickly. Because Friendly AI inherently implies the extremely advanced capability of making real-world plans in pursuit of goals, the present-day field of AI has been reluctant to discuss the topic at all. Speculation about the future of AI has been largely seen as the reserved subject of "popular" works. Trying to discuss extreme-sounding scenarios in most academic venues - or even daring to indicate that one takes the topic seriously - leads to the loss of status, as other scientists, eager to display their own constraint and conservatism, chime in with multiple choruses of criticism. Normally, this is not a problem; it is a useful part of the social process of science, which counteracts the tendency of individuals to focus on ideas that sound glamorous and exciting, to the detriment of the rational global distribution of effort. Friendly AI is admittedly exciting, perhaps even "glamorous", but it seems quite safe to say that Friendly AI would remain genuinely important and necessary even if the possibility of real-world AI utterly failed to excite the human psyche and the research itself were as boring as grade-school arithmetic. So, in this case, overcompensation has occurred. Friendly AI is important and any opportunity to develop the theory further should be taken as soon as it becomes available, rather than researchers waiting because "It isn't necessary yet" or "AI is a long way off". Such statements may be true but they are not knowably true. As an attitude, that is simply the wrong way to approach a public safety issue. To say that the rational distribution of effort has failed is to implicitly state that useful present-day efforts can be made in Friendly AI. We believe that it is both possible, and desirable, to build a firm theoretical foundation for future Friendly AI efforts before Friendly AI becomes immediately necessary, and that much of this work can be advanced by present-day investigations. These present-day investigations have not been undertaken because the field of Artificial Intelligence is averse to the detailed visualization of extreme possibilities, which visualization is necessary in order to understand which present-day investigations could be useful. It is a legitimate topic of cognitive science to examine, for example, "Qualitatively different behaviors exhibited by probabilistic and non-probabilistic goal systems under reflection."  But to realize that this investigation is worthwhile - that it is relevant to Friendly AI - requires extended thinking about self-improving reflective Artificial Intelligence, a presently nonexistent technology. We recommend that Friendly AI be regarded as a legitimate area of research by the field of Artificial Intelligence. An AI project implementing a Friendly AI feature should be praised for making progress in Friendly AI and for setting a good safety precedent, rather than being accused of implicit unconservatism. We recommend that any AI project stating the future goals of general intelligence and self-improvement regard itself as having a responsibility to be Friendliness-aware. Because Friendly AI is a frontier research problem as well as a public safety issue, it is impossible to legislate or centrally regulate the design of Friendly AI. We recommend that centrally imposed requirements, if any, take the form of an obligation to produce a coherent policy or report on efforts made, rather than an obligation to implement specific features. (We believe any attempt to formalize responsibilities is extremely premature given the current state of Friendly AI, but this may change in the future.) We reaffirm the feature recommendations made in the Design section of these Guidelines. We recommend that all desirable Friendly AI features be implemented as soon as they become possible. Failing that, we recommend that all easy Friendly AI content be implemented as soon as possible, and that all architectural features be implemented at least one stage of development in advance of when they are expected to become necessary. In the future, we envision that nonprofit organizations such as the Singularity Institute may employ Friendship programmers which are loaned out, on a part-time or full-time basis, to AI projects requiring Friendship-awareness but having no resources or inadequate resources allocated to Friendly AI. We recommend this as an important public safety measure and a legitimate future use of philanthropic or charitable resources. We recommend that all AI projects which obtain concrete experience in problems of Friendly AI should publish their results, or as much of those results as can possibly be published without revealing trade secrets. There is no conceivable benefit of withholding Friendly AI results that outweighs the benefit of making the information public. We recommend that relevance to a Friendly AI problem be regarded as a legitimate affirmation of the importance of an issue in cognitive science, and that such issues be investigated as part of an effort to build a theoretical foundation for Friendly AI in advance of when it is needed. We recommend that theoretical efforts to extrapolate a foundation for Friendly AI focus on architectural issues.


Return to Publications
Return to the Singularity Institute