The following design features have been mined from Creating Friendly AI, and specifically CFAI 3: Design of Friendship systems. These highly condensed summaries are offered for the benefit of individuals looking for a fast visualization of how a Friendship architecture might operate. Features of Friendly AI does not constitute an explanation of the Friendly AI paradigms that make these design features necessary, but not others. For an introduction to Friendly AI, please see "What is Friendly AI?" For a complete exposition on Friendly AI, please see the book-length online paper Creating Friendly AI, here abbreviated "CFAI". References are also made to General Intelligence and Seed AI, abbreviated "GISAI". Glossary references appear in green.
A causal goal system backpropagates desirability along predictive links between actions and outcomes. [CFAI 3.1.1: Cleanly causal goal systems].
Intelligence is an evolutionary advantage because it enables us to model, predict, and manipulate reality. Whether a model of reality effectively "mirrors" reality is often ambiguous, but whether a model can be used to successfully predict future sensory information is a stronger test. A still stronger test is the use of a model to decide between a limited set of possible actions based on the set of predicted outcomes of world-plus-action. The strongest test is the use of a model for manipulation by starting with an image of a desired goal-state and reasoning backwards to find the necessary actions. [GISAI 2.1: World-model.]
Goal-oriented behavior is behavior that coherently acts to steer the world toward a particular target state, or toward a state meeting a particular description. [CFAI: An Introduction to Goal Systems.] Decision and manipulation are two forms of goal-oriented cognition that enable an intelligent being to act in such a way that desired outcomes are more likely.
A crude model of goal-oriented cognition nonetheless suffices
to show the backward flow of desirability. (The following
diagram should not be taken as referring directly to the
computational token
level, but rather as representing a very high-level description
of complex mental imagery.)
-
Given a model capable of making accurate predictions, and a model capable of inventing complex actions, a causal goal system suffices to yield normative goal-oriented cognition. In fact, causal goal systems are usually taken as the standard which defines normative goal cognition.
A causal goal system, in combination with sensory feedback and the Bayesian Probability Theorem, suffices in itself to produce normative positive and negative reinforcement, which in the human goal system are implemented by pain and pleasure. [CFAI 3.1.4: Bayesian reinforcement.] Positive and negative reinforcement are evolutionary advantages because they increase or decrease the probability of repetition of behaviors which have produced positive or negative outcomes, thereby increasing or decreasing the probability of repetition of the positive or negative outcomes themselves. Since Bayesian reinforcement requires deliberative general intelligence in order to operate, and general intelligence is an evolutionarily recent innovation, the development of separate hardware by natural evolution to implement positive and negative reinforcement should not be taken as evidence that a separate system is the optimal design. [CFAI 2.2.1: Pain and pleasure.]
In CFAI 3.1.4: Bayesian reinforcement, a scenario is offered involving an AI with two sensory devices, a microphone and an optical pickup, and a "motor" peripheral consisting of a computer monitor under the AI's direct control. The AI, in this scenario, has the goal of "producing a loud noise", which would be indicated by sound entering the microphone. We will suppose that this AI has observed at least one thunderstorm. The AI, having correctly observed that lightning is often followed by thunder, has incorrectly hypothesized that bright lights directly cause loud noises. To be specific: The AI, having observed a thunderstorm, has learned to predict that a flash (RGB spike detected by optical pickup) will be followed by noise in the microphone 95% of the time, and has incorrectly hypothesized, with 80% probability, that the light itself is the direct cause of the thunder.
The AI has also learned that its actions in controlling the monitor's luminosity are always reflected in incoming sensory information from the optical pickup - that is, that the AI has the "motor" capability to generate RGB spikes. The AI has thus incorrectly hypothesized that by flashing the monitor, the AI will be able to generate noise.
(This is the essential difference between a causal prediction
system and a causal goal system. Within a causal prediction
system, there is no inherent distinction between direct
causes and indirect causes; it is enough to know that the
flash of light is usually followed by the sound of thunder.
A causal goal system must distinguish between the hypothesis
that the flash itself causes thunder directly, and the hypothesis
that both the flash and the thunder are generated by a mutual
third cause such as an electrical discharge. Both
the former and the latter case are useful for prediction;
only the former case is useful for manipulation.)
-

The 80% confidence in the hypothesis, and the 95% hypothesized correlation, are summed to a 76% chance that an AI-produced RGB spike will lead to a loud noise. If the "loud noise" has desirability 100, that desirability flows back to give the RGB spike a desirability of 76, which flows to the "world plus flash" without noticeably diminishing, which flows back to give the action of flashing the monitor a payoff of 76. We'll suppose that the expected cost of flashing the monitor is 1; thus, the total desirability of flashing the monitor is 75. (Adding in some corrections discussed in CFAI 3.1.4: Bayesian reinforcement, the differential desirability of flashing is actually 74.24.) Since the differential desirability is positive, the AI will decide to flash.
After taking the flash action, the monitor's flash reflects off nearby objects and adds to the ambient light, the camera picks up the increased ambient light, and the AI observes the expected RGB spike. After the RGB spike, the expected noise fails to materialize and no sound is detected by the microphone - the action has failed.
However, no hardcoded analogue of "pain" or "frustration"
is necessary to prevent the system from repeatedly flashing
the monitor (at a cost of 1 each time) in hopes of obtaining
the expected payoff of 76. The Bayesian
Probability Theorem suffices in itself.
-

Using the possible-worlds formulation of the Bayesian Probability Theorem:
Given a hundred possible worlds, all of them contain monitor flashes. The monitor flash always leads to an RGB spike (an RGB spike is expected in all 100 of the possible worlds). The hypothesis has an 80% confidence; it is correct in 80 possible worlds, incorrect in 20. If the hypothesis is correct, the RGB spike will lead to a noise 95% of the time. If the hypothesis is incorrect, then no noise is expected to occur. Thus, in 76 possible worlds, the hypothesis is correct and a noise occurs. In 4 possible worlds, the hypothesis is correct and no noise occurs. In 0 possible worlds, the hypothesis is incorrect and a noise occurs. In 20 possible worlds, the hypothesis is incorrect and no noise occurs.
The AI now flashes the monitor. The expected RGB spike is observed. However, no noise materializes. Thus, in accordance with the Bayesian Probability Theorem, the probability that the hypothesis is correct goes from 80/(80 + 20) to 4/(20 + 4), or 17%. Formerly, the expected payoff of flashing the monitor was a confidence of 80% times a correlation of 95% times a payoff of 100, for a total payoff of 76. After a single failed flash, the probability of the hypothesis goes from 80% to 17%, and the expected payoff of flashing the monitor becomes (17% * 95% * 100) = 16.15; with some minor corrections, the differential desirability is now 14.99. Since the differential desirability is still positive, another attempt is made to flash the monitor. After another failure, the probability goes from 17% to 1%, and the differential desirability goes from positive 14.99 to negative .06. The hypothesis now has a probability so low that, with the cost of flashing the monitor factored in, it is no longer worthwhile to test the hypothesis.
Several interesting behaviors emerge directly from Bayesian reinforcement:
Surprisingly, I have not been able to find any reference in the literature to Bayesian reinforcement. There are references to the use of Bayesian belief networks to make choices, and (of course) the use of Bayesian sensory data to alter Bayesian belief networks, but no reference to Bayesian reinforcement proper. If anyone knows of such a reference, please contact the author at friendly@singinst.org.
"Subgoal" content describes events whose desirabilities are strictly contingent on their predicted outcomes. "Supergoal" content describes events that are considered intrinsically desirable; i.e, whose desirabilities are not contingent on their predicted outcomes.
Friendliness should be the sole top-level goal ("supergoal")
within the system. Other subgoals, such as "self-improvement",
should derive their desirability from the desirability of
Friendliness. For example, self-improvement is predicted
to lead to a more effective future AI, which, if the future
AI is Friendly,
is predicted to lead to greater fulfillment of the Friendliness
supergoal.
Friendliness does not overrule other goals; rather, other goals' desirabilities are derived from Friendliness. [CFAI 3.1: Cleanly Friendly goal systems.]
If a programmer correctly sees a behavior as necessary and nonharmful to the existence and growth of a (Friendly) AI, then the behavior is, for that reason, a valid subgoal of Friendliness. [CFAI 3.1.2: Friendliness-derived operating behaviors.] The necessity of a behavior may be affirmed by the programmers even if the link is not immediately visible to an unaided AI. [CFAI 3.1.3: Programmer affirmations.]
There is never any valid reason to raise any subgoal of the programmers' to supergoal status within the AI. The derivations of desirability within the AI's goal system should structurally mirror the derivations of desirability within the programmers' minds. If this seems impossible, it indicates that some key facet of goal cognition has not been implemented within the AI, or that the programmers' motives have not been fully documented.
For example, the programmers may wish the AI to focus on
long-term self-improvement rather than immediate Friendliness
to those humans within visible reach. An incorrect
"hack" would be promoting self-improvement to an independent
supergoal of greater value than Friendliness. The
correct action is for the programmers, by self-examination
of their own goal systems, to realize that the reason
they want the AI to focus on long-term self-improvement
is that a more powerful future Friendly AI would
benefit humanity. Thus, the desired distribution of
efforts by the AI can be made to fall directly out of the
following goal-system content:
-

This goal system content shows an AI whose primary motivation is derived from the prospect of future Friendliness. The largest desirabilities flowing through the system originate in the "Future Friendliness" subgoal; thus, most of the AI's present-day actions will be focused on self-improvement, or, in the case of a commercial system, performing tasks for present-day users. However, the AI also tracks present-day Friendliness, allowing the AI to continue gaining direct experience in what constitutes "Friendliness".
Evolution did not produce this system architecture in humanity because most of humanity's evolutionary history was spent without general intelligence, and because evolution tends to accrete small pieces of behavior as independent adaptations even where a more general behavior would be more useful and more flexible. [CFAI: Interlude: The story of a blob.]
Similarly, one of the frequently asked questions about strict subgoals is whether a Friendly AI using that architecture would be too "utilitarian" to understand things like curiosity, aesthetic appreciation, and so on. Since these things are so incredibly useful that people automatically conclude that a Friendly AI without them would fail, they are, by virtue of that very fact, valid subgoals of Friendliness. These subgoals may not be obvious to young AIs; if so, the statement that "curiosity behavior X is a valid child goal of 'discovery'" can be programmer-affirmed. [CFAI 3.1.4.2: Perseverant affirmation (of curiosity, injunctions, et cetera).]
The subgoal nature of curiosity does not mean that curiosity must be justified by a visible, specific expected result in each particular instance. The programmer-affirmed statement that "curiosity is useful" can describe "curiosity" in general, context-insensitive terms; the "curiosity" behaviors described can look - to a human - like exploration for its own sake. The programmer affirmation suffices to draw a predictive line between these curiosity behaviors and the expectation of useful discoveries; no specific expectation of a specific discovery is required for this predictive link to be drawn.
Even if the curiosity behaviors fail to exhibit immediate
results the first few times they are employed, the heuristic
need not be immediately disconfirmed by 1.2:
Bayesian reinforcementBayesian negative reinforcement.
As described above, curiosity need only be affirmed as a
behavior with a very rare, very large payoff. (Similarly,
many negative injunctions
[see CFAI
3.2.4: Injunctions] are behaviors supported by the belief
that these injunctions, very rarely, prevent a very large
negative outcome.) Graphically:
-

"D" stands for "discovery". This diagram shows the very rare (one out of a thousand tries) usefulness of curiosity, at a very high payoff, is affirmed at very high confidence by the human programmers. If the original affirmed probability is 98%, then it will take 416 failed tries before the probability goes down to 97%, 947 tries before the probability goes down to 95%, 1694 tries before the probability goes down to 90%, 6086 failed tries before the probability goes down to 10%, and 8483 failed tries without a single success before the probability goes down to 1%. The curiosity subgoal can be as perseverant as the human independent drive, if the programmers tell the AI in advance that curiosity often doesn't work.
Anything other than Friendliness - any scenarios other than those directly desirable under the programmers' philosophical purpose for the AI - should be subgoal content; that is, their desirabilities should be contingent on their predicted outcomes. The link between child goal and parent goal should be subject to Bayesian positive and negative reinforcement. Alterations to the desirability of the parent goal should be immediately and directly backpropagated to all child goals; alternatively, major actions should reverify the desirability of the parent goals. (The latter is a computationally inexpensive serial action.) [CFAI 3.1.1: Cleanly causal goal systems.]
A cleanly causal goal system is one that can be viewed as containing supergoals, decisions, and beliefs, with all subgoal content being identical to beliefs about which events lead to other events. Under a cleanly causal goal system, desirability is equivalent to leads-to-supergoal-ness and backpropagates along predictive links with exactly the same degree of transitivity and contingency as a normal "will-lead-to-scenario-X" property. For example, if C normally leads to B, and B normally leads to A, it can usually but not always be concluded that C normally leads to A. If A is a supergoal, B would inherit desirability from A, and C would probably but not necessarily inherit desirability (leads-to-A-ness) from B. If the predictive link between C and B were disconfirmed, C would lose both "desirability" and "leads-to-A-ness"; in fact, these are the same property. [CFAI 3.1.1: Cleanly causal goal systems.]
Clean contingency should be considered a powerful feature of a goal system, rather than a constraint. This is made clearer by considering, for example, the idea of an associative or spreading-activation goal system, in which desirability travels along similarity links (rather than predictive links) and is perseverant rather than contingent. Such a system would exhibit very odd, non-normative, non-useful behaviors. For example, if a loud noise were desirable, and the system observed lightning flashes in association with thunder, the system would - rather than hypothesizing causation - acquire a "fondness" for luminosity spikes, and would then begin happily flashing the monitor, on and off, without noticing or caring that the action failed to produce a loud noise. An AI with a causal goal system will preferentially seek out useful behaviors. This not only produces a more useful AI, it produces a smarter AI. The realm of useful plans exhibits far more interesting complexity and exposes fundamental regularities in underlying reality. [CFAI 3.1.5: Cleanliness is an advantage.]
"External reference semantics" refers to a set of architectural features which enable a Friendly AI to refine the Friendliness supergoal content and see that refinement as desirable. [CFAI 3.4.1: External reference semantics.]
Where supergoals are absolutely certain - "correct by definition" - the AI has a motive to resist any attempt on the part of the programmers to change the supergoals. If supergoals are certain, any change to the supergoals is automatically in conflict with the current supergoals. [CFAI 3.4.1.1: Probabilistic supergoal content.]
For example:
The change in priorities is subtle and becomes truly apparent only when discussing reflective AIs. Suppose the AI, before new information arrives, considers, in the abstract, the possibility that new information will arrive. While S1 currently appears desirable, it is undesirable to spontaneously or unjustifiedly remove the subgoal S1. However, the AI, using its current knowledge, can perceive the hypothetical desirability of removing S1 if new information arrives disconfirming the link between S1 and G1. In Bayesian terms, information disconfirming S1 is expected to arrive if and only if S1 is actually undesirable; thus, the hypothetical rule of action "If disconfirming information arrives, remove S1" is evaluated as desirable.
Probabilistic supergoals allow a self-modifying AI to hypothetically consider conditions under which it is desirable to change the supergoals. For the AI to cooperate with programmers in modifying supergoal content, the AI must be able to conceive of the current supergoal content as being possibly "wrong". [CFAI 3.4.1.1: Probabilistic supergoal content.] More precise information about the semantics is supplied in the sections below.
An example of an incorrect way to solve the problem is to have supergoals G1, G2, G3, and G4, where G4 is "Obey the programmers' instructions about supergoals." G1, G2, and G3 would have desirability 10 and G4 would have desirability 1000. This hack only works until one attempts to modify G4. It is not a solution to the problem of Friendship structure. [CFAI: Interlude: Why structure matters.]
The Bayesian Probability Theorem describes the general binding between sensory information and cognitive models. Under the model of external reference semantics used in Creating Friendly AI, the Bayesian Probability Theorem is also used to describe cognition about supergoal content. [CFAI 3.4.1.2: Bayesian affirmed supergoal content.]
"Bayesian affirmation" describes the AI's model of the
derivation of its probabilistic supergoals. The target
for Friendliness is considered to be an external referent;
that is, information about Friendliness is considered to
have sources external to the AI. (The semantics of
external referents are described in more detail in 3.3:
Distinct reflection and CFAI
3.4.1.3: Semantics of external objects and external referents.)
The programmers are assumed to know about "Friendliness";
Friendliness, or rather the referent of "Friendliness",
has an effect on the programmers. (Whether this referent
exists outside the programmers or inside the
programmers' minds is irrelevant, so long as the AI understands
that the referent exists outside the AI.) The
programmers, in turn, affect the sensory inputs of the AI,
such as the keyboard. Thus, the keystrokes "X is Friendly"
or "X is unFriendly" are hypothesized to represent sensory
information about Friendliness - Friendliness affects the
programmers, who affect the keyboard, which affects the
AI.
-

It is also noteworthy that the above arrangement allows the AI to, for example, conceive of the possibility that fulfilling a user request might not be Friendly. Feedback provided from users can only modify the description of the referent of "satisfied user", and not the description of Friendliness. Although fulfilling user requests, in general, is predicted to lead to Friendliness, most of the time, it is quite possible for a specific instance of a fulfilled user request to have led to unFriendliness. The correlation of the inputs "The user is happy" and "You have been unFriendly" will not confuse the AI, but will rather (correctly) imply that there are some subcategories of user request which should be refused, or queried to higher authority.
The actual target, the definition for the referent of Friendliness - which, in Creating Friendly AI, is described in programmer-independent but not human-independent terms - may be found in CFAI 3.4.4: The actual definition of Friendliness. The definition is such that, given the definition, it is accurate to describe programmers as sources of usually-correct information about Friendliness.
What is an "external referent"?
There are three ways in which a reflective AI might conceptualize its attempts to achieve the supergoal:
The second case is the "correct by definition" case that is structurally unsafe because it does not permit the AI to see as desirable the refinement, extension, or correction of the supergoals.
The third case involves "external reference semantics"; the ability of the AI to visualize the referent of a concept as something apart from the content of the concept. [CFAI 3.4.1.3: Semantics of external objects and external referents.]
Since all the AI's thoughts are necessarily internal - the map is not the territory; the thought of an apple is not an apple - it is tempting to think of external reference as an impossibility. Any attempt to take a concept and "dereference" it will inevitably arrive at merely another piece of mental imagery, rather than the external object itself. If you think in terms of the "referent" as a special property of the concept, then you can take the referent, and the referent's referent, and the referent's referent's referent, and never once wind up at the external object.
The answer is to think in terms of referencing rather than deferencing. A mental image of the "sky", where it occurs, is itself - directly - the "referent". Thoughts that refer to image X are thoughts about the sky. A reflective AI can also have an image Y for image X, "my image of the sky", which, when referred to, is a thought about X. A concept in ordinary usage is thought of in terms of its referent; under exceptional circumstances it can be thought about as a concept.
Reflective thought needs a way to distinguish between map and territory. The condition where the "territory" is a special case turns out to be unworkable because of an infinite recursion problem. If thinking about the "map" is the special case, then distinguishing between the levels is both finite and workable. Discoveries that apply to the referent will change the image of the referent. If discoveries are made that apply to the mental image itself, the discovery will change the image of the image, rather than changing the image itself.
Thus, probabilistic supergoals turn out to be a series of probabilistic statements made about the referent of "Friendliness". Incoming information does not change Friendliness itself, but rather changes the AI's beliefs about Friendliness.
Subgoals for "improving the supergoals" or "improving the goal-system architecture" are child goals of the unknowns in the supergoal content. The behavior of seeking additional information from the programmers to resolve an ambiguity, for example, is not a super-supergoal or a meta-supergoal, nor is it inherently desirable. The desirability of "resolving a supergoal ambiguity" derives from the prediction that the unknown referent of Friendliness will be better served (see above), and not from a prediction that one of the current probabilistic descriptions of Friendliness will be fulfilled. [CFAI 3.4.1.4: Deriving desirability from supergoal content uncertainty.]
Decisions and behaviors having to do with the improvement, correction, refinement, learning, et cetera of Friendliness, should be conceptualized as child goals of the currently unknown supergoal content - that is, whichever parts of the target supergoals differ from the current approximation. Another way of putting it is that Friendliness-improvement behaviors must derive desirability from the naked referent of Friendliness, and cannot be attached to any of the current cognitive beliefs about specific Friendliness. This can occur either by abstracting away from specific details ("Regardless of what 'Friendliness' is, I can find out by asking the programmers, because the effectiveness of that method is not sensitive to the actual content of the specific details I'm currently wondering about.") or by branching on multiple possibilities ("If Friendliness turns out to be X, and I ask the programmers, the programmers are likely to say that Friendliness is X, and I have a 90% probability of choosing X. If Friendliness turns out to be X, and I don't ask the programmers, I have a 50% probability of choosing X. If, on the other hand, Friendliness turns out to be Y, and I ask the programmers, the programmers are likely to say that Friendliness is Y, and I have a 90% probability of choosing Y...").
Programmer statements about what constitutes "Friendliness" are not arbitrary, and should not be viewed by the AI as being "correct by definition". It is, for example, possible for a programmer to say something that the programmer later views as a "mistake". The ability for a programmer to correct programmer-affirmed content, for example, implies that later programmers know more about Friendliness than earlier programmers - that a programmer can spot a "mistake" and "correct" it. For this to make sense, a programmer must be viewed as a system rather than a black box, and Friendliness supergoal content must be viewed as a complex output of complex forces rather than an arbitrary dataset. [CFAI 3.4.2: Shaper/anchor semantics.]
A "shaper", within an AI, is one of the forces hypothesized to be a producer of supergoal content. From another viewpoint, a shaper is an element within the cause-and-effect chain that ends with a specific Friendship programmer making a specific statement about Friendliness. Knowledge about the actual causes of the current supergoal content is simple historical knowledge. "Shaper content", while based on this historical knowledge, consists of current beliefs about the normative causes of the current supergoal content. Thus, shaper semantics implicitly invoke causal validity semantics (see below), since the AI is using current content to distinguish between "valid" and "invalid" forces in the causal model of the programmers' beliefs about Friendliness.
Shaper content is sometimes referred to as philosophical content in Creating Friendly AI. Shaper content can have unbounded internal complexity, with constraints or forces acting on other forces as well as acting directly on supergoal content. Actual examples of shapers are usually subtle and complex; much of the beginning of CFAI 3.4.2: Shaper/anchor semantics is devoted to providing some specific examples. [CFAI 3.4.2.2: Some forces that shape Friendliness: Moral symmetry, semantics of objectivity.] Shaper/anchor semantics are an advanced topic in Friendly AI.
In the same way that external reference semantics permit the AI to visualize the current supergoal content as "wrong", shaper semantics permit the AI to visualize the programmers' current statements or beliefs as being "wrong". The beginning of shaper semantics can be seen in the model whereby Friendliness affects the programmers which affect the programmers' statements which affect the AI; this allows for "errors" (in causal validity terms, "invalid effects") to creep in along the line of transmission. A typing error by a programmer is an obvious instance of such a mistake. A cognitive error by the programmer is a deeper instance of such a mistake.
Suppose that a programmer views "moral symmetry" as a philosophical constraint on Friendliness, and that a separate programmer belief violates moral symmetry (unrealized by the programmer). This would constitute a cognitive error on the part of the programmer. On the programmer realizing that a belief about Friendliness violated moral symmetry, the programmer would internally correct the belief and provide an amended statement to the AI. If the AI has an accurate model of these forces, the AI can anticipate the programmer correction in advance. If the AI's causal validity semantics currently designate, as "valid" or "normative", the programmer's belief in moral symmetry, then the AI can, not just anticipate the programmer correction as sensory data, but also see the normative Friendliness that the programmer is correcting to.
Insofar as the ultimate grounding of shapers can be phrased in programmer-independent terms - as references to complex functional adaptations which all humans have in common - the formulation of normative Friendliness can be independent of specific programmers, although not of humanity. [CFAI 3.4.4: The actual definition of Friendliness.]
The use of shaper semantics to visualize the abstract possibility of a programmer error is an early use of shaper semantics, requiring only the abstract, reflective knowledge of how shapers operate. The use of shaper semantics to spot programmer errors, and report them to the programmers, may be considered a testing ground for the acquisition of specific knowledge about the historical and normative causes of Friendliness. The use of shaper semantics to contradict programmer mistakes and become programmer-independent should probably be limited, by injunction and ethical injunction, to transhuman AIs.
An AI that uses shaper/anchor semantics can acquire Friendliness sources even if the programmers themselves are partially unclear (or mistaken) about the reasons for their decisions. This occurs through the use of probabilistic reasoning about the causes of the programmers' statements. An anchor holds where the programmers are aware that they may have not fully understood the causes of their own beliefs, and hence advise the AI that the desirability of some subgoal, or the probability of a hypothesis about supergoal content, or the validity of some shaper, may not be completely contingent on the specific justifications provided. An AI may also independently hypothesize an anchor where it believes that a subgoal/supergoal/shaper was supported by valid causes which did not rise to the programmers' conscious attention. [CFAI 3.4.2.5: Anchors.]
As stated in 3.4: Unknown supergoal content, curiosity about shapers derives desirability from the currently unknown supergoal content, or the possibility of error in the currently known supergoal content. Shapers are not super-supergoals or meta-supergoals. Shapers are not parent goals; other goals cannot inherit desirability from them. Shapers make statements about the normativeness or non-normativeness of the supergoal description, rather than statements about the desirability or undesirability of states in external reality.
The only "stuff" that shapers can mold is the RAM containing the cognitive content describing the supergoal. Shapers cannot directly make statements about the desirability or undesirability of, e.g., saving a human from a burning building. An indirect effect can, of course, occur; a shaper can state that "seeing saving humans from a burning building as desirable" is normative, which leads to seeing "saving humans from a burning building" as desirable, which leads to actual actions to save the humans from the building. However, shaper fulfillment is only important insofar as it enables the achievement of the referent of Friendliness in external reality. Curiosity about shapers derives desirability from uncertainty in the supergoal content. "Thinking about shapers" can never be more desirable than Friendliness because it is desirable as a means to Friendliness.
A meta-supergoal - for example, "Maximally satisfy the programmers" - is structurally unsafe because, e.g., a transhuman AI might directly adjust the programmers' brains for maximum satisfacton, or (on a more mundane scale) simply lie to generate additional satisfaction. [CFAI: Interlude: Why structure matters.] A hypothetical shaper stating "Supergoals should maximally satisfy the programmers" - not a valid shaper, although a good heuristic - could not override the actual supergoal content stating that direct modification of a brain is unFriendly, especially if the causal validity semantics label direct modification as an invalid cause of satisfaction.
As briefly introduced above, "causal validity semantics" are concerned with the AI's modeling of the historical events that led to its creation and its current mind-state, and the AI's use of current philosophical content to label certain effects as "valid" or "invalid" and thereby create beliefs about the normative causal chain, with its associated causal output of a normative AI. [CFAI 3.4.3: Causal validity semantics.]
Causal validity semantics provide a framework for generalized architectural changes as well as specific content changes - that is, the entire code base of the AI can be viewed as the result of programmer intentions, and "mistakes" in those intentions leading to "mistaken code" can be identified and corrected. In the same way that external reference semantics enable the AI to see supergoal content as not being "correct by definition", and shaper/anchor semantics enable the AI to see programmer statements as not being "correct by definition", causal validity semantics enable a reflective seed AI to see its own source code and cognitive processes as not being "correct by definition".
Causal validity semantics provide a general framework that subsumes both external reference semantics and shaper/anchor semantics. Causal validity semantics are the most advanced form of Friendship structure currently described in Creating Friendly AI or known to this author, and should supply the necessary structural capability to see open-ended improvements and corrections as "desirable" using bounded amount of initial content.
Causal validity semantics provide a framework for:
The human mind, in thinking about beliefs and goals, automatically employs an enormously powerful, fundamentally flawed intuition: "Effects cannot have greater validity than causes." Quick examples would be our assumption that a valid deduction cannot arise from untrue premises, that the discovery of ulterior motives or rationalization indicts a moral argument, and the general intuition that pointing out an invalid cause of a belief or moral principle constitutes an attack on the belief or moral principle itself. In Creating Friendly AI, this intuition (which might be better referred to as an emergent result of the local rules employed for modelling reality and goals) is referred to as the "rule of derivative validity". [CFAI 3.4.3.3: The rule of derivative validity.] Although derivative validity has no tail-end recursion - a problem discussed in Creating Friendly AI - it usually works extremely well as a local heuristic.
Derivative validity has two important consequences for Friendship structure. First, it acts to ensure that a mistaken piece of code does not circularly self-protect or otherwise become regarded as part of normative cognition; code, or design, that the programmer later comes to regard as "mistaken" is presumably invalid, since (under the definition of target Friendliness in CFAI) the source of validity is the programmer's valid intentions. In other words, derivative validity in combination with the CFAI definition states that the AI's cognition cannot be more valid than the programmer's intentions for the cognition (or rather, a normative programmer's intentions for the cognition). Thus, derivative validity is part of the Friendship structure that ensures an AI can want to make open-ended changes in response to a valid change in the programmers' intentions.
Second, derivative validity reduces the likelihood of a class of "philosophical crisis"; specifically, that class of philosophical crisis which results from computing the "Bayesian prior before programmer affirmation" that the programmer affirmations are flawed. [CFAI: Crisis of Bayesian affirmation.] Under Bayesian reasoning, no amount of programmer affirmation can reduce the prior probability of the possibility that "programmer affirmations are worthless [for reason X]". This is correct reasoning and cannot be "fixed". However, it is also correct to point out that programmer intentions gave rise to the AI's code, and most hypothesized scenarios which would invalidate all programmer affirmations will be correctly seen as also invalidating all the AI's reasoning processes, including the reasoning which called the programmer intentions into doubt. [CFAI 3.4.3.3: The rule of derivative validity.] Also, by decreasing the perceived likelihood of a philosophical crisis, derivative validity acts to increase unity of will. [CFAI 3.3.3: Unity of will.]
An "injunction" is a heuristic which has primarily nonlocal support, or very abstract support, or which derives from a small probability of large consequences, or for other reasons exhibits significant context-insensitivity to a class of usually-salient details. An injunction may suggest either positive actions or negative avoidants, although the latter is more obviously relevant to Friendly AI. It should be noted that injunctions are still normative subgoal content, even if the parent goals are distant or abstract rather than near and concrete. [CFAI 3.2.4: Injunctions.]
Injunctions are useful under these circumstances:
An ethical injunction is an injunction that has no direct preconditions for violation because the probability of a mistaken violation exceeds that of a valid violation. (More exactly, the total payoff from the future AI avoiding apparently valid violations, considered abstractly and in advance, is expected to be positive.) [CFAI 3.2.5: Ethical injunctions.]
Ethical injunctions rely on awareness of fallibility. Rationality theorists would regard ethical injunctions as consequences of "meta-rationality", the self-estimate of how rational one is (either in absolute terms, or relative to others). A more formal viewpoint is that of "Bayesian self-awareness" - the probability that a given thought is associated with a corresponding fact in external reality. For example, experience may show that first-order cognition assigns "90% probability" to a collection of facts which have so far proven to actually be true 85% of the time. Thus, second-order cognition would regard the thought "I am 90% sure of this" as sensory evidence indicating an 85% actual surety. That is, Bayesian self-awareness consists of regarding one's own thoughts as sensory data rather than conclusions.
The secondary utility of ethical injunctions is that they prevent errors in the possibly narrow class of scenarios in which an AI has abandoned the direct support of an injunction, but has not abandoned the abstract reasoning which led to the use of the ethical injunction. The primary utility of ethical injunctions is that they enable a present-day AI to see, as desirable, cooperation with the programmers to prevent a future AI from taking certain actions, since a future AI that attempts to break the hypothetical safeguard will (in most possible futures) be doing so because of failure of Friendliness. [CFAI 3.3.3.1: Cooperative safeguards.]
Ethical injunctions rely on the AI's self-evaluation being fair. Fairness can be expected, given the lack of observer-biased beliefs resulting from natural evolution in a political environment. [CFAI 2.3: Observer-biased beliefs evolve in imperfectly deceptive social organisms.] Also, leaving margin for error - in this case, by trying to overestimate fallibility - is a standard precaution. [CFAI 3.2.4: Injunctions.] The combination of fairness with safety margin should be enough to ensure almost complete avoidance of the human class of self-overestimation errors.
Ethical injunctions are useful under these circumstances:
[CFAI 3.5.3.2: Controlled ascent.]
First, note that "controlled ascent" is not a means of restraining AIs, or of "keeping them under control". All safeguards, to function, must be undertaken with the cooperation of the AI, and it does not appear to be rational for a normative altruist to forgo self-improvement indefinitely; trying to implement controlled ascent as a permanent constraint rather than a temporary safeguard would break unity of will. [CFAI 3.3.3: Unity of will.]
The rationale for controlled ascent is that certain types of Friendship structure may need to be acquired before transhumanity to be effective. Once (a bounded amount of) Friendship structure is in place, a transhuman AI can anticipate whichever (valid) programmer actions would be taken and carry those out independently. However, if the transhuman AI lacks enough Friendship structure to do this correctly, it may no longer be possible for the programmers to fix the problem. For example, an external reference semantics system can turn anticipated programmer actions into supergoal content, but has no explicit facilities for turning anticipated programmer actions into code or Friendship structure. [CFAI 3.5.3.1: The Bayesian Boundary.] Thus, there are certain things that must be done before a hard takeoff if they are to be done at all.
The primary context and use of controlled ascent is to "Singularity-safe" an early AI that lacks general intelligence and structural Friendliness, so that the AI slows down and asks for directions before continuing. This ensures that programmers have time to react in the event that rapid self-improvement ("hard takeoff") begins substantially before anticipated. [CFAI 3.5.3: Singularity-safing ("In case of Singularity, break glass"), GISAI 1.1: Seed AI, "What is Seed AI?"]
[CFAI 3.5.3.2.1: Programmatic controlled ascent via an "improvements counter".]
Suppose that RSIS is a recursively self-improving system. For the purposes of discussion, RSIS will be assumed to be a primitive AI having an architecture generally along the lines of Eurisko. Certain RSIS actions would count as "self-improvements" - creating a heuristic, improving a heuristic, successfully tweaking a parameter, making a discovery, and so on. An "improvements counter" would keep a log of the time of each improvement and a numerical count of the number of improvements.
If an average day's work for RSIS is 1000 "improvements", then RSIS should page the programmers and trigger a save-and-shutdown if 50000 "improvements" occur in 24 hours or less. The trigger for this action should not be checked "every hour", but should rather be checked each time an improvement registers. (Computationally, this is an extremely trivial expenditure; conversely, a lot can happen in an hour.)
An interesting, more general version of the above feature would be one that caught 50000 improvements without validation, rather than in some specified timeframe. A programmer hitting some specific UI switch might validate all the improvements up to half an hour ago. If 1000 improvements is the per-day norm, and the programmers remember to validate at the end of each working day, then the ascent feature should never trigger under ordinary circumstances. Similarly, if 10000 improvements are going by per hour, and the programmers are watching interestedly but see no reason to slow down as yet, then validating repeatedly (rather than once at the end of the day) would enable RSIS to keep running at full speed.
As with almost every aspect of a Friendship system, this feature serves a double purpose. The start of one jump doesn't necessarily herald the start of a hard takeoff, so any given alert is almost certainly a "false alarm"... but "false" only from a Singularity standpoint. Even if RSIS is just climbing a one-time but steep curve, it might be a good idea for it to happen with the programming team present and watching, rather than happening unsupervised at 3AM. A controlled ascent feature thus has applications beyond Singularity-safety.
The precise nature of the triggered action also deserves further exploration. A save-and-shutdown order is easy to implement, but it is also an overly extreme response, verging on the adversarial. Passing 50000 improvements should not trigger a shutdown, but should instead trigger an altered system state in which self-improvement is no longer pursued. If an additional 1000 improvements pass - perhaps due to some autonomic learning process - and the number of unvalidated improvements reaches 51000, then a save-and-shutdown should be triggered.
The "temporarily stop self-improvement" state also has applications beyond Singularity-safety. There may be many times at which the programmers want to carry out some involved operation, such as a set of coordinated changes, without worrying about the system mutating out from under them.
A programmatic controlled ascent feature is a simple programmatic task; it is the easiest of all Friendly AI features to implement, can be implemented in any self-improving system, should be implemented in any self-improving system (1), and sets an excellent precedent for a Friendliness-aware AI project.
[CFAI 3.5.3.2.2: Controlled ascent as ethical injunction.]
Beyond a certain point, safeguards must be cooperative, which means "justified within the goal system". Controlled ascent past a certain point requires a controlled ascent subgoal. Similarly, only a controlled ascent subgoal is effective for a seed AI that has advanced to the point that it would notice any attempt to implement a save-and-shutdown feature. For relatively young AIs, intermediate in intelligence between "too dumb to represent the subgoal" and "aware of all code in the system", probably the best course is a "stop-improving" subgoal kicking in at 50000, a "save-and-shutdown" subgoal kicking in at 51000, and an emergency programmatic "save-and-shutdown" feature kicking in at 51050.
Self-improvement is affirmed/predicted to result in future Friendliness, which is why self-improvement is desirable for a cleanly Friendly seed AI. [CFAI 3.1.2: Friendliness-derived operating behaviors.] A controlled ascent subgoal is justified by an assertion that too much self-improvement without external guidance may not lead to future Friendliness, and may even lead to future unFriendliness.
The AI has a self-observational metric of self-improvement and a metric of external guidance. If self-improvement gets out of phase with external guidance, then further self-improvement begins increasing the probability of the negative scenario "future unFriendliness" instead of, or to a greater degree than, it increases the probability of the positive scenario "future Friendliness". Thus, the desirability of further self-improvement shifts from positive to negative.
The trigger of the "controlled ascent" injunction is: