| Next: | Interlude: Philosophical crises | Bookmark | |
| Up: | 3.4: Friendship structure | Monolithic | |
| Prev: | Interlude: Why structure matters |
| DEFN: | External reference semantics: The behaviors and mindset associated with
the idea that supergoals can be "wrong" or "incomplete" - that the current
supergoals are not "correct by definition", but are an approximation to
an ideal, or an incomplete interim version of a growth process. Under
a system with external reference semantics, supergoal content takes the
form of probabilistic hypotheses
about an external referent.
In other words, under ERS, supergoal content takes the form
of hypotheses about Friendliness rather than a definition of
Friendliness.
Letting the programmers make alterations to the supergoals is not a subgoal of a "let the programmers make alterations" meta-supergoal (1), but rather derives desirability from uncertainty in the supergoal content. |
Where supergoals are absolutely certain - "correct by definition" - the AI has a motive to resist any attempt on the part of the programmers to change the supergoals. If supergoals are certain, any change to the supergoals is automatically in conflict with the current supergoals. (2).
For example:
The change in priorities is subtle and becomes truly apparent only when discussing reflective AIs. Suppose the AI, before new information arrives, considers, in the abstract, the possibility that new information will arrive. While S1 currently appears desirable, it is undesirable to spontaneously or unjustifiedly remove the subgoal S1. However, the AI, using its current knowledge, can perceive the hypothetical desirability of removing S1 if new information arrives disconfirming the link between S1 and G1. In Bayesian terms, information disconfirming S1 is expected to arrive if and only if S1 is actually undesirable; thus, the hypothetical rule of action "If disconfirming information arrives, remove S1" is evaluated as desirable.
If supergoals are probabilistic, then overprotecting supergoals is undesirable for the same reason that overprotecting subgoals is undesirable (see 3.3.5: FoF: Wireheading 2). The uncertainty in a child goal - or rather, the uncertainty in the predictive link that is the "child goal" relation - means that the parent goal is ill-served by artificially strengthening the child goal. The "currently unknown subgoal content", the differential between normative subgoals and current subgoals that reflects the differential between reality and the model, would be stomped on by any attempt to enshrine the model. Similarly, the currently unknown supergoal content would be violated by enshrining the current supergoals. Normative subgoal cognition serves the supergoals; normative probabilistic supergoal cognition serves the "actual" or "ideal" supergoals. See 3.4.1.4: Deriving desirability from supergoal content uncertainty.
Probabilistic supergoals are only one facet of an ERS system. In isolation, without any other Friendship structure, probabilistic supergoals are fundamentally incomplete; they are not safe and are not resistant to structural failures of the type shown in Interlude: Why structure matters. If, however, one wished to implement a system that had "probabilistic supergoals" and nothing else, the design requirements would be:
In particular, the above system, considered in isolation, is isomorphic to a system that has different quantitative strengths for a set of supergoals, with the strengths being adjusted on the occurrence of various events. Calling this quantitative strength a "probability" doesn't make it one.
The term "external reference semantics" derives from the way that many of the behaviors associated with probabilistic supergoals are those associated with refining an uncertain view of external reality. In particular, the simplest form of external reference semantics is a Bayesian sensory binding.
(You may wish to review the section 3.1.3.1: Bayesian sensory binding.)
This is an example of a very simple goal system with very simple External Reference Semantics:
| NOTE: | Don't worry about the classical-AI look. The neat boxes are just so that everything fits on one graph. The fact that a single box is used for "Fulfill user requests" doesn't mean that "Fulfill user requests" is a suggestively named LISP token; it can be a complex of memories and abstracted experiences. See GISAI: Executive Summary and Introduction for a fast description of the GISAI paradigms, including the way in which intelligence is the sequence of thoughts that are built from concepts that are abstracted from experience in sensory modalities that are implemented by the actual code. In short, consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind. |
| Diagram 1: Bayesian ERS |
![]() |
| NOTE: | Green lines indicate sensory feedback. Blue lines indicate predictions. Orange lines indicate hypothesized causes. Rectangles indicate goals. A rounded rectangle indicates supergoal content. A 3D box indicates sensory data. An oval or circle indicates a (non-goal) object or event within the world-model. |
The above depicts three goals within the goal system; efficiency, which leads to fulfilling user requests, which leads to Friendliness. This content would not be structurally accurate for a seed AI intended to become Transition Guide - see 3.1.2: Friendliness-derived operating behaviors - but it would be more or less accurate for a prehuman AI sold as a data-mining tool. (3).
Most of the previously discussed phenomena fit into Diagram 1 above. All of the context-sensitivity that's discussed in 3.1.1: Cleanly causal goal systems and 3.2.6: FoF: Subgoal stomp, for example; a user request is fulfilled because fulfilling user requests is expected to lead to Friendliness. It's possible for the "happy user" button to be pressed, indicating a definite instance of a fulfilled user request, and for the programmers to type in "You have not been Friendly" or "You have been unFriendly", indicating a definite non-instance of Friendliness or a definite instance of an event undesirable under Friendliness. The predictive link between "Fulfill user requests" and "Friendliness" has, say, 98% confidence; this still leaves room to discover cases where fulfilling a user request leads to unFriendliness. Eventually the system can formulate new concepts, generalizations that describe known instances of failure but not known instances of success, and try out heuristics such as "Fulfilling a user request from Saddam Hussein is predicted to lead to unFriendliness."
The statement "'Programmers' are external objects that know about the referent of 'Friendliness', so sensory data caused by them (pattern-copied from their beliefs) has a Bayesian binding to the referent of 'Friendliness'." should also be familiar from 3.3.3.1: Cooperative safeguards. Thus, it would be more accurate to say: "The AI believes that the external objects called 'programmers' have accurate knowledge about the referent of concept 'Friendliness', and believes that sensory data such as 'You have been Friendly' is caused by the programmers, and that the content of the sensory data is pattern-bound (structural binding) to the accurate knowledge possessed by the 'programmers'." All of these beliefs, of course, are probably programmer-affirmed - at least in the first stages of the Friendly AI - meaning that the programmers typed in "The objects called 'programmers' have accurate knowledge of Friendliness", and the AI expects that the programmers wouldn't have typed that in if it weren't true. (4).
The Bayesian binding for programmer-affirmed Friendliness looks something
like this:
| Diagram 2: Bayesian Friendliness affirmation |
![]() |
In human terms, the above translates something like this:
"I think X is Friendly, but I'm not very sure. If X is Friendly, there's a good chance the programmer will notice and say so. (I.e., if X is Friendly, there's a good chance that the programmer will think about X, decide X is Friendly, and type in the words "X is Friendly" on the "keyboard" sensory input.) If X is Friendly, the chance is almost zero that the programmer will say it's unFriendly. There's also a fair chance that the programmer won't bother to say anything about it either way. If X is unFriendly, the programmer is very likely to tell me so; the chance is pretty small that the programmer will mistakenly label X as unFriendly, but the chance exists. There's also a small but significant chance that the programmer won't say anything."If the AI's internal representation looks like Diagram 2, the Bayesian reasoning will proceed as follows. Suppose that there are 100,000 "possible worlds". In 90,000, X is Friendly; in 10,000, X is unFriendly. In 72,000, X is Friendly and the programmer says X is Friendly. In 9, X is Friendly and the programmer says X is unFriendly. In 17,991, X is Friendly and the programmer says nothing. In 100, X is unFriendly and the programmer says X is Friendly. In 9,000, X is unFriendly and the programmer says X is unFriendly. In 900, X is unFriendly and the programmer says nothing.
The Bayesian numbers now fall automatically out of the calculation. The a priori chance that X is Friendly is 90%. If the AI hears "X is Friendly", the probability that X is Friendly goes from 90% to 99.86% (72,000 / (72,000 + 100)). If the AI hears "X is unFriendly", the chance that X is unFriendly goes from 10% to 99.90% (9000 / (9 + 9000)). If the AI hears nothing, the probability that X is Friendly goes from 90% to 95.24% - an unintended consequence of the fact that programmers are more likely to remark on unFriendly things; silence is a very slight indicator of consent.
Thus, despite the AI's large a priori differential (a better word than "bias" or "prejudice"), the statement "X is Friendly" or "X is unFriendly" is enough to virtually settle the issue.
The underlying, fundamental distinction of "external reference semantics" can be summed up in one phrase: "The map is not the territory." There are three ways in which a reflective AI might conceptualize vis attempts to achieve the supergoal:
Since all the AI's thoughts are necessarily internal - there can be no direct identity between an image and the external object it's supposed to represent - ERS necessarily takes the form of, first, the AI's behaviors; second, how the AI conceptualizes vis behaviors. The first issue applies to all AIs, no matter how primitive; the second issue applies only to reflective AIs.
In both cases, the behaviors and concepts for ERS are those that govern any images representing external objects - that is, ERS applies to all imagery, not just goal imagery. Is the sky blue? Asking whether "the sky" is "blue" is a trivial question that can be answered with certainty; just check concept "sky" and see whether it has a color and the color is concept "blue". It is equally easy for a seed AI to intervene in the concept "sky" and change the color to "green". The question is whether the AI understands, either declaratively or as a behavior, that "sky" has a referent and that the external sky is not necessarily blue, nor can vis intervention change the color of the actual sky.
It is tempting but wrong to think of ERS as an impossibility, like a magical C++ pointer that can never be dereferenced. Any attempt to take a concept and "dereference" it will inevitably arrive at merely another piece of mental imagery, rather than the external object itself. If you think in terms of the "referent" as a special property of the concept, then you can take the referent, and the referent's referent, and the referent's referent's referent, and never once wind up at the external object.
The answer is to think in terms of referencing rather than deferencing. Concept "sky", where it occurs, is itself - directly - the "referent". A reflective AI can also have imagery for "concept sky", or imagery for "imagery for concept sky", and so on. A human can think, think about thinking, or think about thinking about thinking, but anything beyond four or five levels is not humanly possible. The recursion is not infinite. A concept, in ordinary usage, is thought of in terms of its referent; under special circumstances, it can be thought of as a concept. In fact, by saying "thought of as a concept", we are intrinsically implying that there are thoughts that refer to the concept, but are not identical with the concept. So it's not a question of trying to endlessly dereference; all concepts, all images, are inherently referential, and you need a new meta-image to refer to the first one if you want to think about the image as an image, and a meta-meta image if you want to think about the meta-image. (5).
The important characteristic of reflective thought is simply that it needs a way to distinguish between map and territory. Any way of distinguishing will do, so long as the two levels can be conceptualized as separate things, and different rules applied. The condition where the "territory" is a special case turns out to be unworkable because of an infinite recursion problem; if thinking about the "map" is a special case, distinguishing between the levels is both finite and workable; those behaviors that belong to the referent will be assigned to the referent, and if any behaviors are discovered that apply to the concept, they will be assigned to the concept rather than being confused with the referent.
A non-reflective AI - or rather, an AI with some kind of reflective capability, but not much knowledge about how to use it - can still learn the hard way about external reference semantics, in much the same way that a human who tries to alter the universe by altering his thoughts is rudely disillusioned. Of course, a human has a lot of components that are not subject to conscious control, unlike a seed AI - a human thinking that a stove isn't hot can always be yanked back to reality by the searing pain, which causes an automatic shift in focus of attention and will tend to knock down whatever towers of meta-thought got built up. If the human goes on trying to control the stove's temperature with his thoughts, eventually the negative reinforcement will blow away whatever underlying ideology turned him solipsistic. Or the human will die and leave the gene pool. Either version explains why we aren't surrounded by solipsist humans.
An AI that went solipsist could alter all sensory data (or rather, all reports of sensory data) as well as the concepts themselves; thus, an AI that rests vis hand on a "cool" stove could alter the reports coming in to read "cool" and "everything OK" rather than "hot" and "OUCH". However, this only applies to an AI making a determined, suicidal attempt at true solipsism. An AI that goes solipsist due to erroneous conflation of thoughts and reality would not expect to need to alter the sensory data; rather, ve would expect that the sensory data would report a cool stove, in accordance with the stomped mental imagery describing a cool stove. For a human, actions have consequences, and the consequences would yank the human back to reality. For an AI, concepts make predictions, and the failure of those predictions would yank the AI back to reality.
It is now possible to distinguish more clearly between the three alternatives shown above:
In case two, the AI is no longer subject to wireheading failures. An imagined future in which the supergoal predicate changes is not desirable, since that future no longer fulfills the supergoal predicate, whose content is "descriptive content 1, 2, 3". However, by the same token, the AI is too stable; ve will attempt to defend verself against programmer improvements as well as internal stomps.
In case three, the AI has full external reference semantics; ve will accept programmer improvements without being subject to wireheading failures. The supergoal predicate now refers to an external something called "Friendliness"; information from the programmers is accepted as probable information about this external something, but both the sensory data and the current probable description are processed as being apart from the concept that refers to "Friendliness" itself. There's "Friendliness", "Sensory data about Friendliness", and "Current probable description of Friendliness". Normally the first and third concepts are pretty much interchangeable, since the AI has no knowledge of Friendliness apart from the current probable description. When the AI checks to see whether an action or future fulfills "Friendliness", the AI checks the action or future against the current probable description of Friendliness; however, the resulting fulfillment metric will be probabilistic as well. That is, if future X has a 90% chance of meeting the current description, and the current description has a 90% chance of being correct, then future X has an 81% chance of meeting the external referent for Friendliness.
Under an ERS architecture, supergoal content can be improved and extended because supergoal content takes the form of hypotheses about an external referent. In a cleanly causal goal system, desirability of subgoals is contingent on expected outcomes, and inherited desirability is diluted by the confidence and strength of the expectation. A specific supergoal-fulfilling scenario X, as a topmost-level parent goal, has estimated desirability which is diluted by the confidence of the hypothesis that the scenario X fulfills the Friendliness referent.
By analogy to cleanliness in causal goal systems, ERS requires that the estimated desirability of Scenario X be cleanly contingent on the continued confidence and strength of whichever hypothesis asserts that Scenario X fulfills the Friendliness referent. A reflective AI should consider clean ERS to be normative.
In accordance with the general philosophy of AI expressed in GISAI, conclusions about Friendliness should have flexibility, granularity, reduceability, and so on; the general attributes opposed to the crystalline fragility of classical AI.
In particular, what this means is that saying that "description X" has "90% probability" is too classical-AI, since it implies that the entire description is all one image with a single quantitative probability. It's not. The external referent "Friendliness" exists. Various conclusions, at various probabilities, with various dependencies, are drawn from the available sensory data. (That, at least, is how things would be for a medium-mature Friendly AI.) A particular outcome may fit a particular definition of Friendliness that has been strongly confirmed and that thus has a 90% probability. Other parts of the description may be less confident. And there may be vast stretches of known unknowns; things that the AI doesn't know about, but that ve knows are there. In short, the discovery and partial description of Friendliness should resemble the discovery and partial description of any other complex fact.
Subgoals for "improving the supergoals" or "improving the goal-system architecture" derive desirability from uncertainty in the supergoal content. They may be metaphorically considered as "child goals of the currently unknown supergoal content". The desirability of "resolving a supergoal ambiguity" derives from the prediction that the unknown referent of Friendliness will be better served, and not from a prediction that one of the current probabilistic descriptions of Friendliness will be fulfilled.
If future X has a 90% chance of meeting the current description, and the current description has a 90% chance of being correct, then future X has an 81% chance of meeting the external referent for Friendliness. Under most situations, any uncertainty in the description of Friendliness will cancel out of most choices, since it applies equally to most possible futures. If the AI is considering actions that would prevent ver from obtaining further sensory data about Friendliness - for example - then the probabilism no longer cancels out of the equation; it's desirable to nail down that last 10% so that the wiser AI can take better actions, and thus undesirable to be unable to obtain sensory data.
Even though the AI has no specific idea about that last 10%, it is still possible for the AI to conclude that being unable to obtain data about Friendliness decreases the probability of that last 10% being achieved, since it reduces the probability that the future AI will know about that last 10%. The heuristic that "sensory data makes it more likely that I'll learn more" holds true even when nothing specific is known about the subjunctive data or the subjunctive discoveries; the heuristic "An AI that knows about supergoal content is more likely to achieve it" holds true - as the dominant probability - even when nothing specific is known about the subjunctive content or the subjunctive achievement.
Decisions and behaviors having to do with the improvement, correction, refinement, learning, et cetera of Friendliness, should be conceptualized as deriving desirability from supergoal uncertainty. Another way of putting it is that Friendliness-improvement behaviors must derive desirability from the naked referent of Friendliness, and cannot be attached to any of the current cognitive beliefs about specific Friendliness. Structurally, this can occur in one of two ways; by abstracting away from specific details, or by branching on multiple possibilities.
Abstracting away from specific details:: "Regardless of what 'Friendliness' is, I can find out by asking the programmers, because the expected effectiveness of that method is not sensitive to the actual content of the specific details I'm currently wondering about." However, this requires the ability to generalize from experience and engage in reasoning about abstract properties.
Branching on multiple possibilities: "If Friendliness turns out
to be X, and I ask the programmers, the programmers are likely to say that
Friendliness is X, and I have a 90% probability of choosing X. If
Friendliness turns out to be X, and I don't ask the programmers, I have
a 50% probability of choosing X. If, on the other hand, Friendliness
turns out to be Y, and I ask the programmers, the programmers are likely
to say that Friendliness is Y, and I have a 90% probability of choosing
Y..." This method is more cumbersome but requires less intelligence,
since it can operate entirely on specific scenarios.
| Next: | Interlude: Philosophical crises |
| Up: | 3.4: Friendship structure |
| Prev: | Interlude: Why structure matters |