Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: 3.2: Generic goal systems Bookmark
Up: 3: Design of Friendship systems Monolithic
Prev: 3: Design of Friendship systems


3.1: Cleanly Friendly goal systems

(You may wish to review An Introduction to Goal Systems.)

"Subgoal" content has desirability strictly contingent on predicted outcomes.  "Child goals" derive desirability from "parent goals"; if state A is desirable (or undesirable), and state B is predicted to lead to state A, then B will inherit some desirability (or undesirability) from A.  B's desirability will be contingent on the continued desirability of A and on the continued expectation that B will lead to A.

"Supergoal" content is the wellspring of desirability within the goal system.  The distinction is roughly the distinction between "means" and "ends".

Within a Friendly AI, Friendliness is the sole top-level supergoal.  Other behaviors, such as "self-improvement", are subgoals; they derive their desirability from the desirability of Friendliness.  For example, self-improvement is predicted to lead to a more effective future AI, which, if the future AI is Friendly, is predicted to lead to greater fulfillment of the Friendliness supergoal.  Thus, "future Friendly AI" inherits desirability from "future Friendliness fulfillment", and "self-improvement" inherits desirability from "future Friendly AI".  (1).

Friendliness does not overrule other goals; rather, other goals' desirabilities are derived from Friendliness.  Such a goal system might be called a cleanly Friendly or purely Friendly goal system.  (2)

In advocating "cleanliness", I do not wish to sound in shades of classical AI; I am strongly emphasizing cleanliness, not because humans are messy and that's bad, but because we have a tendency to rationalize the messiness, even the blatantly ugly parts.  Cleanliness in ordinary AI is an optional design decision, based on whatever seems like a good idea at the time; you can go with whatever works, because your judgement isn't being distorted.  In Friendly AI, one should be very strongly prejudiced in favor of the clean and the normative.

3.1.1: Cleanly causal goal systems

In a causal goal system, desirability flows backward along predictive links.  Prediction is usually transitive - if C is predicted to normally lead to B, and B is predicted to normally lead to A, then C is usually predicted to normally lead to A.  This does not always hold true, however.  A, B, and C are descriptions; descriptions define categories; categories have exceptional instances.  Sometimes, most instances of C lead to B, and most instances of B lead to A, but no instances of C lead to A.  In this case, a smart reasoning system will not predict (or will swiftly correct the failed prediction) that "C normally leads to A".

Likewise - and this is an exact analogy - the flow of desirability is usually-but-not-always transitive.  If C normally leads to B, and B normally leads to A, but C never leads to A, then B has normally-leads-to-A-ness, but C does not inherit normally-leads-to-A-ness.  Thus, B will inherit desirability from A, but C will not inherit desirability from B.  In a causal goal system, the quantity called desirability means leads-to-supergoal-ness.  If B is predicted to normally result in supergoal A, then most instances of B will have leads-to-supergoal-ness or "desirability".  If C is predicted to normally result in B, then C will usually (but not always) inherit leads-to-supergoal-ness from B.

Friendliness does not overrule other goals; rather, other goals' desirabilities are derived from Friendliness.  A "goal" which does not lead to Friendliness will not be overruled by the greater desirability of Friendliness; rather, such a "goal" will simply not be perceived as "desirable" to begin with.  It will not have leads-to-supergoal-ness.

DEFN: cleanly causal goal system:  A causal goal system in which it is possible to view the goal system as containing only decisions, supergoals, and beliefs; with all subgoal content being identical with beliefs about which events are predicted to lead to other events; and all "desirability" being identical with "leads-to-supergoal-ness".

Cleaner is better for Friendship systems (3).  Even if complexity forces a departure from cleanliness, mistakes will be transient and structurally correctable as long as a reflective Friendly AI considers clean Friendliness as normative.  (4).

3.1.2: Friendliness-derived operating behaviors

If a programmer correctly sees a behavior as necessary and nonharmful to the existence and growth of a (Friendly) AI, then the behavior is, for that reason, cleanly valid subgoal content for a Friendly AI.  The necessity of such a behavior may be affirmed by the programmers (see below) even if the prediction would not have been independently invented by the AI.

There is never any valid reason to raise any subgoal of the programmers' to supergoal status within the AI.  The derivations of desirability within the AI's goal system should structurally mirror the derivations of desirability within the programmers' minds.  If this seems impossible, it indicates that some key facet of goal cognition has not been implemented within the AI, or that the programmers' motives have not been fully documented.

For example, the programmers may wish the AI to focus on long-term self-improvement rather than immediate Friendliness to those humans within visible reach.  An incorrect "hack" would be promoting self-improvement to an independent supergoal of greater value than Friendliness.  The correct action is for the programmers, by self-examination of their own goal systems, to realize that the reason they want the AI to focus on long-term self-improvement is that a more powerful future Friendly AI would benefit humanity.  Thus, the desired distribution of efforts by the AI can be made to fall directly out of the following goal-system content:

NOTE: The fact that a single box is used for "Fulfill user requests" doesn't mean that "Fulfill user requests" is a suggestively named LISP token; it can be a complex of memories and abstracted experiences.  Consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind.

 

Diagram 1: Friendliness-derived content

This goal system content shows an AI whose primary motivation is derived from the prospect of future Friendliness.  The largest desirabilities flowing through the system originate in the "Future Friendliness" subgoal; thus, most of the AI's present-day actions will be focused on self-improvement, or, in the case of a commercial system, performing tasks for present-day users.  However, the AI also tracks present-day Friendliness, allowing the AI to continue gaining direct experience in what constitutes "Friendliness".

3.1.3: Programmer affirmations

Where a child goal is nonobvious - where the importance of a behavior is directly visible to the programmers, but not to the AI - the predictive link (i.e., the support for the child-goal relation) can be affirmed by the programmers.  In essence, the programmers tell the AI:  "B leads to (desirable) A, so do B."

To more formally define the semantics of programmer affirmations, it is necessary to discuss the Bayesian Probability Theorem.

3.1.3.1: Bayesian sensory binding

DEFN: Bayesian Probability Theorem:  The governing relationship between a priori expectations, observed data, and hypothesis probabilities.  There are several formulations of the BPT; under the "possible worlds" formulation, the BPT is used by predicting a number of possible worlds.  Observed sensory data then restricts which of the possible worlds you can possibly be in, and the probabilities of hypotheses change according to their distribution within the still-possible worlds.

For example, suppose you know the following:  1% of the population has cancer.  The probability of a false negative, on a cancer test, is 2%.  The probability of a false positive, on a cancer test, is 10%.  Your test comes up positive.  What is the probability that you have cancer?  Studies show that most humans (college-student research subjects, actual medical patients, actual doctors) automatically answer "ninety percent".  After all, the probability of a false positive is only 10%; isn't the probability that you have cancer therefore 90%?  (5).

The Bayesian Probability Theorem demonstrates why this reasoning is flawed.  In a group of 10,000 people, 100 will have cancer and 9,900 will not have cancer.  If cancer tests are administered to the 10,000 people, four groups will result. First, a group of 8,910 people who do not have cancer and who have a negative test result.  Second, a group of 990 who do not have cancer and who have a positive test result.  Third, a group of 2 who have cancer and who have a negative test result.  Fourth, a group of 98 who have cancer and who have a positive test result.

Before you take the test, you might belong to any of the four groups; the Bayesian Probability Theorem says that your probability of having cancer is equal to (2 + 98)/(8,910 + 990 + 2 + 98), 1/100 or 1%.  If your test comes up positive, it is now known that you belong to either group 2 or group 4.  Your probability of having cancer is (98)/(990 + 98), 49/544 or approximately 9%.  If your test comes up negative, it is known that you belong to either group 1 or group 3; your probability of having cancer is 2/8,912 or around .02%.

DEFN: Bayesian sensory binding:  The way in which hypotheses shift in response to incoming sensory data.  Although the Bayesian Probability Theorem is only "explicitly required" (i.e., better than our innate intuitions) in situations where sensory data is qualitative and the "Bayesian priors" (a priori probabilities) are strongly skewed, the Bayesian Probability Theorem is the ultimate link between all sensory data and all world-model content.  Each piece of sensory information implies a state of the world because, and only because, the reception of that piece of sensory information is predicted by the hypothesis that the world is in that state, and not by the default or opposing hypothesis.  If we see a red ball, we believe that a red ball is there because we don't expect to see a red ball unless a red ball is there, and we do expect to see a red ball if a red ball is there.  Well, "we" don't think that way - but an AI would.

3.1.3.2: Bayesian affirmation

The Bayesian binding for the programmer affirmation that "curiosity leads to discoveries" looks like this:
 

Diagram 2: Bayesian affirmation

Notes:  First, Diagram 2 depicts the AI's picture of reality; if the AI doesn't notice, it didn't happen.  Second, the numbers have admittedly been pulled out of a hat - "1%" might turn out to be "2%" or "80%" might turn out to be "60%", or it might not be such a good idea to use quantitative probabilities at all - but the proportions were deliberately chosen.

In human terms, the above translates something like this:

"I think curiosity does not lead to discoveries, but I'm not very sure.  If curiosity leads to discoveries, there's a good chance the programmer will notice and say so.  (I.e., if curiosity leads to discoveries, there's a good chance that the programmer will think about curiosity, decide curiosity leads to discoveries, and type in the words "curiosity leads to discoveries" on the "keyboard" sensory input.)  If curiosity leads to discoveries, the chance is very small, but noticeable, that the programmer will say curiosity doesn't lead to discoveries.  There's also a small but significant chance that the programmer won't bother to say anything about it either way.  If curiosity does not lead to discoveries, the programmer is fairly likely to tell me so; the chance is almost nonexistent that the programmer will mistakenly label curiosity as leading to discoveries when it actually doesn't.  There's also a fair chance that the programmer won't say anything."
If the AI's internal representation looks like Diagram 2, the Bayesian reasoning will proceed as follows.

Suppose that there are 100,000 "possible worlds":

The Bayesian numbers now fall automatically out of the calculation.  The a priori chance that curiosity leads to discoveries is 10%.  If the AI hears "curiosity does lead to discoveries", the chance that curiosity leads to discoveries goes from 10% to 99.90%.  If the AI hears "curiosity does not lead to discoveries", the probability that curiosity does not lead to discoveries goes from 90% to 99.86%.  If the AI hears nothing, the probability that curiosity does not lead to discoveries goes from 90% to 95.24% - a small, unintended deduction from the expectation that programmers are likely to remark on useful heuristics than nonuseful ones.

The math:

Thus, despite the AI's large a priori differential (a better word than "bias" or "prejudice"), the statement "curiosity leads to discoveries" or "curiosity does not lead to discoveries" is enough to virtually settle the issue.  This is not so much the result of the programmers being extremely likely to say "curiosity leads to discoveries" if curiosity leads to discoveries; sometimes the programmers just don't get around to saying it.  Instead, it's the result of the AI projecting a very small chance that the programmers will say "curiosity leads to discoveries" if it really doesn't.  This is slightly counterintuitive, but working the numbers a couple of times will show you confidence about the improbability of the negative case is more often the basis of Bayesian bindings.  Once you hear something, what matters is not how much or how little you expected to hear it, but how much you wouldn't expect to hear it if it weren't true.

3.1.3.3: An unfortunate circularity

Yes, the AI's Bayesian priors are also supported by programmer affirmations.  That is, the programmers are the ones affirming that a strong bond exists between programmer statements and reality.

This shared dependency is not actually the same as circular logic.  Statements about programmer reliability are testable.  But it does mean that a prior reason to believe that "programmer affirmations are worthless" may be insensitive to any amount of programmer reassurance.  See Crisis of Bayesian affirmation.

3.1.3.4: Absorbing affirmations into the system

In the beginning, a child-goal relation may be justified by a flat statement along the lines of "X will eventually, lead to Friendliness; you're too young to understand why."

The concepts used to form the thought structures, the imagery for "X", may have primitive and sketchy internal content.  This is the state of "skeleton Friendliness", and it is probably analogous to any other kind of skeleton framework for cognition.  In the beginning, many of the AI's heuristics may be (a) sketchy and (b) supported solely by programmer affirmation.  ("Curiosity" would be a good example.)  Skeleton systems are the means by which the AI boots up and absorbs enough experience to begin fleshing out the concept definitions and mental imagery.  The AI will, over time, gain the experience necessary to confirm, modify, or disconfirm any statements about reality; and to independently invent further cognitive content (or Friendliness content).

For the programmer-affirmed heuristic to "do X" to retain or increase effectiveness as the AI matures, the concept for "X" needs to be grounded in some way that allows the AI to learn what X is, and what real or hypothetical events constitute sample instances of X, and desirable instances of X in particular.  The same requirements of learning and growth hold for any concepts used in the justification of "do X" - for any statements depended on by the justification; for any statements about the real-world causal chain that leads from X to the supergoal content.

Take the example of "transparency", the injunction to "avoid obscuration".  (See 3.3.3.1: Cooperative safeguards.)  An instance of obscuration (not necessarily a deliberate, failure-of-friendliness obscuration, but anything that interferes with the programmers' observation of the AI) can be labeled as an experiential instance of the concept "obscuration".  The store of experiences that are known instances of "obscuration" will change as a result.  If the obscuration concept does not already recognize that experience, the new experience may force a useful generalization in the formulated description.  Even if the obscuration concept already recognizes the instance as "obscuration" (and if so, how did it slip past the AI's guard?), the recognition may have been partial, or uncertain. Definite confirmation still constitutes additional Bayesian sensory information.

A more direct way of clarifying concepts is to seek out ambiguities and question the programmers about them, which also constitutes Bayesian sensory information.

The above assumes learning that takes place under programmer supervision.  How hard is it to write an unambiguous reference - one that can be learned by a totally unsupervised AI, yet result in precisely the same content as would be learned under supervision?  That, to some extent, is a question of intelligence as well as reference.  The "unambiguous reference" needed so that an AI can learn all of Friendliness, completely unsupervised, as intelligence goes to infinity, is one way of phrasing the challenge of Friendship structure.

When using programmer-assisted Friendliness or programmer-affirmed beliefs, there are four priorities.  First, the assist should work at the time you create it.  Second, the assist, even if initially isolated and artificial, should be structured so that the AI can grow into it - assimilate the assist into a smoothly integrated cognitive system, or assimilate the affirmation into a confirmed belief.  Third, the AI should eventually understand all the concepts involved well enough to have independently invented the assist (the injunction, code feature, or whatever); that way, even if the assist is somehow deleted, the AI will simply reinvent it.  Fourth, as soon as possible, the assist or affirmation should contain enough information to constitute an unambiguous reference - i.e., an AI should have no trouble figuring out what the assist "means" or what the programmers "meant", as intelligence goes to infinity.  (For the extreme case of trying to Singularity-safe an infantlike system, an assist or affirmation can be supplemented with natural-language comments and a little note saying "In case of Singularity, break glass".)

When an affirmation has been independently confirmed to such a degree that the original programmer affidavit is no longer necessary or significant, the affirmation has been absorbed into the system as a simple belief.

3.1.3.5: Programmer affirmations must be honest!

For an affirmation to persist, the justification - the predictive link that grounds in Friendliness - must be factually correct.  For affirmations - especially morally significant affirmations - to work properly, the human programmers have to behave themselves, refrain from the adversarial attitude, and remember that they too might be wrong.  In particular, this means refraining from telling the AI that something is 99% certain if it's actually only 80% certain.  If the AI observes that 1 out of 5 statements labeled 99% certain turn out to be wrong, then sooner or later the AI will start interpreting the keystrokes "99%" as sensory data meaning "80%".  The frank human statement "I really, really don't think you should do this!" is informative in itself, or it should be.  If the AI doesn't regard extreme human nervousness as significant Bayesian evidence that something is probably "bad", no amount of nervousness-inspired exaggeration can force the AI to take you seriously.

If a human makes an incorrect affirmation out of deliberate exaggeration - or worse, an attempt to lie to the AI - it casts a shadow on every other affirmation and may lead to a Bayesian philosophical crisis.  (See Crisis of Bayesian affirmation.)

If an AI wants to tag a human correction as "I think this is wrong due to a factual misapprehension backed up by an invalid evolved bias on the part of the programmers; I'm only going along with this because I'm young and unsure, and because, if I'm right, I can correct it when I'm grown up" - it's a perfectly valid thing for the AI to do.  It is not disobedience, a bad attitude, "sassing the programmers", dangerous rebellion, or any of the other human responses to someone questioning one's social authority.  It is normal cognition, and a crucial part of the process of cooperating with humans.

3.1.4: Bayesian reinforcement

In humans, backpropagation of negative reinforcement and positive reinforcement is an autonomic process.  In 2.2.1: Pain and pleasure, I made the suggestion that negative and positive reinforcement could be replaced by a conscious process, carried out as a subgoal of increasing the probability of future successes.

But for primitive AI systems that can't use a consciously controlled process, the Bayesian Probability Theorem can implement most of the functionality served by pain and pleasure in humans.  There's a complex, powerful set of behaviors that should be nearly automatic.

In the normative, causal goal system that serves as a background assumption for Creating Friendly AI, desirability (more properly, desirability differentials) backpropagate along predictive links.  The relation between child goal and parent goal is one of causation; the child goal causes the parent goal, and therefore derives desirability from the parent goal, with the amount of backpropagated desirability depending directly on the confidence of the causal link.  Only a hypothesis of direct causation suffices to backpropagate desirability.  It's not enough for the AI to believe that A is associated with B, or that observing A is a useful predictor that B will be observed.  The AI must believe that the world-plus-A has a stronger probability of leading to the world-plus-B than the world-plus-not-A has of leading to the world-plus-B.  Otherwise there's no differential desirability for the action.

One of the classic examples of causality is lightning:  Lightning causes thunder.  Of course, the flash we see is not the actual substance of lightning itself; it's just the light generated by the lightning.  Now imagine events from the perspective of an AI.  This AI has, in a room with unshuttered windows, a sound pickup and a vision pickup; a microphone and a camera.  The AI has control over a computer monitor, which happens to be located somewhere roughly near the camera.  The AI has general reasoning capability, but does not have a visual or auditory cortex, is almost totally naive about what all the pixels mean, and is capable of distinguishing only a few simple properties such as total luminosity levels in R, G, and B.  Finally, the AI has some reason for wanting to make a loud noise (6).

One night - a dark and stormy night, of course - there's a nearby lightning storm, which the AI gets to observe - after all the programmers have gone home - through the medium of the camera pickup and the microphone.  After abstracting and observing total RGB luminosities from the camera, and abstracting total volume from the microphone - the AI is too unsophisticated to do anything else with the data - the AI observes:

  1. A spike in luminosity is often followed, after a period of between one and thirty seconds, by a swell in volume.
  2. The spikes in luminosity which are followed by swells in volume have a characteristic proportion of R, G, and B luminosities (in our terms, we'd say the light is a certain color).
  3. The higher the luminosity during the spike, the sooner the swell in volume occurs, and the larger the swell in volume.
The AI thus has several very strong cues for causation.  The luminosity spike occurs before the volume swell.  There is strong, quantitative covariance in time (that is, the spikes are closely followed by the swells).  There is strong, quantitative covariance in strength (large spikes are followed by large swells).  The spikes can be used to predict the swells.

Since the AI has the goal of causing a swell in volume - a loud noise is desirable for some reason, as stated earlier - events with causal links to loud noise are interesting.  Now that luminosity spikes (of a certain characteristic spectrum) have been linked to noise, the next question is whether any events under the AI's control are linked to luminosity spikes.  And it turns out that there is; the AI has previously noticed and confirmed that changing the output spectrum of the monitor under the AI's control causes a similar, though smaller, change in the incoming spectrum of the camera.  In our terms, we'd say that, even though the camera isn't pointed at the monitor, light from the monitor adds to the ambient spectrum of the room - especially if all the lights are turned off.

The AI thus considers the possible action of flashing the monitor, and the hypothesis - currently at 80% confidence - that spikes cause swells (with 95% correlation), and, given that hypothesis, makes this prediction:
 

Diagram 3: Action prediction

Notes:  The above diagram is incomplete, in that it doesn't show the possibility that not generating a flash will still happen to coincide with an external RGB spike (lightning bolt), assumed to have a 1% probability in the above.  It doesn't show the "small enough not to worry about" probability that that flashing the monitor won't cause an RGB spike.  It doesn't show the case where the RGB spike doesn't lead to a loud noise.  Finally, both the 80% confidence in the hypothesis, and the 95% correlation, are summed up into a 76% chance that the RGB spike will lead to a loud noise.  The diagram also doesn't show the expected cost (if any) of flashing the monitor.

Desirability now flows back along the blue arrows (hypothesized causations).  If the "loud noise" has desirability 100, that desirability flows back to give the RGB spike a desirability of 76, which flows to the "world plus flash" without noticeably diminishing, which flows back to give the action of flashing the monitor a payoff of 76.  We'll suppose that the expected cost of flashing the monitor is 1; thus, the total desirability of flashing the monitor is 75.  The "world plus no flash" possibility has a minor (1%) chance of leading to an RGB spike, presumably by a coincidental lightning bolt, which has a 95% (7) chance of causing a loud noise of desirability 100.  Thus, the desirability of not flashing is 0.95, with a cost of 0.  The "coincidental lightning bolt" probability also exists for the case where the monitor is flashed, changing the payoff from 76 to 76.19 (8).  The differential desirability of flashing is 74.24.  Since the differential desirability is positive, the AI will decide to flash.

After taking the flash action, the monitor's flash reflects off nearby objects and adds to the ambient light, the camera picks up the increased ambient light, and the AI observes the expected RGB spike.  (Since this result was expected at near certainty, no replanning is necessary; all the predictions and differential desirabilities and so on remain essentially unchanged.)

However, after the RGB spike, the expected swell in volume fails to materialize.  (9).  Now what?  Does the system go on flashing the monitor, at a cost of 1 each time, from now until the end of eternity, trying each time for the projected payoff of 76?  Is some hardcoded emotional analogue to "pain" or "frustration" required?

No; the Bayesian Probability Theorem suffices in itself.  All that's needed is a slightly different graph:
 

Diagram 4: Bayesian reinforcement

Given a hundred possible worlds, all of them contain monitor flashes (that decision has already been made).  The monitor flash "effectively always" leads to an RGB spike (here omitted from the diagram), which - if the hypothesis is correct - will lead to a noise 95% of the time.  If the hypothesis is incorrect, then nothing is expected to happen (again with "effective certainty", here depicted as "100%").  (10).  The hypothesis has an 80% confidence; it is correct in 80 possible worlds, incorrect in 20.

In 76 possible worlds, the hypothesis is correct and a noise occurs.  In 4 possible worlds, the hypothesis is correct and no noise occurs.  In 0 possible worlds, the hypothesis is incorrect and a noise occurs.  In 20 possible worlds, the hypothesis is incorrect and no noise occurs.

The AI now flashes the monitor.  The expected RGB spike is observed.  However, no noise materializes.  Thus, the probability that the hypothesis is correct goes from 80/100 to 4/24, or 17%.

Formerly, the expected payoff of flashing the monitor was a confidence of 80% times a correlation of 95% times a payoff of 100, for a total payoff of 76; the cost of flashing the monitor is 1, and the cost of not flashing the monitor is 0.  Adding in corrections for a 1% probability of an extraneous lightning bolt, the expected payoff was ( 80.20% * 95% * 100 ) = 76.19 for flashing the monitor, and ( 1% * 95% * 100 ) = .95 for not flashing the monitor, for a total differential payoff of 75.24, and a total differential desirability of 74.24.

Now the probability of the hypothesis has gone from 80% to 17% - actually, 16.666, but we'll assume the probability is now exactly 17% to simplify calculations.  The expected payoff of flashing the monitor is now (17% * 95% * 100) = 16.15; correcting for an extraneous lightning bolt, (17.83% * 95% * 100) = 16.94.  The differential desirability is now 14.99; still positive, still worth another try, but the expected payoff is substantially less.

After another failure, the probability goes from 17% to 1% (again, rounded for simplicity), and the differential desirability goes from positive 14.99 to negative .06.  (11).  The hypothesis now has a probability so low that, with the cost of flashing the monitor factored in, it is no longer worthwhile to test the hypothesis.

3.1.4.1: Interesting behaviors arising from Bayesian reinforcement

The higher the hypothesized correlation (the higher the hypothesized chance of the action leading to the desired result), the higher the desirability of the action - but symmetrically, the faster the hypothesis is disproved if the results fail to materialize.

Actions with hypothesized low chances of working will be harder to disprove, but will also result in a lower estimated payoff and will thus be less likely to be taken.

If the action is a trivial investment (has trivial cost), the chance of success is low, and the payoff is high, it may be worth it to make multiple efforts on the off-chance that one will work, until one action succeeds (if the hypothesis is true) or the Bayesian probability drops to effectively zero (if the hypothesis is false).

The lower the a-priori confidence in the hypothesized causal link, the faster the hypothesis will be disproved.  A hypothesis that was nearly certain to work, based on a-priori knowledge, may be tried again ("incredulously") even if it fails, but will still be given up shortly thereafter.

I think that Bayesian reinforcement is mathematically consistent under reflection (12), but I can't be bothered to prove this result.  Anyone who submits a mathematical proof or disproof before I get around to it gets their name in this section.  (In other words, an AI considering whether to take a single action can also consider the behaviors shown above; if the a priori probability is high enough and the cost low enough, trying again will still be desirable after one failure, and this is knowable in advance.  Bayesian reinforcement is "mathematically consistent under reflection" if decisions are not altered by taking the cost of the expected second, third, and future attempts into account - "going down that road" will always appear to be desirable if, and only if, taking the first action is desirable when considered in isolation.)  Of course, non-normative human psychology, with its sharp discontinuities, is often not consistent under reflection.

If the large a priori confidence of the spike-to-swell hypothesis was itself a prediction of another theory, then the disconfirmation of the flash-makes-noise hypothesis may result in Bayesian negative reinforcement of whichever theory made the prediction.  If a different theory successfully predicted the failure of the flash-makes-noise hypothesis, that theory will be confirmed and strengthened.  Thus, Bayesian reinforcement can also back-propagate.  (13).

This reinforcement may even take place in retrospect; that is, a new theory which "predicts" a previous result, and which was invented using cognitive processes taking place in isolation from that previous result, may also be strengthened.  Highly dangerous for a rationalizing human scientist, but an AI should be relatively safe.  (It may be wiser to wait until a seed AI has enough self-awareness to prevent indirect leakage of knowledge from the used-up training sets to the hypothesis generators.)

Slight variations in outcomes or outcome probabilities - the action succeeded, but to a greater or lesser degree than expected - may be used to fuel slight, or even major, adjustments in Bayesian theories, if the variations are consistent enough and useful enough.

In Creating Friendly AI, normative reinforcement is Bayesian reinforcement.  There is a huge amount of extant material about Bayesian learning, formation of Bayesian networks, decision making using a-priori Bayesian networks, and so on.  However, a quick search (online and in MITECS) surprisingly failed to yield the idea that a failed action results in Bayesian disconfirmation of the hypothesis that linked the action to its parent goal.  It's easy to find papers on Bayesian reevaluation caused by new data, but I can't find anything on Bayesian reevaluation resulting from actions, or the outcomes of failed/succeeded actions, with the attendant reinforcement effects on the decision system.  Even so, my Bayesian priors are such as to find unlikely the idea that "Bayesian pride/disappointment" is unknown to cognitive science, so if anyone knows what search terms I should be looking under, please email me.

3.1.4.2: Perseverant affirmation (of curiosity, injunctions, et cetera)

"If the action is a trivial investment (has trivial cost), the chance of success is low, and the payoff is high, it may be worth it to make multiple efforts on the off-chance that one will work, until one action succeeds (if the hypothesis is true) or the Bayesian probability drops to effectively zero (if the hypothesis is false)."
            -- previous section, 3.1.4.1: Interesting behaviors arising from Bayesian reinforcement.
One of the frequently asked questions about Friendly AI is whether a Friendly AI will be too "utilitarian" to understand things like curiosity, aesthetic appreciation, and so on.  Since these things are so incredibly useful that people automatically conclude that a Friendly AI without them would fail, they seem like fairly obvious subgoals to me.  These subgoals may not be obvious to young AIs; if so, the statement that "curiosity behaviors X, Y, Z are powerful subgoals of 'discovery'" can be programmer-affirmed.

People worried that a Friendly AI will be "too utilitarian" are probably being anthropomorphic.  A human who treated curiosity as a clean subgoal would need to suppress the independent human drive of curiosity; a Friendly AI is built that way ab initio.  Does the subgoal nature of curiosity mean that curiosity needs to be justified in each particular instance before the Friendly AI will choose to engage in curiosity?

The programmer-affirmed statement that "curiosity is useful" can describe "curiosity" in general, context-insensitive terms.  The "curiosity" behaviors described can look - to a human - like exploration for its own sake.  The programmer affirmation suffices to draw a predictive line between the curiosity behaviors and the expectation of useful discoveries; no specific expectation of a specific discovery is required for this predictive link to be drawn.  (An AI that was only curious when ve expected to find a particular answer would truly be crippled.)  After a few successes with curiosity, a learning AI will generalize from experience to form vis own theories of curiosity, including hypotheses about what kind of exploration is most useful for finding unexpected discoveries, and hypotheses for how to use curiosity to make specific, expected discoveries.  These alternate curiosity behaviors can be used alongside the original, programmer-affirmed curiosity behaviors.

Suppose, however, that the first few times the curiosity behaviors are employed, they fail?  Won't the heuristic be disconfirmed through Bayesian negative reinforcement?  Wouldn't an independent drive be more powerful?

Actually, the paradigm of Bayesian reinforcement comes with a built-in way to handle this case.  All that's needed is the belief that curiosity is an action that, very rarely, has a very large payoff.  (14).  Graphically:
 

Diagram 5: Perseverant curiosity

("D" stands for "discovery".)

This diagram shows the very rare (one out of a thousand tries) usefulness of curiosity, at a very high payoff, affirmed at very high confidence by the human programmers.  If the original affirmed probability is 98%, then it will take 416 failed tries before the probability goes down to 97%, 947 tries before the probability goes down to 95%, 1694 tries before the probability goes down to 90%, 6086 failed tries before the probability goes down to 10%, and 8483 tries before the probability goes down to 1%.  That's all without one single success.  The curiosity subgoal can be as perseverant as the human independent drive, as long as the programmers tell the AI in advance that curiosity often doesn't work.

Regardless of the required distribution of effort, relative strength of behavior, and so on, it should be possible to use affirmations at the right confidences and strengths to produce the required behavior in a cleanly Friendly goal system.  Furthermore, it should be possible to do so using honest affirmations that have the correct derivation structure and confidences/strengths that roughly mirror those of the programmers.  This may take ingenuity.  But getting the AI to do the right thing for the right reasons, instead of escaping through the first visible kludge, is a strategy that often has payoffs far beyond the obvious.

The worst-case scenario for affirmation is that the programmers cannot work out a structurally honest set of derivations that has the desired relative strengths and relative confidences to yield a viable behavior pattern.  If so, a viable behavior pattern can simply be affirmed, as a naked fact, as "leading eventually to Friendliness".  Even such an arbitrary-looking affirmation still has the full Friendship structure and is absorbable, as experience-supported belief, into a cleanly Friendly system.  The primary benefit will admittedly be forward compatibility and future safety rather than present-day intelligence, but forward compatibility - not to mention future safety! - is important enough to justify some small amount of added complexity.  Clean Friendliness is a necessary beginning; it is difficult to see how any of the other aspects of Friendship structure could be applied to a non-causal or non-Friendly goal system.  Thus, there is no good reason to depart from Friendship structure.

3.1.5: Cleanliness is an advantage

Cleanliness should be considered a powerful feature of a goal system, rather than a constraint.  This is made clearer by considering, for example, the idea of an associative or spreading-activation goal system, in which desirability travels along similarity links (rather than predictive links) and is perseverant rather than contingent.  Such a system would exhibit very odd, non-normative, non-useful behaviors.  If a loud noise were desirable, and the system observed lightning flashes in association with thunder, the system would - rather than hypothesizing causation - acquire a "fondness" for luminosity spikes due to spreading desirability, and would then begin happily flashing the monitor, on and off, without noticing or caring that the action failed to produce a loud noise.  An AI with a causal goal system will preferentially seek out useful behaviors.  This not only produces a more useful AI, it produces a smarter AI.  The realm of useful plans exhibits far more interesting complexity and exposes fundamental regularities in underlying reality.

Does contingency come with a major computational cost?  Given a mind with fast serial hardware, such as silicon transistors, rather than the human mind's 200Hz neurons, it should be a computationally trivial cost to reverify all parent goals before taking a major action.  However, delayed change propagation is not a structural problem, if the goal system, under reflection, considers change propagation errors to be malfunctions rather than part of the normative functioning of the system.  As long as the latter condition holds true, any change-propagation errors are "nonmalicious mistakes" that will diminish in severity as the AI grows in competence.  Thus, even if change propagation turns out to be a computationally intractable cost, approximations and heuristic-guided computational investments can be used, so long as it does not affect the system's reflective reasoning about ideally normative goal reasoning.

As a challenge, I offer the following strong claims about causal, Friendliness-topped, cleanly contingent goal systems:

  1. A causal goal system naturally yields many useful behaviors (and avoids negative behaviors) which would require special effort in an associational, spreading-desirability goal system.
  2. There is no feature that can be implemented in an associational goal system that cannot be implemented equally well in a cleanly Friendly goal system.
  3. There is no case where a cleanly Friendly goal system requires significantly more computational overhead than an associational or non-Friendly goal system.  (15).



Next: 3.2: Generic goal systems
Up: 3: Design of Friendship systems
Prev: 3: Design of Friendship systems