Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: 3.4.3: Causal validity semantics Bookmark
Up: 3.4: Friendship structure Monolithic
Prev: Interlude: Philosophical crises


3.4.2: Shaper/anchor semantics

3.4.2.1: "Travel AI":  Convergence begins to dawn

According to 3.4.1: External reference semantics, above, the simplest method of grounding the supergoals is an AI model in which there's a something called Friendliness, that the programmers know about, that they tell the AI about, which means that programmer statements are information about Friendliness.

Eventually, of course, the AI will want to know:  How do the programmers know about Friendliness?  Where do they get their knowledge from?  A human, asked this question, may give any number of possible answers; but let us quiet that cacaphony by considering the problem from the AI's perspective.

Is there a problem at all, from the AI's perspective?  There are certain assumptions humans are built to make about everything in the Universe.  On hearing that there's a something called "Friendliness" that affects the programmers, a human will have certain expectations about this something, this referent or external cause.  One of the cognitive pressures is called the "minimum encoding" or "minimum coding"; it means that we experience a strong cognitive pressure to prefer compact descriptions to incompact ones.  Thus a human, shown a 2D 100 x 100 array of Cs with one single D in a random position, will either assume that the D is an error, or try to figure out what the cause of the D was.  This happens because all the Cs are not represented, internally, as independent causes.  If the internal representation had an independent cause for each and every one of the 9,999 Cs, then the 1 D would call no attention to itself; it would simply be one of 10,000 independent causes.  But we don't represent the Cs as independent causes.  Instead, we automatically translate our perception into the internal format "100 x 100 array of Cs, with one D."  That is the minimum encoding; much more compact than "C at 1, 1; C at 1, 2; C at 1, 3..."

This cognitive pressure, experienced at the level of sensory modalities or concept formation, has its analogue in deliberate thought processes, known as Occam's Razor:  Even if there are only two options for each pixel, C or D, the chance of almost all the pixels being C by sheer coincidence are infinitesimal.  (2^(-9,998) is around 1e-3000, which is "infinitesimal" for our purposes.)  Even a binary qualitative match turns into one heck of a strong structural binding when there are 9,999 terms involved.  Thus, for almost any fact perceived - in this, our low-entropy Universe - it makes sense to hypothesize a common cause, somewhere in the ancestry.  Every star in our Universe would turn to cold iron before a single 1e-3000 chance popped up by pure coincidence.

The question is whether these considerations would apply - either on the level of sensory modalities, or as a conscious process - to the external referent for Friendliness or supergoal.  To take a more morally neutral example, suppose that a technically nonFriendly traffic-control AI, having the "external reference" architecture but not "Friendliness" content, is given, via programmer affirmation, a set of goals that includes getting each vehicle in the city to wherever it's going at 40 miles per hour.  This violates several rules of Friendly AI, but let's consider it anyway, as a possible example.  Instead of "Friendliness", the supergoal's external referent will herein be referred to as "Travelness".

Let's also suppose that, instead of giving a blanket instruction that Travelness means getting "vehicles" to where they're going at 40mph, the programmers instead give a different statement for each individual vehicle:  "It is Travelly to get vehicle 4,383 to its destination at 40mph;" "It is Travelly to get vehicle 4,384 to its destination at 40mph", and so on.  Now suppose that the Travel AI is presented with vehicle 4,145.  Does the AI have any tentative conclusions in the moment before ve hears from the programmers?  Will the AI see anything unusual if ve's told that V4145 should travel at 39mph?  What happens if the programmers are all suddenly hit by a falling refrigerator and the AI needs to make vis own decision about V4145?

In the absence of any attempt by the programmers to set up these intuitions in advance, the answer is hard to predict.  (1).  I would certainly expect any general intelligence to have formulated some expectations about the programmers at this point, but whether this translates into expectations about Friendliness(Travelness) is an entirely different issue.  The sensory/conceptual cognitive pressure of minimum coding, and the heuristics for Occam's Razor, will undoubtedly have been learned, and can correctly be generalized to all physical objects, but it's still possible that the AI may conclude that Friendliness/Travelness is made out of a different kind of stuff than other external referents, and decide not to generalize.

If the AI forms any expectations at all about Travelness - if the AI has any reason at all to believe that Travelness is a low-entropy something - then it's very easy for the AI to conclude that Travelness probably (though not, of course, certainly) says that vehicle 4145 should go to its destination at 40mph.  There are, in fact, several different ways this can happen:

  1. The Travelness supergoal can be believed to have temporal coherence.  The AI can predict that the programmer (a physical, low-entropy object) will produce the sensory information "Vehicle 4145 should go to its destination at 40mph".  If the Travelness supergoal is believed to be temporally coherent, in the sense that what's true today is (by default assumption) true tomorrow and vice versa, then a probabilistic prediction of future sensory information about Travelness is a probabilistic statement about Travelness.  (If the AI believes that Travelness is temporally incoherent - that Travelness actually changes at the exact moment when the programmer delivers the sensory information, or that Travelness is actually unspecified before then - then predictions of this kind have nothing to say about present-day Travelness.)
  2. The Travelness supergoal can be believed to have some kind of content coherence.  Thus, it would be valid to try and find a minimum encoding for known Travelness content, and this minimum encoding could be used to make probabilistic statements about unknown Travelness content.  (If Travelness is believed to be totally content-incoherent, so that each and every truth about each vehicle is a totally independent cause, then all the previous instances of vehicles needing to go say nothing about the next vehicle in line.)
  3. Finally, the AI may make deductions by reasoning about the nature of the (unspecified) link between Travelness and the programmers.  A nondestructive brain scan would demonstrate that the programmers were producing their statements about individual vehicles by cognitive reference to a statement about vehicles in general.  The AI might thereby conclude that the programmers, having been affected by Travelness, were affected in such way as to produce in the programmers' minds a statement about vehicles in general, rather than Travelness affecting them so as to produce multiple independent statements about individual vehicles, and that the "general statement" contained in the programmers' minds is therefore valid sensory information about Travelness.
Please note the very narrow focus of the above possibilities.  We are not concerned with "proving" that vehicle 4145 should travel at 40mph, nor even the question of whether the AI's expectations are strong enough to compete with programmer-provided sensory information.  We are dealing rather with the question of whether the AI has any expectations at all, be they marked as ever so improbable.  If the AI concludes there's a 10% probability that Travelness has temporal coherence, and a 10% probability that the programmer will say "V4145 should go to its destination at 40mph", that's enough to establish a 1% probability for a statement about Travelness content.  If there's no equal and opposite factor, say a 1% probability that the Travelness content is "V4145 should remain halted (at the usual desirability for Travelness vehicle goals)", then just that isolated "1% probability that V4145 should go" will be enough to move V4145 around... at 1/100th of the usual differential desirability, perhaps, but the vehicle will still move.  Of course, a Friendly AI might have (or deduce) injunctions or ethical injunctions that advise against taking actions based on tenuous reasoning without programmer confirmation, but we are presently discussing a mind somewhat more primitive than that.

A human is veritably supersaturated with cognitive forces that would let us make deductions about Travelness, and any mind-in-general will be saturated with information about how a low-entropy universe works.  However, it is still possible that the generalizations will stop completely short of Travelness - that the AI will not only regard hypotheses about Travelness coherence as low probability, but as entirely unsupported by any known piece of sensory information, because Travelness is believed to be made out of different stuff.

Is there any imaginable configuration of an AI's mind that will naturally avoid all hypotheses about Travelness?  I think so, but it's a pretty exotic configuration.  The AI has to believe - in the sense of seeing no reason to believe otherwise - that Travelness is temporally noncoherent, that Travelness has noncoherent content, and that the programmers' use of a general statement to produce specific statements is irrelevant to the content of Travelness.  Travelness being defined as "What programmers say", if translated into a specifically exotic model, could meet these requirements.  The AI would need to assume that the Travelness referent changes at the exact moment when the programmer's lips move, and that any cognitive or physical processes that take place previous to the programmer's lips moving are irrelevant (insofar as sensory information about Travelness is concerned).  Under those circumstances, all of the above methods for reasoning about Travelness would, in fact, be incorrect; would produce incorrect predictions.  Of course this definition can very easily short-circuit, as depicted in Interlude: Why structure matters.

In fact, it looks to me like any definition which does not enable probabilistic reasoning about the supergoal referent must define the referent as incoherent both temporally and "spatially", and must define the referent as being identical with the sensory information produced about it (or rather, becoming identical at the instant of production of such information, then remaining identical thereafter).

I myself evaluate a very high probability that an AI would somehow wind up with expectations about Travelness unless the programmers made a deliberate attempt to prevent it.  For example, even given an exotic structure which permits no expectations about Travelness in advance of sensory information, the AI could still evaluate a finite (albeit very, very small) probability that sensory information had been produced but dropped, or erased from memory by later intervention, and so on.  In the absence of diabolic misinterpretation, of course, this is a nearly infinitesimal probability and will generate nearly infinitesimal desirability differentials, but still.  Technically, the AI is trying to correct mistakes in the transmission of sensory data, rather than forming expectations about supergoal content, so this doesn't really count from a Friendly-AI structural standpoint.  It does, however, show how hard it is to develop totally incoherent supergoals.  Similarly, even an AI with a "short-circuited" definition of "Travelness" might conclude that the programmer's lips are likely to move and thereby alter Travelness in a certain way, and move vehicle 4145 into position in anticipation of greater future supergoal fulfillment; this is sort of half-way between the two possibilities as far as relevance is concerned.

The atomic definition of convergence, as you'll recall from Interlude: The story of a blob -- what, you say?  You don't recall what was in that topic?  You're not even sure that it was in the same paper?  You can't recall anything from before you started reading 3: Design of Friendship systems, including your own childhood?  I guess you'll just have to start reading again after you've finished.  Anyway, the atomic definition of convergence is when a system makes the same choice given two different conditions; a blob turning to swim towards nutrients, regardless of whether the blob was originally swimming east or west; a "mathblob" adding 2 to 65 and 3 to 64, to achieve 67 in either case.

If a Travel AI with external reference semantics has any expectations at all that "Travelness" is a something, a thing in a low-entropy Universe that at least might obey some of the same rules as other things, then the Travel AI will form expectations about Travelness.  This doesn't require that Travelness be defined as a sentience-independent physical object floating out in space somewhere; all that's required is that Travelness have some definition that's physically derived or that has some connection to our low-entropy Universe.  If you show the Travel AI four thousand similar statements, they'll have at least a little inertia, a little effect on the four-thousand-and-first.

This small degree of convergence doesn't prove that the Travel AI will suddenly break free of all human connections - if the Travel AI has a 1%-confidence belief in a 1%-strength correlation, the differential desirabilities are pretty small compared to higher-priority content.  Even so, it would be possible for a programmer to deliberately "damp down" the asserted confidence of some piece of programmer-derived sensory information - tell the AI, "With a confidence of 0.001%, vehicle 4145 should move at 39mph" - and, in the absence of injunctions, the Travel AI will still think it more likely that vehicle 4145 should move at 40mph.  That is, the Travel AI will think it most likely that vehicle 4145 should move at 40mph, regardless of whether (binary branch) the programmer says "With a confidence of 0.001%, vehicle 4145 should move at 40mph", or "With a confidence of 0.001%, vehicle 4145 should move at 39mph."  This is a tiny, tiny amount of convergence, and a tiny, tiny amount of programmer independence (2) - but it's there.

3.4.2.2: Some forces that shape Friendliness:  Moral symmetry, semantics of objectivity

The basic cognitive structure for external reference semantics leaves unspecified where Friendliness content actually comes from; so, as depicted above, it's possible to come up with different definitions, and different evaluated probabilities based on the different definitions.  A "Friendly AI" with external reference semantics and nothing else may not contain any information at all that would help the Friendly AI make a decision about where the programmers get their knowledge about supergoals.  A positive outcome would be if the Friendly AI assumed that the programmers, who know about Friendliness content, are also the most reliable source for information about where Friendliness comes from, and thus accepted the programmers' statements as sensory information.  However, without a priori knowledge or causal validity semantics, this generalization would have to be made blindly.  (3).

As with Friendliness supergoal content itself, the issue of "Where does Friendliness come from?" is complex enough that no snap answer should be embedded as "correct by definition" in the AI.  We can thus immediately see that the reply to "Where does Friendliness come from?" requires a method for learning the answer, rather than a snap answer embedded in the code and enshrined as correct by definition.  Similarly, external reference semantics provide a method for growing and correcting whatever interim answer is being used as Friendliness content.  However, this section, "shaper/anchor semantics", is just about the interim answer used by Creating Friendly AI for "Where does Friendliness come from?", in the same sense that Creating Friendly AI uses the volition-based definition when talking about Friendliness content.  It's the later section 3.4.3: Causal validity semantics that closes the loop, explaining how, e.g., a Friendly AI with only external reference semantics could acquire shaper/anchor semantics.

As humans, of course, we leave the factory with built-in causal validity semantics.  These intuitions will now be applied to the question at hand:  Where does Friendliness come from?

Supposing I were to ask you a question about Friendliness, where would you get your answer?  What are some of the forces that might affect your answer?  Suppose, for example, that someone were to propose to you that the request of a human whose last name ends in a 'p' is worth only half as much (has half as much desirability) as the request of a human whose last name ends with any other letter.  (4).  In this case, the primary reason for your instant rejection is fairly simple:  You don't perceive any cause whatsoever for that modification, and you have a minimum-encoding, low-entropy perception of Friendliness.

Suppose the suggester presses her case - for example, by offering a reason such as "Well, I'm human, a volitional entity, and what I want matters, and I want a Friendly AI to dislike people whose last names start with 'p' - in fact, I'll experience mental anguish if I live in a world where Pfolk are equal citizens."  In this case, you would probably bring up an argument to the effect that all people (all humans, all sentient beings, et cetera) should be treated as morally equal, which you regret to announce is an overriding factor as far as you're concerned.  (I would say the same thing, by the way.)  If the suggester presses the case further, it will probably be by announcing that the Pfolk were shown to be responsible for 90% of all mime performances, and therefore deserve whatever they get.  Let's stop the argument here for a moment, and try to look at some of the underlying forces.

Moral equality is not only a powerful ambient meme of the post-WWII era, but also a very direct manifestation of a panhuman cognitive pressure towards moral symmetry.  This is probably not the best term, since it's very close to "moral equality", but it's the best one I can offer.  Moral symmetry is supported by three cognitive forces; first, the way we model causality; second, our having evolved to persuade others; third, our having evolved to resist persuasion.

For humans engaged in moral argument, everything needs to be justified.  We expect a cause for all decisions as we expect a cause for all physical events.  Someone who doesn't think that a decision requires a cause is not only cognitively unrealistic - it's really hard to imagine acausal anything - but subject to exploitation by anyone with a more coherent mind.  "Could you give me all of your money?"  "Why should I?"  "There's no reason whatsoever why you should."  "Okay, I'll do it!"  Similarly, someone who believes that decisions are acausal is likely to be an ineffective persuader:  "Could you give me all of your money?"  "Why should I?"  "There's no reason why you should."  "Ummm... no."

In the discussion on "moral deixis", the example was given of John Doe saying to Sally Smith, "My philosophy is:  Look out for John Doe.", and Sally Smith hearing, "Your philosophy should be:  Look out for Sally Smith.", rather than hearing:  "Your philosophy should be:  Look out for John Doe."  The conclusion there was that we have very strong expectations of speaker deixis and automatically substitute the [speaker] variable for any heard self-mention.  The conclusion here is that if John Doe expects Sally, like himself, to have a built-in instinct for the protection of John Doe, John is doomed to disappointment.  For John Doe to be an effective persuader, he must make use only of cognitive forces that he has a reasonable expectation will exist in Sally's mind.  He can send an argument across the gap either by recasting his arguments to appeal to Sally's own observer-centered goals, or by using the semantics of objectivity.

The latter really ticks off the moral relativists, of course.  Moral relativists insist that no objective standard of morality exists, and that arguments that use the semantics of objectivity are automatically flawed, thereby appealing to the universal human preference for unflawed arguments; they then go on to use moral relativism to argue against some specific moral principle as being ultimately arbitrary, thereby appealing to the universal human prejudice against arbitrariness.  Um, full disclosure:  I hate moral relativism with a fiery vengeance, and I hate cultural relativism even more, but rather than going on a full-scale rant, I'll (for the moment) just state my position that any public argument is, de facto, phrased in terms which appeal to a majority of listeners.  If a moral relativist wants to appeal to an audience prejudice against arbitrariness by saying that all morality is arbitrary and therefore Friendliness is arbitrary, I'm justified in using the criteria of that audience prejudice against arbitrariness as my objective standard for arguing whether or not Friendliness is arbitrary.

This doesn't prove that total moral relativism is logically inconsistent; (honest) evangelism of total moral relativism is logically inconsistent, but it's theoretically possible that there could be millions of logically consistent, honest, total moral relativists keeping their opinions private.  However, arguing with me in front of an audience about moral relativism only makes sense relative to some agreed-upon base layer held in common by the audience and both debaters; all I need to do is show that Friendliness meets the criterion of that base layer.  See 3.4.3.6: Objective morality, moral relativism, and renormalization below.

Anyway, the upshot is that, for any sort of diverse audience, humans generally use the semantics of objectivity, by which I mean that a statement is argued to be "true" or "false" without reference to data that the audience/persuadee would cognitively process as "individual".  (Whether the appealed-to criterions are human-variant data that the audience happens to have in common, or panhuman complex functional adaptations, or characteristics of minds in general, or even a genuine, external objective morality, is irrelevant to this particular structural distinction.)  Appeals to individualized goals are usually saved for arguing over what kind of pizza to get, or convincing someone to be your ally in office politics, and so on - individual interactions, or interactions with a united audience.  Thus, when humans talk about "morality", we generally refer to the body of cognitive material that uses the semantics of objectivity.

This holds especially true of any civilization that's been around long enough to codify the semantics of objectivity into a set of declarative philosophical principles, or to evolve philosophical memes stating that observer-centered goals are morally wrong.  Even if some subgroup within that civilization (Satanists, moral relativists, Ayn Rand's folk) has a philosophy that makes explicit reference to observer-centered goals, the philosophy will have an attached justification stating, in the semantics of objectivity, the reason why it's okay to appeal to observer-centered goals.  Anyone who grows up in a civilization like that is likely to have a personal philosophy built from building blocks and structure that grounds almost exclusively in statements phrased in the semantics of objectivity, and has a reasonable expectation that a randomly selected other citizen will have a similarly constructed philosophy, enabling the semantics of objectivity to be used in individual interactions as well.

The semantics of objectivity are also ubiquitous because they fit very well into the way our brain processes statements; statements about morality (containing the word "should") are not evaluated by some separate, isolated subsystem, but by the same stream of consciousness that does everything else in the mind.  Thus, for example, we cognitively expect the same kind of coherence and sensibility from morality as we expect from any other fact in our Universe.

In the example given at the start of this subsection, someone had just proposed discrimination against the Pfolk; that the request of a person whose last name starts with "p" should be valued (by a Friendly AI) at one-half the value of any other citizen.  So far, the conversation has gone like this:  "Discriminate against the Pfolk."  "Not without a reason."  "I'll be unhappy if you don't."  "That's not a reason strong enough to override my belief in moral equality."  "Pfolk are responsible for 90% of all mime performances, so they deserve what they get."

In that last parry, in particular, we see an appeal to moral symmetry.  Moral symmetry is a cognitive force, not a moral principle (the moral principle is "moral equality"), but if we were to try and describe it, it would go something like this:  "To apply an exceptional moral characteristic to some individual, the exceptional moral characteristic needs to be the consequence of an exceptional attribute of the individual.  The relation between individual attribute and moral characteristic is subject to objectivity semantics."  (5).  There's a very strong cognitive pressure to justify philosophies using justifications, and justifications of justifications, that keep digging until morally symmetric, semantics-of-objectivity territory is reached.

There's an obvious factual component to the statement "Pfolk are responsible for 90% of all mime performances, so they deserve what they get" - as far as I know, Pfolk are not responsible for 90% of all mime performances, not that I've checked.  In this case, the factual reference is very near the surface; however, factual references quite often pop up, not just during moral debates, but during philosophical debates (in discussions about how to choose between moralities).  This, again, is a consequence of our brains using the same semantics of objectivity for facts and morality; the very distinction between "facts" and "morality" (or "supergoals" and "subgoals", for that matter) is a highly sophisticated discrimination, so it's not surprising that the two are mixed up in ordinary discourse.  This is not important to the present discussion, but it will become important shortly.

We've now seen several factors affecting our beliefs about Friendliness, our beliefs about supergoals, and our beliefs about morality (communicable supergoals).  Some of them are high-level moral beliefs, such as moral equality.  Some of them are more intuitive, such as moral symmetry and our tendency to "put yourself in the other's shoes".  Some lie very close to the bottom layer of cognition, such as our using a single brainwide set of causal semantics for all thoughts, including thoughts about morality.

3.4.2.3: Beyond rationalization

We use the whole of our existing morality to make judgements about the parts.  Of course, since we're humans, with observer-biased beliefs and so on, this trick often doesn't work too well.  However, so long as you have enough seed morality to deprecate the use of observer-biased beliefs, and you happen to be a seed AI with access to your own source code, "nepotistic" self-judgements should not occur - that is, if the system-as-a-whole has a valid reason to make a negative judgement of some particular moral factor, then that negative judgement (modification of beliefs) will not be impeded by the current beliefs.  Nor will the fact that some particular judgement winds up contradicting a previously "cherished" (high confidence, high strength, whatever) be experienced as a cognitive pressure to regard that judgement as invalid - that's also a strictly human experience.

A default human philosophy (i.e., one that operates under the evolved design conditions) is a system that interestingly contradicts itself.  (Note that the word is default, not average.)  A default human will phrase all his moral beliefs using the semantics of objectivity (for the reasons already discussed), but trash all actual objectivity through the use of observer-biased beliefs - a complex process of "rationalization", whereby conclusions give birth to justifications rather than the other way around.  Because of this rationalization-based disjunction between morality and actions, the two have been, to some degree, pushed around independently by evolution.  As long as rationalization (at sufficient strength and reliability) is already present as an adaptation, evolution can freely modify moral reasoning to use the semantics of objectivity, or justification by public benefit, without causing the organism to actually make objective decisions or act for the public benefit, both of which might be a disadvantage (from evolution's perspective, anyway).  (As always in discussions of evolutionary psychology, readers should bear in mind that evolutionary motives are not the same as cognitive motives, and that evolutionary hypocrisy does not imply cognitive hypocrisy.  All actual actions being equal, cognitive hypocrisy is a disadvantage in social situations; humans are imperfectly deceptive.)

Actually, rationalization does not totally disjoint morality and actions; it simply gives evolution a greater degree of freedom by loosely decoupling the two.  Every now and then, the gene pool or the memetic environment spits out a genuine altruist; who, from evolution's perspective, may turn out to be a lost cause.  The really interesting point is that evolution is free to load us with beliefs and adaptations which, if executed in the absence of rationalization, would turn us into total altruists ninety-nine point nine percent of the time.  Thus, even though our "carnal" desires are almost entirely observer-centered, and our social desires are about evenly split between the personal and the altruistic, the adaptations that control our moral justifications have strong biases toward moral symmetry, fairness, truth, altruism, working for the public benefit, and so on.

Evolution, however, being blind, foresaw neither the printing press nor the Internet.  Rather than social morality and personal rationalization being equally balanced, those philosophies that are socially acceptable now have a chance to build up over time.  Since the second half of the twentieth century, in particular, humanity has accumulated a lot of ambient memes explaining how to overcome rationalization.  The conflict between rationalization-corrupted altruism and genuine altruism is now proverbial; equally, our conscious tendency to believe that we are altruistic automatically biases us towards thinking that a way to overcome rationalization is a good thing.  Humanity as a whole has not yet won the conflict between moral intuitions and rationalization, but the balance is definitely not what it was in the ancestral environment.

A human trained in evolutionary psychology can pretty much win completely, as long as he starts his journey with the surface belief that altruism is good and rationalization is bad.  For an AI, there isn't even a contest - again, as long as the seed philosophy starts out with a surface belief validating altruism and deprecating rationalization.  It shouldn't even matter where this belief comes from, so long as it's there.

3.4.2.4: Shapers of philosophies

DEFN: Shaper:  A shaper is a philosophical affector, a source of supergoal content or a modifier for other shapers; a belief in the AI's philosophy; a node in the causal network that produces supergoal content.

The essential idea underlying shaper/anchor semantics is that, rather than transferring the outputs of our philosophies to the AI as supergoal content, we also transfer the philosophies themselves, so that the AI can guess our responses, produce new supergoal content and revise our mistakes.  This doesn't just mean the first-order causes of our decisions, such as moral equality, but the second-order and third-order causes of our decisions, such as moral symmetry and causal semantics.  Shapers can validate or deprecate other shapers:  For example, memetic survival rates play a large part in morality, but we're likely to think that the differential survival of memes using objectivity semantics is a "valid" shaper, while deprecating and trying to compensate for the differential survival of memes appealing to hatred and fear.  Rationalization is strongly deprecated; false factual beliefs acting as shapers are even more strongly deprecated.

Anchors are described in more detail below and allow the AI to acquire shapers by observing humans, or by inquiring into the causes of current philosophical content.

Shaper/anchor semantics serve the following design purposes:

3.4.2.4.1: SAS:  Correction of programmer errors

A generic goal system can detect and correct programmer errors in reasoning from supergoals to subgoals.

A shaper-based goal system can detect and correct programmer errors where the error is made explicit - for example, where the programmer says "Moral principle B is the result of shaper A", and it's not - the programmer is (detectably) engaging in cognitive activity of a type labeled by the current system as "rationalization", or is making a conclusion influenced by beliefs that are factually incorrect.

An anchor/shaper Friendly AI can detect and correct implicit errors - for example, where the programmer says "A is moral", the AI deduces that the statement is made as a result of within-the-programmer shaper B, plus a factual error C.  The Friendly AI can assimilate B, correct the factual error C (producing correct factual belief D), and use B plus D to produce the correct moral statement E.  (There are several emendations to this general principle, both under anchoring semantics and under causal validity semantics; see below.)

3.4.2.4.2: SAS:  Programmer-independence

Thanks to the way human society works, humans have a strong need to justify themselves.  Thanks to the memetic rules that govern an interaction between any two people, each time you ask "Why?", the farther down you dig - the more likely the person is to attempt to justify herself in terms that are universal relative to the target audience.  Sometimes these universals are cultural, but, since nobody makes a deliberate effort to use only cultural justifications, sometimes these universals happen to be panhuman attributes, or even very deep attributes like causal semantics (which could plausibly be a property of minds in general).  Very often the justifications are rationalizations, of course, but that doesn't matter; what matters is that as long as the AI learns the given justifications instead of the surface decisions, the AI will tend to wind up with a morality that grounds in human universals - certainly, a philosophy that's a lot closer to grounding in human universals than the philosophy of whichever human was used as a source of philosophical data.  Another way of phrasing this is that if you seat two different humans in front of two different AIs with complete structural semantics, you'll tend to wind up with two AI philosophies that are a lot closer than the philosophies of the two humans.

Also, of course, the surface of the philosophy initially embedded in a competently created Friendly AI would have a strong explicit preference that validates programmer-independence and deprecates programmer-dependence.  An extension of this principle is what would enable a Friendly AI to move from exhibiting normative human altruism to exhibiting the normative altruism of minds in general, if the philosophy at any point identifies a specific difference and sees it as valid.

NOTE: When I say the "surface" of the philosophy, I refer to the proximate causes that would be used to make any given immediate decision.  It doesn't imply shallowness so much as it implies touching the surface, if you think of the philosophy as a system in which causes give rise to other causes and eventually effects; a really deep, bottom-layer shaper can still be "surface" - can still produce direct effects as well as indirect effects.

If you seat two different humans with an explicit, surface-level preference for programmer-independent Friendliness in front of two AIs with complete structural semantics, you will quite probably wind up with two identical AIs.  (In later sections I'll make all the necessary conditions explicit, define programmer-independence, and so on, but we're getting there.)

3.4.2.4.3: SAS:  Grounding for external reference semantics

In a later section, I give the actual, highly secret, no-peeking target definition of Friendliness that is sufficiently convergent, totally programmer-independent, and so on.  Hopefully, you've seen enough already to accept, as a working hypothesis, the idea that a philosophy can be grounded in panhuman affectors.  The programmers try to produce a philosophy that's an approximation to that one.  Then, they pass it on to the Friendly AI.  The Friendly AI's external referent is supposed to refer to that programmer-independent philosophy, about which the programmers are good sources of information, as long as the programmers give it their honest best shot.  This is not a complete grounding - that takes causal validity semantics - but it does work to describe all the ways that external reference semantics should behave.  For example, morality does not change when words leave the programmers' lips, it is possible for a programmer to say the wrong thing, the cognitive cause of a statement almost always has priority over the statement itself, manipulating the programmer's brain doesn't change morality, and so on.

(Note also that an AI with shaper semantics cannot nonconsensually change the programmer's brain in order to satisfy a shaper.  Shapers are not meta-supergoals, but rather the causes of the current supergoal content.  Supergoals satisfy shapers, and reality satisfies supergoals; manipulating reality to satisfy shapers is a non-sequitur.  Thus, manipulating the universe to be "morally symmetric", or whatever, is a non-sequitur in the first place, and violates the volition-based Friendliness that is the output of moral symmetry in the second place.)

(Of course, if the anticipation of supergoal satisfaction is set up in the wrong way, and the definition of Friendliness is at least partially grounded in the human species, it is conceivable that an ends-justify-the-means case of manipulating the human species' beliefs about morality, in anticipation of greater "supergoal satisfaction" later on, could short-circuit the goal system.  More about that later.  But at least it would be a case of ends justifying means; under meta-supergoals, where the supergoal of avoiding nonconsensual mindspace violation is a strict subgoal of maximizing human satisfaction with the supergoals, the short-circuit isn't even seen as temporarily undesirable.)

3.4.2.5: Anchors

DEFN: Anchor:  Anchor semantics are a structural attribute that enable the AI to discover and absorb philosophical content even if the programmers themselves are unaware of it.  An "anchor" is an element of the AI's current philosophical content - supergoal content, subgoal content, or shaper content - whose causal antecedents are not fully known.

An anchor happens when the human says "I think X is a good idea" but isn't clear about why, or isn't sure she is fully aware of all the valid factors that played a part in the decision.  An anchor is a reference to whichever cognitive events were occurring at the space-time event that was the human making the decision to add the anchor.  Dereferencing the anchor will probably be carried out by (externally) asking the programmer probing questions, or by (internally) using Bayesian reasoning, Occam's Razor, known facts about humans, et cetera.  (Of course, internal reasoning of this type may tend strongly towards the probabilistic rather than the near-certain.)  In theory, a Transition Guide - a successful seed AI turned superintelligence - could initiate a nondestructive brain scan of the programmer's neural mind state and find out all the cognitive causes of the human's belief in the anchoring point.  I mention this possibility mainly to show that an anchor is a reference to information that is entirely objective - a physical fact.  However, the referent of the anchor, once the anchor is successfully dereferenced, becomes standard information about humans - i.e., it has as much (or as little) effect as would a direct statement by the human.

3.4.2.5.1: Positive anchors

A positive anchor is an action or state marked as anchor-desirable; a positive anchor is the AI's conclusion about probabilistic shaper/supergoal/subgoal content, made from the sensory information that is a human stating "This action/state is desirable [for reasons X], but I'm not completely sure why [that those are all the reasons]."

A positive anchor would enable (for example) the humans to ask the AI to do something that is clearly a subgoal rather than an end in itself, but which was not a subgoal of any currently known (to the AI) supergoal content, nor a subgoal of the unknown supergoal content that supports subgoals regarding the goal system itself.  This could also apply to shaper content or supergoal content.  If the humans felt that something was a good idea, but not an end in itself, and weren't sure why, it would become a positive anchor.  I strongly feel that a positive anchor should be used with caution to ensure that it behaves as a reference to the causes that were in fact in the minds of the humans, and not as a rationalization target that promotes any line of reasoning that happens to intersect it.

A positive anchor should be distinguished from an ordinary programmer affirmation.  An anchor is when the humans don't know, or aren't sure they know completely.  A programmer affirmation is when the AI is too young to understand the justification, or the humans just haven't got around to explaining it.  A positive anchor is a request that the AI try to discover the philosophical causes on vis own.  A programmer affirmation is a request to trust the programmers today until the explanation is provided tomorrow.  Under vanilla external reference semantics, without any information about shapers, all supergoal content is programmer-affirmed.  (However, an AI that has shaper/anchor semantics but is provided only with external-reference content still has the philosophical capability to hypothesize a line of reasoning under which the programmers are "wrong", just as a general intelligence acting entirely on programmer-affirmed data would possess the same capability with respect to hypotheses about the physical world.)  ERS content might also be analogized to a programmer-affirmed subgoal supported entirely on trust (i.e., rather than supported by a programmer-affirmed link to a parent goal) - the programmers saying "We think this is a good idea, and you know we more than you do, and you don't have any reason to believe we're wrong, so go ahead and do it - it's more likely to work out than not."

3.4.2.5.2: Negative anchors

A negative anchor is an action or state marked as anchor-undesirable; a negative anchor is the AI's conclusion about probabilistic shaper/supergoal/subgoal content, made from the sensory information that is a human stating "This action/state is undesirable [for reasons X], but I'm not completely sure why [that those are all the reasons]."

Like a positive anchor, the power of a negative anchor is to lend greater resilience to a chain of logic.  Suppose there's a negative subgoal marked as a negative anchor.  Even if a key justification - dimly understood by the programmers, but not represented at all by the AI - is left out, and the humans originally justified the negative subgoal using a fallacious argument, the collapse of that fallacious argument will not completely extinguish the negative anchor.  Instead, the AI will hypothesize, at whatever degree of probability, that there is an unknown but valid reason for the negative anchor, and go looking for it.

Rationalization away from negative anchors seems less inherently dangerous than rationalization towards positive anchors, but psychological realism is still an important constraint.

Note also that a Friendly AI is capable of hypothesizing that something has, or should have, anchor status - i.e., that another shaper exists behind it, or that it's a good idea to look for such a shaper - even if the humans don't tell ver so.

3.4.2.5.3: Anchor abuse

Programmer affirmation of anchor status, like any other programmer affirmation, is dependent on the AI's belief that the programmers are likely to be correct - in Bayes logic, that the programmer has a small chance of offering an anchor in a case where no anchor is necessary.

Anchors, like any assertions, have probabilities and strengths.  Because of our human absolutist psychology, it's tempting to take something that we really don't want the AI to discard and assign it maximum-probability, maximum-strength anchor status - for example, to make "Don't kill humans" a maximum-probability negative anchor.  One, this is adversarial.  Two, this is confusing negative anchors with ethical injunctions.  Three, this is lying to the AI - technically, making a mistake that has a high probability of being a common cause for lots of other mistakes - and goes that little extra length towards precipitating a philosophical crisis.

Negative anchors, as a tool, exist when there is uncertainty on your part that you know all the reasons why something is undesirable, or when you have cause to believe there's a reason that isn't on the list.  If you think that there would be a large negative penalty for forgetting a reason, causing the AI to incorrectly deprecate [whatever], then that's an injunction which says that [whatever] should be treated as having negative anchor status.

Ultimately, what you say is simply sensory information to the AI.  If your nervousness about some negative point being violated causes you to fill the air with negative anchors, then the AI will, quite correctly, deduce that all the negative anchors are a result of your nervousness.  So if the AI has some reason (technically, the Bayesian prior, independent of programmer affirmations) to think that your nervousness is wrong, then that hypothesis would invalidate all the anchors as well - or rather, would invalidate all the sensory information produced as a consequence of your nervousness, no matter how strong or how high a probability you asked the AI to assign to the negative anchors.  If you honestly think logic of this type is more likely to indicate failure of Friendliness on the AI's part than failure of normative altruism on your own part, then that is an ethical injunction, not a negative anchor.

3.4.2.6: Useful shaper content requires high intelligence

Shapers take place in a domain - philosophy - that is one of the most sophisticated domains around.  It is possible that even a young AI will be able to manipulate information about shapers in such a way to make useful deductions in advance of programmer-provided information, and even possible that a young AI will be able to correctly second-guess the programmers.  But I wouldn't bet on it.  I would be less surprised to find the young AI's predictions being totally useless, or right only by coincidence.

It might be wise to try and teach an AI shaper content before vis intelligence has reached the necessary level, as long as everything is running at very low intrinsic probabilities, so that the AI simply gains experience by being proved wrong over and over again - still useful data.  However, this does mean that it may be harder to test shaper content (philosophy) at this stage - to verify that it actually does work, as opposing to doing as much untested work in advance as possible.  Depending on the intrinsic structure of the hard takeoff trajectory, this may mean that it's impossible to achieve supersaturated Friendliness in this area in advance of the point in time where a hard takeoff becomes possible.  In fact, it may even mean that shaper content is still basically useless - insufficient intelligence to be reliable at all, much less human-competitive - at the point where a hard takeoff first becomes possible.  I'll talk about how to solve this class of problem later (in 3.5: Developmental Friendliness).

However, abstract reasoning about shaper semantics, without reference to specific shapers, may still be enough to establish important early beliefs.  For example, the idea that a later programmer knows more than the same programmer at an earlier point in time, and hence that later statements should override earlier ones, requires understanding that the programmer is not a black box; that the programmer can make a "mistake" and "correct" it.  Thus, an AI may have need of shaper semantics in an early stage of development, even if shaper content must wait on the acquisition of sufficient intelligence.



Next: 3.4.3: Causal validity semantics
Up: 3.4: Friendship structure
Prev: Interlude: Philosophical crises