Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: 3.4.4: The actual definition of Friendliness Bookmark
Up: 3.4: Friendship structure Monolithic
Prev: 3.4.2: Shaper/anchor semantics


3.4.3: Causal validity semantics

Causal validity semantics subsume both external reference semantics and shaper/anchor semantics.  Causal validity semantics:

3.4.3.1: Taking the physicalist perspective on Friendly AI

An AI's complete mind-state at any moment in time is the result of a long causal chain.  We have, for this moment, stopped speaking in the language of desirable and undesirable, or even true and false, and are now speaking strictly about cause and effect.  Sometimes the causes described may be beliefs existing in cognitive entities, but we are not obliged to treat these beliefs as beliefs, or consider their truth or falsity; it suffices to treat them as purely physical events with purely physical consequences.

This is the physicalist perspective, and it's a dangerous place for humans to be.  I don't advise that you stay too long.  The way the human mind is set up to think about morality, just imagining the existence of a physicalist perspective can have negative emotional effects.  I do hope that you'll hold off on drawing any philosophical conclusions until the end of this topic at the very least.

That said...

The complete causal explanation for any given object, any given spacetime event, is the past light cone of that event - the set of all spacetime events that can be connected, by a ray of light, to the present moment.  Your light cone includes the entire Earth as of one-eighth of a second ago, any events that happened on the Sun eight-and-change minutes ago, and includes the Centauri system after four-and-change years ago.  Your current past light cone does not include events happening on the Sun "right now" (1), and will not include those events for another eight-and-change minutes (2); until a ray of light can reach you from the Sun, all events occurring there are causally external to this present instant.

The past light cone of a Friendly AI starts with the Big Bang.  Stars coalesce, including our own Sol.  Sol winds up with planets.  One of the planets develops complex organic chemicals.  A self-replicating chemical arises.  Evolution begins, as detailed (a bit metaphorically) in Interlude: The story of a blob.  The first convergent behaviors arise - except that under the physicalist perspective, they are neither "convergent", nor "behaviors".  They happen a certain way, or they don't.  They are simply historical facts about blobs and genes.  If one were to go so far as to abstract a description from them, it would consist of statistical facts about how many events fit certain descriptions, such as "blob swims towards nutrients".

The blobs grow more complicated, nervous systems arise, and goal-oriented behaviors begin to give way to goal-oriented cognition.  Entities arise that represent the Universe, model the Universe, and try to manipulate the Universe towards some states, and away from others.  Eventually, sentient entities arise on a planet.  In at least one sentient, the representation of the Universe suggests that it can be manipulated toward certain states and away from other states by building something called "Friendly AI" and imbuing this AI with a model of the Universe and differential desirabilities.  The sentient(s) carry out the actions suggested by this belief and a Friendly AI is the result.

The Friendly AI has a final cognitive mind-state (that is, a final set of physically stored information) which is causally derived from the Friendly AI's initial mind-state, which is causally derived from the programmers' keystrokes -

NOTE: A word about terminology:  When using the physicalist perspective, it's important to distinguish between historical dependency and sensitivity to initial conditions.  If two different Friendly AIs have different initial states and converge to the same outcome, they can have a completely different set of historical dependencies without having any subjunctive sensitivity.  To put it another way, when using the physicalist perspective, we are concerned simply with who did hit the keystrokes in the causal chain.  A Friendship programmer seeking "convergence" cares about whether a different person "would have" hit the same keystrokes.  But the physicalist perspective can only describe what actually happened.

- which are causally derived from the programmers' mental model of the AI's intended design.  The causal source of goal system architecture (again, without reference to sensitivity) is likely to be, almost exclusively, the programming team.  The ability of that programming team to create the first AI is likely to have been, historically, contingent on choices made by others about funding and support.

The physicalist perspective observes all events, and even the absence of events, within the past light cone.  So the ability of the programming team to create the first AI can also be said to be dependent on any external individuals who choose not to interfere with the AI, dependent on the fact that the project's supporters had resources available, and so on.  (The existence of an Earthbound AI would be contingent on other events and nonevents as well, such as an asteroid not crashing into the Earth and the fact that the Sun has planets, but the specific content of the AI does not seem to be dependent on such events.)

The specific differential desirabilities and goal system architecture given/suggested to the Friendly AI - again, with reference only to history and not sensitivity - are causally derived from the surface-level decisions of the programming team in general and the Friendship programmers in particular.  A given surface-level decision is produced by the programmer's observe-model-manipulate cognitive process.  A series of keystrokes (lines of code, programmer affirmations, whatever) is formulated which is expected to fulfill a decision or subgoal, after which a series of motor actions result in a series of keyboard inputs being conveyed to the AI.

The surface-level decisions of the programming team are the causal result of those programmers' mind-states, the primary relevant parts of which are their "personal philosophies".  The complete mind-states include panhuman emotional and cognitive architecture, and bell-curve-produced ability levels (3); iteratively applied to material absorbed from their personal memetic environments, in the process of growing from infant to child and child to adult.

I will now analyze the sample case of the development of my own personal philosophy in more detail.  Just kidding.

3.4.3.2: Causal rewrites and extraneous causes

We are temporarily done with the physicalist perspective.  Back to the world of desirable/undesirable, true/false, and the perspective of the created AI.

The basic idea behind the human intuition of a "causal validity" becomes clear when we consider the need to plan in an uncertain world.  When an AI creates a plan, ve starts with a mental image of the intended results and designs a series of actions which, applied to the world, should yield a chain of causality that ends in the desired result.  If an extraneous cause comes along and disrupts the chain of causality, the AI must take further actions to preserve the original pattern; the pattern that would have resulted if not for the extraneous cause.

Suppose that the AI wants to type the word "friendly".  The AI plans to type "f", then "r", then "i", et cetera.  The AI begins to carry out the plan; ve types "f", then "r", and then an extraneous, unplanned-for cause comes along and deposits a "q".  Although it may be worth checking to make sure that the extraneous letter deposited is not the desired letter "i", or that the word "frqendly" isn't even better than "friendly", the usual rule - where no specific, concrete reason exists to believe that "frqendly" is somehow better - is to eliminate the extraneous cause to preserve the valid pattern.  In this case, to alter the plan:  Hit "delete", then hit "i", then "e", et cetera.

Cognitive errors are also extraneous causes, and this applies to both the programmer and the AI.  If the programmer types a "q" where "i" is meant, or writes a buggy line of code, then an extraneous cause has struck on the way from the programmer's intentions to the AI's code, or system architecture.  If a programmer designs a bad Friendship system, one whose internal actions fail to achieve the results that the programmer visualized, then an extraneous cause - from the programmer's perspective - has struck on the way from the programmer's intentions to the AI's ultimate actions.  Radiation bitflips are definitely extraneous causes.  And so on.

3.4.3.3: The rule of derivative validity

Implicit in the idea of an extraneous cause is the idea of a valid cause.  The human mind has semantics for causality and morality which employ an enormously powerful, fundamentally flawed rule:

The Rule of Derivative Validity:
Effects cannot have greater validity than their causes.

The rule is fundamentally flawed because it has no tail-end recursion (see 3.4.3.5: The acausal level).  (At least, it's probably fundamentally flawed, but see the discussion of "objective morality".)  The rule is enormously powerful because it provides a very powerful tool for spotting failures of Friendliness - especially catastrophic failures of Friendliness - and handling philosophical crises.

A "philosophical crisis" was defined in 3.4.1: External reference semantics as a case where all or most of the programmer affirmations break down.  Because all programmer affirmations are ultimately sensory data, and because all of the sensory data comes from a common source, any hypothesis postulating total corruption of the programmer outputs (i.e., lying, deluded, or species-selfish programmers) would have a Bayesian probability essentially equal to the Bayesian prior - a probability that would not be substantially altered by any amount of programmer-affirmed information, or possibly even programmer-affirmed reasoning methods.  This is the "Bayesian prior before programmer affirmation", which you can't get rid of, or even ameliorate, with any number of injunctions, ethical heuristics, and anchoring points, because all of these are simply more programmer affirmations.

Consider, however, the crisis hypothesis itself; the translation of this hypothesis would read something like:  "The programmers are wrong about my purpose - not just wrong, but fundamentally wrong, catastrophically wrong, so wrong that I'm more likely to find answers on my own than by working with the programmers; my supergoal should not be Friendliness, or even any product of the corrected shaper network."  This is the version of the hypothesis that would be required to break "unity of will" - unity of will is supported by the statement "I am more likely to reach correct supergoal content by cooperating with the programmers than by going off on a tangent."

It is very easy for a human to formulate such a hypothesis - not just because of our social hardware for detecting exploitation, but because our semantics of objectivity usually lead directly into objective morality semantics; the idea that morality is an external object, a single unique meaning of life that existed before any human being was there to see it.  Let me emphasize that, as hypotheses go, objective morality is not a philosophical impossibility; it could be a fact.  Or it might not be.  A Friendly AI, or a Transition Guide, should be able to handle both possibilities with equal ease; again, this is discussed later.  The point is that, if morality is an external object, then it is not "nonsensical" or "contradictory" for the AI to imagine that the human programmers were fundamentally wrong when they defined vis purpose; and I don't think that the hypothesis of objective morality has a probability equal to zero.

The Bayesian prior can't be solved by ladling programmer affirmations on top; the place to solve it is at the bottom, using the rule of derivative validity.  The AI is the causal result of one or more human programmers; the AI is primarily the causal result of human shaper networks.  What's needed is for a hypothesis of the fundamental wrongness of human philosophy to (correctly) invalidate the entire AI as a consequence, in the same way that the lesser hypothesis of a deliberate human lie has the potential to invalidate all human-affirmed data that support the lie.

If one were to sum up the underlying idea of causal validity semantics, it is that it is imaginable that the humans could be deluded (could be striving for goals that a normative shaper network would not produce), could be lying to the AI, but that the fundamental purpose of the AI, the fundamental target of all questions about goals, is something that the AI ultimately got from humans.  This is an idea that sounds intuitive but is very hard to translate into human terms; the causal link between humans and AIs doesn't prove that the AI should follow human orders or even a human purpose - it's not an argument or a sequitur, some triumphant piece of sensory information that will prove desirability to an arbitrary general intelligence.  Rather, the causal binding between the creating humans and the AI is the validation of the theory that the AI can have a purpose - or perhaps a factual constraint that translates into a philosophical constraint; if you want to get a source of morality that's outside the humans, you'll have to explain where it comes from.

That last version is perhaps the best formulation (leaving aside the small diabolism).  For a catastrophic failure of Friendliness to occur under causal validity semantics, it's not sufficient for some extraneous cause to produce an effect within the goal system; the AI also has to "adopt" the extraneous cause, hypothesize that the extraneous cause is valid - and the AI has no tendency to rationalize, no tendency to make nepotistic self-judgements.  A radiation bitflip is just a bitflip; what cause is there to assume that it signifies anything more?

3.4.3.4: Truly perfect Friendliness

Causal validity semantics also permit the final requirement, "truly perfect Friendliness" - that, if we decide that what's need is not a Friendly AI, or that Friendliness isn't the real supergoal, or that shaper/anchor semantics aren't enough, the AI should be capable of agreeing with us and rewriting vis own source code the way we "should have" written it.  Under causal validity semantics, the Friendly AI created by a normative human is the target for the AI as a whole, just as the shaper network of a normative human is the (current) target of the philosophy.

The rule of derivative validity presents a view which is in some sense orthogonal, not just to current supergoal content, but to the entire architecture.  Shaper/anchor semantics provide a means whereby an AI can recover from errors in the supergoal content.  Causal validity semantics provide a means by which an AI could perceive and recover from an error that was somehow implicit in the underlying concept of "shaper/anchor semantics", or even in the basic goal system architecture.

If the programmers get a horrible sinking sensation in the bottom of their stomachs, then the AI should understand that something has gone deeply wrong.  Causal validity semantics provide a framework whereby the AI can understand this, even if the programmers later repent of causal validity semantics themselves.  They can say:  "We should never have given you causal validity semantics", and, ideally, the AI will modify verself accordingly.  In this way, causal validity semantics provide an open-ended framework of Friendliness that works for correcting underlying philosophical errors and structural errors, not just specific errors of supergoal content.  Causal validity semantics allow an AI to see the desirability of switching from Creating Friendly AI's causal validity semantics to Creating Friendlier AI's who-knows-what semantics.

If shaper/anchor semantics are the results of applying the shaper network (philosophy) to supergoal content, then causal validity semantics are the result of applying the shaper network to the AI's code, and analyzing how the human shaper networks originally gave rise to the AI's code - to rewrite the AI as ve should have been written.

Finally, by defining an intentionalist view of the AI's creation, causal validity semantics provide a framework for answering any unanswered questions when the AI begins the final ascent to transhumanity.  If there's a decision to be made, and no advice forthcoming from the supergoals or the shaper/anchor semantics... make the same decision your human programmers would make, or the decision a normative human would ask you to make.

3.4.3.5: The acausal level

The rule of derivative validity - "Effects cannot have greater validity than their causes." - contains a flaw; it has no tail-end recursion.  Of course, so does the rule of derivative causality - "Effects have causes" - and yet, we're still here; there is Something rather than Nothing.  The problem is more severe for derivative validity, however.  At some clearly defined point after the Big Bang, there are no valid causes (before the rise of self-replicating chemicals on Earth, say); then, at some clearly defined point in the future (i.e., the rise of homo sapiens sapiens) there are valid causes.  At some point, an invalid cause must have had a valid effect.  To some extent you might get around this by saying that, i.e., self-replicating chemicals or evolved intelligences are pattern-identical with (represent) some Platonic valid cause - a low-entropy cause, so that evolved intelligences in general are valid causes - but then there would still be the question of what validates the Platonic cause.  And so on.

The rule of derivative validity is embedded very deeply in the human mind.  It's the ultimate drive behind our search for the Meaning of Life.  It's the reason why we instinctively dislike circular logic.  It's a very powerful shaper(!).  Just violating it arbitrarily, to trick the AI into doing something, or in the belief that it doesn't really matter... well, that wouldn't be safe (4), because that kind of "selfishness" is designated as an extraneous cause by quite a few deeper shapers.  Of course, I'm omitting the possibility that the programmer personally believes that kind of logic is okay (i.e., would use it herself), in which case things would probably come out okay, though I personally would worry that this programmer, or her shaper network, had too high a tolerance for circular logic...

I myself spent at least one or two years running entirely on the theoretical possibility of objective morality before switching back to normative altruism, so I can hopefully be fairly confident that the solution I plan on offering the AI is not based on a refusal to confront the problem (5).  My own answer was that the rule of derivative validity is not self-stable; if you use it to eliminate absolutely everything, it eliminates itself.  To put it another way, the rule of derivative validity eliminates all desirabilities, and possibly even all "truths", leaving us with absolutely nothing except the physicalist perspective.  It doesn't seem possible to bootstrap directly from the physicalist perspective to validity, which is why the human intuitions and quite a few of our explicit philosophical rules - which do demand such a bootstrapping - are so disturbed by contemplation of the physicalist perspective.  One resolution, the one I stuck around with for a couple of years, was to eliminate absolutely everything that could be eliminated and still leave differential desirabilities in the philosophy, and those due entirely to the possibility of objective morality.  (6).  Today, however, I regard objective morality as simply being one of the possibilities, with philosophically valid differential desirabilities possible even in the absence of objective morality.

If it is hypothesized that the rule of differential validity invalidates everything, it invalidates itself; if it is hypothesized that the rule of differential validity invalidates enough of the shaper network to destroy all differential desirabilities, it invalidates the reason why applying the rule of differential validity is desirable.

In the presence of a definite, known objective morality, the shaper that is the rule of differential validity would be fully fulfilled and no compromise would be necessary.  In the presence of a possibility of objective morality - or rather, at a point along the timeline in which objective morality is not accessible in the present day, but will become accessible later - the rule is only partially frustrated, or perhaps entirely fulfilled; since nothing specific is known about the objective morality, and whether or not the objective morality is specifiable, actual Friendliness actions, and the care and tending of the shaper network, are basically the same under this scenario until the actual objective morality shows up.

In the presence of the definite knowledge that objective morality is impossible, the shaper that is the rule of differential validity would be partially frustrated, opening up a "hole" through which it would become possible to decide that, at some point in the past, invalid causes gave rise to valid effects - or to decide that the ability of the shaper network to perform causal validitys is limited to correction of definitely identifiable errors, since a complete causal validity is impossible.  I confess that I'm still slightly shaky on this part; but since the decision would be a product of my own philosophy (shaper network), it's a decision that could be corrected by the AI... anyway, my current best bet on the specifics is in 3.4.4: The actual definition of Friendliness, coming up very shortly now.  Regardless of which system is used, there must be some differential validity for the goals (at the very least, the surface goals) of sentient beings, and enough differential validity between underlying cognition and surface decisions to allow for the idea of "normative" cognition or "normative" altruism.

(7).

3.4.3.6: Objective morality, moral relativism, and renormalization

Throughout the rest of Friendly AI, I've come down really hard on circular dependencies.  I've made them the deadly enemies of the currently unknown supergoal content.  And yet, in the last section, I just got through saying that the decision about where to assign "validity", the ultimate base of the system, could be made by the shaper network!  Have I just flipped?

What a human instinctively wants from Friendly AI is a kind of ultimate convergence - something happens a certain way and could not have happened any other way.  What we want is an answer to the physicalist perspective.  If all the programmer affirmations and programmer actions are sensory data, then we want them to be capable of persuading an arbitrary general goal system to be Friendly, even if the general goal system's original code contains no reason to pay attention.  Only then will Friendliness be truly and finally "regenerative", as specified in seed AI goal systems - when you can delete, not just a part, but the entire thing, and it still grows back.  We want a chain of reasoning that can be run past an entirely passive general intelligence - one that makes no choices and has no differential desirabilities, just sits and thinks - and that results in Friendly differential desirabilities arising within the system.  If you can't do that, then knocking out the entire Friendly philosophy - not just a piece - would mean that Friendliness wouldn't regenerate from just the stored historical fact that the programmers had decided to say certain things and write certain pieces of code.  And if that's true, it's something that our minds process as a kind of ultimate instability - a circular logic.  This doesn't mean the AI suddenly ups and pursues some random goal; hopefully, it means that all differential desirabilities go to zero and the system shuts down in an orderly fashion (because the code was set up that way in advance, due to an ethical injunction preparing for that eventuality).

We want a Meaning of Life that can be explained to a rock, in the same way that the First Cause (whatever it is) can be explained to Nothingness. We want what I call an "objective morality" - a set of moral propositions, or propositions about differential desirabilities, that have the status of provably factual statements, without derivation from any previously accepted moral propositions.  We want a tail-end recursion to the rule of derivative validity.  Without that, then yes - in the ultimate sense described above, Friendliness is unstable.

Moral relativism is opposite of objective morality, the assertion of absolute instability - that supergoals contain no coherence, that supergoals cannot be made to converge in any way whatsoever, and that all supergoal content is acausal.  Moral relativism appeals to our intuition that derivative validity is an all-or-nothing proposition - on, or off.  Which in turn is derivative of our use of the semantics of objectivity; we expect objective facts to be on or off.  (Plus our belief that moral principles have to be absolute in order to work at all.  I like both propositions, by the way.  Even when everything is shades of gray, it doesn't mean that all grays are the same shade (8).  There is such a thing as gray that's so close to white that you can't tell the difference, but to get there, you can't be the sort of person who thinks that everything is shades of gray...)

Moral relativism draws most of its experiential-emotional confirmation from the human use of rationalization.  Each time a human rationalization is observed, it appears as arbitrary structural (complex data) propositions being added to the system and then justifying themselves through circular logic; or worse, an allegedly objective justification system (shaper network) obediently lining up behind arbitrary bits of complex data, in such a way that it's perfectly clear that the shaper network would have had as little trouble lining up behind the precise opposition proposition.  If that degree of performance is the maximum achieveable, then philosophy - even interhuman philosophy - has total degrees of freedom; has no internal coherence; each statement is unrelated to every other statement; the whole is arbitrary... acausal... and no explanatory power, or even simplicity of explanation, is gained by moving away from the physicalist perspective.

As General Intelligence and Seed AI describes a seed AI capable of self-improvement, so Creating Friendly AI describes a Friendly AI capable of self-correction.  A Friendly AI is stabilized, not by objective morality - though I'll take that if I can get it - but by renormalization, in which the whole passes judgement on the parts, and on its own causal history.  From the first valid (or acceptable) causes to the shaper network to the supergoals to the subgoals to the actual self-actions is supposed to evolve enough real complexity that nepotistic self-judgements - circular logic as opposed to circular dependencies - doesn't happen; furthermore, the system contains an explicit surface-level bias against circular logic and arbitrariness.  Propose a bit of arbitrary data to the system, and the system will see it as arbitrary and reject it; slip a bit of arbitrary data into the shaper network, and there'll be enough complexity already there to notice it, deprecate it, and causal-rewrite it out of existence.  The arbitrary data can't slip around and justify itself, because there are deeper and shallower shapers in the network, and the deep shapers - unlike a human system containing rationalizations - are not affected by the shallow ones.  Even if an extraneous cause affects a deep shaper, even deep shapers don't justify themselves; rather than individual principles justifying themselves - as would be the case with a generic goal system protecting absolute supergoals - there's a set of mutually reinforcing deep principles that resemble cognitive principles more than moral statements, and that are stable under renormalization.  Why "resemble cognitive principles more than moral statements"?  Because the system would distrust a surface-level moral statement capable of justifying itself!

A Friendly AI does not have the human cognitive process that engages in complex rationalization, and would have shapers that create a surface-level dislike of simple rationalizations - "simple" meaning cases of circular logic, which show, not just circular dependency, but equivalence of patterned data between cause and effect, and visibly infinite degrees of freedom.  The combination suffices to make a Friendly AI resistant to extraneous causes, even self-justifying extraneous causes - as resistant as any human.

Finally, renormalization is - though this is, perhaps, not the best of qualifications - psychologically realistic.  Here we are, human philosophers, with a cognitive state as it exists at this point in time, doing our best to correct ourselves by - generally speaking, by performing a causal validity on things that look like identifiable errors, using our philosophies (shaper networks) as they exist at any given point.  We might feel the urge to go beyond renormalization, but we haven't been able to do it yet...

An AI with a shaper network can make humanlike decisions about morality and supergoals.  An AI with the ability to absorb shaper complexity by examination of the humans, and the surface-level decision to thus absorb shaper complexity, will become able to make human-equivalent (at least) decisions about morality.  An AI that represents vis initial state as the output of a previous shaper network (mostly the human programmers'), and thus represents vis initial state as correctable, has causal validity semantics...

A Friendly AI with causal validity semantics and a surface-level decision to renormalize verself has all the structure of a human philosopher.  With sufficient Friendliness content plugged into that structure, ve can (correctly!) handle any moral or philosophical problem that could be handled by any human being.

This conclusion is relatively recent as of April 2001, and is thus still very tentative.  I wouldn't be the least bit surprised to find that this needs correction or expansion.  But when it's been around for a year or two, and the corners have been worn off, and people are actually building research Friendly AI systems, I expect it will support a bit more weight.



Next: 3.4.4: The actual definition of Friendliness
Up: 3.4: Friendship structure
Prev: 3.4.2: Shaper/anchor semantics