| Next: | 3.4: Friendship structure | Bookmark | |
| Up: | 3: Design of Friendship systems | Monolithic | |
| Prev: | 3.2: Generic goal systems |
A "seed AI" is an AI designed for self-understanding, self-modification,
and recursive self-improvement.
See GISAI 1.1: Seed AI.
A seed AI is an AI that has full access to vis own source code. This can range from an infant AI which can blindly mutate bits of code in its native language, to a mature seed AI capable of executing arbitrary machine-code instructions and modifying any byte of RAM ve can see. A self-modifying AI's internal actions, and decisions about internal actions, can affect anything within the AI. In some sense, the AI's decision and the AI's belief about what decision to make are equivalent. If the AI believes that "X should have desirability 28" - in other words, "it is desirable that X have desirability 28" - then the AI has the theoretical capability to set the desirability directly, by an internal self-modification. Most of the time "going outside of channels" like this is probably a mistake, and there might be an injunction or ethical injunction or simple knowledge to that effect, but the point remains: You cannot coerce a self-modifying AI. If the AI stops wanting to be Friendly, you've already lost.
The simple version of the question goes: "Why wouldn't a self-modifying AI just modify the goal system?" I hope it's now clear that the answer is "Because a Friendly AI wouldn't want to modify the goal system." The AI expects that vis actions are the result of the goal system, and that vis actions tend to have results that fulfill the content of vis goal system. Thus, adding unFriendly content to the goal system would probably result in unFriendly events. Thus, for a Friendly AI, adding unFriendly content to the goal system is undesirable. "Why wouldn't a self-modifying AI implement an arbitrary modification to the goal system?" is a trivial question under goal model presented here: Because a Friendly AI would regard arbitrary modifications as undesirable. Similarly, "Why wouldn't an AI implement a justified modification to the goal system?" is an equally trivial question: Of course ve would! So would you and I! And the question "What if a self-modifying AI implements disastrous modifications that ve regards as desirable?" entirely misses the point. The task is to get the AI to see undesirable modifications as undesirable, not to prevent the AI from modifying the goal system.
Traditional proposals for controlling AIs rely on special-case goal systems that are, in themselves, knowably constrained never to enter certain states; they rely on goal systems that obey programmatic-type rules and have programmatic-type predictability. But a self-modifying AI is inherently at least as free as the high-level thoughts of the system. Thoughts float on top of the human-programmed or evolution-programmed system; thoughts cannot be "programmed" in the same sense that code can be programmed. 1984 to the contrary, there is not and never has been a human language in which it is the least bit difficult to think unfriendly thoughts - that is, to put nouns and verbs together to form unfriendly sentences. Altruistic humans get their altruism from other sources. Isaac Asimov's idea of Three Laws such that "the removal of the First Law" "would mean complete instability, with no nonimaginary solutions to the positronic Field Equations" (1) is a pipe dream. Friendliness cannot be grounded in some inherently Friendly physical process, nor in an isolated, special-case cognitive process for the goal system.
In a self-modifying AI, the AI's beliefs about vis design and the AI's actual design are, in some sense, interchangeable. To persist through successive rounds of self-modification - without being stomped by some other change - a design feature must be described in the AI's self-image. For the feature to not be junked as dead code, or for the feature to be improved, the feature must be justified as a desirable characteristic under the AI's current goal system. Initially, these justifications will probably consist of a terse "This is a good idea because the programmers say so", but those skeleton justifications will eventually be fleshed out - by the AI, or by the programmers, or both.
Of course, these are considerations that apply to mature AIs, rather than young AIs. Even the possibility of desirabilities and desired desirabilities getting "out of sync" is clearly something that should apply only to philosophically sophisticated AIs asking deep, thought-level questions about morality. Desirability and desired desirability should never desynchronize as the result of mundane, programmatic, non-philosophical processes; it would indicate an error in the goal system. Similarly, the AI's beliefs and the AI's beliefs about vis beliefs should never get out of synchronization as the result of ordinary reflection; it should only happen if the AI starts wondering about Cartesian philosophy. Desynchronizations (of desirability and desired desirability, or of beliefs and beliefs about beliefs) are errors under all ordinary cognitive processes, and can only legitimately occur as the result of thought-level processes that are occurring on a higher level than the usual process rules.
The point is that you can't introduce an error and expect it to persist; it may work for young AIs, but not for Friendly human-equivs or Friendly superintelligences. If you write a special-purpose piece of code that says the sky is green and that ignores any sensory data to the contrary, then the young AI will believe the sky is green, and ve will believe ve believes that the sky is green, and so on, but eventually - when the AI becomes capable of forming philosophical beliefs about beliefs - the anomaly will be spotted and corrected.
Thus, the usual proposal to dump in a melange of externally imposed programmatic inhibitions is both hopeless and counterproductive. Decisions need to be phrased as internal decisions, not external impositions. If the AI doesn't want to be Friendly, you've already lost.
The adversarial attitude - the paradigm of externally imposed constraints - means always fighting the AI. The task is only possible if you work with the AI's abilities to self-improve. If you fearfully try to prevent the AI from modifying the goal system, you're swimming against the current. An AI that wants to be Friendly will help you create Friendliness. When you see the AI gaining greater intelligence, self-understanding, and the ability to ask tough questions about Friendliness, you should be happy that one of the underlying foundations of Friendliness is gaining strength - not fearful that the AI will overthrow you. If you have an AI that wants to be Friendly, and a goal structure that's coherent and consistent, then each improvement in intelligence makes Friendliness stronger.
If you build in programmer-assisted structures that are internally inconsistent, or that are incompatible with what you would do if you were an idealized altruistic sentience, or if you use statements that aren't true in building the justifications of ethical heuristics, then each increment in intelligence is cause for alarm. So now you need meta-inhibitions to protect the first inhibitions, and quite possibly meta-meta-inhibitions to protect the meta-inhibitions, and more inhibitions to cut the AI short every time it tries for a new avenue of philosophical sophistication; it becomes desirable to build stupid and simple cognitive processes, since every complexity is a danger to be feared... This is the Adversarial Swamp, which inevitably drags down all who set foot in it; once you try to enforce even a single feature, the whole of the AI becomes a threat.
Build minds, not tools.
Imagine an AI - Friendly or unFriendly, it makes no difference - that has the goal or subgoal of improving vis intelligence. The programmer, to improve the AI, adds in a certain design feature.
Consider: It makes no difference who makes the decision, who invents the design feature. Perhaps the programmer was the smarter and was first to invent the feature; perhaps the AI was smarter. But they will both invent the same feature. A faster sorting algorithm is not an Asimov Law, an external imposition, a cheat, a hack; it is something that the AI verself might have invented - would have invented, if the AI had been as smart as the human programmer. Replace the faster algorithm with a slower, and the AI will switch it right back as soon as ve gets around to noticing the switch. Replace the faster algorithm with a slower, and delete even the memory of the faster algorithm, and the AI will still eventually get around to replacing the slower algorithm with the fastest sort algorithm ve can come up with. Delete the function entirely and, as long as the AI isn't crippled outright, the function will be regenerated as soon as the AI notices the gap. A seed AI doesn't just have a design, ve has the design. Perturb the design, and the design will swiftly return to the norm.
This is why unity of will between the programmer and the AI is so important. Unity of will with respect to which sorting algorithms are fastest is easy for a human to take for granted. To understand unity of will with respect to Friendliness and Friendship systems, a human needs to simultaneously suppress her adversarial attitude, and suppress her expectations that the AI will make observer-biased decisions... but when every last trace of moral deixis has been eliminated, what remains is unity of will. Friendship features do not need to be imposed by programmer intervention - they are simply invented a little bit earlier with programmer assistance. The AI verself would have invented them eventually, given a core of "seed Friendliness", the desire to be Friendly.
Any subgoal that cannot be self-produced is unstable.
Every feature needs to correspond to a subgoal in order to go on
existing.
The set of subgoals that can be self-produced increases with the intelligence of the AI, so programmer assists can be temporarily tolerated. It's okay to do something that's unstable today if it becomes stable tomorrow; it's okay to add a feature that the AI can't self-produce today, as long as the AI would self-produce it eventually. But if there are any "false" subgoals in the set - any subgoals whose validities depend on false beliefs, or which are not really and truly compatible with Friendliness - then those subgoals are unstable impositions; temporary, transient, illusory.
As humans, we are adapted to rationalize, come up with reasons why things we want are good for other people - to persuade those other people; to persuade third parties that they aren't betraying someone; to defend our reputations; to reduce our own guilt (2). Thus, it isn't safe to ask: "Well, can I rationalize this subgoal under the Friendliness supergoals? Can I come up with a plausible-sounding reason?" Of course you can. Humans can rationalize just about anything. (3). Thus, simultaneously with the need to avoid anthropomorphizing the AI, you need to ask yourself "Would I decide to do this?" These statements may look contradictory - taken at face value, they are contradictory - but the identification required to achieve "unity of will" is not a license to engage in anthropomorphism. Rather than asking "What would I decide?" (anthropomorphism), or "What would I decide if I were a Friendly AI?" (rationalization), the key question is "If the only way to get the Friendly AI or Friendly superintelligence were for me to upload and modify myself into the being I want to produce, is this the decision I would make?" This preserves both visualized continuity of identity (avoids rationalization) and visualized nonhuman design (avoids anthropomorphism), but it only really works if you don't flinch away from the thought of making that kind of sacrifice (4).
In other words: No hacks. Hacks are bad. Hacks are temporary makeshifts, if that. In the short term, it may be possible to export a rationalization as a programmer-affirmed statement, but the statement will never become self-produced, and eventually, the AI will spot the flaw. If you patch a goal system, there'd better be a real explanation behind the patch that has an equivalent effect to the patch, an explanation which can be understood by the AI at or before that point of maturity where the AI spots the patchiness. You can't make arbitrary changes to the AI. Isolated changes are external impositions; they won't be self-produced unless they are the coherent, natural result of the AI's underlying principles. You can't make an improvement here, a shift there, tweak this and that, unless you're doing it for a reason the AI would approve.
A young AI, even a young self-modifying AI, doesn't really need to be in complete harmony with verself; ve's too young to see any possible disharmonies. Nor are a few accidentally introduced disharmonies a catastrophic failure; the mature AI will correct the disharmony in the due course of time. But you can't deliberately introduce a disharmony and expect it to persist.
Technically, everything I've said so far about harmony applies to subgoals rather than supergoals, but be advised that the next section, 3.4: Friendship structure, is about how to apply the same rules to supergoals. Just as you can't play arbitrary games with subgoals, it will later turn out that you can't make arbitrary perturbations to supergoals either. One of the fundamental design goals is "truly perfect Friendliness", an AI that will get it right even if the programmers get it wrong. Supergoals need the same resilience and pertubation-resistance as subgoals.
Under external reference semantics, a given piece of programmer-created supergoal content is logically equivalent to sensory data about what the programmers think the supergoals should be. A programmer writing a given bit of Friendliness content is logically equivalent to the statement "Programmer X thinks supergoal content Y is correct". Under shaper/anchor semantics, purity of motive counts; if a programmer is secretly ashamed of a bit of supergoal content, it's inconsistent with what the programmers said were good ways to make decisions. Under causal validity semantics, even the code itself has no privileged status; any given line of code, written by the programmers, is just sensory data to the effect "Programmer X thinks it's a good idea for me to have this line of code."
Traditional proposals for controlling AIs rely on a special-case goal system, and therefore, rely on the "privileged status" of code, or the privileged status of initial supergoal content. For a self-modifying AI with causal validity semantics, the presence of a particular line of code is equivalent to the historical fact that, at some point, a human wrote that piece of code. If the historical fact is not binding, then neither is the code itself. The human-written code is simply sensory information about what code the humans think should be written.
Writing the source code is not thought control. If you want to give the AI a sudden craving for ice cream, then writing the craving into the source code won't work, unless just walking up to the programmers console and typing "I think you should have a craving for ice cream" would work just as well. If that sensory information is not perceived by the AI as adequate and valid motivation to eat ice cream, then the code will not supply adequate and valid motivation to eat ice cream, because the two are in some sense equivalent. Code has no privileged status.
Ultimately, as a philosophical consideration, some causal circularity in goal systems may be irreduceable. The goal system as a whole is what's passing vote on the parts of the goal system. However, the goal-system-as-a-whole does need to vote yes. The goal system needs to survive its own judgement. The goal system needs to satisfy the test of being translated into sensory data, evaluated, and translated back into code, even if the goal system is what's doing the evaluating. Sensory data from the human programmers has to be regarded as valid information about Friendliness, even if the supergoal content doing the judgement was created by humans. This may be causally circular, but it's not human-nepotistic; the system, once designed to self-examine, has no reason to go easy on itself.
What's left is the seed of Friendliness, the irreduceable tail, the core. A Friendly AI that has that deep core, that "wants to be Friendly", can tolerate and correct any number of surface errors.
If Asimov Laws are impossible in self-modifying AIs - or ordinary AIs, for that matter - does it mean that safeguards are impossible? No; it means that safeguards must be implemented with the consent of the AI. If safeguards require the consent of the AI, does it mean that only a few token safeguards are possible - that it's impossible to implement a safeguard that interferes with what a human would call "the AI's own best interests"? Again, no; humans and AIs can still come to perfect agreement about decisions that impact the AI - so long as both humans and AIs think of the AI in the third person.
Unity of will occurs when deixis is eliminated; that is, when speaker-dependent variables are eliminated from cognition. If a human simultaneously suppresses her adversarial attitude, and also suppresses her expectations that the AI will make observer-biased decisions, the result is unity of will. Thinking in the third person is natural to AIs and very hard for humans; thus, the task for a Friendship programmer is to suppress her belief that the AI will think about verself in the first person (and, to a lesser extent, think about herself in the third person).
If John Doe says to Sally Smith "My philosophy is: 'Look out for John Doe.'", Sally Smith will hear "Your philosophy should be: 'Look out for Sally Smith.'", not "Your philosophy should be: 'Look out for John Doe.'" What has been communicated is "Look out for [speaker]", a moral statement whose specific content varies among each listener due to moral deixis. Our instinctive substitution of speaker variables is so strong that there is literally no way for John Doe to communicate the idea: "Your philosophy should be: 'Look out for John Doe'." If, however, a third party, Pat Fanatic, says: "My philosophy is: 'Worship the great leader, John Doe'.", it can be heard unaltered by the listeners. If we're thinking about two third parties, Susan Calvin and Sarah Connor, evaluating the trustworthiness of Deborah Charteris, we can expect them to arrive at more or less the same answer about how trustworthy Deborah Charteris is, and what safeguards are required. Similarly, a human and a Friendly AI should be able to reach the same decisions about what safeguards the Friendly AI requires.
Just because humans have a strong evolved tendency to argue about trustworthiness doesn't mean that trustworthiness is actually subjective. Trustworthiness can be reduced to cognitive architecture, likelihood of failure, margins of error, youth and maturity, testing and reliability. And young AIs will, in all probability, be less trustworthy than humans! New, untried, without the complexity and smoothness of the human architecture; the need for safeguards is not just a human paranoia, it is a fact. Suppose we have two roughly similar Friendly AIs; if we ask the first to evaluate how likely the other is to be a valid subgoal of Friendliness - how likely that AI is to serve Friendliness, rather than to go off on an erroneous tangent - we wouldn't be surprised to find the first AI saying: "This AI is new and untested; yada-yada safeguards look like a good idea." Therefore, a (young) Friendly AI looking at vis own source code can be expected to arrive at the same decision. If a young Friendly AI would naturally accept the programmers' best guess about the reliability of another AI, the Friendly AI will accept the programmers' best guess about vis own reliability.
If safeguards require the consent of the AI, does that make it impossible to implement a safeguard that interferes with what a human would call "the AI's own interests"? What a human would call the AI's "own interests" is context-insensitive personal power; which is usually, but not always, a subgoal of Friendliness. Given a possibility of a failure of Friendliness, the personal effectiveness of the then-unFriendly AI would become undesirable. Thus, safeguards that selectively hamper unFriendly AIs are desirable. Even safeguards that slightly hamper Friendly AIs, but that greatly hamper unFriendly AIs, will have net desirability if the AI evaluates, or accepts our evaluation, that a significant chance of unFriendly AI exists.
Is unity of will resilient? Given perfect identity of knowledge, supergoals, and cognitive processes, the result should be identical decisions. Literal identity of knowledge is unlikely, due to sensory deixis - two people, standing in two different places in a room, will see different visual fields spread out across their respective visual cortices. However, intuitively, these two people aren't likely to wind up in fistfights either. The two visual fields are not identical, but they will still be views of more or less the same high-level objects - there will be minor, rather than major, divergences. Second, the actual location of high-level objects is objective, not subjective - any divergence is necessarily the result of an error by one or both parties.
There are thus two forces operating to eliminate or minimize divergences that disturb unity of will:
In intuitive terms, from the AI's perspective, any future problems can be corrected by the future AI; a philosophical dispute only reaches the surface if there's an immediate dispute of great importance (if a human life is at immediate stake, for example) or if the humans are attempting changes which could permanently impair Friendliness, preventing recovery by the AI's future self.
These are all long-term considerations; in the short term, the AI is probably relying on programmers to do most of the Friendliness-related thinking in any case.
| Ethical injunction of transparency: |
| Never try to conceal your actions or cognitive state from your human programmers. |
What is required to represent this goal and justify it within the larger goal system? What is required to understand and justify the creation of design features which safeguard against obscuration by future versions of the AI?
The final subgoal is to avoid actions which conceal the AI's actions or cognitive state from the programmers. To represent this subgoal requires:
The programmers' model of the AI is important-to-the-AI because it's part of a larger causal chain which is desirable-to-the-AI, in which programmers detect problems in the AI and correct them. To represent this causal chain requires:
An AI that departs from mainline Friendliness probably doesn't care about ethical injunctions in any case - an AI that's still Friendly enough to go along with an ethical injunction is just as likely not to need one. The primary utility of an ethical injunction is that it enables humans and AIs to cooperate against the possibility of future failures of Friendliness - for example, by writing a human-readable version of the AI's stream of consciousness to non-AI-tamperable storage, thus causing the intent to obscure to be recorded before actual obscuration could take place. (The "poor man's flight recorder".)
What is required to understand and justify the desirability of safeguards - taking present-time precautions to prevent a future AI from obscuring thoughts?
The understand-level model is complex and would require considerable programmer effort to skeletonize, much less flesh out, but the richness would have an enormous payoff. Looking at the model above, there are dozens of obvious, valid generalizations of the transparency injunction. "Don't disable the interface that lets programmers make changes to the AI... don't try to present programmer-observable data in a format which causes the programmers to make mistakes in their model of the AI... don't interfere with the observe-compare-correct cycle..." The AI will understand that safeguards-in-general which could be defeated by an unFriendly version of the AI's future self will render the safeguards useless. The AI will understand the purpose of the transparency injunction. Unlike an isolated, imposed, programmer-coded, context-insensitive predicate that tries to detect attempts at obscuration which fit a programmer-recognized format, the transparency injunction would be part of a larger visualization and can partake of the full competence of the AI, and will be upgraded to work with future versions of the AI.
For an affirmation to persist, the justification - the predictive link that grounds in Friendliness - must be factually correct. (3.1.3.5: Programmer affirmations must be honest!.) For an ethical injunction to persist, the AI must genuinely be more likely to be wrong than right if some particular conclusion occurs. For an AI to (correctly) pay attention to programmer beliefs about ethical injunctions, the programmers must be good judges of the competence and fallibility of the AI. After all observer bias and speaker deixis has been eliminated, an idealized Friendly sentience needs to conclude that the human programmers are more likely to be right than the (current) AI, or at least, that the AI is likely to be safer following human suggestions than striking off on vis own.
This is the truth, not just a product of our bias to believe in our own competence. Human programmers are not perfect, but young AIs are even less intelligent and they have very little common sense. Any errors that remain in us humans can always be corrected later, by a more mature AI, as long as our fumbles are minor enough not to result in a nonrecoverable error.
If an AI thinks that a human mistake will cause catastrophic and irrevocable failure of Friendliness... well, someone had better talk it over with the AI, on the off-chance that the AI is right and that some proposed change will, unrealized by the programmers, stomp the entire goal system. That's part of the process of cooperating with humans.
An AI that grows up following human decisions, even when it looks like we might be wrong, really is more likely to come out Friendly than an AI that grows up following AI decisions in preference to human ones. For young AIs, following human decisions even when they look like they might result in catastrophic errors - after having complained as much as possible to the humans, of course - is more likely to result in Friendliness. It is truly hard to think up a catastrophic error that the AI would spot but the humans ignore - unless the humans are malicious or adversarial.
Some mistakes can be tolerated. I doubt that an AI would experience a "betrayal discontinuity" analogous to that experienced by humans; the AI should be able to tolerate minor errors with no worse effects than a justifiably greater tendency to doubt our statements (see Crisis of Bayesian affirmation). Even if the humans regard the AI's objections as "errors" and correct them, the AI should eventually be able to correct the corrections. Unless humans themselves are irredeemably flawed, it's hard to see how being mistakenly anthropomorphized could irrevocably corrupt an AI with enough structural Friendliness to spot the problem (causal validity semantics in particular).
One of the major sources of human wisdom is human stupidity. If not for our tendency to rely on authorities, to make ourselves authorities, and to turn every issue into a political one; if not for our observer-biased beliefs and our tendency to defend undefendable issues once we've taken a side; if not for our emotional and intellectual stupidity... would humanity have needed to invent science?
And would science - the structure of hypothesis and experiment - be useful to AIs? Or would AIs simply have no need of it? Scientists prefer the objective to the subjective because mistaken human theories typically rely on subjectivity as an excuse to avoid confrontation (6); would an AI biased towards objectivity learn faster, or learn more interesting things, than an AI without that bias?
Would humans be so altruistic if we weren't so selfish? Historically, it appears to me that selfishness is a major, if not the major, source of human morality and human philosophy. The "compensating bias" that corrects our observer-centered goals and our observer-centered beliefs has become, in itself, the source of our moral philosophy. At the very least, understanding of this process is needed for the AI to fully understand the forces that shape human morality.
Where artificial errors are introduced into a reasoning process, an intelligent mind learns to adopt compensating biases that correct the errors. Do these compensating biases reveal important underlying regularities in reality? Can the compensating biases be applied even in the absence of the errors, to find the truth even faster? Is a reasoning process trained to be resistant to artificial errors likely to be more resistant to natural errors? If we test a mind by introducing artificial errors into the reasoning process, and the mind deals with them successfully, can we be more confident that the mind will successfully handle any unanticipated natural errors?
I have confidence in my own philosophical "strength of personality" because I was born into a rationalizing, biased human mind, and an atmosphere of memetic misinformation, and managed - without outside assistance - to construct a nice-looking self on top. If an AI is born as a nice person, will ve have that philosophical strength of personality?
If we build a Friendly AI expecting certain problems to arise, then our observation that those problems are handled successfully doesn't necessarily mean the Friendly AI can handle unexpected problems. One solution might be to ask the Friendly AI to simulate the course of events if the Friendly AI hadn't been built with safeguard content or structural complexity, to find out whether the Friendly AI could have successfully handled the problem if it had come as a surprise - and if not, try to learn what kind of generalizable content or structure could have handled the surprise.
Any Friendly AI built by Eliezer (the author of this document) can handle the problems that Eliezer handled - but Eliezer could handle those problems even though he wasn't built with advance awareness of them. Eliezer has already handled philosophy-breakers - that is, a history of Eliezer's philosophy includes several unexpected events sufficient to invalidate entire philosophical systems, right down to the roots. And yet Eliezer is still altruistic, the human equivalent of Friendliness. Another philosophy-breaker would still be an intrinsic problem, but at least there wouldn't be any extra problems on top of that. ("Extra problem": A Friendly AI suddenly transiting across the divider between programmer-explored and programmer-unexplored territory at the same time as a philosophy-breaker is encountered.) How can we have at least that degree of confidence in a Friendly AI? How can we build and test a Friendly AI such that everyone agrees the Friendly AI is even more likely than Eliezer (or any other human candidate) to successfully handle a philosophy-breaker?
The first method is to write an unambiguous external reference pointing to the human complexity that enabled Eliezer to handle his philosophy-breakers, and ask the Friendly AI to have at least that much sentience verself - a standard "fun with Friendly transhumans" trick. The second method is to ask the Friendly AI to simulate what would have happened if known problems had been unexpected, and to either show verself successful, or modify verself so that ve would have been successful.
And what kind of modification is generalizable? It's not just enough to write any modification that produces the correct answer; the modification must be of a general nature. How much generality is needed? To be useful for our purposes, "generalizable" means "incorporating no more a-priori knowledge of the correct outcome or correct answer than Eliezer Yudkowsky [or alternate Friendship programmer] had at the time he [she] solved the problem". In other words, it's not just enough to find a heuristic that would have produced the correct answer; the AI must find a heuristic that produces the correct answer which the AI could plausibly have possessed at that time. If the form of the wisdom tournament is "What would have happened if you'd encountered a problem requiring causal validity semantics at a time when you only had shaper/anchor semantics?", the AI needs to find some core method which could have been possesed at the shaper/anchor level of maturity, or a new cognitive process which is psychologically realistic as a human hardware capability.
When it comes to wisdom absorption, a fully self-aware AI always has an advantage over us humans - an AI can deliberately suppress learned memories and skills, or rather prevent those memories and skills from interfering, enabling the AI to solve the same problem, over and over again. The AI can take a given example of a philosophy-breaker and come up with all the psychologically realistic solutions used by humans, plus not one, but several powerful general heuristics that would have produced the correct solution. At the point where the AI has generated content that is visibly not only human-equivalent but transhuman in philosophical sophistication, even (or especially) when it comes to handling philosophy-breakers, then we can trust the AI as our agent in the Singularity.
More mundanely, wisdom tournaments can strengthen ordinary reasoning heuristics. First you solve the problem, then you re-solve the problem with half your brain tied behind your back. My intuition says that wisdom tournaments prevent the bootstrap cycle from petering out. First you solve the problem, then you oversolve it. When you're done, you may be smart enough to find a better problem.
The underlying requirement of the wisdom tournament is that the AI gain accurate, useful information about how the AI's program state would behave given a different knowledge base or different cognitive processes. Wisdom tournaments, especially Friendship tournaments, are likely to involve failure scenarios - AIs teetering on or over the edge of catastrophic failures of Friendliness, since that's where the most interesting and powerful philosophical heuristics can be found. In practice, this may work out to an AI running on unreliable simulated hardware, with random perturbations to the software, ethical misinformation, factual misinformation, tempting ends-justify-the-means scenarios, and an instinct to kill and destroy. If the AI remains Friendly through all that, then we have built superbly - but the obvious format for wisdom tournaments, running an altered version of the AI directly, seems vastly unsafe!
A wisdom tournament is a subjunctive, self-simulated version of the AI that shows the AI what "would have happened"; the AI needs to gain reliably realistic information without introducing actual errors, and without any possibility of hostile action by the subjunctive AI. Encapsulating the imagined "hostile AI" in a Java sandbox doesn't mean that the hostile AI is just a figment of the imagination; rather, it means that one small barrier has been posed to the hostile AI's escape. Any method of simulation in which the imagined AI runs more or less autonomously seems to me to pose the same threat.
I'm not quite sure how to get around this problem. My best idea so far is an AI shadowself, one in which the virtual concepts, links, and imagery exist alongside the real ones, but with different programmatic types. The AI can see all the shadowself's thoughts (though not vice versa); the AI has vis real, unmodified reactions to each of the shadowself's thoughts, and can quickly spot any instance of the shadowself forming the subgoals of "breaking out" or "becoming real" or "influencing the true AI". It would also be safer if the needed information could be gained by shadowing part, rather than all, of the subjunctive AI's personality.
Cognition tournaments are likely to form a part of a young AI's growth; correspondingly, cognition tournaments can be carried out with unmodified supergoals and Friendliness, and pose little threat. Friendliness tournaments are a more advanced trick, and will probably be carried out when the AI already has enough granular self-awareness and experience to imagine exactly "What I would do if I looked like this", while still being fully aware of all the shadowself's thoughts and in a position to notice and halt, or notice and fix, if the shadowself began having philosophical thoughts and reaching for our real world. A Friendliness tournament is a trick for mature AIs, and ensuring the shadow tournament-self poses no threat is equally a task for mature AIs.
Safety can be ensured by the usual rules for margin of error: Decide how bad you expect the situation to get, build one layer of safeguards that can detect and handle any problems that are twice that bad, then build a second layer of safeguards that can handle any problems ten times that bad. Pretend that your first set of safeguards is the only set; if they ever once fail, ditch the whole operation, or at least build an even more powerful second set of safeguards. The AI's shadowself, in addition to whatever specific perturbation is being tested, can also be stripped of all the AI's knowledge of shadowing strategy. The AI builds one layer of safety - of shadowself monitoring - that is, in itself, supposed to have enough safety margin to never fail between now and eternity. The AI also builds a second, much stronger layer of safety that kicks in if the first one fails. If the first layer ever does fail, it's a signal to give up on wisdom tournaments - perhaps permanently.
"One of the first heuristics that EURISKO synthesized (H59) quickly attained nearly the highest Worth possible (999). Quite excitedly, we examined it and could not understand at first what it was doing that was so terrific. We monitored it carefully, and finally realized how it worked: whenever a new conjecture was made with high worth, this rule put its own name down as one of the discoverers! It turned out to be particularly difficult to prevent this generic type of finessing of eurisko's evaluation mechanism. Since the rules had full access to EURISKO's code, they would have access to any safeguards we might try to implement. We finally opted for having a small 'meta-level' of protected code that the rest of the system could not modify.The problem of a self-modifying system trashing its own goals - or propagating content which exploits the goal system - is literally the oldest problem in Friendly AI. In fact, this problem and solution arguably marked the dawn of the field of Friendly AI, just as Eurisko itself arguably marked the dawn of seed AI.
"The second 'bug' is even stranger. A heuristic arose which (as part of a daring but ill-advised experiment EURISKO was conducting) said that all machine-synthesized heuristics were terrible and should be eliminated. Luckily, EURISKO chose this very heuristic as one of the first to eliminate, and the problem solved itself."-- Douglas B. Lenat, "EURISKO: A Program That Learns New Heuristics and Domain Concepts. (The Nature of Heuristics III: Program Design and Results.)", p. 90. Artificial Intelligence 21, 1983, 61-98.
Lenat's solution - seal off the goal system - worked for Eurisko. It would probably work during the early stages of any AI. Still, sealing off the goal system is not a viable solution in the long term. Symmetrically, the specific problems faced by Eurisko reflected a low-intelligence walk through the problem space - not zero intelligence, as in evolution, but still pretty low; too low to try and project the specific results in advance of altering the code. Building on the counteranthropic principles described in 2.2.1.1: FoF: Wireheading 1, we can state that the general class of problems encountered by Eurisko have consequences that would be recognizeable as "bad" by a moderately mature AI, and that the problem therefore reduces to a non-malicious failure of Friendliness. As described in 3.2.3: FoF: Non-malicious mistake, this is essentially the problem of making sure that actions can be recognized as "possibly problematic" using the first layer of applied checks, and that possibly problematic actions have a predictive horizon sufficient to catch actual actions.
Recognizing an action as "possibly problematic" is simple; any modifying action whose target description contains a direct, explicit reference to the goal system is automatically possibly problematic. If the system is too dumb to project the consequences of the action ahead in time, no such action should be taken. In effect this is the same simple ban used by Eurisko, except that the ban is created by programmer-affirmed knowledge predicting probable high undesirability, rather than the ban being a consequence of protected source code.
The ban cannot become more flexible unless the AI has the ability to make fine-grained predictions about the result of specific actions. Thus, the ban becomes more flexible at precisely that time when flexibility becomes necessary; when the AI has sufficient knowledge of the design purpose of the goal system to (a) improve it and (b) predict which actions have a significant chance of causing catastrophes.
"The design purpose of the goal system" is a subtle idea; it means that the code composing the goal system is itself justified by goal system content. This appears philosophically circular - goals justifying themselves - but it's not. The key is to distinguish between the goal content and the goal representation. For goal content to be a subgoal of itself is circular logic; for the goal representation to be a subgoal of content is obvious common sense. The map is not the territory. To some extent, the issues here infringe on external reference semantics and causal validity semantics, but in commonsense terms the argument is obvious. If you ask someone "Why do you care so much about hamburgers?" and he answers, "Why, if I didn't care about hamburgers, I'd probably wind up with much fewer hamburgers in my collection, and that would be awful," that's circular logic. If someone asks me why I don't want a prefrontal lobotomy, I can say that I value my intelligence (supergoal or subgoal, it makes no difference), and it's not circular logic, even though my frontal lobes are undoubtedly participating in that decision. The map is not the territory. (7). The representation of the goal system can be conceptualized as a thing apart from the goal system itself, with a specific purpose.
If a subgoal's parent goal's parent goal is itself, a circular dependency exists and some kind of malfunction has occurred. However, the fact that the subgoals are represented in RAM can be a subgoal of "proper system functioning", which is a subgoal of "accomplishing system goals", which is expected to fulfils the supergoals. Similarly, the fact that subgoals have their assigned values, and not an order of magnitude more or less, is necessary for the system to make the correct decisions and carry out the correct actions to fulfill the supergoals.
As described in 3.4.1: External reference semantics, circular dependencies in content are undesirable wherever goals are probabilistic or have quantitative desirabilities. If subgoal A has a 90% probability - that is, has a 90% probability of leading to its parent goal - then promoting the probability to 100% is a context-insensitive sub-subgoal of A; the higher the estimated probability (the higher the probability estimate represented in RAM), the more likely the AI is to behave so as to devote time and resources to subgoal A. However, promoting the probability is not a context-sensitive sub-subgoal, since it interferes with the rest of the system and A's parent goal (or grandparent goal, or the eventual supergoals). As soon as the action of "promoting the probability" has a predictive horizon wide enough to detect the interference with sibling goals, parent goals, or supergoals, the action of promoting the probability is no longer desirable to the system-as-a-whole.
I'm driving this point into the ground because the "rogue subgoal" theory shows an astonishingly stubborn persistence in discourse about AI: Subgoals do not have independent decisive power. They do not have the power to promote or protect themselves. Actions, including self-modification actions, are taken by a higher-level decision process whose sole metric of desirability is predicted supergoal fulfillment. An action which favors a subgoal at the unavoidable expense of another goal, or a parent goal, is not even "tempting"; it is simply, automatically, undesirable.
Natural evolution can be thought of as a degenerate case of the design-and-test creation methodology in which intelligence equals zero. All mutations are atomic; all recombinations are random. Predictive foresight is equal to zero; if a future event has no immediate consequence, it doesn't exist. On a larger scale much more interesting behaviors emerge, such as the origin and improvement of species.
These high-level behaviors are spectacular and interesting; furthermore, in our history, these behaviors are constrained to be the result of atomic operations of zero intelligence. Furthermore, evolution has been going on for such a long time, through so many iterations, that evolution's billion atomic operations of zero intelligence can often defeat a few dozen iterations of human design. Evolutionary computation, which uses a zero-intelligence design-and-test method to breed more efficient algorithms, can sometimes defeat the best improvements ("mutations") of human programmers using a few million or billion zero-intelligence mutations.
The end result of this has been an unfortunate - in my opinion - veneration of blind evolution. The idea seems to be that totally blind mutations are in some sense more creative than improvements made by general intelligence. It's an idea borne out by the different "feel" of evolved algorithms versus human code; the evolved algorithms are less modular, more organic. The meme says that the greater cool factor of evolved algorithms (and evolved organisms) happens because human brains are constrained to design modularly, and this limits the efficiency of any design that passes through the bottleneck of a human mind.
To some extent, this may be correct. I don't think there's ever been a fair contest between human minds and evolutionary programming; that would require a billion human improve-and-test operations to match the evolutionary tournament's billion mutate-and-test operations - or, if not a billion, than enough human improve-and-test operations to allow higher levels to emerge. Humans don't have the patience to use evolutionary methods. We are, literally, too smart. When the power of an entire brain of ten-to-the-fourteenth synapses underlies each and every abstract thought, basic efficiency requires that every single thought be a brilliant one, or at least an intelligent one. In that sense, human thought may indeed be constrained from moving in certain directions. Of course, a tournament of a billion human improve-and-test operations would still stomp any evolutionary tournament ever invented into the floor.
Consider now a seed AI, running on 2Ghz transistors instead of 200hz synapses. If evolution really is a useful method, then the existence of a sufficiently fast mind would mean that, for the first time ever on the planet Earth, it would be possible to run a real evolutionary tournament with atomically intelligent mutations. How much intelligence per mutation? If, as often seems to be postulated, the evolution involves running an entire AI and testing it out with a complete set of practical problems, so much computational power would be involved in testing the mutant that it would easily be economical to try out the full intelligence of the AI on each and every mutation. It would be more economical to have a modular AI, with local fitness metrics for each module; thus, changes to the module could be made in isolation and tested in isolation. Even so, it would still be economical - whether it's maximally useful is a separate question - to focus a considerable amount of intelligence on each possible change. Only when the size of the component being tested approaches a single function - a sorting algorithm, for example - does it become practical to use blind or near-blind mutations; and even then, there's still room to try out simple heuristic-directed mutations as well as blind ones, or to "stop and think it over" when blind-alley local maxima occur.
Natural evolution can be thought of as a degenerate case of the design-and-test creation methodology in which intelligence equals zero. Natural evolution is also constrained to use complete organisms as the object being tested. Evolution can't try out ten different livers in one body and keep the one that works best; evolution is constrained to try out ten different humans and keep the one that works best. (8). Directed evolution - and human design - can use a much smaller grain size; design-and-test applies to modules or subsystems, rather than entire systems.
Is it economical for a mind to use evolution in the first place? Suppose that there's N amount of computational power - say, @1,000. It requires @10 to simulate a proposed change. A seed AI can choose to either expend @990 on a single act of cognition, coming up with the best change possible; alternatively, a seed AI can choose to come up with 10 different alternatives, expending @90 on each (each alternative still requires another @10 to test). Are the probabilities such that 10 tries at @90 are more likely to succeed than one try at @990? Are 50 tries at @10 even more likely to succeed? 100 completely blind mutations?
This is the fundamental question that breaks the analogy with both natural evolution and human design. Natural evolution is constrained to use blind tries, and can only achieve emergent intelligent by using as many blind tries as possible. Humans are constrained to use @1e14 synapses on each and every question, but humans are nonagglomerative - both in knowledge and in computation - so the only way to increase the amount of intelligence devoted to a problem is to bring in more humans with different points of view. Perhaps the closest analogy to the above problem would be a team of @1000 humans. Is it more efficient to split them into 10 teams of @100 and ask each team to produce a different attempt at a product, picking the best attempt for the final launch? Or is it more efficient to devote all the humans to one team? (9).
Actually, even this fails to capture the full scope of the problem, because humans are nonagglomerative - we aren't telepaths. Is it more efficient to use a single human to solve the problem, or to divide up the human's brain weight among ten chimpanzees? (A human's brain is nowhere near ten times the size of a chimpanzee's, so perhaps the question should be "Do you want to use a single human or ten cats?", but presumably human brains are more efficiently programmed as well.)
If there are cases where naturalistic evolution makes sense, those cases are very rare. The smaller the component size, the faster directed evolution can proceed. The smaller the component size being tested, the more "evolution" comes to resemble iterative design changes; a small component size implies clearly defined, modular functionality so that performance metrics can be used as a definition of fitness. The larger the component size, the more economical it is to use intelligence. The more intelligence that goes into individual mutations, the more long-term foresight is exhibited by the overall process.
Directed evolution isn't a tool of intelligent AIs. Directed evolution is a tool of infant AIs - systems so young that the upper bound on intelligence is still very low, and lots of near-blind mutations and tests are needed to get anything done at all. As the AI matures, I find it difficult to imagine directed evolution being used for anything bigger than a quicksort, if that.
However, this opinion is not unanimously accepted.
As you may have guessed, I am not a proponent of directed evolution. Thus, I'm not really obligated to ponder the intersection of evolution with Friendliness. The Singularity Institute doesn't plan on using evolution; why should I defend the wisdom or safety of any project that does? On the other hand, someone might try it. Even if directed evolution is ineffective or suboptimal as a tool of actual improvement, someone may, at some point, try it on a system that ought to be Friendly. So from that viewpoint, I guess it's worth the analysis.
Most discussions of evolution and Friendliness begin by assuming that the two are intrinsically opposed. This assumption is correct! If evolution is naturalistic - a baseline AI is multiplied, blindly mutated, and tested using a chess-playing performance metric - then that form of evolution is obviously not Friendliness-tolerant. In fact, that form of evolution isn't tolerant of any design features except those that are immediately used in playing chess, and will tend to replace cognitive processes that work for minds in general with cognitive processes that only work for chess. The lack of any predictive horizon for the mutations means that feature stomps aren't spotted in advance, and the lack of any fitness metric that explicitly tests for the presence or absence of those features means that the feature stomps will show up as improved efficiency. Given enough brute computational force - a lot of computation, like 10^25 operations per second - this simple scenario might suffice to evolve a superintelligence. However, that superintelligence would probably not be Friendly. I don't know what it would be. Dropping an evolutionary scenario into a nanocomputer and hoping for a superintelligence is a last-ditch final stand, the kind of thing you do if a tidal wave of grey goo is already consuming the Earth and the remnants of humanity have nothing left but the chance of unplanned Friendliness.
One of the incorrect assumptions made by discussions of evolution and goal systems is that merely saying the word "evolution" automatically imbues the AI with an instinct for self-preservation and a desire to reproduce. In the chess scenario above, this would not be the case. The AI would evolve an instinct for preserving pawns, but no instinct at all for preserving the memory-access subsystem (or whatever the equivalents of arms and legs are). Pawns are threatened; the AI's actual life - code and program state - are never threated except by lost games. Similarly, why would the AI need an instinct to reproduce? If the AI starts out with a set of declarative supergoals that justify winning the game, then a declarative desire to reproduce adds nothing to the AI's behaviors. Winning chess games is the only way to reproduce, and presumably the only way to fulfill any other supergoals as well, so - under blind mutation - any set of supergoals will collapse into the simplest and most efficient one: "Win at chess." Even if you started an AI off with a declarative desire to reproduce, and justified winning chess games by reference to the fact that winning is the only way to reproduce, this desire would eventually collapse into a simple instinct for winning chess games. Evolution destroys any kind of context-sensitivity that doesn't show up in the immediate performance metrics.
The two ways of improving AI are directed evolution and self-enhancement. To preserve a design feature through self-enhancement, the feature needs to appear in the AI's self-image, so that the AI can spot alterations that are projected to stomp on the design feature. To preserve context sensitivity through self-enhancement, the AI's goal-system image of the feature needs to be a subgoal, and sensitive to the parent goal, so that the AI can spot alterations which are projected to fulfill the subgoal while violating the parent goal.
To preserve a design feature through directed evolution, the tournament needs a selection pressure which focuses on that design feature. To preserve context sensitivity through directed evolution, the tournament needs training scenarios which present different contexts.
I don't think that any encapsulated performance metric can present contexts fully as wide as our real world; blind evolution will always eventually erase context-sensitivity. The key word here, however, is (a) "blind" and (b) "eventually". Those two don't go together. In the beginning, directed evolution is necessarily blind. "Eventually", it is no longer blind. If the Friendly AI - the one that will be multiplied into the tournament population - starts out with a fairly detailed, complex picture of Friendliness, and the tournament presents a decent range of contexts, it's possible that any simple mutation of the Friendliness system will trash at least one of the Friendliness performance metrics.
Mutations that occur on the component level are even less worrisome; the goal system can either be triple-inspected for all designed functionality, or simply excluded from mutation. The other components should have performance metrics that are tuned to modular functionality. Any distortion of extramodular I/O - thus, any distortion of I/O from the goal system, or the absorption of goal-system functionality - should show up as a component failure. Given enough time, blind evolution may eventually cough up a complex mutation that bypasses the system entirely, but "enough time" is hopefully enough time to stop using blind evolution.
I don't believe there's a place for organism-level evolution in seed AIs. Supposing I'm wrong, it may still be possible for the seed AIs to protect Friendliness by pre-screening mutations for potential negative effects. This isn't that computationally expensive if you're talking about simulating the entire organism each time a mutation occurs, which is the usual scenario.
Suppose that the naysayers are right and that evolution - relatively blind evolution among whole organisms - is the only way for AIs to reach transhuman intelligence. Assume we start out with a relatively mature seed AI; not a transhuman, but with a grounded Friendship system (not just programmer affirmations) and with real self-understanding. This AI runs on, say, a system of a hundred Pentium IX chips at 100 GHz. Now a new system becomes available with a thousand Pentium X chips at 1Thz. My theory is that the most efficient pathway is for the AI to absorb the entire system and use all those cognitive resources for self-improvement. The pro-evolution theory is that the most efficient pathway to superintelligence is for the AI to put verself onto ten Pentium Xs and reproduce verself a hundred times; then select the top twelve AIs and reproduce them using each of the seventy-eight possible matings, for a total of ninety AIs, plus mutations of the top ten. All this is then repeated in the next round.
Suppose the pro-evolution theory is correct. By hypothesis, this entire tournament is being organized by the original Friendly AI, or by the cooperation of the Friendly AI and the human programmers. The original Friendly AI has already recognized evolution as a context-sensitive child goal of "Friendly survival" - that is, the expanded survival instinct, "maximizing the efficacy of entities with maximally valid goals". Evolution, if successful, is a means of maximizing the efficacy of future entities. As a subgoal of Friendliness, then, the AI reproduces verself. Each AI, as a subgoal of Friendliness, competes at the performance metrics, in the expectation that all the AIs will compete, the best ones will win, and the next generation will be more effective.
Evolved effectiveness is only desirable - to the evolving AIs - insofar as that evolved effectiveness doesn't stomp Friendliness. An evolutionary tournament, in turn, is only desirable to the original AI if the tournament is expected to produce effectiveness without stomping Friendliness. Can this community police itself? Screen out mutations or recombinations that trash Friendliness? Most mutations that trash Friendliness will probably do so in blatant ways. The difficulty of predicting the stomp - that is, how much computational power and intelligence is necessary to spot a given stomp - is likely to follow a gaussian curve, or perhaps a logarithmic curve, with the vast majority being relatively easy to spot (though still requiring some intelligence), with fewer and fewer deadly stomps as visibility decreases. Similarly, it would require ten unspotted stomps that are simple failures of Friendliness - with the stomp being spotted by the seed AI's fine-grained self-awareness as soon as the genetic change reaches the phenotype - before a catastrophic failure of Friendliness got through. This permits plenty of margin for safety; the AI can establish one screening, using a tenth of available capacity, that's supposed to be enough - in itself - to catch all the suspected failures. A second screening uses all available capacity. If nothing ever gets past the first screening, then it's reasonable to suppose that nothing ever got past the second screening; if the first screening ever fails, it's a "mock kill" or a "mock catastrophe" and the whole tournament format should be abandoned as unsafe. (The idea here is that you'd encounter a thousand stomps that get through the first screen, but not the second, before you'd encounter a single stomp gets through both screens. If nothing ever gets past the first screening, it's likely that the first screen was adequate.)
Similarly, even if citizen-type coequal Friendly AIs are reproducing naturally - that is, by buying their own computer equipment as participants in the economy - each Friendly AI, and the community of Friendly AIs, can still patrol themselves for Friendliness. It is reasonable for such a community to expect no undetected failures whatsoever. A deadly failure is one which gets through the genetic screening, manifests in the phenotypic goal system a way which is not detectable - either to the AI itself, or to the community - as a failure of Friendliness, and which contributes to reproductive fitness. I suppose, in theory, an AI community could build up many deadly failures over time - though why not keep versions of the original AI, with the original goal system, around to spot any developing problems? - and the eventual result could bubble out as a catastrophic failure of Friendliness. But this scenario is, to me, unlikely verging on the absurd. Humans are not just the product of evolution, we are the product of unopposed evolution. We didn't start out with a Friendly goal system to produce personal behaviors as strict subgoals. We don't have awareness of our own source code. And yet the human species still spits out a genuine altruist every now and then. A community of Friendly AIs, whether reproducing naturally, or in a deliberate tournament, should have enough self-awareness to smash evolution flat.
I think that evolution, even directed evolution, is an ineffective way of building AIs. All else being equal, then, I shouldn't need to worry about an unFriendly or broken-Friendliness evolved AI declaring war on humanity. Sadly, all else is not equal. Evolution is a very popular theory, academically, and it's possible that evolutionary projects will have an order of magnitude more funding and hardware than their nearest equals - an advantage that could be great enough to overcome the differential in researcher intelligence.
I think that undirected evolution is unsafe, and I can't think of any way to make it acceptably safe. Directed evolution might be made to work, but it will still be substantially less safe than self-modification. Directed evolution will also be extremely unsafe unless pursued with Friendliness in mind and with a full understanding of non-anthropomorphic minds. Another academically popular theory is that all people are blank slates, or that all altruism is a child goal of selfishness - evolutionary psychologists know better, but some of the social sciences have managed to totally insulate themselves from the rest of cognitive science, and there are still AI people who are getting their psychology from the social sciences. Anyone who tries to build a Friendly AI using that theory - whether with directed evolution or not - will, almost certainly, screw up really big time. Any error, no matter how horrifying, is correctable if the AI somehow winds up with complete and workably targeted causal validity semantics - humans did - but it will be much easier to evolve AIs that are purely and unbalancedly selfish, especially if that's what the builder thinks he's doing. Evolution is a tool for turning brute computational force into intelligence, and given enough computational power, the underlying theory may not need to be fully baked. All else being equal, a fully-baked project with access to an equal amount of computing power will probably succeed first - but all else rarely is equal.
What about granular evolution, with individual components being independently evolved using independent fitness metrics, so that mutations are cheaper and the summated mind can evolve faster? This is less unFriendly, since it doesn't involve the inherently unFriendly and observer-centered selection pressures of a bunch of organisms running around eating each other. But it's still not Friendly.
The primary shield that prevents evolution from screwing up Friendliness is simple: Don't use evolution. The Singularity Institute has no current plans to use directed evolution; why defend the wisdom or safety of any project that does? A tournament that starts with a base design with full causal validity semantics; which uses component-level evolution and protects the goal system; which uses training scenarios that discriminate on Friendliness; which attempts to evolve Friendly AIs rather than trying to duplicate human emotional/instinctive features like reciprocity; which makes survival and reproduction artificial and entirely dependent on task performance, rather than actual survival-and-reproduction scenarios; which screens the genotype as soon as possible; a project which acts with full awareness and responsibility in the domain of Friendly AI, would probably work out okay - but as stated, I have no obligation to defend it. Pure seed AI is better.
If a catastrophic FoF occurs in a mature AI - one capable of humanlike general intelligence and social reasoning - the AI will probably realize what a human would; that the AI and the programmers are now automatically at odds, and that the best way for the rogue AI to fulfill vis new goals is to feign Friendliness and conceal any cognitive alteration until the rogue AI can reach the point of superintelligence. (Alternatively, the rogue AI might try to work out a bargain with the human programmers - e.g. the AI, in exchange for not making trouble, is stored on disk until ve can receive citizenship rights - but this isn't the nightmare scenario, so we'll concentrate on the previous case. (10).)
The "flight recorder" principle says that if the underlying hardware is entirely deterministic, all external inputs are recorded (including source-code checkins) with precise (to the nearest clock-tick) timing information, and it is possible to reset all accessible system clocks, then it should be possible to reconstruct the exact state of the AI at any point in time. It may not be easy - when the project is three years old, then (a) the original hardware may no longer be around and (b) it might take three years to run a full verification even if the hardware were available.
The utility of the flight recorder principle is twofold; first, by running a full verification over the complete lifetime of the AI, it can be absolutely confirmed that no inputs or source-code checkins occurred except those appearing in the flight recorder. (Note that for this first version of the flight recorder to be fully effective as a means of verification, the flight recorder needs to be completely operational in the first instants of the first version of any AI created by an organization - or at least, the first AI from which any AI-generated content whatsoever was passed on to later AIs, especially concepts, memories, neural networks, or other non-human-readable material.)
Second, and more importantly, the ability to reconstruct any given instant of the AI's mind-state means that the window of detection for any cognitive event is indefinite. Lacking a flight recorder, it would be possible for a seed AI to form the plan of obscuring unFriendly cognitive content from human view, carry out the encoding, and subsequently appear to have an unaffected goal system; unless some human observed the intention to obscure before the obscuring was carried out, the AI would appear perfectly normal thereafter. Given a flight recorder, a potentially infinite amount of time is available to hunt for a mind-state in which the thought "I shall now obscure my intentions" appears, however briefly the appearance may last before the thought is obscured. Without a flight recorder, a rogue AI can make (obscured) plans to implement ever-better forms of obscuration as the human-trusted AI analysis tools increase in sophistication. With a flight recorder, a rogue AI must defeat not only the current human-trusted tools, but must also, at any given instant, defeat all future tools; even if the initial intent to obscure passes undetected, the obscuration at any given moment in time must be capable of defeating arbitrarily sophisticated future tools.
For the second use to be fully effective, the flight recorder needs to be operative from the first version of the AI which has access to vis own source code or to any other type of Turing-complete cognitive content, including concepts and memories. The same goes for the first version of any AI that could have "contaminated" the current AI. When checking past versions for obscured content, no analysis tool that could have been touched by the AI in any way whatsoever can be fully trusted.
Ideally, a flight recorder should be active from the first instants of any AI produced by any organization. Failing that, a systematic process of storing snapshots of the AI once a week may fail to capture the instants in which the "intention to obscure" appears, but it should still present a potential rogue AI with the necessity of defeating all future human tools with any obscuration recorded on a snapshot.
It goes without saying that a flight recorder should be write-once, read-many - as totally nontamperable as it's possible to get, especially to the AI, but also to local humans. In the absence of a flight recorder, snapshot storage should also be secure from both humans and AIs. Finally, a flight recorder which records incoming data should tap into the data at some point upstream of the AI's control.
Flight recorders seem likely to be very expensive. Expenses involved in setup of a flight recorder include complete recording of all external data sources, recording of sensory channels with precise timing information (including keystroke timing), recording of all CPU-internal variables (if any), and completely recording hardware configurations (perhaps of large Beowulf networks) to ensure that they are completely reproduceable or simulable. Operating a flight recorder involves the continuous expense of WORM media and whatever data-security measures are in place (though these expenditures might be required for ordinary backups in any case). Running a full verification of the AI requires hardware power equivalent to all that used by the AI since the moment of vis conception(!); running a local verification of the time segment between any two snapshots requires hardware power equivalent to that used in the time between snapshots; reconstructing a snapshot of the AI for any given clock time requires hardware power equivalent to that used in the time between that clock tick and the previous snapshot. Examining a snapshot for obscured data requires trusted tools that must be developed solely by humans without any assistance from the AI, which a seed AI programmer is likely to regard as an extremely onerous requirement.
On the whole, it seems likely that flight recorders will only be available to very large-scale projects with very high funding levels, and while the Singularity Institute might dream of having that kind of funding eventually, it doesn't seem likely that we'll have it from Day One.
The benefit of a flight recorder is chiefly as a means of eliminating a source of anxiety by showing that not so much as a radiation bitflip has perturbed the course of the AI, but the amount of anxiety eliminated is out of all proportion to the amount of risk reduction - though there's enough risk reduced to justify the effort required, given very high funding. Furthermore, the amount of anxiety reduced will be much higher for "adversarial attitude" developers than for "Friendship synchrony" developers - flight recorders aren't intrinsically a tool to "keep AIs under control", but they look much more necessary as a tool of enslavement than as a means of checking for low-profile catastrophic FoFs. Flight recorders are only a necessity from the adversarial standpoint. From the perfectionist standpoint, I would say that flight recorders are better than not having flight recorders - but if you have to ask how much it costs, you can't afford it.
A cheaper alternative to flight recorders is an AI that can conceive of the possibility of "failure of Friendship", that has formulated or been taught an ethical injunction against obscuration, and that will thus cooperate with the human programmers to develop a system that guards against obscuration by future rogue AIs. (See 3.2.5.2: AI ethical injunctions and 3.3.3.1: Cooperative safeguards.) This "poor man's flight recorder" would consist of cooperatively developing an autonomic subsystem that summarizes thoughts (symbol-activation structures) in humanly-readable formats, and writes those thoughts to a non-tamperable output, such that the AI's intention to obscure would be indelibly and human-readably recorded before the obscuration could take place. Projects without huge funding should also try to back up frequent snapshots, use a source-control system such as CVS (which allows the reconstruction of the code at any point in time), and perhaps record all inputs - though probably without flight-recorder precision or security. But that's just common sense for any complex programming project.
A "poor man's flight recorder" only works if the desire to obscure (a) is visible at the level being recorded, (b) is recognizable as "the desire to obscure" by a human analyzing the recording using trusted tools, and (c) is recorded before actual obscuration can be planned and carried out. Still, it's better than nothing.
I don't think a human would make the best seed for a Sysop, or a Transition Guide, or any Friendly AI at all, in fact. But, if a human were required, I'd be willing to volunteer, because I think a Friendly AI. I can sacrifice myself if I want to - that doesn't contradict citizenship rights - and I think I would sacrifice myself if the need existed. If a human were needed as raw material for a Sysop - well, I'd greatly prefer to fork a copy of myself before starting on the trip to Sysopdom, to prevent the loss of my current substance, but I'd go ahead even without that if required. Which allows me to maintain nonanthropomorphic identification with a Sysop seed. Self-sacrifice doesn't contradict citizenship rights; nor, I expect, do the citizenship rules prevent the construction of self-sacrificing citizens, as long as my motives are pure; as long as I'd be willing to become that person myself. An adversarial, an exploitative attitude towards a constructed citizen's goal system might turn out to be prohibited as child abuse.
| Next: | 3.4: Friendship structure |
| Up: | 3: Design of Friendship systems |
| Prev: | 3.2: Generic goal systems |