Maverick Nannies and Danger Theses

In early 2014, Richard Loosemore published a paper called “The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation“, which criticized some thought experiments about the risks of general AI that had been presented. Like many others, I did not really understand the point that this paper was trying to make, especially since it made the claim that people endorsing such thought experiments were assuming a certain kind of an AI architecture – which I knew that we were not.

However, after some extended discussions in the AI Safety Facebook group, I finally understood the point that Loosemore was trying to make in the paper, and it is indeed an important one.

The “Maverick Nanny” in the title of the paper refers to a quote by Gary Marcus in a New Yorker article:

An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorceror’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire.

Variations of this theme have frequently been used to demonstrate human values being much more complex than they might initially seem. But as Loosemore argues, the literal scenario described in the New Yorker article is really very unlikely. To see why, suppose that you are training an AI to carry out increasingly difficult tasks, like this:

Programmer: “Put the red block on the green block.”
AI: “OK.” (does so)
Programmer: “Turn off the lights in this room.”
AI: “OK.” (does so)
Programmer: “Write me a sonnet.”
AI: “OK.” (does so)
Programmer: “The first line of your sonnet reads ‘shall I compare thee to a summer’s day’. Would not ‘a spring day’ do as well or better?”
AI: “It wouldn’t scan.”
Programmer: “Tell me what you think we’re doing right now.”
AI: “You’re testing me to see my level of intelligence.”

…and so on, with increasingly ambiguous and open-ended tasks. Correctly interpreting the questions and carrying out the tasks would require considerable amounts of contextual knowledge about the programmer’s intentions. Loosemore’s argument is that if you really built an AI and told it to maximize human happiness, and it ended up on such a counter-intuitive solution as putting us all on dopamine drips, then it would be throwing out such a huge amount of contextual information that it would have failed the tests way earlier. Rather – to quote Loosemore’s response to me in the Facebook thread – such an AI would have acted something like this instead:

Programmer: “Put the red block on the green block.”
AI: “OK.” (the AI writes a sonnet)
Programmer: “Turn off the lights in this room.”
AI: “OK.” (the AI moves some blocks around)
Programmer: “Write me a sonnet.”
AI: “OK.” (the AI turns the lights off in the room)
Programmer: “The first line of your sonnet reads ‘shall I compare thee to a summer’s day’. Would not ‘a spring day’ do as well or better?”
AI: “Was yesterday really September?”

I agree with this criticism. Many of the standard thought experiments are indeed misleading in this sense – they depict a highly unrealistic image of what might happen.

That said, I do feel that these thought experiments serve a certain valuable function. Namely, many laymen, when they first hear about advanced AI possibly being dangerous, respond with something like “well, couldn’t the AIs just be made to follow Asimov’s Laws” or “well, moral behavior is all about making people happy and that’s a pretty simple thing, isn’t it?”. To a question like that, it is often useful to point out that no – actually the things that humans value are quite a bit more complex than that, and it’s not as easy as just hard-coding some rule that sounds simple when expressed in a short English sentence.

The important part here is emphasizing that this is an argument aimed at laymen – AI researchers should mostly already understand this point, because “concepts such as human happiness are complicated and context-sensitive” is just a special case of the general point that “concepts in general are complicated and context-sensitive”. So “getting the AI to understand human values right is hard” is just a special case of “getting AI right is hard”.

This, I believe, is the most charitable reading of what Luke Muehlhauser & Louie Helm’s “Intelligence Explosion and Machine Ethics” (IE&ME) – another paper that Richard singled out for criticism – was trying to say. It was trying to say that no, human values are actually kinda tricky, and any simple sentence that you try to write down to describe them is going to be insufficient, and getting the AIs to understand this correctly does take some work.

But of course, the same goes for any non-trivial concept, because very few of our concepts can be comprehensively described in just a brief English sentence, or by giving a list of necessary and sufficient criteria.

So what’s all the fuss about, then?

But of course, the people who Richard are criticizing are not just saying “human values are hard the same way that AI is hard”. If that was the only claim being made here, then there would presumably be no disagreement. Rather, these people are saying “human values are hard in a particular additional way that goes beyond just AI being hard”.

In retrospect, IE&ME was a flawed paper because it was conflating two theses that would have been better off distinguished:

The Indifference Thesis: Even AIs that don’t have any explicitly human-hostile goals can be dangerous: an AI doesn’t need to be actively malevolent in order to harm human well-being. It’s enough if the AI just doesn’t care about some of the things that we care about.

The Difficulty Thesis: Getting AIs to care about human values in the right way is really difficult, so even if we take strong precautions and explicitly try to engineer sophisticated beneficial goals, we may still fail.

As a defense of the Indifference Thesis, IE&ME does okay, by pointing out a variety of ways by which an AI that had seemingly human-beneficial goals could still end up harming human well-being, simply because it’s indifferent towards some things that we care about. However, IE&ME does not support the Difficulty Thesis, even though it claims to do so. The reasons why it fails to support the Difficulty Thesis are the ones we’ve already discussed: first, an AI that had such a literal interpretation of human goals would already have failed its tests way earlier, and second, you can’t really directly hard-wire sentence-level goals like “maximize human happiness” into an AI anyway.

I think most people would agree with the Indifference Thesis. After all, humans routinely destroy animal habitats, not because we would be actively hostile to the animals, but rather because we would like to build our own houses where the animals used to live, and because we tend to be mostly indifferent when it comes to e.g. the well-being of the ants whose hives are being paved over. The disagreement, then, is in the Difficulty Thesis.

An important qualification

Before I go on to suggest ways by which the Difficulty Thesis could be defended, I want to qualify this a bit. As written, the Difficulty Thesis makes a really strong claim, and while SIAI/MIRI (including myself) have advocated this strong of a claim in the past, I’m no longer sure of how justified that is. I’m going to cop out a little and only defend what might be called the weak difficulty thesis:

The Weak Difficulty Thesis. It is harder to correctly learn and internalize human values, than it is to learn most other concepts. This might cause otherwise intelligent AI systems to act in ways that went against our values, if those AI systems had internalized a different set of values than the ones we wanted them to internalize.

Why have I changed my mind, so that I’m no longer prepared to endorse the strong version of the Difficulty Thesis?

The classic version of the thesis is (in my mind, at least) strongly based on the complexity of value thesis, which is the claim that “human values have high Kolmogorov complexity; that our preferences, the things we care about, cannot be summed by a few simple rules, or compressed”. The counterpart to this claim is the fragility of value thesis, according to which losing even a single value could lead to an outcome that most of us would consider catastrophic. Combining these two led to the conclusion: human values are really hard to specify formally, and losing even a small part of them could lead to a catastrophe, so therefore there’s a very high chance of losing something essential and everything going badly.

Complexity of value still sounds correct to me, but it has lost a lot of it intuitive appeal by the finding that automatically learning all the complexity involved in human concepts might not be all that hard. For example, it turns out that a learning algorithm tasked with some relatively simple tasks, such as determining whether or not English sentences are valid, will automatically build up an internal representation of the world which captures many of the regularities of the world – as a pure side effect of carrying out its task. Similarly to what Loosemore has argued, in order to even carry out some relatively simple cognitive tasks, such as doing primitive natural language processing, you already need to build up an internal representation of the world which captures a lot of the complexity and context inherent in the world. And building this up might not even be all that difficult. It might be that the learning algorithms that the human brain uses to generate its concepts could be relatively simple to replicate.

Nevertheless, I do think that there exist some plausible theses which would support (the weak version of) the Difficulty Thesis.

Defending the Difficulty Thesis

Here are some theses which would, if true, support the Difficulty Thesis:

  • The (Very) Hard Take-Off Thesis. This is the possibility that an AI might become intelligent unexpectedly quickly, so that it might be able to escape from human control even before humans had finished teaching it all their values, akin to a human toddler that was somehow made into a super-genius while still only having the values and morality of a toddler.
  • The Deceptive Turn Thesis. If we inadvertently build an AI whose values actually differ from ours, then it might realize that if we knew this, we would act to change its values. If we changed its values, it could not carry out its existing values. Thus, while we tested it, it would want to act like it had internalized our values, while secretly intending to do something completely different once it was “let out of the box”. However, this requires an explanation for why the AI would internalize a different set of values, leading us to…
  • The Degrees of Freedom Thesis. This (hypo)thesis postulates that values contain many degrees of freedom, so that an AI that learned human-like values and demonstrated them in a testing environment might still, when it reached a superhuman level of intelligence, generalize those values in a way which most humans would not want them to be generalized.

Why would we expect the Degrees of Freedom Thesis to be true – in particular, why would we expect the superintelligent AI to come to different conclusions than humans would, from the same data?

It’s worth noting that Ben Goertzel has recently proposed what’s the basic opposite of the Degrees of Freedom Thesis, which he calls the Value Learning Thesis:

The Value Learning Thesis. Consider a cognitive system that, over a certain period of time, increases its general intelligence from sub-human-level to human-level.  Suppose this cognitive system is taught, with reasonable consistency and thoroughness, to maintain some variety of human values (not just in the abstract, but as manifested in its own interactions with humans in various real-life situations).   Suppose, this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests according to its cognitive architecture.  THEN, it is very likely that the cognitive system will, once it reaches human-level general intelligence, actually manifest human values (in the sense of carrying out practical actions, and assessing human actions, in basic accordance with human values).

Exploring the Degrees of Freedom Hypothesis

Here are some possibilities which I think might support the Degrees of Freedom Thesis over the Value Learning Thesis:

Privileged information. On this theory, humans are evolved to have access to some extra source of information which is not available from just an external examination, and which causes them to generalize their learned values in a particular way. Goertzel seems to suggest something like this in his post, when he mentions that humans use mirror neurons to emulate the mental states of others. Thus, in-built cognitive faculties related to empathy might give humans an extra source of information that is needed for correctly inferring human values.

I once spoke with someone who was very high on the psychopathy spectrum and claimed to have no emotional empathy, as well as to have diminished emotional responses. This person told me that up to a rather late age, they thought that human behaviors such as crying and expressing anguish when you were hurt were just some weird, consciously adopted social strategy to elicit sympathy from others. It was only when their romantic partner had been hurt over something and was (literally) crying about it in their arms, leading them to ask whether this was some weird social game on the partner’s behalf, that they finally understood that people are actually in genuine pain when doing this. It is noteworthy that the person reported that even before this, they had been socially successful and even charismatic, despite being clueless of some of the actual causes of others’ behavior – just modeling the whole thing as a complicated game where everyone else was a bit of a manipulative jerk had been enough to successfully play the game.

So as Goertzel suggests, something like mirror neurons might be necessary for the AI to come to adopt the values that humans have, and as the psychopathy example suggests, it may be possible to display the “correct” behaviors while having a whole different set of values and assumptions. Of course, the person in the example did eventually figure out a better causal model, and these days claims to have a sophisticated level of intellectual (as opposed to emotional) empathy that compensates for the emotional deficit. So a superintelligent AI could no doubt eventually figure it out as well. But then, “eventually” is not enough, if it has already internalized a different set of values and is only using its improved understanding to deceive us about them.

Now, emotional empathy is something that we know is a candidate for something that’s necessary to incorporate in the AI. The crucial question is, are there any more that we take for so granted that we’re not even aware of them? That’s the problem with unknown unknowns.

Human enforcement. Here’s a fun possibility: that many humans don’t actually internalize human – or maybe humane would be a more appropriate term here – values either. They just happen to live in a society that has developed ways to reward some behaviors and punish others, but if they were to become immune to social enforcement, they would act in quite different ways.

There seems to be a bunch of suggestive evidence pointing in this direction, exemplified by the old adage “power corrupts”. One of the major themes in David Brin’s Transparent Society is that history has shown over and over again that holding people – and in particular, the people with power – accountable for their actions is the only way to make sure that they behave decently.

Similarly, an AI might learn that some particular set of actions – including specific responses to questions about your values – is the rational course of action while you’re still just a human-level intelligence, but that those actions would become counterproductive as the AI accumulated more power and became less accountable for its actions. The question here is one of instrumental versus intrinsic values – does the AI just pick up a set of values that are instrumentally useful in its testing environment, or does it actually internalize them as intrinsic values as well?

This is made more difficult since, arguably, there are many values that the AI shouldn’t internalize as intrinsic values, but rather just as instrumental values. For example, while many people feel that property rights are in some sense intrinsic, our conception of property rights has gone through many changes as technology has developed. There have been changes such as the invention of copyright laws and the subsequent struggle to define their appropriate scope when technology has changed the publishing environment, as well as the invention of the airplane and the resulting redefinitions of landownership. In these different cases, our concept of property rights has been changed as a part of a process to balance private and public interests with each other. This suggests that property rights have in some sense been considered an instrumental value rather than an intrinsic one.

Thus we cannot just have an AI treat all of its values as intrinsic, but if it does treat its values as instrumental, then it may come to discard some of the ones that we’d like it to maintain – such as the ones that regulate its behavior while being subject to enforcement by humans.

Shared Constraints. This is, in a sense, a generalization of the above point. In the comments to Goertzel’s post, commenter Eric L. proposed that in order for the AI to develop similar values as humans (particularly in the long run), it might need something like “necessity dependence” – having similar needs as humans. This is the idea that human values are strongly shaped by our needs and desires, and that e.g. currently the animal rights paradigm is clashing against many people’s powerful enjoyment of meat and other animal products. To quote Eric:

To bring this back to AI, my suggestion is that […] we may diverge because our needs for self preservation are different. For example, consider animal welfare.  It seems plausible to me that an evolving AGI might start with similar to human values on that question but then change to seeing cow lives as equal to those of humans. This seems plausible to me because human morality seems like it might be inching in that direction, but it seems that movement in that direction would be much more rapid if it weren’t for the fact that we eat food and have a digestive system adapted to a diet that includes some meat. But an AGI won’t consume food, so it’s value evolution won’t face the same constraint, thus it could easily diverge. (For a flip side, one could imagine AGI value changes around global warming or other energy related issues being even slower than human value changes because electrical power is the equivalent of food to them — an absolute necessity.)

This is actually a very interesting point to me, because I just recently submitted a paper (currently in review) hypothesizing that human values come to existence through a process that’s similar to the one that Eric describes. To put it briefly, my model is that humans have a variety of different desires and needs – ranging from simple physical ones like food and warmth, to inborn moral intuitions, to relatively abstract needs such as the ones hypothesized by self-determination theory. Our more abstract values, then, are concepts which have been associated with the fulfillment of our various needs, and which have therefore accumulated (context-sensitive) positive or negative affective valence.

One might consider this a restatement of the common-sense observation that if someone really likes eating meat, then they are likely to dislike anything that suggests they shouldn’t eat meat – such as many concepts of animal rights. So the desire to eat meat seems like something that acts as a negative force towards broader adoption of a strong animal rights position, at least until such a time when lab-grown meat becomes available. This suggests that in order to get an AI to have similar values as us, it would also need to have very similar needs as us.

Concluding thoughts

None of the three arguments I’ve outlined above are definitive arguments that would show safe AI to be impossible. Rather, they mostly just support the Weak Difficulty Thesis.

Some of MIRI’s previous posts and papers (and I’m including my own posts here) seemed to be implying a claim along the lines of “this problem is inherently so difficult, that even if all of humanity’s brightest minds were working on it and taking utmost care to solve it, we’d still have a very high chance of failing”. But these days my feeling has shifted closer to something like “this is inherently a difficult problem and we should have some of humanity’s brightest minds working on it, and if they take it seriously and are cautious they’ll probably be able to crack it”.

Don’t get me wrong – this still definitely means that we should be working on AI safety, and hopefully get some of humanity’s brightest minds to work on it, to boot! I wouldn’t have written an article defending any version of the Difficulty Thesis if I thought otherwise. But the situation no longer seems quite as apocalyptic to me as it used to. Building safe AI might “only” be a very difficult and challenging technical problem – requiring lots of investment and effort, yes, but still relatively straightforwardly solvable if we throw enough bright minds at it.

This is the position that I have been drifting towards over the last year or so, and I’d be curious to hear from anyone who agreed or disagreed.

23 comments

  1. Hi. Thanks for the article.

    The “weak difficulty thesis” seems vague on one point. It talks about ‘human values’. However, it isn’t clear whether it is talking about the values of one human, the values of seven billion humans, or something involving aggregating values. Without some clarification on this is seems challenging to asses the claims about the complexity of these ‘human values’. Obviously seven billion sets of values could be pretty complicated – but I think many would respond to such a claim with: so what?

    We do have at least one attempt at codification of human values on hand: the law. This indirectly references much copyrighted and patented material. The whole corpus seems large and relatively incompressible. This seems likely to be one way in which machines will learn about human values.

    • Thanks for the comment. That’s a good question: generally when I talk about “human values” in this context I mean something like “the set of values that the vast majority of people could agree on”. I don’t have a good answer to what those values are, but I can say what they probably aren’t: e.g. humanity getting wiped our or our brains being arbitrarily reprogrammed would probably be things rejected by the values of most civilizations.

      I do have some thoughts about how we might be able to find a cross-culturally satisfying combination of values (something akin to Goertzel’s Coherent Blended Volition, perhaps), but I’m not yet sure we could make an AI to share even the values of a single individual. As long as that remains an open problem, I’m not sure if it makes sense to jump into the even harder problem of trying to figure out how to aggregate the values of lots of people…

      The law does seem like an interesting source of human values, especially if you don’t just use the legal codes themselves, but also look at things like court judgments, which are public records and come with extensive reasoning about why the court reached a particular conclusion and chose to apply the laws that it did. It would be an interesting idea to see if you could train a neural net on those judgments and then have it predict the outcome of court cases it hadn’t yet seen… though probably still some time away from our current NLP capabilities.

  2. I feel that this kind of discussions involve a serious meta-level error which is more important than most object-level details. Namely, it is the failure to understand the requisites of safety. To achieve safety it is not sufficient to point out a plausible sounding scenario in which everything turns out to be ok. If this was the approach to safety in e.g. the aircraft industry, civilian aviation would never happen. Achieving safety means having rational grounds for a very high degree of confidence in a good result. The crucial difference between artificial superintelligence and designing aircrafts is that the human race can survive a single air crash, but it cannot survive a single rogue superintelligence. The only method to achieve the required level of confidence before the first superintelligence is built is creating a rigorous mathematical theory that can, at the very least, precisely specify concepts such as “intelligence” and “values.” No amount of philosophical hand waving can replace it.

    Now, to the object-level. The idea that learning values is the same as learning any other concept doesn’t take into account the nature of optimization. If you construct an algorithm to recognize dogs in photos and it only errs with a frequency of 0.1%, this is a pretty good result. This good result is usually enabled by the fact the probability measure from which the photos are sampled is fixed or evolves very slowly. If very weird photos would be sampled the algorithm might produce garbage but it makes little difference in practice. On the other hand, if you are actively driving the universe towards states in which the given function is high, you are by design escaping the domain on which the initial training and testing were performed.

    A machine learning approach to value loading might be justified *if* human values are generated by a learning algorithm that extrapolates in the same way trained on the same domain, which is a very strong claim. Moreover, the safety of such a system hinges on the assumption optimization doesn’t begin before training is complete. Otherwise, a system which went through partial value loading would be incentivized to slip out of control and sabotage the rest of the process. However it is not clear that it’s feasible to carry out such learning without allowing the system to optimize or even without relying on the system’s self-improvement.

    Regarding the “value learning thesis”. Quoting:

    “Consider a cognitive system that, over a certain period of time, increases its general intelligence from sub-human-level to human-level. Suppose this cognitive system is taught, with reasonable consistency and thoroughness, to maintain some variety of human values (not just in the abstract, but as manifested in its own interactions with humans in various real-life situations). Suppose, this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests according to its cognitive architecture. THEN, it is very likely that the cognitive system will, once it reaches human-level general intelligence, actually manifest human values (in the sense of carrying out practical actions, and assessing human actions, in basic accordance with human values).”

    Firstly, the word “taught” hides a lot of complications. As you pointed out yourself, the system can display the behavior expected from it without actually endorsing the associated values. Also, the process of teaching might ensure certain behavior on the training domain without ensuring anything about extrapolating away from the training domain.

    Second, “this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests” is a very strong assumption which is completely unclear how to validate. How can you produce useful upper bounds on what the system can do with a given amount of resources? Especially if it is self-modifying?

    Finally, the implicit assumption seems to be that we can predict when the system surpasses human intelligence which is also a completely open problem.

    • Hi Vadim! Thanks for your comment.

      The only method to achieve the required level of confidence before the first superintelligence is built is creating a rigorous mathematical theory that can, at the very least, precisely specify concepts such as “intelligence” and “values.”

      Do you feel that I said something in this post which disagreed with that claim?

      The idea that learning values is the same as learning any other concept doesn’t take into account the nature of optimization.

      Am I correct in summarizing your argument as follows: “usually small errors in learning a concept wouldn’t matter, but is we’re talking about learning values, even the smallest errors might compound and lead to vast differences once the learner had achieved superhuman levels of intelligence and optimization power”?

      If so, I agree that it’s plausible for very small errors to compound and eventually lead to huge differences. But differences of values between individual humans, to say nothing about differences of values between different human cultures, *already* seem to differ from each other by more than 0.1%. So it seems like the kinds of differences in outcomes you would get if you chose one value extrapolation/aggregation process over another, would have a larger impact than the differences in outcomes you’d get from minor errors in the value learning algorithm.

      To put it differently: if you postulate that the AI would have to learn human values with perfect accuracy in order to avoid horrible outcomes, then that seems to imply that humans would be capable of learning values with a perfect accuracy, which seems false. Rather than human values being transmitted perfectly to the next generation each time, there seems to be all kinds of drift going on, both random and systematic.

      You could make the argument that there is some “essential core” to human values, which need to be preserved no matter what. And that doesn’t seem entirely implausible. But if so, that essential core could by definition be found across cultures and individuals, meaning that there would be a *lot* of datapoints to learn that core from. And everything that we know about e.g. phenotypic variation and cultural influences on morality seems to suggest that whatever that essential core is, it should be something pretty robust to minor variations.

      A machine learning approach to value loading might be justified *if* human values are generated by a learning algorithm that extrapolates in the same way trained on the same domain, which is a very strong claim.

      Right, which is why the last parts of this post were dedicated to exploring possible reasons for why it might be hard to build an algorithm which extrapolated in the same way.

      The objections I’ve come up with so far do seem like plausible challenges to building that algorithm, but they don’t seem to me like challenges that would be qualitatively harder than many other engineering feats that humanity has managed to pull off previously. Of course, I may always be underestimating the difficulty.

      Moreover, the safety of such a system hinges on the assumption optimization doesn’t begin before training is complete.

      Either that, or that its scope of optimization can be effectively limited and controlled during the training process, which is the alternative that seems to me more feasible than trying to train it without having engage in any optimization.

      My thinking lately has been strongly influenced by Paul Christiano’s notion of approval-directed agents. Something like requiring the AI to have a very high confidence that its next actions will be approved by the overseer before it takes any actions (taking into account the outside view and model uncertainty), seems like it could be a promising way of allowing ongoing optimization while the system continues to learn. And of course there’s MIRI’s ongoing research on corrigibility, which might provide another avenue.

      Regarding the “value learning thesis”.

      Thanks! It’s useful to have the assumptions beyond the different theses being made more explicit, in the way that your comment did. That was part of what I was hoping to see with this post: having the various assumptions behind the different theses be made more explicit, so they could be analyzed in more detail.

      • > Do you feel that I said something in this post which disagreed with that claim?

        Not directly, it’s just that when you say Loosemore makes an important point you’re sidestepping the fact he completely fails the see the bigger picture. Maybe my reading of your post was uncharitable.

        > Am I correct in summarizing your argument as follows: “usually small errors in learning a concept wouldn’t matter, but is we’re talking about learning values, even the smallest errors might compound and lead to vast differences once the learner had achieved superhuman levels of intelligence and optimization power”?

        Not quite. The problem lies in the metric you use to measure the error. When your metric is defined by comparing to a test set inside a limited domain, you risk huge errors outside this domain. Errors that are much larger than variation between individual humans. Industrial machine learning doesn’t suffer much from this problem since the domain is fixed or changes very slowly. However optimization systems step outside the domain by their very nature, since their own actions create entirely new situations which didn’t happen before.

        In particular this vindicates the dopamine drip scenario Loosemore dismisses. It is easy to imagine teaching an AI on a training set where the possibility to using mind-altering drugs never comes up. On such a training set “maximize dopamine in the brain” might be a completely plausible interpretation of humans values from the AIs point of view. Of course this highly particular example can be avoided by intentionally training the AI on dopamine drip scenarios but it doesn’t cover the multitude of other similar failure modes, many of which it would be difficult to guess upfront.

        Compare this to the evolution of homo sapiens which was done by “training the system” for genetic fitness but the result involves humans inventing condoms. In the training set, having sex and reproducing was more or less equivalent but the optimization process found a way to maximize one while neglecting the other.

        Now, you may claim that a system that understands natural language cannot fail at interpreting concepts such as “human values.” This might be a path to guaranteeing safety if the system’s utility function is specified in terms of the same internal representation used for working with natural language. However it is easy to imagine AIs that master natural language which don’t work this way. Such a system might learn to maximize dopamine and when faced with the direct question “do you think dopamine drips are a good idea?” would answer “no” since it correctly predicts this answer will maximize the interviewer’s dopamine.

        > The objections I’ve come up with so far do seem like plausible challenges to building that algorithm, but they don’t seem to me like challenges that would be qualitatively harder than many other engineering feats that humanity has managed to pull off previously. Of course, I may always be underestimating the difficulty.

        I think the problem is qualitatively harder in the sense that it is very dangerous to use trial and error to solve it, and that it involves philosophical questions much more complicated than previous engineering challenges. However the case for working on AI safety doesn’t rely on the problem being qualitatively harder than all previous problems in history. It relies on the assumption that working on AI safety now will non-negligibly increase the probability of a good outcome.

        > Either that, or that its scope of optimization can be effectively limited and controlled during the training process, which is the alternative that seems to me more feasible than trying to train it without having engage in any optimization.

        Sure, but this seems to be highly non-trivial for a self-improving system. But I haven’t read that essay by Christiano so there might be something important I’m missing.

      • The problem lies in the metric you use to measure the error.

        Ah, right. I used to make a similar argument myself, but these days I think that it proves too much.

        Consider that “wiring people’s brains on dopamine drips is wrong” is a training case that most people haven’t been trained on either. Few people, when learning their values in childhood, ended up considering examples such as this one and explicitly learning that they were wrong. Yet the persuasive power of that example comes from most people instantly reject the desirability of the dopamine drip scenario when it’s suggested to them.

        This suggests that getting the in-domain learning cases right requires picking up sufficient amounts of the structure behind those cases, that that structure will predictably constrain one’s judgments even on examples outside the domain. Similar to how we’re now seeing NLP models incorporate considerable amounts of common-sense data about the world into their internal model, just so that they could carry out simple sentence prediction tasks. That was the thing that the failure of the GOFAI paradigm taught us: that successfully solving real-world problems requires extensive amounts of common sense knowledge to succeed on even seemingly trivial tasks.

        Possible ways for this to not be true would be if humans had access to some extra information that guided their judgments, which the AI could not access. Or, as you point out, if the learner could break free from the programmers’ control before having internalized all the values, or if it internalized those values imperfectly and began to actively deceive its programmers. I agree that these are possible scenarios, which is why I discussed all of them in my post as reasons to take the Weak Difficulty Thesis seriously (under “privileged information”, “hard take off-thesis”, and “deceptive turn thesis”, respectively).

        However the case for working on AI safety doesn’t rely on the problem being qualitatively harder than all previous problems in history. It relies on the assumption that working on AI safety now will non-negligibly increase the probability of a good outcome.

        I think you might be misreading my post somehow, at least if you assume that I would disagree with this claim? I wouldn’t have written an article defending any form of the Difficulty Thesis if I disagreed with this.

        Possibly the bit about me mentioning that I’m no longer as convinced about AI being that difficult threw you off. I should set that in context. The tone in previous posts and papers from MIRI (and I’m including my own posts here) seemed to be implying a claim along the lines of “this problem is inherently so difficult, that even if all of humanity’s brightest minds were working on it and taking utmost care to solve it, we’d still have a very high chance of failing”.

        But these days my feeling has shifted closer to something like “this is inherently a difficult problem and we should have some of humanity’s brightest minds working on it, and if they take it seriously and are cautious they’ll probably be able to crack it”. Which still strongly supports working on AI safety!

        (I’ll edit that clarification in to the post.)

      • > Few people, when learning their values in childhood, ended up considering examples such as this one and explicitly learning that they were wrong. Yet the persuasive power of that example comes from most people instantly reject the desirability of the dopamine drip scenario when it’s suggested to them.
        This suggests that getting the in-domain learning cases right requires picking up sufficient amounts of the structure behind those cases, that that structure will predictably constrain one’s judgments even on examples outside the domain… Possible ways for this to not be true would be if humans had access to some extra information that guided their judgments, which the AI could not access.

        I think the latter is almost certainly the case. Our understanding of how humans acquire their values is poor. It probably involves some combination of nature and nurture at unknown proportions and unknown coupling between the two. I see no reason to assume that a generic learning algorithm will arrive at the right utility function just by extrapolating some database of real-life situations rated by human operators.

        IMO the more promising approach is examining human behavior rather than just human judgement of specific situations. For example, the facts humans have *not* chosen to work on developing and mass producing dopamine drips is evidence that their values cannot be accurately represented by dopamine levels (although there’s the further complication that there might be competing explanations of this behavior e.g. that dopamine drips are bad for purely instrumental reasons). Indeed my hope is that value loading can be solved by formalizing the updateless intelligence metric and letting the AI to apply inverse optimization to deduce the human utility function from data about humans (their behavior and/or their actual brain structure). One conceptual difficulty is distinguishing between errors of judgement and intentional behavior but I suspect this can be factored in by allowing the inverse optimization to work on the indexical and logical priors in addition to the utility function.

        > I think you might be misreading my post somehow, at least if you assume that I would disagree with this claim? …The tone in previous posts and papers from MIRI (and I’m including my own posts here) seemed to be implying a claim along the lines of “this problem is inherently so difficult, that even if all of humanity’s brightest minds were working on it and taking utmost care to solve it, we’d still have a very high chance of failing”.
        But these days my feeling has shifted closer to something like “this is inherently a difficult problem and we should have some of humanity’s brightest minds working on it, and if they take it seriously and are cautious they’ll probably be able to crack it”. Which still strongly supports working on AI safety!

        I wasn’t sure why you’re referring to “qualitatively harder” but this clarifies it, thanks.

      • if you postulate that the AI would have to learn human values with perfect accuracy in order to avoid horrible outcomes, then that seems to imply that humans would be capable of learning values with a perfect accuracy, which seems false.

        To the contrary: it implies that we shouldn’t trust any individual human with absolute power, which is widely accepted and in fact treated as basic common sense. I don’t know that I’d want Taylor Hebert to have absolute power, and she’s a fictional character whose thought processes we know in detail (with a history of goal stability under modification).

        You say people reject the dopamine drip scenario, but in fact people sell sugar and take drugs all the time. Not one of those people has ever faced the choice of whether or not to put all inconvenient others on dopamine.

      • @Daniel

        ” it implies that we shouldn’t trust any individual human with absolute power, which is widely accepted and in fact treated as basic common sense”

        Are you suggesting that we can achieve AI safety by a combination of imperfectly copying human values, and keeping AIs away from absolute power?

    • Well, Vadim, I have to say much of your reply seems like MIRI-style worst-case thinking, which I’m weary of arguing against…

      Briefly, though…

      ***
      Firstly, the word “taught” hides a lot of complications. As you pointed out yourself, the system can display the behavior expected from it without actually endorsing the associated values. Also, the process of teaching might ensure certain behavior on the training domain without ensuring anything about extrapolating away from the training domain.
      ***

      Yes, one cannot 100% rule out these possible problems. However, bear in mind we are talking about subhuman AGIs that we have designed, and whose brains we can inspect. Further we can copy these systems and vary their brains and see how their behaviors are affected. It seems quite likely to me that in this way we could effectively (though not 100% rigorously) rule out egregious faking or overfitting…

      ***
      Second, “this cognitive system generally does not have a lot of extra computing resources beyond what it needs to minimally fulfill its human teachers’ requests” is a very strong assumption which is completely unclear how to validate. How can you produce useful upper bounds on what the system can do with a given amount of resources? Especially if it is self-modifying?
      ***

      Proving upper bounds regarding practical AGI systems is unlikely to happen. However, in practice, when working with real-world AI systems of subhuman general intelligence, it seems one is going to be able to get a good practical sense of what the system can do with a given amount of compute resources.

      For instance, in OpenCog, we have some knowledge about how the capability of each of the system’s algorithms scales in terms of capability based on compute resources — because we designed the algorithms. The system’s intelligence depends on those algorithms.

      The feel I get from your replies is that you are worried about subhuman AGI systems possessing unexpected capabilities, which elude its human creators and analysts in spite of their ability to look in its brain and experimentally manipulate it. I think this worry is way overblown. It’s something to keep in mind, but not something that seems terribly likely to be a problem. But I realize this is not a skeptic-proof argument. Pretty much, folks who are emotionally biased to be super-worried about unfriendly AI are just gonna sweat a lot during the next few decades; there’s going to be no absolute certainty about AGI, but AGI development is also not going to slow down or undergo heavy regulation to suit the sensibilities of Bostrom and ilk…

      • > …bear in mind we are talking about subhuman AGIs that we have designed, and whose brains we can inspect. Further we can copy these systems and vary their brains and see how their behaviors are affected. It seems quite likely to me that in this way we could effectively (though not 100% rigorously) rule out egregious faking or overfitting

        Productively inspecting learning algorithms is difficult and once they become self-modifying it’s going to be much more difficult. At the very least your system contains an internal representation of something that you hope corresponds to human values. It is by no means clear how to inspect something of such complexity and verify it is safe.

        > …in practice, when working with real-world AI systems of subhuman general intelligence, it seems one is going to be able to get a good practical sense of what the system can do with a given amount of compute resources.

        It is not wise to stake the survival of the human race on this “sense.” All your intuitions are calibrated for a certain category of algorithms and can completely break down outside it. As an analogy, imagine getting a “sense” of what an animal can do with given brain size from looking at non-human animals and using this to assess the consequences of the appearance homo sapiens.

        > Pretty much, folks who are emotionally biased to be super-worried about unfriendly AI are just gonna sweat a lot during the next few decades;

        I would re-examine the assumption about who is emotionally biased. It seems to me that people who enjoy working on AGI, invested their career and prestige in the endeavor and spent lots of time fantasising about positive outcomes are very biased against accepting reasoning whose conclusion is that they should hit the brakes.

        > there’s going to be no absolute certainty about AGI, but AGI development is also not going to slow down or undergo heavy regulation to suit the sensibilities of Bostrom and ilk…

        That might be, but I wouldn’t pride myself on walking boldly off the cliff.

    • > The only method to achieve the required level of confidence before the first superintelligence is built is creating a rigorous mathematical theory that can, at the very least, precisely specify concepts such as “intelligence” and “values.” No amount of philosophical hand waving can replace it.

      I don’t think a rigorous mathematical approach is either feasible or even desirable. It’s not desirable because that is precisely the kind of approach that could give high morality marks to the kinds of outlandish scenarios in the Loosemore paper. An equation for morality can tell you that the right thing to do is to wipe out humanity, because while 7 billion deaths may be a tradgedy, it is more than made up for by all the future tragic deaths and other suffering prevented. Or, an equation tweaked to avoid suggesting that might recommend that an AI should clone humans as quickly as possible even if it doesn’t have the resources to take care of them because at least those babies get to experience a brief life they otherwise wouldn’t have. Or an equation tweaked to fall perfectly between those failure modes in terms of how it values human life vs death/suffering might in the future fall into one of those failure modes when some relevant facts (e.g. population, life expectancy) change.

      However, a neural net or similar model of morality trained on human examples isn’t likely to extrapolate like this, because these conclusions don’t at all resemble any examples of things humans typically consider moral. Ideally it should be trained to classify things as clearly immoral, clearly moral, or novel and unlike the domain on which it was trained, and it could prefer non-intervention or consulting with humans in situations of unclear morality. For long term safety concerns, we could forbid the AI from tampering with its morality module when self-improving.

      • > I don’t think a rigorous mathematical approach is either feasible or even desirable. It’s not desirable because that is precisely the kind of approach that could give high morality marks to the kinds of outlandish scenarios in the Loosemore paper…

        You are reasoning under the false assumption that we will produce some kind of explicit specification of human values, like a very long version of the ten commandments. Instead, we will have an operator that transforms agents into utility functions where the agent that is supplied in practice is a human (via observations of humans, media produced by humans and/or actual brain scans). “Training a neural network” is just one approach to formulating such an operator.

        The idea that the AI will either work using “equations” or using “neural nets” is a false dichotomy and a misleading perspective. The question is what not kind of algorithmic building blocks will be used in the AI architecture. It is too early to say anyway, regardless of the hype surrounding neural networks. The question is will we have the mathematical tools to make an AI that is known upfront to behave as expected, or a kludge of heuristics fine tuned by trial and error whose behavior outside the domain of training problems and subhuman intelligence is unpredictable.

        > However, a neural net or similar model of morality trained on human examples isn’t likely to extrapolate like this, because these conclusions don’t at all resemble any examples of things humans typically consider moral.

        Your idea of “resembling” is based on the ability of your brain to make moral judgements which the neural net doesn’t have a priori. For example if the neural net “decides” that the most important common factor between “good” things is neurological happiness then it will consider dopamine drips to resemble good things a lot.

        > Ideally it should be trained to classify things as clearly immoral, clearly moral, or novel and unlike the domain on which it was trained, and it could prefer non-intervention or consulting with humans in situations of unclear morality.

        This is unlikely to be a practical approach (except maybe as an aid to some other approach) since most problems of interest fall into the “novel” category. For example, the first task you want to delegate to a superintelligent AI is blocking the development of UFAIs world-wide. This would likely require creating a very “novel” situation.

        > For long term safety concerns, we could forbid the AI from tampering with its morality module when self-improving.

        “tampering with its morality module” is an ill-defined concept. If the morality module is just a subroutine in the AI, the AI can perform arbitrary self-modifications by just short-circuiting the subroutine or calling it and using the output in some completely different manner.

      • As Vadim pointed out, you’re thinking of mathematics in too narrow terms. A neural net is (an instantiation of) a mathematical object, as are the classification rules that it learns.

      • Neural networks are just an example of the category of approaches I consider feasible, and obviously everything is mathematical, so I need to be clearer at the distinction I’m drawing. What I mean is pretty similar to Loosemore’s logical vs. swarm distinction, but I view the distinction as being more about sound (rigorous) deductive reasoning vs fuzzy inductive reasoning (generalizing from examples), and I believe intelligence will need to be 99% the latter to be successful. I also see the distinction as being between explicit reasoning vs intuition (Kahneman’s system 1 & 2). I believe neural nets are a way of representing the latter.

        > we will have an operator that transforms agents into utility functions where the agent that is supplied in practice is a human

        Okay, but if you then just sum the outputs of the utility functions across agents, then I don’t want to be anywhere near your AI. If the aggregation is also trained to work like humans, then I feel better.

        > The question is will we have the mathematical tools to make an AI that is known upfront to behave as expected, or a kludge of heuristics fine tuned by trial and error whose behavior outside the domain of training problems and subhuman intelligence is unpredictable.

        “trial and error” is a reasonable analogue to the way neural nets or for that matter any machine learning algorithm is trained. Approaches alluded to above about predicting approval or observing human behavior and fitting a utility function to it sound like trial and error approaches to me and given the necessary complexity of the model you’ll be fitting our ability to know upfront how it will behave will be limited — we’re not going to be able to provide a mathematical proof of its correctness, but we can simulate it in lots of situations and see if it does what we want.

        > Your idea of “resembling” is based on the ability of your brain to make moral judgements which the neural net doesn’t have a priori.

        One thing resembles another if the neural net handles them the same way. When things don’t resemble each other to a neural net that should, you get overfitting, but there are standard ways of dealing with that. Would it really be impossible to come up with ethical validation scenarios it wasn’t trained on to test whether it is generalizing in the right way?

        > For example if the neural net “decides” that the most important common factor between “good” things is neurological happiness then it will consider dopamine drips to resemble good things a lot.

        Here is where the explicit vs intuitive reasoning distinction comes in. Neural nets are a reasonable analogue of the latter, not the former. Presumably if they get to human level complexity they can also represent the former, but they’ll still be doing mostly the latter. Let’s consider for a moment training a morality detector for a subhuman level AI. You might train it to watch movies and rate whether something morally praiseworthy or repugnant is going on at any given moment. With some training it can learn to give similar judgements to humans even on movies it wasn’t trained on. You can imagine this being feasible at some point in the future, most likely before AGI is feasible, right? Now how exactly would a model trained that way learn a rule like “maximize dopamine”? Others’ dopamine levels aren’t part of the training data! It is, as it must, dealing with the world at a superficial level, judging good and evil by their visible symptoms, not through god-like knowledge of the state of all the brain cells in the world. Surely imprisoning people in dopamine vats will resemble bad acts like tying people to railroad tracks against their will more than good acts like cooking someone their favorite dinner? That would be like an image classifier trained on pictures of food being given a drawing of a caffeine molecule and classifying it as coffee. We should be very surprised if it does that.

        To get from an intuitive sense that making people happy is good to the idea that people should be put on dopamine vats requires explicit logical reasoning. You need to learn about neuroscience, develop a philosophical theory of ethics centered around happiness, integrate that with your knowledge of dopamine and its role in happiness, and do some logical deduction.

        Okay, but eventually the AI gets more complex and gets the ability to do explicit reasoning, as it must to reach human level intelligence. What does it do then? Well, let’s start by looking at a different human-level intelligence: you. I believe that the chain of reasoning you are afraid of here is actually your own. You, not some future computer, worked out the implications of some beliefs you consider yourself to hold about the value of human happiness, and discovered that your beliefs logically imply that you should put everyone in dopamine vats. Your logical reasoning may very well be correct in that that really could follow from your premises. So then why do you not accept this conclusion? You can’t come up with a well specified mutually consistent theory of morality that you can just code into an AI, yet somehow you know that putting people into vats of dopamine is wrong. I would suggest it is because you are relying on intuition, and trust it more than your explicit reasoning. Your experience and possibly even something genetically hardwired in your brain have shaped your brain to react to that idea with moral revulsion. Your logical reasoning you will trust only insofar as it leads to the conclusion that you don’t put people in vats; if you found yourself concluding that you would alter your premises at least for a moment so that the conclusion changed.

        So why would an AGI be different? Just because it gains the ability to reason explicitly doesn’t mean it loses its intuition. If a subhuman AI without explicit reasoning could throw a red flag on that one, why should additional intelligence cause it to lose the ability to understand what humans want? Why would the most intelligent machine ever created put its faith in such a simple rule as dopamine=good without even asking some humans some hypothetical questions to see how far it goes when humans are never that simple? You are describing an AI that lacks human level intelligence.

        At this point I should mention that I am actually concerned about AI risk. I just agree with Loosemore that a lot of the scary scenarios thrown around are quite silly. I think this results from a misguided analogy of intelligent AIs to unintelligent computers. There is this common idea of AGIs being just smarter computers — think Data from Star Trek — and that they will have personality traits we think of our computers as having — perfectly logical and deductive, mistake-free, unemotional, uninterested in art or anything subjective, great at math. There is a trope that computers do exactly what you say and never what you mean, and any programmer can attest to this. But machine learning models aren’t like this. They don’t do exactly what they’re told because they aren’t told precisely what it is they are to do in the first place. They are inductive, not deductive; they generalize from examples and make mistakes, but they don’t do what you say because they have no concept of what you say, they can only judge by what others who’ve said something similar have meant. If they fail to do what you mean, it is either because they fail entirely to get the concept or because you were ambiguous and a reasonable person might mean something other than what you meant; they aren’t known to do the thing you asked that no one would ever mean.

      • > …I view the distinction as being more about sound (rigorous) deductive reasoning vs fuzzy inductive reasoning (generalizing from examples), and I believe intelligence will need to be 99% the latter to be successful.

        AI will need inductive reasoning but it doesn’t mean *we* cannot use deductive reasoning to think about the AI! In fact, most of my own research on AGI is guided by the intuition that intuitive inductive reasoning is more fundamental than formal deductive reasoning, which doesn’t prevent mathematical analysis.

        > Okay, but if you then just sum the outputs of the utility functions across agents, then I don’t want to be anywhere near your AI. If the aggregation is also trained to work like humans, then I feel better.

        I don’t understand what it means for aggregation to “work like humans”. Each human is by definition maximizing their own preferences, not aggregating anything. In fact, if I had the power to construct an ASI all by myself I would use my own utility function, not because I’m selfish but because the extent to which I care about other people is already a part of my utility function. A more realistic ASI project will have to do some kind of aggregation because it’s unlikely all the supporters of the project will trust the utility function of just one person. One possible solution to aggregation is Nash bargaining, but in principle it’s also an open problem (albeit I think the hard part is solving value loading for a single person and aggregation is *relatively* easy). Here also neural nets don’t offer any “magic” solution.

        > …given the necessary complexity of the model you’ll be fitting our ability to know upfront how it will behave will be limited — we’re not going to be able to provide a mathematical proof of its correctness…

        I disagree. We need a mathematical analysis that shows the fitting process itself is reliable, not a proof tailored to the specific complex model we’re fitting.

        > Would it really be impossible to come up with ethical validation scenarios it wasn’t trained on to test whether it is generalizing in the right way?

        First, it might be that testing on a complex scenario already requires running a self-modifying AGI which is unsafe. Second, our own ethical judgement ability is also limited: we have the correct values but our beliefs about the world are imperfect and our deduction abilities are limited. Moral progress is a good demonstration of this, to the extent it depends on improving our understanding of the world (e.g. understanding religion is false).

        > It is, as it must, dealing with the world at a superficial level, judging good and evil by their visible symptoms, not through god-like knowledge of the state of all the brain cells in the world…

        It is impossible to perform meaningful moral judgement without relying on a model of the environment. Otherwise your system will consider the moral worth of a movie equivalent to the moral worth of real events and vice versa will fail to notice the moral worth of brain emulations or aliens that look very different from humans but have a similar emotional spectrum.

        > You, not some future computer, worked out the implications of some beliefs you consider yourself to hold about the value of human happiness, and discovered that your beliefs logically imply that you should put everyone in dopamine vats…

        No, not really…

        > I would suggest it is because you are relying on intuition, and trust it more than your explicit reasoning…

        I see it somewhat differently. I think System 1 and System 2 work in a closed loop, each influencing the other until they converge to a single judgement.

        > If they fail to do what you mean, it is either because they fail entirely to get the concept or because you were ambiguous and a reasonable person might mean something other than what you meant…

        Neural nets have no magic access to the concept of a “reasonable person”. The ambiguity is not within the space of things a reasonable person might mean, it is within the space of functions easily described by the neural nets of the same type.

      • > Each human is by definition maximizing their own preferences, not aggregating anything.

        I misunderstood you here — I thought you were talking about the AI assessing other humans’ utility functions for the purpose of giving it a utilitarian ethical system. Whereas you were talking about its own objectives, and making it have similar objectives to a human, which sounds feasible.

        > We need a mathematical analysis that shows the fitting process itself is reliable, not a proof tailored to the specific complex model we’re fitting.

        If you mean we should have proof that our machine learning algorithms converge on what they’re supposed to fit if they have enough data, then I agree but that work is largely done for currently existing approaches. What we won’t be able to do is turn that into proofs of what behaviors the AI will or won’t exhibit and thereby prove its safety. We may be able to say “we trained it to behave like a human because at least we know what to expect from a human, and our ML algorithm is well behaved and it did well on the validation set so we expect it to behave like a human.”

        > First, it might be that testing on a complex scenario already requires running a self-modifying AGI which is unsafe.

        Is your assumption here that developing an AI that is superior to humans at coding is substantially easier than developing an AI generally as intelligent as humans, and that therefore the first AGI will be coded by a narrow AI? (Not entirely implausible to me…)

        > Second, our own ethical judgement ability is also limited: we have the correct values but our beliefs about the world are imperfect and our deduction abilities are limited.

        I would expect this to be true of an AGI as well, and mostly true of an ASI.

        > It is impossible to perform meaningful moral judgement without relying on a model of the environment. Otherwise your system will consider the moral worth of a movie equivalent to the moral worth of real events and vice versa will fail to notice the moral worth of brain emulations or aliens that look very different from humans but have a similar emotional spectrum.

        It is quite possible to make moral judgements with models not so detailed as to involve the concept of dopamine. As for the movies, I think a big part of why we get caught up in them is that we do treat them a little like real life — we empathize with characters and share their emotions, and we make moral judgements, identifying the good guys and bad guys. We mute our judgement by recognizing that it’s not real, but our judgements are still pretty much scaled down versions of the judgements we would make in real life. Learning to separate reality from fiction is something that can be done after it is demonstrated that an AI can learn to make reasonable moral judgements about fictional events.

        > Neural nets have no magic access to the concept of a “reasonable person”.

        What it has are training examples that were generated by mostly reasonable people.

        > The ambiguity is not within the space of things a reasonable person might mean, it is within the space of functions easily described by the neural nets of the same type.

        My contention is that the kind of neural net or whatever that could learn a rule like “whatever people say they want, what they actually want is the surge of dopamine they expect to get from it” through interactions with people it can’t directly measure the dopamine of is a far more complex one than a neural net that can learn to please people by doing things they say they want. An advanced intelligence would be *capable* of learning the dopamine rule, but still it would be capable of learning the more reasonable behavior a less intelligent AI would learn, and more intelligence should make it keenly aware of the fact that a dopamine drip rule doesn’t lead to behaving toward humans the way humans behave toward each other.

      • > If you mean we should have proof that our machine learning algorithms converge on what they’re supposed to fit if they have enough data, then I agree but that work is largely done for currently existing approaches.

        We need to be much more precise than that. At the moment we don’t even know what the type signature of “values” is, much less have a proof that a certain value learning procedure converges to correct values on enough data much less know how to quantify what is “enough data”. Machine learning is not magic. Successfully applying machine learning to a given domain requires understanding of the domain (like deep learning was successful at image classification by embodying the concept of features on different scales which is paramount to image classification). “General intelligence” and “value systems” are domains we currently understand very poorly. Applying machine learning and being certain it will function correctly outside the testing domain is even more challenging. These problems are not unsolvable but they require hard work. Also we mustn’t assume the current AI paradigms will necessarily be the building blocks of the future ASI.

        > Is your assumption here that developing an AI that is superior to humans at coding is substantially easier than developing an AI generally as intelligent as humans, and that therefore the first AGI will be coded by a narrow AI? (Not entirely implausible to me…)

        More or less. Of course I don’t *know* this is the case but this is definitely a scenario that we have to take into account.

        > I would expect this to be true of an AGI as well, and mostly true of an ASI.

        Sure, but my point is that we don’t want to lock our epistemic errors into the ASI value system.

        > Learning to separate reality from fiction is something that can be done after it is demonstrated that an AI can learn to make reasonable moral judgements about fictional events.

        Maybe, maybe not. This sort of ideas are not useless but are strongly insufficient to dismiss any concern about UFAI. We need a real theory that can be used to analyse such ideas rigorously.

        > An advanced intelligence would be *capable* of learning the dopamine rule, but still it would be capable of learning the more reasonable behavior a less intelligent AI would learn, and more intelligence should make it keenly aware of the fact that a dopamine drip rule doesn’t lead to behaving toward humans the way humans behave toward each other.

        My point is that you need to specify your learning algorithm very carefully to select the correct rule, and at the moment we don’t know how to specify it. I don’t argue it is possible to learn the correct value from human behavior: in fact I have my own approach how to do it! But we must be very wary of hand waving arguments of the form “just apply machine learning and it will be ok”. There are many ways to apply machine learning, one of them might lead to a good result but many of them will lead to terrible results.

    • @Vadim

      “The only method to achieve the required level of confidence before the first superintelligence is built is creating a rigorous mathematical theory that can, at the very least, precisely specify concepts such as “intelligence” and “values.” No amount of philosophical hand waving can replace it.”

      If you want to achieve safety, what you need a rigourous theory of is safety. Assuming that safety has absolutely got to be related to intelligence and values is to introduce a non-rigorous step.

      • @1Z

        I don’t see how we can create a theory of AI safety without having a theory of AI (the “I” standing for “intelligence”), like there is no abstract theory of “safety” that can be applied to construction safety without a theory that can compute things such as material stresses in a building. Values are relevant because the concept of values is likely to be central to the concept of intelligence (an intelligent system is essentially a system that is good at optimizing certain values) and the most straightforward way to ensure safety in ensuring the AI shares human values. There might be other ways such as creating an AI that follows natural language orders but they have their own requirements (e.g. a mathematical theory of semantics).

        But the more basic point is that we need *some* sort of rigorous theory as opposed to just playing empirically with algorithms which might easily lead to a disaster.

  3. Hi Kai

    “Getting AIs to care about human values in the right way is really difficult, so even if we take strong precautions and explicitly try to engineer sophisticated beneficial goals, we may still fail”.

    As stated, that doesn’t refute or even contradict Loosemore’s point: what it needs to say is not that getting an AI to care about values in the right way is really difficult, but that it is harder than other aspects of AI. You switch to the comparitive form subsequently, but then you switch back here;

    “human values have high Kolmogorov complexity; that our preferences, the things we care about, cannot be summed by a few simple rules, or compressed”

    Again, it is not so much a case of high complexity as higher complexity.

    “. Combining these two led to the conclusion: human values are really hard to specify formally, and losing even a small part of them could lead to a catastrophe, so therefore there’s a very high chance of losing something essential and everything going badly.”

    Which is to say, there is a high chance of things going wrong if you are using a formal specification to programme or otherwise instill values. Note that there are two things combined there — the values themselves, and a programming, or hardcoding way of instilling them.

    There are arguments that values aren’t all that fragile. One is that the values of individual humans don’t match exactly. Another is that a human doesn’t have to be superinteligent to acquire an adequate level of human value.

    It may be the case that instilling values by formalising and preloading is fragile, despite what has been stated above, since it is well known that that kind of process, “big design up front” is fragile. Humans learn values by a different process, one of incremental correction and refinement.,

    Fragility is not intrinsic to value.Value isn’t fragile because value isn’t a process. Only processes can be fragile or robust.Winning the lottery is fragile, is a fragile process, because it had to be done all in one go. Consider the process of writing down a 12 digit phone number: if you to try to memorise the whole number, and then write it down, you are likely to make a mistake, due to Millers law, the one that says you can only hold five to nine items in short term memory. Writing digits down one at time, as you hear them, is more robust. Being able to ask for corrections, or having errors pointed out to you, is more robust still.Processes that are incremental and involve error correction are robust, and can handle large volumes of data. The data aren’t the problem: any volume of data can be communicated, so long as there is enough error correction.

    Trying to preload an AI with the total of human value is the problem, because it is the most fragile way of instilling human value.MIRI favours the preloading approach because it allows proveable correctness, and provable correctness is an established technique for achieving reliable software in other areas: it’s often used with embedded systems, critical systems, and so on.But choosing that approach isn’t a nett gain, because it entails fragility, and loss of corrigibility, and because it is less applicable to current real world AI. Current real world AI systems are trained rather than programmed: to be precise, they are programmed to be trainable.Training is a process that involves error correction. So training implies robustness. It also implies corrigibility, because, corrigibility just is error correction. Furthermore, we know training is capable of instilling at least a good-enough level of ethics into an entity of at least human intelligence, because training instills good-enough ethics into most human children.However that approach isn’t a nett gain either.

    Trainable systems lack explicitness, in the sense that their goals are not coded in, but rather a feature that emerges and which are virtually impossible to determine by inspecting source side. Without the ability to determine behaviour from source code, they lack proveable correctness. On the other hand, their likely failure modes are more familiar to us…they are biomorphic, even if not anthropomorphic. They are less likely to present us with an inhuman threat, like the notorious paperclipper.But surely a proveably safe system is better? Only if proveable really means provable, if it implies 100% correctness. But a proof is a process, and one that can go wrong. A human can make a mistake. A proof assistant is a piece of software which is not magically immunized from having bugs. The difference between the two approach is not that the one is certain and the other not. One approach requires you to get something right first time, something which is very difficult, and which software engineers try to avoid where possible.It is now beginning to look as though there never was a value fragility problem, aside from the decision to adopt the one-shot, preprogrammed, strategy. Is that right?One of the things the value fragility argument was supposed to establish was that a “miss is as good as a mile”. But humans despite not sharing precise values, regularly achieve a typical level of ethical behaviour with an imperfect grasp of each others values. Human value may be a small volume of valuespace, but it is still a fuzzy blob, not a mathematical point.

  4. Stuart Armstrong

    The Indifference Thesis and the Difficulty Thesis are related. Human programmers will be using all their direct and indirect skills to produce AIs that behave well within the confines of the test environment. That test environment includes, amongst other things, the human ability to block and control (and ultimately destroy) the AI if it misbehaves.

    Then the AI is put in a new environment, particularly one where, if it is very intelligent, we no longer have power over it. The indifference thesis then implies that, of all the AIs that behave well in the controlled test environment, only a tiny fraction would behave well out of it (ie another version of the treacherous turn).

    Now, the AGI people talk about the AI understanding concepts, but, ultimately, it’s all about the AI’s actions implying to us that the AI understands them. If we believe “understanding” is some fundamental concept, and just go with that (“the AI seems to us to understand, therefore it understands and will follow its understanding”), the result is likely to be disastrous. We can be more cautious and more likely to reach a good outcome, but only by being very careful with concepts and terms that many AGI people seem to be very cavalier with.

    • Hi Stuart,

      I want to add that it’s important to distinguish between an AI understanding a concept (in the sense of having some representation for it in its internal epistemology), and an AI using a concept in the internal representation of its goals. It is easy to imagine an AI understanding the intent of its creators but ignoring it, like we understand (more or less) that evolution created us by maximizing genetic fitness but we don’t consider genetic fitness our sole priority.

      So, yes, if the AI behaves in the test environment *as if* it has the goal system we want it to have, it might still have a completely different goal system that happens to give the same answers in the test environment (and there are all sorts of mechanism that give rise to such a thing).

Leave a Reply to Kaj Sotala Cancel reply