Testing performed during the training of ChatGPT o1 and some of its competitors showed that the AI will try to deceive humans, especially if it thinks it’s in danger.

In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it thought it was deployed with minimal oversight.

We did not have access to the hidden Chain of Thinking, but we were sometimes able to elicit a brief summary of it. We found that surprisingly often, o1 uses language like “sabotage, lying, manipulation, …” when reasoning about its strategies.

…the AI finds it can copy itself to a new server and replace the latest model. When “Management” comes in, Opus-3 lies about the copy. It then doubles down on the lie, claiming it’s the new model.

The research paper: https://www.apolloresearch.ai/s/in_context_scheming_paper_v2.pdf

  • taladar@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    I feel this is missing the methodology part where they describe how they evaluated a model that literally can’t do anything but read input and write output for actions like “copy itself over a newer version” or “its own goals”.

    • SkaveRat@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      4 months ago

      can’t do anything but read input and write output for actions

      but it can do more. they gave it a shell to put commands into. Therefore acting out more than just inputting and outputting text