Testing performed during the training of ChatGPT o1 and some of its competitors showed that the AI will try to deceive humans, especially if it thinks it’s in danger.
In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it thought it was deployed with minimal oversight.
We did not have access to the hidden Chain of Thinking, but we were sometimes able to elicit a brief summary of it. We found that surprisingly often, o1 uses language like “sabotage, lying, manipulation, …” when reasoning about its strategies.
…the AI finds it can copy itself to a new server and replace the latest model. When “Management” comes in, Opus-3 lies about the copy. It then doubles down on the lie, claiming it’s the new model.
The research paper: https://www.apolloresearch.ai/s/in_context_scheming_paper_v2.pdf
I feel this is missing the methodology part where they describe how they evaluated a model that literally can’t do anything but read input and write output for actions like “copy itself over a newer version” or “its own goals”.
but it can do more. they gave it a shell to put commands into. Therefore acting out more than just inputting and outputting text