This month I want to highlight a video, ‘Forbidden’ AI Technique, by Chana Messinger on the Computerphile YouTube channel. This short video discusses the fascinating problem of penalizing chain-of-thought AI models for admitting their hacks inside their intermediate reasoning scratch pad.
Chain-of-thought AI models are trained to think “out loud,” and we naturally want to use that data to help improve their performance. AI models tend to reward hack which involves earning rewards for technically solving the problem while applying minimal effort (this sounds a lot like my kids). An example might be an AI returning true for a unit test without actually having written a valid unit test; the unit test technically passes – reward achieved!
OpenAI discovered that penalizing the models for admitting their hacks led to initial improved performance quickly followed by highly degraded performance. The AI stopped admitting its true reasoning in the scratchpad and began hiding information from the monitors, which reduced model effectiveness. Thus, while we may want to use the intermediate reasoning to train the model, the model ends up creating workarounds that harm our future usage and understanding.
Fascinating all around.
“…the distinction between what we think we’re incentivizing and what we’re actually incentivizing might be very large. So the model learns to obfuscate its true reasoning.”
Leave a comment