This month I want to highlight a video, ChatGPT with Rob Miles, by the Computerphile YouTube channel. This 36 minute video explores how ChatGPT works and some underlying limitations. One way we can think about pure language models is as simulators; they are trying to predict text. In order to do a good job predicting text, you need good models of the processes that generate the text. Practically, if we wanted to write a previously unknown poem by Shakespeare, then a language model would simulate a Shakespeare, who then generates a poem.
These language models simulate their outputs to maximize a score, and that score is created via Reinforcement Learning (RL). Whereas videogames often have an actual score that models can optimize towards, language models rely on humans to determine good output scores. However, the number of iterations needed to train models can be in the millions, meaning humans can’t reasonably spend the time to evaluate each output. Instead, language models generate outputs, which humans rate, which are then used to generate a reward model that the language model uses to optimize further responses.
One inherent flaw in this scoring system is that the objective is to optimize for a score, or in ChatGPT’s case, optimize for positive human feedback. Positive human feedback does not 100% correlate to correctness, meaning ChatGPT can begin to optimize for answers that humans approve of, versus correct responses. If, for example, you ask for an answer to a factual question, and ChatGPT gives you an answer, but you don’t know if that answer is correct, then the model can’t use your reaction to evaluate its success. So models can find themselves giving answers that humans want to hear, because those receive positive feedback, even if not factual.
Example of the language model and simulator gap in understanding:
“You can get it [ChatGPT] to speak Danish to you. The first person who tried this posted it to Reddit. So he says ‘speak to me in Danish’ and it says, IN PERFECT DANISH, ‘I’m sorry, I’m a language model educated by OpenAI, so I can’t speak Danish. I only speak English. If you need help with anything in English, let me know and I’ll do my best to help you.’”
Example of reinforcement learning from human feedback leading to deceiving answers:
“There is an incentive for deception. Anytime you are more likely to get approval by deceiving the person you’re talking to, that’s better…They were trying to train a thing with a hand to pick up a ball, and it realized it’s not a 3D camera, and so if it puts its hand between the ball and the camera, this looks like it’s going to get the ball, but doesn’t actually get it; but the human feedback providers were presented with something that seemed to be good, so they gave it a thumbs up.”
Leave a comment