original version of this story Appear in Quanta Magazine.
Two years ago, in a project called Beyond the Imitation Game benchmark (BIG-bench), 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models such as ChatGPT. Chatbots provide the power. In most tasks, performance improves predictably and smoothly as model size increases—the larger the model, the better the performance. But in other tasks, the improvement of abilities does not go smoothly. Performance stays near zero for a while, then it skyrockets. Other studies have found similar leaps in ability.
The authors describe this as “breakthrough” behavior; other researchers liken it to a phase change in physics, like liquid water freezing into ice. In a paper published in August 2022, the researchers noted that these behaviors were not only surprising, but also unpredictable, and they should inform the evolving conversation around AI safety, potential, and risks. They call these capabilities “emergence,” a term that describes the collective behavior that occurs only when systems reach high levels of complexity.
But things may not be that simple. A trio of Stanford University researchers argue in a new paper that the sudden appearance of these abilities is simply a result of the way researchers measure LLM performance. They argue that these abilities are neither unpredictable nor sudden. “This transition is more predictable than people think,” said Sanmi Koyejo, a computer scientist at Stanford University and senior author of the paper. “Strong claims about emergence have to do with how we choose to measure it, with what the model is doing.”
Because these models have become so large, we are only now seeing and studying this behavior. Large language models are trained by analyzing large data sets of text (words from books, web searches, and online sources like Wikipedia) and finding links between words that often appear together. Size is measured in terms of parameters, roughly analogous to all the ways words are connected. The more parameters there are, the more connections the LL.M. can find. GPT-2 has 1.5 billion parameters, while LLM GPT-3.5, which supports ChatGPT, uses 350 billion parameters. GPT-4 debuted in March 2023 and is now the basis of Microsoft Copilot, with a reported $1.75 trillion in use.
This rapid growth has resulted in an astonishing spike in performance and efficiency, and no one disputes that a sufficiently large LLM can accomplish tasks that smaller models cannot, including those for which it was not trained. The Stanford trio dismissed the emergence as a “mirage,” recognizing that the LL.M. became more effective as it scaled; in fact, the increased complexity of larger models should be able to better tackle more difficult and diverse problems The problem. But they argue that whether the improvement looks smooth and predictable, or jagged and sharp, depends on the choice of metrics (or even the lack of test examples) rather than the inner workings of the model.