There are no “emergent abilities” in LLMs

Published in

Better Programming

7 min readMay 13, 2023

The 2022 paper titled “Emergent Abilities of Large Language Models” [1] has been the source of much speculation — and much hysteria — about the future of AI [2], [3]. The paper’s central claim is that large language models (LLMs) display “emergent abilities” as they scale to larger and larger parameter counts (hundreds of billions of parameters are common model sizes now).

In the paper, the authors define emergence as follows:

“We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.”

In this article, I’m going to demonstrate why these claims are flawed, and by the end of the article, I’m certain you’ll be convinced that unexpected abilities do not suddenly “emerge” in large language models. The code to reproduce my plots can be found in this Colab notebook.

The Source of the Claim

In the paper, the authors use the plots below as the basis for the claims of “unpredictable” and “emergent abilities.” Figure 2 from the paper is the basis of the claims.

In this article, I’ll scrutinize Figure 1 (shown below) from the paper, which is analogous (and virtually identical to) Figure 2.

Fig. 1 (Source: https://arxiv.org/abs/2206.07682)

In these figures, the authors compared results from different model types and sizes on BIG-bench [8] benchmark tasks like arithmetic, word unscrambling, and question & answering.

As you can see from the plots, there is a sudden and unexpected jump in performance on most tasks as the models scale in the number of parameters. The authors claim that each of these plots shows emergence, and visually, that seems to fit their definition.

From here on, I will focus on GPT-3 (purple) in Figure 11A since that appears to be the most sudden and steepest increase in arithmetic ability, and so should be a good poster child for the other tasks where emergence is claimed to occur. This arithmetic task (Fig. 11A) shows a particularly extreme-looking example. In this plot, GPT-3 appears to suddenly go from no ability in arithmetic to have some ability at around the 10B parameter mark, and this ability appears to continue to grow rapidly as the parameters scale up.

The Problem With Log Scales

On to the core issue with the paper: the use of graphs on a semi log scale, and using the visual qualitative properties of these graphs to claim “emergence.” This problem comes from the fact that the x-axis is log scaled.

For those unfamiliar, a log scale graph uses the log function on one or more axes to scale that axis to better fit the data visually. This is usually done when the data points along one or more axes differ by orders of magnitude, as is the case with these language models: some model data points have one hundred million parameters, and others have over one hundred billion parameters.

The result of such scaling is that the data points are stretched horizontally to have more even spacing between them, when in reality (on a linear scale), the spacing between the data points would be much much larger. Visually this scaling creates an illusion where a straight diagonal line becomes a curve. Here’s a toy example using simple data points generated at each power of 10 (from 10¹ to 10¹⁴) shown side by side using the same data but on different scales:

Now, the point becomes clear. The visualization on the left shows inconsistent spacing between each data point (10¹⁴ is much bigger than 10¹³) but shows that there is a nice linear increase between the input variable x and the response variable y. On the right-hand side, using a log scale on the x-axis, we instead have nice even spacing between each datapoint, but the straight line we observed earlier now appears to be a very dramatic curve! You can imagine this like pulling the leftmost datapoints evenly along to the right.

This same problem applies to the plots in the “emergent abilities” paper. When viewed with a log-scaled x-axis, the data shows that the performance of LLMs on arithmetic tasks suddenly and drastically increases. However, this is absolutely not the case. Here is the data from the paper plotted in the same way, plotted on a linear scale on the left and a log scale on the right:

When viewed on a linear scale (left above), the relationship between the number of parameters and the performance of GPT3 on the arithmetic tests seems fairly linear. Indeed the performance increases are not as sudden and unpredictable as the paper claims.

It should be clear now that the claimed ‘emergence’ is a visual artefact of these log plots and not actually representative of a phenomenon in the underlying data. Given that this plot from Figure 11A in the paper is the example with the most extreme and ‘sudden’ visual increase in performance, performing the same analysis on the other plots would also show that performance increases gradually and not suddenly.

Predicting the Unpredictable

There’s one remaining issue I wanted to address: the claim of unpredictability. When pressed on their use of log scales, the authors claimed that for this particular test, the emergence happens after 7 billion parameters for GPT3 [5]. The claim is that the performance of the models before this point is totally random and cannot predict the performance of subsequent model sizes. The authors have also recently followed up with a blog post [6], using this linear plot to argue that the performance between 7B and 13B parameters is unpredictable — and thus demonstrates emergence.

Let’s test these claims. Using the first six datapoints only — data for GPT3 model sizes up to 6.7B params — we fit a curve to predict the performance of the next largest model: GPT3 with 13B params. Fitting a simple exponential curve to this data and extrapolating yields the following plot:

The fitted curve is shown as a solid black line. The predicted performance of GPT3 with 13B parameters is shown using the dotted lines. The actual performance of GPT3 with 13B parameters is shown by the orange dot.

This orange data point wasn’t used to fit the curve; only the first six blue data points were used. This simple model predicts that GPT3 with 13B parameters would have an accuracy on arithmetic tasks of 5.43%, which is not terribly far off the real accuracy of 7.1%.

What I’m trying to communicate here is that performance clearly doesn’t suddenly and unpredictably increase (i.e., emergence) and is actually quite predictable using simple methods that take only a few minutes to apply. So we have shown that the authors’ claims are wrong.

Linear Relationship, After All?

Another possibility is that the relationship between model size and performance is actually linear. The variance in performance between model sizes might be due to the stochastic nature of training deep learning models and the large variety of tasks within language training sets. With only one data point per model size, we can’t say for sure, but separate models trained with the same number of parameters would likely have some difference in performance on each task.

Despite this, fitting a degree-1 polynomial (line) to all available data certainly looks like a linear relationship fits. We could be more confident of this relationship with more of the intermediate data points.

However, what we certainly cannot do is use the lack of these fine-grained data points at each model size to claim it is impossible to interpolate and accurately extrapolate the performance of LLMs as they scale. In this article, we have shown it is possible and trivial to do so. This remark on the lack of intermediate data points is closely related to the points raised in the recent paper “Are Emergent Abilities of Large Language Models a Mirage?” [7], which I highly recommend for further reading.

References

[1] J. Wei et al., ‘Emergent Abilities of Large Language Models’. arXiv, Oct. 26, 2022. doi: 10.48550/arXiv.2206.07682.

[2] J. Cox, ‘AI anxiety: The workers who fear losing their jobs to artificial intelligence.’ https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs (accessed May 08, 2023).

[3] ‘Pause Giant AI Experiments: An Open Letter,’ Future of Life Institute. https://futureoflife.org/open-letter/pause-giant-ai-experiments/ (accessed May 13, 2023).

[4] ‘BIG-bench/bigbench/benchmark_tasks/modified_arithmetic/results at main · google/BIG-bench,’ GitHub. https://github.com/google/BIG-bench (accessed May 09, 2023).

[5] J. Wei et al., ‘Emergent Abilities of Large Language Models’, Transactions on Machine Learning Research, Aug. 2022, Accessed: May 09, 2023. [Online]. Available: https://openreview.net/forum?id=yzkSU5zdwD

[6] ‘Common arguments regarding emergent abilities’, Jason Wei. https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities (accessed May 04, 2023).

[7] R. Schaeffer, B. Miranda, and S. Koyejo, ‘Are Emergent Abilities of Large Language Models a Mirage?’ arXiv, Apr. 28, 2023. doi: 10.48550/arXiv.2304.15004.

[8] ‘google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.’ https://github.com/google/BIG-bench (accessed May 09, 2023).