ChatGPT is not just lossy compression

Published in

Better Programming

13 min readFeb 12, 2023

Craiyon, formerly DALL·E mini, prompt extracted from proximal text: “[…]lossy compression[…]”

Recently, an interesting and stimulating criticism of large language models like OpenAI’s ChatGPT and addressed to the general public has been published in the New Yorker, comparing them to lossy compression for images. In this article, the author, Ted Chiang (henceforth TC), argues that language models are not capable of producing original content, and that they are merely a lossy compression of all the text on the web, the lossy aspect being dangerous because difficult to detect like pointed out by David Kriesel during his investigation of the infamous Xerox bug [1] [2]. However, this view overlooks many of the capabilities of language models and oversimplifies their purpose and limitations.

In an engineering spirit, this post will address the flaws in this specific comparison of LLMs to lossy compression and provide a more nuanced and down-to-earth view of the capabilities and limitations of language models. We will argue that language models are not simply a lossy compression of text data, let alone image data, but are instead interactive systems designed for a wide range of tasks, including natural language understanding and generation, question-answering, and more.

Furthermore, we will demonstrate that the fallibility of language models, like that of plausible hallucination, is not a weakness but rather a reflection of the complexity and variability of natural language. And while there are limitations to these models, there are also countermeasures in place to address these, such as fine-tuning on specific tasks or using ensembles of models.

Finally, we will argue that the conclusion of TC’s article — namely that we want the original and are disappointed with a lossy compression version — is a straw man argument, as it oversimplifies the capabilities and goals of language models. The goal of language models per se is not to preserve all details, but to produce coherent and relevant text based on a prompt.

Let us begin with some general remarks on the rhetorical aspects of the article.

A misleading framework

First, the author frames the debate, and imposes an analogy which constrains the capabilities of language models by comparing them to a lossy compression algorithm. This framing overlooks the fact that language models have many other capabilities, and it gets worse when the author then launches his attacks on such “compressive LLMs”. In rhetoric, we call that the straw man fallacy.

Imagine someone telling you that the laws of physics evidently do not contain projectiles and flowers, which therefore means the laws of physics are lossy compressions of the world, and because they are lossy yet plausible, then they both miss something and are misleading, therefore they’re just a distraction. Right…

Implicit assumptions

The article implicitly assumes that we want to preserve all details in the information retrieved from the result of a process, without considering that the goal of a process may be to simplify or abstract information.

It also works against its own goal of pointing out unsolvable problems in language models by showing how a problem in lossy compression was solved by Xerox, highlighting that the process of correction and improvement is ongoing and that solutions can be found for problems that arise. The article also implies that there will always be problems and that the correction process is endless, ignoring the possibility of convergence. Additionally, the article ignores the fact that countermeasures can be taken to address quirks in the process.

Once we are completely taken aboard that weak analogy with lossy compression, we are then led to the conclusion that: “we want the original, yet we are given a lossy compressed version, so we are disappointed.” Not really convincing.

All in all, the article is very well written, with a story “once upon a time, a company needed to Xerox…”, but it is no more than a fiction, a science fiction.

Let us now take a look at the technical aspects.

A quick overview of the technical soundness of the comparison

Of course, Kevin Murphy (whose 2-volume books on “Probabilistic Machine Learning” are hailed as a reference in the field, and which I myself peruse regularly with pleasure and great interest) is not wrong when he tweets that “MLE training is exactly lossy compression”, because of the deep connections revealed by vector quantization between the JPEG et al. maths and machine learning, which he himself points out in his book: “this shows the deep connection between data compression and density estimation” (p. 719 of his volume 1).

MLE (Maximum Likelihood Estimation) is a method for estimating the parameters of a probability distribution that best explain a given dataset, or in Kevin Murphy’s words a “density estimation”. In the context of language models, MLE is used to estimate the probabilities of sequences of words given the training data, which can then be used to generate text. The idea that MLE training is lossy data compression thus simply comes from the fact that MLE is a form of compression in the sense that it tries to find a compact representation of the training data in the form of a probability distribution. However, this compression is also not the same as lossy data compression, which discards information in order to reduce the size of a file, as we shall now see.

JBIG2, the file format that is a key part of TC’s storytelling, is an image format that relies on an encoding that uses pattern-matching algorithms to identify and compress repeating patterns in bi-level images, such as scanned text or line drawings. JBIG2 encoding does not use MLE or any other statistical techniques; instead, it uses a combination of pattern matching and arithmetic coding to compress the image data. Furthermore, the pattern matching algorithm does not rely on any human input, contrary to LLMs which by essence study the statistics of a human produced corpus in which words are linked by humans. Instead, the pattern matching algorithm uses a “match score” which an example of is “the Hamming distance, that is, the count of the number of mismatched pixels between the potential matching pixel block and the current pixel block when they are aligned according to the geometric centers of their bounding boxes” in the words of the JBIG2 authors [3]. JPEG, on the other hand, does not use pattern matching, which is a key component of the Xerox bug described by David Kriesel [1][2] and which TC heavily relies upon. So the depth of the connection between MLE and compression, as pointed out by Kevin Murphy in his tweet, is not what can hold together TC’s argumentation.

In any case, I guess in some way any extraction of the parameters of a law, subsequently used to generate approximations to what the parameters were extracted from, is a form of lossy compression, but then that analogy sees its significance diluted in its generality. Maybe I could excitedly tweet that “maths is exactly lossy compression” and get away with it. It might be a “deep” analogy, but that depth, abyssal be it, doesn’t make it useful from an engineering point of view.

Let us now list a few of the significant engineering differences between LLMs and (lossy) compression.

Interactivity

The key difference between language models and lossy compression algorithms is meaningful interactivity. Language models, such as GPT-3, are designed to generate responses to user inputs in real-time, providing a conversational and interactive experience (which since 11/30/22 has been proven amazingly successful with millions of active interacting users, who seem to have a much better appreciation of the “lipstick on the pig” than Kevin Murphy, cf. his tweet). On the other hand, lossy compression algorithms are designed to reduce the size of digital data, such as images, by removing redundant or less noticeable information, without the ability to interact with users. This distinction highlights the fact that language models are more complex and sophisticated AI systems, capable of generating human-like responses, while lossy compression algorithms are relatively simple and focused on a specific task.

Also note that language models are trained on large amounts of text data, allowing them to generate coherent and contextually appropriate responses. This level of training and complexity is not present in lossy compression algorithms, which simply apply mathematical algorithms to reduce data size and expand it back when required.

Craiyon, formerly DALL·E mini, prompt extracted from proximal text: “[…]interactivity[…]”

Prompting

Let us consider the trained artificial neural network (ANN) that makes up the language model as some form of compression of a whole corpus. In this context, the prompt given to the language model serves as a way to produce meaningful views on the corpus. The language model uses the information contained in the corpus, combined with the prompt, to generate contextually appropriate and coherent responses.

This is another important difference between language models and lossy compression algorithms: the ability to generate new content based on partial information. Language models, such as GPT-3, are trained to produce text based on a given prompt, or partial text, allowing them to generate coherent and contextually appropriate responses. This is a unique capability of language models that allows them to interact with users and generate new text in real-time.

On the other hand, lossy compression algorithms do not have the ability to generate new content based on partial information. They simply reduce the size of digital data, such as images, by removing redundant or less noticeable information. A partial compressed image does not yield a global, coherent decompressed image in any way. The output of a lossy compression algorithm is a simplified version of the original data, but it does not have the ability to generate new meaningful content on partial input, or if it does it is a side effect like Rorschach. Specifically, if you look at the JBIG2 standard [3] (the one which TC uses in his article), you can’t construct anything from a mutilated JBIG2 file.

As a result, arbitrary sampling of a compressed file as input to decompression will not produce meaningful output, as it lacks the context and the relationships between the different pieces of information that are present in the original data (which in practice also means it misses the metadata, making the whole process hopeless). In contrast, the prompt given to a language model serves as a way to access meaningful views on the corpus, allowing it to generate contextually appropriate and coherent responses.

So even if we accept the idea that the trained ANN is a compression of the corpus, the way it is used is completely different. From an engineering perspective, LLMs and lossy compression are completely different. Similarly, complex analysis is just a sophisticated set theory, but that has absolutely no effect on the practice of complex analysis, nor gives any practical insight into the field.

Decompression is a function

The functional nature of decompression refers to the property of lossy compression algorithms that the same compressed file will always lead to the same result upon decompression. This means that if you apply the same lossy compression algorithm to the same digital data, you will always get the same output, regardless of the time or environment: that is part of the design. This property is essential for lossy compression algorithms as it ensures the consistency and reliability of the compressed data.

However, this property may not always hold true for language models. The output generated by language models, such as GPT-3, can vary depending on the context and the prompt used. This is because language models are trained on large amounts of text data and are designed to generate human-like responses, which means that their output can be influenced by the context and the prompt and have a probabilistic dynamic by design.

For example, if you prompt a language model with the same text, but in different contexts, it may generate different responses, even if the prompt is identical. This is because the language model uses the context to generate a response that is coherent and appropriate in that specific scenario.

Blind spots are not unavoidable

Again, language models such as GPT-3, are designed to compress a whole corpus of text into a form from which sense can be made. This means that the language model is trained on a large amount of text data and uses this information to generate contextually appropriate and coherent responses to user inputs.

Of course, language models may have limitations or blind spots, but these are not inherent to the model itself. Instead, these are a result of the training data and the model architecture used. For example, if the training data contains biases or lacks certain perspectives, these biases may be reflected in the responses generated by the language model. Similarly, the model architecture used can influence the capabilities and limitations of the language model. Furthermore, no one claims that LLMs all by themselves will necessarily suffice for AGI, and there is no reason why LLMs could not be merged with other approaches, in a similar way to the modules hypothesis of human cognition as put forward by Noam Chomsky.

Craiyon, formerly DALL·E mini, prompt extracted from proximal text: “[…]lacks certain perspectives[…]”

Different metrics

Language models and lossy compression algorithms differ in how they approach sequences of words. Language models, such as GPT-3, use probabilities of sequences of words to generate responses. This means that the model is trained to predict the likelihood of a sequence of words given a prompt, allowing it to generate contextually appropriate and coherent responses.

In contrast, lossy compression algorithms do not use probabilities of sequences of words. Instead, they use similarity between words to reduce the size of digital data, such as images. This means that lossy compression algorithms identify similar patterns in the data and remove redundant or less noticeable information, without considering the sequence of words.

Usefulness of summaries

Quite apart from the validity of comparing chatGPT et al. to lossy compression, the article’s last sentence asks if lossy compression is useful when you have the original data: this is similar to asking if a summary of a book is useful when you have the original! Of course, just like summaries of books can be useful even if you have the original, lossy compression algorithms can also be useful in certain situations.

Lossy compression algorithms can be useful for reducing the size of digital data, such as images, making it easier to store and transfer. This can be particularly important in situations where storage or bandwidth is limited. Additionally, lossy compression algorithms can also improve processing speed by reducing the amount of data that needs to be processed.

The LLM provides a dynamic, the lossy compression only a kinematic

The last argument I will present is more abstract, but I think it provides a fertile point of view. LLMs have an implicit theory of meaning, i.e. that it lies within the transition probabilities. A lossy compression has no theory of meaning whatsoever. Said differently, for me LLMs have a dynamic in the same sense that the Newton laws of motion are dynamical because they include a notion of force, and thus a reason for change of motion. Similarly, LLMs exhibit a force in the form of transition probabilities, and a way of properly choosing the relevant objects on which those forces are to be applied through the concept of attention. Furthermore, you can yourself influence that dynamic, by appropriately changing the forces and the importance of the locations where those are applied. Of course, this is just an analogy, but I find it much more productive than the lossy compression one. Maybe I’ll send a decompressed, and hopefully plausible, version of that idea to the New Yorker.

Craiyon, formerly DALL·E mini, prompt extracted from proximal text: “[…]LLMs exhibit a force in the form of transition probabilities[…]”

The Perils of Analogy: The Misleads of Straw Man Argumentation

The comparison of language models to lossy compression algorithms is not only an overgeneralization but also an example of straw man argumentation. This type of argumentation oversimplifies the capabilities and goals of language models and, once this frame is put in place, leads to incorrect conclusions.

A more appropriate analogy would be to compare language models to a lecture, say, on probability. Just like a professor who has read books and consulted all kinds of other sources, a language model has been trained on a large corpus of text. The professor then gives a shorter version of all that information to an audience, producing a somewhat original output with insights, mistakes, and errors the latter sometimes producing plausible results. Reading the last sentence of the New Yorker article in that context boils down to asking: why do we need a teacher when the sources from which the teacher made the course are available?

A professor can make mistakes or first-order approximations, but then we can point them out and ask for explanations, and of course, the professor can correct himself and refer to the sources for more details, and perhaps even provide all of that in a subsequent lecture. In the same way, if a language model makes a mistake, there is no reason why it could not connect to the sources it used and, if questioned or asked about details, access those and refine its point.

The fallibility of language models is not a weakness, but rather a reflection of the complexity and variability of natural language, much closer in spirit to a human than a lossy compression scheme that could never interrogate its sources.

This analogy also highlights the importance of recognizing the convergence of a correction process. Just as the performance of language models is constantly improving as technology advances and more data is used for training, the approximation made by a professor can be corrected and refined over time, and years of teaching have shown me that courses indeed evolve together with the professor thanks to self-reflection, work, and students’ feedback.

Craiyon, formerly DALL·E mini, prompt extracted from proximal text: “[…]compare language models to a lecture, say, on probability[…]”

In conclusion, it is important to be cautious of oversimplifications and straw man arguments, especially when it comes to cutting-edge technology like language models. While language models may have limitations and blind spots, these are not inherent to the model but rather a result of the training data and the model architecture. Even if they were proved to be inherent then nothing stands in the way of adding corrective modules perhaps even of an entirely different nature.

In any case, the goal of language models per se is not to preserve all details, contrary to Xerox lossy compression, but to produce coherent, meaningful, and relevant text based on a prompt, and they are constantly improving to achieve this goal.

Did you know that you can clap any specific story up to 50 times on Medium? I actually discovered that yesterday together with some other excellent tips. And if you found my post interesting (even if you don’t agree you might still acknowledge I did bring something to the table) then you can connect with me on Twitter @Yann_Le_Du and LinkedIn.

References to go further

[1] David Kriesel reporting the whole story on his website.

[2] David Kriesel giving a talk where he presents the astonishingly problematic "pattern matching" encoding that produces the artifacts.

[3] The emerging JBIG2 standard, P.G. Howard et al., 1998.