Learning is Forgetting: LLM Training As Lossy Compression

Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model’s compression, and the information present in it, can predict downstream performance on MMLU-Pro, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

Introduction

We still have a limited understanding of how Large Language Models (LLMs) achieve impressive results across a wide array of tasks (Devlin et al. 2019; Grattafiori et al. 2024). While a growing body of work interprets LLMs using behavioural experiments, probing, or causal interventions, the scale of these models makes understanding how their representation spaces are structured a continued challenge. Here we look at an LLM as an instance of lossy compression, offering an account of how models represent information during training and what information matters for performance.

Lossy compression represents data efficiently by preserving only the information from a source relevant to a goal. While audio recordings intended for human listeners can be gigabytes in size, MP3 files save space by discarding frequencies typically outside the range of human hearing (Jayant, Johnston, and Safranek 1993); similarly, a JPEG file omits subtle colour variations that are difficult for the human eye to perceive. We draw a parallel with LLMs, which are expected to generate responses humans prefer, after being trained on trillions of tokens – more language data than a human hears in 200 lifetimes. More generally, compression is thought to underpin learning in both humans and models (Feldman 2016), giving a formal account of LLM pre-training in terms of compression allows us to work towards a unified theory of representation learning. We present results showing that over the course of pre-training LLMs optimally compress the information present in their training data for next sequence prediction.

Compression is inherently opinionated – some information from the source is preserved, some is forgotten to save space. Information Theory (Shannon 1948) provides a formal language to describe this process, letting us both quantify the information present in a representation and compute a bound where it is optimally compressed with respect to the data it represents. Our results build on the Information Bottleneck (IB) theory of deep learning (Tishby and Zaslavsky 2015), showing pre-training follows a two phase trajectory: first increasing mutual information with the training objective, before compressing input information. Across a wide array of LLMs we find each model compresses differently, with the optimality of a model’s compression and the information it preserves predicting performance on downstream benchmarks.

A hallmark of large-scale distributed systems, like neural networks, is that they are difficult to understand as a function of their parts alone (Anderson 1972; Mitchell 2009). Our approach to interpretability allows us to consider learning and generalisation at the scale of an entire model, rather than studying individual circuits or neurons within it. Additionally it allows us to frame how models do so well at so much in terms of existing theories of learning and compression, while providing actionable insights at LLM scale.

In what follows we focus on offering concrete answers to three questions: Do LLMs optimally compress their representations? What information survives that compression? What representational structures drive performance? In summary, the core findings are:

Pre-training dynamics for LLMs closely follow theoretical predictions from the Information Bottleneck, with models first expanding representations before slowly approaching optimal compression.
Scale conditions these dynamics, with smaller models (below 7 billion parameters) struggling to achieve meaningful compression later in training.
How optimally compressed a model is correlates significantly with performance on MMLU Pro across three families of large language models, letting us directly relate representation structure to behaviour.
Post-training increases human preference information in a model, with the proportion of preference information also predicting performance on MMLU Pro
Finally, we compare a wide array of open-weight models across 5 model families, showing they all converge near optimal compression.

**LLMs Learn an Optimal Compression of the Internet** **(Left)** The information plane for pre-training of the OLMo2 7B model. The horizontal axis shows mutual information between representations and the input (complexity), the vertical axis shows mutual information with the predicted output (expressivity) using token backoff. The Information Bottleneck theory of deep learning (Tishby and Zaslavsky 2015) hypothesises two phases of training: phase 1 increases \(I(y;z)\) then phase 2 compresses irrelevant \(I(x;z)\) approaching the bound on compression. Previously this had only been documented in relatively small neural networks. Hue indicates timepoint in training in terms of log tokens in billions. Estimates are based on 10,000 examples from the C4 dataset. **(Right)** The vertical axis shows the OLMo2 7B model’s loss on next-token-prediction of C4. The Horizontal axis shows the model’s distance to the bound on optimal compression. Representations begin to approach the bound as the loss saturates. Arrows indicate the phase transition point - the checkpoint with the highest complexity.

Background & Related Work

Learning, Inference, and Compression

Compression has been argued to underpin learning and inference in humans (Chater 1997; Chater and Vitányi 2003; Feldman 2000; Pothos and Chater 2001) and models (Poggio et al. 2004; MacKay 2003). Increasingly, probabilistic inference and complexity minimisation are seen as deeply intertwined (Feldman 2016) – a point perhaps made clearest by Bayesian inference, which implicitly prefers the simplest hypotheses consistent with observed data (Jeffreys 1939; Edwards 1972; Vitányi and Li 2000). Bayesian approaches to human cognition offer accounts of how a broad array of human behaviour can be productively thought of as this kind of inference (Griffiths, Chater, and Tenenbaum 2024). In machine learning Occam’s Razor has long been used as a model selection criterion – where the best model is the simplest one consistent with the data (Wallace and Boulton 1968; Rissanen 1978; Burnham and Anderson 2002). The bias variance trade-off (Geman, Bienenstock, and Doursat 1992) makes this explicit in the context of neural networks, showing more complex models may achieve better fit to the training data, but they also generalise worse than their simpler counterparts.

While some work has studied whether or not LLMs can match lossless compression algorithms in-context (Delétang et al. 2023), this is distinct from giving an account of LLM training itself as a process of lossy compression – the object of study here. It is worth noting that there is not universal agreement about how to assess compression (MacKay 2003), but here we follow in the information-theoretic tradition (Shannon 1948).

Rate Distortion Theory

Consider a function \(E\) that encodes an input \(X\) in a representation \(Z\), \(Z = E(X)\). This representation is then decoded by a function \(D\) to produce predictions \(\hat{Y}\) for an output with true label \(Y\), \(\hat{Y} = D(Z)\). Assuming that \(X\) and \(Y\) are not independent, if \(E\) were to losslessly preserve all the information from the input, we would expect \(D\) to be able to precisely recover the corresponding output, with \(\hat{Y} = Y\). Rate Distortion Theory (RDT) (Shannon 1948) instead considers the lossy case \(\hat{Y} \neq Y\), where some amount of error in the prediction – distortion – is acceptable. It then becomes a question of how much information about the input – termed the rate – the encoder needs to preserve to achieve a given level of distortion.

The Information Bottleneck (IB)

Tishby, Pereira, and Bialek (2000) looks at a particular case, where the rate is given as the mutual information between inputs and their representation \(I(X; Z)\), and distortion as the mutual information between a representation and the corresponding target prediction \(I(Y; Z)\) – the 2D space this creates is called the information plane. Since \(I(X; Z)\) reflects how much information about the input space is preserved it can be referred to as complexity (Zaslavsky et al. 2018). Likewise \(I(Y; Z)\) is referred to as accuracy given it quantifies how much information a representation has about the target output it will be used to predict. To distinguish this quantity from behavioural accuracy (e.g., exact match on a task) we refer to it as expressivity – how uniquely a representation can refer to its target (Kirby et al. 2015). Optimal compression within the IB occurs when an encoding \(Z|X\) preserves only the information about \(X\) relevant to predicting \(Y\), or when \(Z|X\) minimises

\[ \mathcal{F}_\beta[p(Z|X)] = I(X;Z) - \beta I(Y; Z) \]

where \(\beta\) is a trade-off parameter controlling the allowable level of distortion. When \(\beta\) approaches 0 all inputs are compressed to a single point, as \(\beta \rightarrow \infty\) we approach the lossless case, where using \(X\) or its encoding \(Z\) tells us the same information about \(Y\); \(I(Y;Z) = I(Y;X)\). The curve traced by varying \(\beta\) draws a bound, where the encoding \(p(Z|X)\) is optimally compressed – everything above the curve is unachievable and everything below it is suboptimal. This bound starts off with a linear relationship where \(I(X; Z) = I(Z; Y)\), until \(Z\) captures all information shared between \(X\) and \(Y\). Intuitively, in an optimal encoding each additional bit of complexity gets you an additional bit of accuracy, until all information shared by X and Y are represented; \(I(Y;Z) = I(Y;X)\) – in the cases studied here all models stay well below this saturation point, so for clarity we refer to the bound as the line \(I(X; Z) = I(Z; Y)\).

Applying the Information Bottleneck to Deep Learning

Tishby and Zaslavsky (2015) offered a theoretical characterisation of training a multi-layered neural network as optimising an Information Bottleneck. They theorise two phases of training: first a fitting phase during which representations increase mutual information with the target labels \(I(Y; Z)\); and second a compression phase, during which models compress irrelevant information about the input \(I(X; Z)\) and in the process begin to approach the optimal bound. It is this latter phase that is hypothesised to result in representations that generalise robustly.

Shwartz-Ziv and Tishby (2017) confirmed the two-phase prediction from the IB theory of deep-learning empirically in feed forward networks trained on MNIST. Subsequent work has questioned the generality of these findings, showing how – at least in linear networks – the compression phase can be driven by the type of non-linearity used (Saxe et al. 2019), or that compression is not necessarily required for generalisation (Goldfeld et al. 2019). It remains unclear whether deep-learning models in the general case can be expected to follow the phases of expansion and compression predicted by the IB, in particular when it comes to sequence models (e.g. Transformers) trained on complex tasks.

Interpreting Neural Networks

A broad literature on the theory of deep learning tries to give an accounting of learning dynamics in small multi-layer networks (Frankle and Carbin 2018; Saxe, McClelland, and Ganguli 2019). While there has been some extension of these kinds of representational analyses to larger models – like applying information theoretic methods to transformers (Voita, Sennrich, and Titov 2019) – much of the work on interpretability in LLMs leverages behavioural or probing evidence. Behavioural approaches treat models as akin to psycholinguistic subjects (Futrell et al. 2019, 2018), taking model outputs as behaviours (Marvin and Linzen 2018; Warstadt et al. 2019; Hu et al. 2020). Probing (Veldhoen, Hupkes, and Zuidema 2016; Pimentel et al. 2020; Voita and Titov 2020) trains a smaller model – like a linear classifier – to predict labels from a model’s latent representations, as evidence that information relevant to those labels is present. While valuable, these approaches are removed from the models’ representations themselves – characterising downstream behaviours rather than characterising the representational structures that drive them.

Mechanistic interpretability follows in a similar vein but aims to describe how circuits within a model implement the functions that solve a task. These analyses have given accounts of how two layer linear and non-linear models represent features from synthetic data (Elhage et al. 2021) or how single-layer attention only transformers solve modular addition (Nanda et al. 2023). When deployed at scale, to LLMs, this work often relies on training unsupervised probes termed sparse auto-encoders (Elhage et al. 2022) to identify correspondences between parameters and different words or concepts from the training data (Bricken et al. 2023). In the general case this work often focuses on ‘mono-semanticity’ – looking for lossless, one-to-one correspondences between input features and parts of a model. More recently studies of when features emerge during pre-training have aligned with the expansion/compression pattern described by the IB theory (Ge et al. 2025).

To be sure, there is an abundance of methods for analysing deep-learning models. Here, however, we highlight a disconnect between work on the theory of learning in humans and neural networks, and work on interpretability. Interpretability methods can be deployed at scale on complex models and tasks, but lack clear relationship to existing theoretical work. In the sections that follow we operationalise Rate Distortion Theory, and related work on learning as compression, at LLM scale. This allows us to analyse training dynamics while contextualising our conclusions in existent and well-studied theoretical frameworks. Our approach represents one that is theoretically motivated but can be applied to any model at any scale.

Methods

Entropy Estimation

Let \(T \in \mathbb{Z}^{B\times S}\) be a batch of \(B\) tokenized samples with sequence length \(S\), drawn from a corpus of text data \(\mathcal{T}\), and let \(E\) be a model with \(L\) layers and representation dimension \(h\); the corresponding encoded representations are \(Z \in \mathbb{R}^{L \times B \times S \times h}\). Let \(X\in \mathbb{Z}^{B\times S}\) be feature labels for the text in \(T\). For example, when we look at optimal compression with respect to the IB bound, these labels \(X\) are the token ids for the model inputs; however, when analysing representation information more generally, these can be other input features, such as preference label or language id. It is desirable to compute the mutual information \(I(X;Z)\) using Shannon entropy as opposed to differential entropy to accomplish this, previous work quantises \(Z\) into \(n\) bins, to get a discrete encoding \(\hat{Z}\) (Voita, Sennrich, and Titov 2019; Shwartz-Ziv and Tishby 2017). Unfortunately the approaches from this previous work have memory and resource requirements that make them difficult to apply at LLM scale.

As a result we use the soft-entropy estimator from Conklin (2025) – this is an efficient differentiable relaxation of a binning-based estimate that has been shown to converge to the true entropy of a distribution. We describe the estimation process in detail below, this estimator is not original to our work but we are the first to apply it to analyse LLMs using rate distortion theory.

To obtain a soft quantisation \(\hat{Z}\), this approach first computes \(\bar{Z}\), which is the normalization of \(Z\) to lie on the surface of the unit sphere \(\mathbb{S}^h\) in \(\mathbb{R}^h\). It then samples \(n\) points \(\{w_i\}_{i=1}^n\) uniformly at random from \(\mathbb{S}^h\). Then, for each normalized representation \(\bar{z} \in \mathbb{R}^h\), we compute a vector whose \(i^{th}\) entry is the cosine between \(\bar{z}\) and \(w_i\), then apply softmax to that vector – softly assigning each embedding \(\bar{z}\) to the points in \(W\). More formally, for each \((l,b,s) \in [L] \times [B] \times [S]\), tensor \(\bar{Z}\) (whose shape coincides with \(Z\)) is defined so that \(\bar{Z}_{l,b,s, :} = Z_{l,b,s, :} / \| Z_{l,b,s, :} \|\), and we stack the uniform samples \(\{ w_i\}_{i=1}^n\) into a matrix \(W \in \mathbb{R}^{h \times n}\):

\[ \{w_i\}_{i=1}^n \sim \text{Unif}(\mathbb{S}^h), \qquad W_{:, i} = w_i \]

Tensor \(\check{Z} \in \mathbb{R}^{L \times B \times S \times n}\) is then defined so that for \((l,b,s) \in [L] \times [B] \times [S]\),

\[ \check{Z}_{l, b, s, :} = \text{softmax} \Big( \sum_{j=1}^h \bar{Z}_{l,b,s,j} W_{j,:} \Big) \]

Each vector \(\check{Z}_{l, b, s,:}\) defined this way is a probability vector. Let \(\hat{Z} \in \mathbb{R}^{L \times n}\) be the matrix obtained from tensor \(\check{Z}\) by averaging over the batch and sequence dimensions, and let \(\hat{z}_l\) be the \(l\)-th row of this matrix, a probability vector of length \(n\) by construction:

\[ \hat{Z} = \frac{1}{BS} \sum_{b=1}^B \sum_{s=1}^S \check{Z}_{:,b,s,:}, \qquad \hat{z}_l = \hat{Z}_{l, :}, \quad H(\hat{z}_l) = - \sum_{j=1}^n \hat{z}_{l, j} \log \hat{z}_{l,j} \]

Vectors \(\hat{z}_\ell\) are probability vectors for each layer \(l \in [L]\) describing a categorical distribution over \(n\) categories. Therefore we can compute the Shannon entropy \(H(\hat{z}_l)\) as above.

Due to the normalisation step during quantisation, this distribution approximates the probability that a representation in a layer \(l\) lies along a particular angle with respect to the origin. To estimate the entropy in an entire model, denoted \(H(Z)\) we average entropy across layers. Efficiency (Wilcox 1967) normalises \(H\) by the entropy of a uniform distribution \(\log(n)\), thereby bounding the entropic quantity between 0 and 1 – to aid interpretability here we convert \(H(Z)\) to an efficiency \(\mathcal{H}(Z)\) by additionally normalising by the entropy of a uniform distribution at each layer. These definitions can also be conditioned on the feature labels \(X\).

\[ \mathcal{H}(Z) := \frac{1}{L\log(n)}\sum \limits_{l=1}^{L} H(\hat{z}_l) , \qquad \mathcal{H}(Z| X=x) := \frac{1}{L \log n} \sum_{l=1}^L H(\hat{z}_l | X=x) \]

This now allows us to efficiently compute the mutual information between input features \(X\) and encodings across an entire model, regardless of model size.

\[ I(X; Z) := \frac{1}{|X|}\sum \limits_{x \in X} \mathcal{H}(Z) - \mathcal{H}(Z| X=x) \]

Mutual Informations

To look at whether or not a model is optimally compressed with respect to some data we need to compute mutual informations with respect to input and output labels. LLMs are trained with inputs as preceding context and outputs as trailing context. Maintaining conditional estimates of a token embedding given a preceding context \(P(Z|X)\) for every possible context window proves intractable, and many contexts occur only once in the training data. Accordingly, like many other works on language modelling we approximate the distribution over possible sequences using n-grams with a kind of back-off (Katz 1987). By conditioning on finite widths of preceding context we can tractably approximate \(P(Z|X)\); the maximum width we consider here are quad-grams by which point \(I(X;Z)\) begins to converge and past which point computation becomes intractable in an LLM setting. By backing off further (e.g. to trigrams, bigrams, and tokens) we can also estimate how much different context widths contribute to information in a model - for clarity the majority of results use token level backoff, with other levels of backoff noted where they’re presented. We vary the degree of backoff equally for both the input \(P(Z|X)\) and output \(P(Z|Y)\) distributions, this is because during training a model receives gradient information from the full trailing context \(Y\) due to teacher forcing.

In addition to mutual information with input and output labels, we also consider human preference data. A growing body of work stresses the importance of post-training approaches for aligning models with human preference (Bai et al. 2022; Rafailov et al. 2023; Ouyang et al. 2022). We can quantify this information in a model using preference data, where a prompt has two continuations, one of which is labelled preferred by human raters. Conditioning on this label lets us compute \(P(Z|\text{preferred})\) and \(I(Z;\text{preferred})\).

Data and Sampling

Getting a true estimate of the entropy of a vector space remains a major challenge, with most approaches underestimating the true entropy (Paninski 2003). As a result we do not claim our experiments estimate the entropy of a model’s true latent distribution, but rather an estimate of the entropy with respect to a particular sample of data. By holding the data constant across models and experiments we can compute an estimate that is useful for comparisons, even if it does not exactly match the true entropy. Unless otherwise noted, token bigram, trigram, and quad-gram estimates are with respect to 10,000 samples from C4 (Raffel et al. 2020), and preference estimates are based on 10,000 samples from Tulu (Lambert et al. 2024); in both cases we consider a maximum context length of 512.

Experiments

In order to study training time-courses our pre-training analyses look at the OLMo2 family of models (OLMo et al. 2025), which makes available intermediate checkpoints. We focus analysis on the 7B model unless otherwise noted, while including results for the 32B and 1B variants to show where conclusions hold or differ across model scales. In addition, to show our conclusions hold outside of this particular family of models we compare a wide array of open-weights LLMs (which do not make intermediate training checkpoints available), showing where they lie on the information plane at the end of training.

**Pre-Training Trajectories Match Theoretical Predictions**. *(Top Left and Bottom Row)* The information plane over pre-training for different levels of backoff. By changing how many tokens we condition the mutual information on in the context window, we see how the OLMo2 7B model compresses not just token but also local context information. Across all context windows we see the same two phase pattern predicted by the Information Bottleneck – with more contextual representations approaching greater optimality. *(Top Right)* The Information Plane with token backoff for a wide array of open-weights models. Models approach the bound for optimal compression, which is shown as a dotted line. Model names are shown at right with two points labelled.

Pre-training Approaches Optimal Compression

The majority of pre-training appears to be a slow compression of a model’s training data. The Information Bottleneck theory of deep learning predicts two phases: a fitting phase during which output information \(I(Y;Z)\) increases, followed by a compression phase during which input information \(I(X;Z)\) decreases and representations approach the bound. This transition to compression is believed to occur when error on the training set saturates.

Shown in the figure above is the training trajectory for the OLMo2 7B model with respect to data from English C4. Strikingly, the 7B model closely follows the two-phase prediction from the Information Bottleneck, first increasing mutual information with outputs, before compressing input information and progressing towards the bound on optimal compression. Additionally this transition appears to happen as the model’s loss on next-token prediction begins to saturate. This shows how, even at scale, deep-learning models appear to thread a needle between representational complexity and expressivity. It also demonstrates how LLMs can be effectively studied from the perspective of Rate Distortion Theory, as they try to converge to an optimal lossy compression of their training data.

Embeddings Largely Encode Local Context

By varying the degree of backoff in the conditional distribution used to compute mutual information, we can see how contextual information evolves over pre-training at the token, bigram, trigram, and quad-gram levels. All cases result in a similar two-phase pattern of expansion and compression, with larger conditioning context converging closer to the bound. There is also a pattern of convergence where quad-grams account for only marginally more information than trigrams – suggesting representations largely encode local context, likely reflecting the information locality of the natural language on which they’re trained (Gibson 1998; Gibson et al. 2000; Hahn et al. 2022). This high degree of optimality in contextual encodings also likely reflects an inherent pressure in the pre-training objective for models to not only develop token representations, but representations of a token in context.

**Smaller Models Struggle to Compress**. *(Top Left)* Pre-training trajectory for three different model sizes with the 7B and 32B models converging to lower complexity solutions than 1B. *(Bottom)* Zooming in on later pretraining for each model size the 1B model matches Phase 1 but struggles to achieve meaningful compression later on, oscillating for much of pre-training off the frontier. *(Top Right)* Spearman correlations between training step and complexity show larger models compress over the course of training with the 32B compressing most. Correlations between step and the expressivity/complexity ratio show only larger models consistently approach the optimal bound (this ratio increases closer to the bound). All results use backoff to the trigram level.

The Effect of Scale: Smaller Models Struggle to Compress

Parameter count shows a marked effect on the degree of compression achievable by a model. The figure above shows pre-training trajectories for the 1B, 7B, and 32B parameter models. The larger models both closely follow the hypothesized Information Bottleneck trajectory, exhibiting phases of expansion and compression, ultimately approaching optimal compression. The 1B parameter model exhibits markedly different behaviour. While it successfully completes the initial expansion phase – increasing output information \(I(Y;Z)\) – it fails to approach optimal compression. Instead, in the second phase the smaller model oscillates while moving slowly away from the theoretical frontier. Correlations between training step and complexity show larger models compress representations, with the 1B model significantly expanding. Correlations between step and the ratio of expressivity over complexity – which increases as models approach the bound – show only larger models consistently approach the bound (as indicated by positive correlation coefficients). This suggests that for a given level of data complexity, a certain parameter threshold may be necessary for models to achieve an optimal compression – an observation in line with work on scaling laws (Kaplan et al. 2020).

Convergence Patterns Across Open-Weight Models

In addition to looking at the OLMo2 family of models, we compute complexity and expressivity estimates across a diverse array of open-weight models (for tractability here we backoff to the token and bigram levels). A striking convergence pattern emerges: across different model families, hyperparameters, and training methodologies, representations ultimately converge to token and bigram informations clustered near the optimal bound on compression. This suggests that training as a process of compression is not an artifact of a single LLM’s training trajectory, but more fundamentally applies to to deep-learning models as a class, and to the data and the objectives used to train them.

**Optimality of Compression and Preference Information Correlate with Performance** While complexity does not directly correlate with downstream performance as measured by accuracy on MMLU Pro (Left), the ratio of expressivity to complexity, which indicates distance from optimal compression, does (Middle). The ratio between preference information and model complexity also correlates with downstream performance (Right). Expressivity and complexity are calculated using token backoff in all three plots.

Relating Representation Structure to Performance

So far we have studied how information in an LLM is structured; we now consider how that structure relates to downstream performance. The figure above shows correlations between representational measures and performance on the MMLU Pro benchmark (Wang et al. 2024) for open weights models from 3 different families. Complexity alone proves not to be predictive of performance (\(r=-0.15\), \(p=0.649\)), neither does expressivity (\(r=-0.04\), \(p=0.897\)). However, the ratio between expressivity and complexity is a significant predictor (\(r=0.64\), \(p=0.024\)). This ratio indicates how close a model is to optimal compression, since it approaches 1.0 as the model approaches the IB bound. Together, these results indicate that compression alone is not a significant predictor of performance, but the optimality of that compression is.

While LLMs approach optimal compression for next sequence prediction over pre-training, a large body of work also tries to improve their ability to follow instructions, and generate responses humans prefer (Ouyang et al. 2022). We use preference data (Lambert et al. 2024) to compute mutual information with preference. As shown in the figure above, the ratio between a model’s complexity and the amount of preference information it contains also proves a significant predictor of downstream performance (\(r=0.8\), \(p=0.002\)). This suggests that not only does the optimality of a model’s compression matter, but exactly what information survives that compression does too.

These results also indicate how the information theoretic approach taken here could potentially be leveraged during training. Two applications could be as a stopping-criterion – ceasing pre-training when distance to the bound no longer decreases, or as a model-selection criterion – picking the checkpoint that is the most optimally compressed, or with the highest proportion of preference information. Given the estimates here are computed with a single-forward pass using teacher forcing, computing an entropy estimate for candidate selection would be substantively less costly than evaluating a model across a suite of benchmarks. We look to experimentally validate these potential use cases of our approach in future work.

Conclusion

The work presented here bridges the gap between theoretical accounts of learning and the practical complexities of LLMs. We show that LLMs learn an optimal compression of the data on which they are trained, with a wide array of open-weights models converging near the IB bound – with the optimality of a model’s compression predicting downstream performance. Each compression is different; we can account for the information that survives the compressive process, showing how representations encode information about different levels of local context and human preferences.

The approach to interpretability we introduce here interprets a model as a whole – rather than focussing on a particular circuit, or attention head – because complex distributed systems are not best understood in terms of their parts alone. Giving a holistic account of what it means to train an entire model on the entire internet is a challenge, but we argue that LLMs are best understood as lossy compression. In doing so, we place them in the context of a long history of work on representation learning across the sciences.

References

Anderson, Philip W. 1972. “More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science.” Science 177 (4047): 393–96.

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv Preprint arXiv:2204.05862.

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread 2.

Burnham, Kenneth P, and David R Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.

Chater, Nick. 1997. “Simplicity and the Mind.” The Psychologist.

Chater, Nick, and Paul Vitányi. 2003. “Simplicity: A Unifying Principle in Cognitive Science?” Trends in Cognitive Sciences 7 (1): 19–22. https://doi.org/10.1016/S1364-6613(02)00005-0.

Conklin, Henry Coxe. 2025. “Information Structure in Mappings: An Approach to Learning, Representation and Generalisation.” The University of Edinburgh.

Delétang, Grégoire, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, et al. 2023. “Language Modeling Is Compression.” arXiv Preprint arXiv:2309.10668.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Edwards, Anthony William Fairbank. 1972. “Likelihood.” In. Springer.

Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, et al. 2022. “Toy Models of Superposition.” arXiv Preprint arXiv:2209.10652.

Elhage, Nelson, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread 1: 1.

Feldman, Jacob. 2000. “Minimization of Boolean Complexity in Human Concept Learning.” Nature 407 (6804): 630–33.

———. 2016. “The Simplicity Principle in Perception and Cognition.” Wiley Interdisciplinary Reviews: Cognitive Science 7 (5): 330–40.

Frankle, Jonathan, and Michael Carbin. 2018. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv Preprint arXiv:1803.03635.

Futrell, Richard, Ethan Wilcox, Takashi Morita, and Roger Levy. 2018. “RNNs as Psycholinguistic Subjects: Syntactic State and Grammatical Dependency.” arXiv Preprint arXiv:1809.01329.

Futrell, Richard, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. “Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State.” arXiv Preprint arXiv:1903.03260.

Ge, Xuyang, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. 2025. “Evolution of Concepts in Language Model Pre-Training.”

Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58.

Gibson, Edward. 1998. “Linguistic Complexity: Locality of Syntactic Dependencies.” Cognition 68 (1): 1–76.

Gibson, Edward et al. 2000. “The Dependency Locality Theory: A Distance-Based Theory of Linguistic Complexity.” Image, Language, Brain 2000: 95–126.

Goldfeld, Ziv, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. 2019. “Estimating Information Flow in Deep Neural Networks.” arXiv. http://arxiv.org/abs/1810.05728.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783.

Griffiths, Thomas L, Nick Chater, and Joshua B Tenenbaum. 2024. Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press.

Hahn, Michael, Richard Futrell, Roger Levy, and Edward Gibson. 2022. “A Resource-Rational Model of Human Processing of Recursive Linguistic Structure.” Proceedings of the National Academy of Sciences 119 (43): e2122602119.

Hu, Jennifer, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger P. Levy. 2020. “A Systematic Assessment of Syntactic Generalization in Neural Language Models.” arXiv:2005.03692 [Cs], May. http://arxiv.org/abs/2005.03692.

Jayant, N., J. Johnston, and R. Safranek. 1993. “Signal Compression Based on Models of Human Perception.” Proceedings of the IEEE 81 (10): 1385–1422. https://doi.org/10.1109/5.241504.

Jeffreys, Harold. 1939. The Theory of Probability. OuP Oxford.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Katz, Slava. 1987. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3): 400–401.

Kirby, Simon, Monica Tamariz, Hannah Cornish, and Kenny Smith. 2015. “Compression and Communication in the Cultural Evolution of Linguistic Structure.” Cognition 141: 87–102.

Lambert, Nathan, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, et al. 2024. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training.”

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.

Marvin, Rebecca, and Tal Linzen. 2018. “Targeted Syntactic Evaluation of Language Models.” arXiv:1808.09031 [Cs], August, 1192–202. https://doi.org/10.18653/v1/D18-1151.

Mitchell, Melanie. 2009. Complexity: A Guided Tour. Oxford University Press.

Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

OLMo, Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, et al. 2025. “2 OLMo 2 Furious.” arXiv. https://doi.org/10.48550/arXiv.2501.00656.

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, 35:27730–44. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

Paninski, Liam. 2003. “Estimation of Entropy and Mutual Information.” Neural Computation 15 (6): 1191–1253. https://doi.org/10.1162/089976603321780272.

Pimentel, Tiago, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. “Information-Theoretic Probing for Linguistic Structure.” arXiv Preprint arXiv:2004.03061.

Poggio, Tomaso, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. 2004. “General Conditions for Predictivity in Learning Theory.” Nature 428 (6981): 419–22.

Pothos, Emmanuel M, and Nick Chater. 2001. “4 Categorization by Simplicity: A Minimum Description Length Approach to Un Supervised Clustering.”

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:53728–41. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” Journal of Machine Learning Research 21 (140): 1–67.

Rissanen, Jorma. 1978. “Modeling by Shortest Data Description.” Automatica 14 (5): 465–71.

Saxe, Andrew M, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. 2019. “On the Information Bottleneck Theory of Deep Learning.” Journal of Statistical Mechanics: Theory and Experiment 2019 (12): 124020. https://doi.org/10.1088/1742-5468/ab3985.

Saxe, Andrew M, James L McClelland, and Surya Ganguli. 2019. “A Mathematical Theory of Semantic Development in Deep Neural Networks.” Proceedings of the National Academy of Sciences 116 (23): 11537–46.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.

Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” arXiv Preprint arXiv:1703.00810.

Tishby, Naftali, Fernando C Pereira, and William Bialek. 2000. “The Information Bottleneck Method.” arXiv Preprint Physics/0004057.

Tishby, Naftali, and Noga Zaslavsky. 2015. “Deep Learning and the Information Bottleneck Principle.” In 2015 Ieee Information Theory Workshop (Itw), 1–5. Ieee.

Veldhoen, Sara, Dieuwke Hupkes, and Willem Zuidema. 2016. “Diagnostic Classiﬁers: Revealing How Neural Networks Process Hierarchical Structure,” 10.

Vitányi, Paul MB, and Ming Li. 2000. “Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity.” IEEE Transactions on Information Theory 46 (2): 446–64.

Voita, Elena, Rico Sennrich, and Ivan Titov. 2019. “The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives.” arXiv. http://arxiv.org/abs/1909.01380.

Voita, Elena, and Ivan Titov. 2020. “Information-Theoretic Probing with Minimum Description Length.” arXiv. http://arxiv.org/abs/2003.12298.

Wallace, Chris S, and David M Boulton. 1968. “An Information Measure for Classification.” The Computer Journal 11 (2): 185–94.

Wang, Yubo, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, et al. 2024. “Mmlu-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” In The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

Warstadt, Alex, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2019. “BLiMP: A Benchmark of Linguistic Minimal Pairs for English.” arXiv:1912.00582 [Cs], December. http://arxiv.org/abs/1912.00582.

Wilcox, Allen R. 1967. “Indices of Qualitative Variation.” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).

Zaslavsky, Noga, Charles Kemp, Terry Regier, and Naftali Tishby. 2018. “Efficient Compression in Color Naming and Its Evolution.” Proceedings of the National Academy of Sciences 115 (31): 7937–42.