so u want to estimate the entropy of a vector space

machine learning
information theory
Author

Henry Conklin

Published

December 9, 2025

say you want to know how much information is packed into the representations of a large language model. not what the model does — what it knows, in the information-theoretic sense. how much of the input space has it compressed into its weights?

to answer that, you need entropy estimates over a model’s latent representations. that turns out to be surprisingly hard.

the problem with doing this naively

the classic approach — from Shwartz-Ziv & Tishby (2017) — is dimension-wise discretisation. take a hidden representation with shape batch × hidden, bin each dimension independently, and treat the resulting strings as a categorical distribution. works fine on a 16-dimensional feedforward network trained on MNIST. at LLM scale it’s a non-starter: a single layer of OLMo2 32B has hidden dimension 5120, so the binned representation would require holding a batch × 512 × 5120 × n_bins tensor in memory. with 50 bins that’s 50× the cost of the forward pass itself, across 64 layers.

the other option is clustering — fit k-means to get cluster centroids, assign each embedding to its nearest cluster, treat membership counts as a distribution. this was the approach in Voita et al. (2019). it’s more memory-efficient, but it requires two passes through the data: one to fit the clusters, one to assign embeddings. you can’t use it online, and re-running it across 150 pre-training checkpoints for a 32B model is expensive.

what you actually want is something that:

  • runs in a single forward pass
  • needs no separate fitting step
  • scales to any model size
  • gives you shannon entropy so the information bottleneck bound applies

soft entropy estimation

the estimator that does this is described in Conklin (2025). the core idea is to replace hard binning with a soft, differentiable version — and to do it on the unit sphere, where cosine similarity gives a natural notion of proximity.

here’s the procedure:

step 1 — normalize. project each embedding onto the unit sphere:

\[\bar{z} = \frac{z}{\|z\|}\]

this throws away norm information (which in LLMs tends to encode frequency rather than meaning), and lets everything live on a common surface regardless of hidden dimension.

step 2 — sample anchors. draw \(n\) reference points \(\{w_i\}_{i=1}^n\) uniformly at random from the sphere surface. this is equivalent to sampling from an isotropic gaussian and normalising:

\[\tilde{w}_i \sim \mathcal{N}(0, I_{d_h}), \quad w_i = \frac{\tilde{w}_i}{\|\tilde{w}_i\|}\]

these anchors act as the “bins” — but unlike hard bins they don’t require knowing the support of the distribution in advance.

step 3 — soft assignment. for each normalized embedding \(\bar{z}\), compute cosine similarities to every anchor and pass through softmax with temperature \(\varepsilon\):

\[\check{Z}_{l,b,s,:} = \text{softmax}\left(\frac{\sum_j \bar{Z}_{l,b,s,j} W_{j,:}}{\varepsilon}\right)\]

this gives a probability vector over the \(n\) anchors — a soft assignment of each embedding to the reference points. lower temperature makes the assignment sharper; higher temperature spreads it out.

step 4 — aggregate and compute entropy. average the soft assignments over the batch and sequence dimensions to get a single categorical distribution per layer:

\[\hat{Z} = \frac{1}{BS} \sum_b \sum_s \check{Z}_{:,b,s,:}, \qquad \hat{z}_l = \hat{Z}_{l,:}\]

then shannon entropy is just:

\[H(\hat{z}_l) = -\sum_{j=1}^n \hat{z}_{l,j} \log \hat{z}_{l,j}\]

average across layers to get a model-level entropy estimate. this is the quantity we use to compute mutual information, and from there — once you condition on input or output labels — the full information plane.

temperature calibration

the one tricky part is temperature. different models have different hidden dimensions, and in high-dimensional spaces dot products concentrate — the softmax saturates and everything looks uniform. you need to calibrate \(\varepsilon\) per model so estimates are comparable across dimensionalities.

the calibration sets \(\varepsilon\) so that the maximum possible KL divergence between the soft assignment distribution and uniform exactly equals \(\log n\) (the entropy of a uniform distribution over \(n\) bins). using the von Mises–Fisher distribution on the unit sphere and Amos-type bounds on Bessel function ratios, this gives a closed-form solution to leading order in \(d\):

\[\varepsilon^\star(m, d) = \frac{1}{\sqrt{2d \log m}}\]

this has a pleasing resemblance to the \(1/\sqrt{d_k}\) scaling used in attention — both are correcting for the concentration of dot products in high dimensions.

what you can do with it

once you have efficient entropy estimates, you can compute mutual information between a model’s representations and any labelled partition of the data. condition on token identity to get \(I(X; Z)\) (complexity). condition on the following token(s) to get \(I(Y; Z)\) (expressivity). the ratio of the two — optimality — tells you how close the model is to the information bottleneck bound on optimal compression.

you can also condition on preference labels (preferred vs. rejected completions) to measure how much alignment information survives in the representations — something that turns out to predict downstream benchmark performance surprisingly well.

all of this runs in a single forward pass, at any model scale, with no caching or clustering required. that’s what makes it usable for studying pre-training dynamics across 150 checkpoints of a 32B model, or comparing 75 open-weights models in a single sweep.