$\renewcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\x}{\mathbf{x}} \newcommand{\Px}{P_\x} \newcommand{\I}{\mathbb{I}} \newcommand{\Pn}{\mathcal{P}_n}$

This article summarizes the method of types, as presented in Chapter 11 of the Cover and Thomas (C&T) text Elements of Information Theory.

## Types (Based on C&T Section 11.1)

Consider a sequence $X_1, \ldots, X_n$ of i.i.d. discrete random variables with alphabet $A = {a_1, \ldots, a_m}$. The distribution of $X_i$ comprises the probabilities $Q = (q_1, \ldots, q_m)$ corresponding to the alphabet. By definition, the probabilities satisfy $q_i \geq 0$ and $\sum q_i = 1$, meaning that $Q$ is limited to a bounded $(m-1)$-dimensional figure within $\R^m$, known as the probability simplex in $\R^m$.

This distribution is often estimated by the empirical distribution $\Px$. Component $i$ of $\Px$ is defined to be the proportion of times $a_i$ occurred in $\x$. Notice that this does not depend on the order of the occurrences but only on the number. In other words, any permutation of a given sequence maps to the same empirical distribution. This observation allows us to split the $m^n$ possible sequences into equivalence classes, known as “types.” The type class corresponding to a distribution $P$ is denoted $T(P)$.

Notice that $\Px$ has the same constraints as $Q$, so it must lie within the simplex as well. Furthermore, for any given $n$, there are only a finite number of possible values for $\Px$. Each component of $\Px$ must be representable by a fraction in $[0,1]$ with denominator $n$. Therefore, the possible values of $\Px$ form a grid within the simplex. For example, the possible empirical distributions in the $m=3$ case are displayed below. (I used Picasion to merge the pictures into a single animation.)

library(grid)

drawSimplex <- function() {
# Draws the probability simplex for the alphabet of size 3
grid.polygon(x = c(0, 1, 0.5), y = c(0, 0, 1))
grid.text("p1=1", 0.5, 1.05)
grid.text("p2=1", -0.03, -0.05)
grid.text("p3=1", 1.03, -0.05)
}

for (n in 1:10) {
png(paste(n, ".png", sep = ""), width = 400, height = 400)
grid.newpage()
pushViewport(viewport(x = 0.15, y = 0.15, w = unit(10, "cm"),
h = unit(10 * sqrt(3)/2, "cm"),
just = c("left", "bottom"),
xscale = c(0, n), yscale = c(0, n)))
drawSimplex()
grid.text("Possible Empirical Distributions", 0.5, 1.2,
gp = gpar(fontsize = 20))
grid.text(paste("n =", n), 0.04, 0.7, just = "left",
gp = gpar(fontsize = 25, col = 2))
for (i in 0:n) {
x <- 0:(n - i) + i/2
grid.circle(x, i, unit(0.015, "npc"), default.units = "native",
gp = gpar(col = 2, fill = 2))
}
popViewport(1)
dev.off()
}

The number of possible empirical distributions can be computed exactly, but the following rough upper bound proves to be more convenient.

### Theorem 1

The number of possible empirical distributions is bounded by $(n+1)^m$.

### Proof

There are only $n+1$ fractions in $[0,1]$ with denominator $n$, and each of the $m$ components must take one of these values. There are only $(n+1)^m$ such possibilities. The subset of these that have components summing to 1 are the possible empirical distributions in the simplex. $\square$

The remaining theorems relate to entropy and relative entropies.

### Theorem 2

The probability of any particular outcome $\x = (x_1, \ldots, x_n)$ is $2^{-n (H(\Px) + D(\Px \Vert Q))}$.

### Proof

Let $N(a \vert \x)$ denote the number of occurrences of $a$ in $\x$. Then (from C&T page 350)

\begin{align*} \P \{X = \x\} &= \prod_{i=1}^n Q(x_i)\\ &= \prod_{a \in A} Q(a)^{N(a \vert \x)}\\ &= \prod_{a \in A} Q(a)^{n \Px(a)}\\ &= \prod_{a \in A} 2^{n \Px(a) \log Q(a)}\\ &= 2^{n \sum_{a \in A} \Px(a) \log Q(a)}\\ &= 2^{-n \sum_{a \in A} (\Px(a) \log \Px(a) - \Px(a) \log \frac{\Px(a)}{Q(a)})}\\ &= 2^{-n (H(\Px) + D(\Px \Vert Q))} \end{align*}

$\square$

Note that all outcomes in the same type class have the same probability. To illustrate entropy and relative entropies over the simplex, I wrote code to generate level curves for the $m=3$ case.

H <- function(p1, p2, p3 = 1 - p1 - p2, ...) {
# Entropy of a discrete distribution
p <- c(p1, p2, p3)
return(-sum(log2(p^p)))
}

D <- function(p1, p2, p3 = 1 - p1 - p2, q1, q2, q3 = 1 - q1 - q2) {
# Relative entropy between two discrete distributions
p <- c(p1, p2, p3)
q <- c(q1, q2, q3)
return(sum(log2(p^p) - log2(q^p)))
}

HplusD <- function(...) {
# Simplifies to -sum(p*log2(q))
return(H(...) + D(...))
}

roots <- function(f, ..., a = 0, b = 1, r = (b - a)/100) {
# Finds the roots of f on interior of (a, b) (they need to pass through 0)
x <- seq(a + r, b - r, by = r)
y <- sapply(x, f, ...)
roots <- c()
# Each time the sign changes, run uniroot on the interval
for (i in 1:(length(x) - 1)) {
if (sign(y[i + 1]) != sign(y[i])) {