Information Geometry

The Derivation

1. Fisher Information Matrix

For a family of distributions p(x; theta), the Fisher information is:

I_ij(theta) = E[ (d/d_theta_i log p) * (d/d_theta_j log p) ]

This measures how much the distribution "changes" as we perturb parameters. It defines a Riemannian metric on parameter space -- information geometry.

2. Computing for N(mu, sigma)

The log-likelihood of a Gaussian is:

log p(x; mu, sigma) = -log(sigma*sqrt(2pi)) - (x-mu)^2 / (2*sigma^2)

Taking derivatives and computing expectations:

I_mu_mu = 1/sigma^2 I_sigma_sigma = 2/sigma^2 I_mu_sigma = 0 (off-diagonal vanishes!)

So the Fisher metric is:

ds^2 = (1/sigma^2) dmu^2 + (2/sigma^2) dsigma^2

3. Recognizing the Poincare Metric

The standard Poincare half-plane metric is ds^2 = (dx^2 + dy^2) / y^2. Our Fisher metric has the same structure with coordinates (mu, sigma) on the upper half-plane (sigma > 0), just with a factor of 2 on the dsigma^2 term.

The punchline: With the substitution eta = sigma * sqrt(2), the Fisher metric becomes exactly the Poincare metric. The space of Gaussians IS the hyperbolic plane.

This means: moving between narrow distributions (small sigma, near the boundary) "costs more" information than moving between wide ones. Changing the mean of a precise measurement is a bigger deal than changing the mean of a vague one. Geometry encodes epistemology.

4. Geodesics

In the Poincare half-plane, geodesics are:

Vertical lines (same mu, different sigma)
Semicircles centered on the mu-axis (boundary)

The geodesic distance between N(mu1, sigma1) and N(mu2, sigma2) is:

d = sqrt(2) * arccosh(1 + ((mu1-mu2)^2 + (sigma1-sigma2)^2) / (2*sigma1*sigma2))

This is the Rao distance -- the true information-theoretic distance between two Gaussians, measured along the shortest path on the manifold.

5. Why This Matters

KL divergence is the "usual" way to measure distance between distributions, but it's asymmetric and doesn't respect the manifold geometry. The Rao/Fisher distance is symmetric, satisfies the triangle inequality, and follows geodesics -- it's the correct intrinsic distance. In ML, this geometry explains why natural gradient descent (which follows Fisher geometry) converges faster than vanilla gradient descent (which ignores curvature).

Poincare Half-Plane (mu, sigma) PLACE