The parameter space of Gaussian distributions, equipped with the Fisher information metric, is the Poincare half-plane -- the canonical model of hyperbolic geometry. Click to place distributions. Watch how "information distance" warps space.
For a family of distributions p(x; theta), the Fisher information is:
I_ij(theta) = E[ (d/d_theta_i log p) * (d/d_theta_j log p) ]This measures how much the distribution "changes" as we perturb parameters. It defines a Riemannian metric on parameter space -- information geometry.
The log-likelihood of a Gaussian is:
log p(x; mu, sigma) = -log(sigma*sqrt(2pi)) - (x-mu)^2 / (2*sigma^2)Taking derivatives and computing expectations:
I_mu_mu = 1/sigma^2 I_sigma_sigma = 2/sigma^2 I_mu_sigma = 0 (off-diagonal vanishes!)So the Fisher metric is:
ds^2 = (1/sigma^2) dmu^2 + (2/sigma^2) dsigma^2The standard Poincare half-plane metric is ds^2 = (dx^2 + dy^2) / y^2. Our Fisher metric has the same structure with coordinates (mu, sigma) on the upper half-plane (sigma > 0), just with a factor of 2 on the dsigma^2 term.
In the Poincare half-plane, geodesics are:
The geodesic distance between N(mu1, sigma1) and N(mu2, sigma2) is:
d = sqrt(2) * arccosh(1 + ((mu1-mu2)^2 + (sigma1-sigma2)^2) / (2*sigma1*sigma2))This is the Rao distance -- the true information-theoretic distance between two Gaussians, measured along the shortest path on the manifold.
KL divergence is the "usual" way to measure distance between distributions, but it's asymmetric and doesn't respect the manifold geometry. The Rao/Fisher distance is symmetric, satisfies the triangle inequality, and follows geodesics -- it's the correct intrinsic distance. In ML, this geometry explains why natural gradient descent (which follows Fisher geometry) converges faster than vanilla gradient descent (which ignores curvature).