Introduction

The Wasserstein distance and optimal transport theory were first studied by the French mathematician Gaspard Monge in the 18th entury. Since that time, the field has been revisited by many illustrious mathematicians like Leonid Kantorovich in the 20th century and more recently by the Fields medalist Cédric Villani.

In addition to these considerable theoretical advances, the method has also benefited from important advances in numerical computations and algorithms which have made it amenable to real large scale problems. Consequently, optimal transport has been applied successfully to various contexts such as regularization in statistical learning, distributionally robust optimization, image retrieval and economics.

This post focuses on the basic concepts underlying the Wasserstein distance, detailing some of its interesting properties as well as practical numerical methods. It shies away from the often math-heavy formulations and tries to focus on the important practical questions to build intuition.

Basic formalism and notation

Given an arbitrary base set ΩRd, M(Ω) denotes the set of probability measures on Ω and m(μ,ν) represents the set of joint distributions with marginals μ and ν in M(Ω). Using some distance D:Ω×ΩR+ such as the lp norms with pN, the p-Wasserstein distance is then defined as the solution to the following optimization problem:

Wp(μ,ν)=infΠm(μ,ν)(ΩΩD(x,y)pdΠ(x,y))1p.

A particular, but usefull case is the situation where we consider only discrete measures. In that case, we have a finite set of points/atoms X={x1,,xn} and Y={y1,,ym} and the measures take the form μ=ni=1δxiai and ν=mj=1δyjbi for aΔn and bΔm where Δl={uRl+:iui=1} is the simplex and δxi is a dirac measure putting all weight on xi.

In that case, the Wasserstein distance (to the pth power) takes on the much simpler form of a linear transportation problem, which we write explicitely in terms of the marginal probabilites aΔn and bΔm:

Wpp(a,b)=minπni=1mj=1πijD(xi,yj)ps.t.ni=1πij=bj,jmj=1πij=ai,iπij0,(i,j)

The problem has a simple intuitive meaning as the minimum amount of work required to transform one distribution into the other. Indeed, when p=1 this is called the Earth Mover Distance and can be though of as moving dirt from high valued bins of one distribution to throughs in another. Each time, we consider the physical distance between the bins and prioritize bins that are close to one another according to the ground distance D.

Properties and rationale

Like other distances, it is non-negative, symmetric, subadditive and 0 only if the 2 measures are identical. It also offers interesting advantages over standard distances like the Euclidean distance. For instance, we notice that n and m can be different, which implies we can compare distributions with a different number of bins.1

Together with an adequate choice of ground distance D:Ω×ΩR+, the Wasserstein distance can capture important local properties of the problem. Each bin of a distribution is compared to every other bin of the other distribution, but closer bins are less penalized that ones that are farther. For images, this might help capture the local correlation in pixel intensity2. In contrast, using the raw Euclidean distance would require having 2 histograms with exactly the same number of bins/points/atoms and comparing values bin by bin.

As shown in the important Rubner et al. paper, these properties help the Wasserstein distance better capture perceived similarities between distributions than the standard Euclidean distance. The authors show that the Wasserstein distance remains superior even for distances induced by quadratic norms D(x,y)=||K(xy)||2, which do not suffer from the bin by bin comparison of the standard Euclidean distance for some symmetric positive definite matrix KRd×d.

A small example

Consider the following vectors x0,x1,x2R5, represented in the figure below:

Avatar
Three five dimensional vectors.

Further assume that each of the 5 dimensions represents a position on a line so that we can determine that the distance between positions/indices 0 and 4 is D(0,4)=|04|, which is four times D(0,1)=|10|, the distance between positions 0 and 1.

Euclidean distance

Avatar

As the figure on the right illustrates, all 3 vectors are all equally dissimilar with a maximal Euclidean distance of 2 since they differ by 1 at exactly 2 positions. Hence the normalized distance is 1 off the diagonal and 0 otherwise.

Wasserstein distance

The figure below illustrates that the result is different when computing the Wasserstein distance between the 3 discrete measures μi=δji,i=0,,2 placing unit mass at j0=0,j1=1,j2=4. Indeed, with that choice of distance, x0 and x1 are much closer than x0 and x2 and x1 and x23.


Computing the Wasserstein distance between three discrete measures with a single atom. Although the ground distance is symmetric accross both diagonals, the Wasserstein distance reflects the intuitive fact that moving 1 unit of mass from 4 to 1 is costlier than moving it from 4 to 0. The top left and right quadrants illustrate the optimal transport plan while the central bottom square illustrates the actual normalized Wasserstein distance.

Computing the Wasserstein distance between three discrete measures with a single atom. Although the ground distance is symmetric accross both diagonals, the Wasserstein distance reflects the intuitive fact that moving 1 unit of mass from 4 to 1 is costlier than moving it from 4 to 0. The top left and right quadrants illustrate the optimal transport plan while the central bottom square illustrates the actual normalized Wasserstein distance.

Wasserstein barycenters

It is also interesting to study the barycenters generated by the Wasserstein distance. Following our intuition based on the usual Euclidean norm and the fact that Ni=1xiNargmaxuNi=11N||xiu||2 (which is one of the key results underlying the k-means algorithm for instance), we can think of barycenters as centroids or representative of a set of objects (measures in this case). More formally, given a set of N distributions {νi}Ni=1, the Wasserstein barycenter is defined as the measure that minimizes the following problem:

μ=argminμ1NNi=1Wpp(μ,νi)

The following sections detail two important cases in which we can find such solutions either numerically or analytically.

Analytical solution for Gaussian distributions

In the specific case where νiN(mi,Σi) are all Gaussian, Agueh and Carlier show that μN(Ni=1miN,ˆΣ), where ˆΣ is a semidefinite symmetric matrix satisfying ˆΣ=1NNi=1(ˆΣ1/2ΣiˆΣ1/2)1/2. In other words, the barycenter remains Gaussian with a mean given by the arithmetic average of the mean of the {νi} and the covariance matrix satisfies a non-linear equation.

As illustrated in the figure below, this contrasts with the mixture distribution Ni=11NN(mi,Σi) which is in general not Gaussian. This shape preserving property has numerous useful applications in image processing, particularly for texture interpolation and texture mixing.

Comparing Wasserstein barycenter and mixture for Gaussian distributions.

Comparing Wasserstein barycenter and mixture for Gaussian distributions.

The Wasserstein barycenter in the preceding example was computed using the excellent ot Python module. For more details and in-depth examples, check out the documentation.

Numerical methods for discrete measures

Another practically important situation, which is discussed in details in Cuturi and Doucet, is the case where we wish to find the barycenter of a set of arbitrary discrete measures that have the same support X={x1,,xn}. In order to obtain the Wasserstein barycenter for such a set of measures, we may use a simple projected subgradient method where we consider the average distance between the incumbent barycenter and one of the N measures {νk}Nk=1 at each iteration.

More specifically, we formulate the problem in terms of the right hand side marginal probabilities aΔn and bkΔn and solve the following problem:

minaΔ1NNk=1Wpp(a,bk)

We let f(a)=Nk=11Nfk(a) where fk(a)=Wpp(a,bk) and set Dkij=D(ai,bkj)p. We then observe that the dual of fk(a) is

maxαRn,βRnαa+βbks.t.αi+βjDkij,(i,j)

We note that if a and bk are in the simplex, both primal and dual problems are feasible and bounded so that both problems have the same finite optimal objective value. We can then show that f(a) is convex 4 and α=Nk=11Nα0k is a subgradient of f at a where (α0k,β0k)argmaxα,βαa0+βbk:αi+βjDkij,(i,j).

Indeed, fk(a0)=α0ka0+β0kbk by strong duality. We also have fk(a)α0ka+β0kbk for any aΔ by weak duality. It follows that fk(a0)+α0k(aa0)fk(a),aΔ and thus Nk=11Nfk(a0)+Nk=11Nα0k(aa0)Nk=11Nfk(a).

We can therefore use the step al+1=PΔ(alηα)=argminuΔ||(alηα)u||2, where PΔ is the projection onto the simplex and η>0 is some predetermined step length.

However, this projected subgradient is only meant for presentation purposes. Indeed, each iteration with a fixed al requires computing k Wasserstein distances and hence solving k (primal) linear programs with O(n2) variables or k (dual) linear programs with O(n2) constraints.

There exists other more sophisticated method such as a multimarginal linear problem where the number of variables is O(nN) and is therefore intractable for most real problems. More importantly, an extremelly efficient fixed point algorithm can be used to solve an approximate regularized version. Details on this Sinkhorn algorithm as well as countless other details and variations are given in the Peyré and Cuturi book.

Conclusion

This post illustrates that the Wasserstein distance is an extremelly powerful and versatile concept. Its locally aware shape preserving properties can be used for numerous applications and it benefits from various sophisticated numerical algorithms.

Regardless of these important advances, the field of optimal transport theory is extremelly rich and dynamic. As of 2020, many theoretical and practical research initiatives are underway and numerous applications are still being developped.


  1. Of course, we can always take l=max{n,m} and consider vectors of dimension l padded with zeros. However, this will increase the dimension of one of the vectors, which might be undesirable if l is very large.

  2. A greyscale pixel image can be seen as a discrete 2 dimensional distribution where the bins represent pixels (points in a 2D grid) and the density is given by the normalized grey intensity in [0,1]. If we want to compare images with resolution (say) 30 X 30, we might then want to evaluate the intensity of the 2 images at pixels (2,3), (3,4) and (15,15). Using the l1 norm as ground distance, (2,3), (3,4) are at a distance of |34|+|43|=2 while (2,3) and (15,15) are at a distance of |152|+|153|=25. Moving intensity from pixel (2,3) to (15,15) is therefore much costlier in the Wasserstein sense than moving it from (2,3) to (3,4), which helps capture the local correlation intensity in images.

  3. In this simple case, the Wasserstein distance corresponds exactly to the basic ground distance since histograms only have a single bin of mass 1, but this is not the case in general.

  4. f(a) is convex since duality reveals it is the maximum of a (finite) collection of affine functions of a. See Boyd and Vandenberghe for more details.