Box Embeddings: Paper Overviews
Tamanna Hossain-Kay / 2021-11-09
Geometric embeddings can be used to represent transitive asymmetric relationships. Box-embeddings are n-dimensional hyperrectangles that can represent such relationships by containment.

Formally, a “box” is defined as a Cartesian product of closed intervals, B(θ)=n∏i=1[zi(θ),Zi(θ)]=[z1(θ),Z1(θ)]×⋯×[zn(θ),Zn(θ)]=[xm,1,xM,1]×⋯×[xm,n,xM,n]
Parameterization: θ may be the output from a neural network
Python package: Box Embeddings: An open-source library for representation learning using geometric structures
In order to learn box-embeddings, intersections and volumes of boxes need to be computed. This can be done in a few different ways:
- Hard Intersection + Hard Volume: Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures (Vilnis et al., 2018)
- Hard Intersection + Soft Volume: Smoothing the Geometry of Probabilistic Box Embeddings (Li et al., 2018)
- Gumbel Intersection + Bessel Approx. Volume: Improving Local Identifiability in Probabilistic Box Embeddings (Dasgupta et al., 2020)
Read on for brief overviews of each method.
Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures
Authors: Luke Vilnis, Xiang Li, Shikhar Murty, Andrew McCallum; University of Massachusetts Amherst
Method: Hard Intersection, Hard Volume
Defining Box operations
x∧y=⊥ if x and y disjoint, else x∧y=∏i[max(xm,i,ym,i),min(xM,i,yM,i)]x∨y=∏i[min(xm,i,ym,i),max(xM,i,yM,i)]
Note,
- While boxes are closed under intersection they’re not closed under union.
- (x∧y)=(x∩y), but (x∨y)>(x∪y)
Volume can be computed as,
Vol(B(θ))=∏imax(xM,i−xm,i,0)=∏imh(xM,i−xm,i)
where mh is the standard hinge function.
Volume can be interpreted as probability. For a concept c associated with box B(θ),
P(c)=Vol(B(θ))
Learning WordNet
The transitive reduction of the WordNet noun hierarchy contains 82,114 entities and 84,363 edges.
The learning task is an edge classification task. Given a pair of nodes (h,t), the probability of existence of the edge hi→ti is computed as P(hi→ti)=P(hi|ti)=P(hi∩ti)P(ti)=Vol(B(θhi)∩B(θti))Vol(B(θti))
Gold label: Existence (hi→ti)⟺P(hi|ti)=1
Loss: Cross-entropy
Surrogate Intersection: When boxes do not intersect, the joint probability is 0 so parameters cannot be updated. In this case, a lower bound is used:
p(a∩b)=p(a)+p(b)−p(a∪b)≥p(a)+p(b)−p(a∨b)
WordNet Results
Box-embeddings perform better than baselines in terms of classification accuracy. Refer to paper for details on baseline models.
Smoothing the Geometry of Probabilistic Box Embeddings
Authors: Xiang Li, Luke Vilnis,Dongxu Zhang, Michael Boratko, Andrew McCallum; University of Massachusetts Amherst
Method: Hard Intersection, Soft Volume
As mentioned in the previous paper (Vilnis et al., 2018), the gradient cannot flow when boxes disjoint. Vilnis et al., 2018 address this issue using a surrogate intersection probability. Here an alternative is proposed using using a Gaussian convolution with the original boxes to smooth the box edges into density functions.
Define boxes as indicator functions of intervals. Volume of boxes can then be expressed as integrals over the products of the indicator functions. Consider the 1D case for ease of exposition and results will generalize to higher dimensions.
The volume of the intersection of two boxes representing the intervals x=[a,b] and y=[c,d] (representing the joint probability between intervals) can be expressed as,
p(x∧y)=∫R1[a,b](x)1[c,d](x)dx
The smoothed version is,
pϕ(x∧y)=∫Rf(x;a,b,σ21)g(x;c,d,σ22)dx
where,
f(x;a,b,σ2)=1[a,b](x)∗ϕ(x;σ2)=∫R1[a,b](z)ϕ(x−z;σ2)dz=∫baϕ(x−z;σ2)dz
and ϕ(z;σ2)=1σ√2πe−z22σ2(Gaussian PDF)
Integral Solution:
pϕ(x∧y)=σ(mΦ(b−cσ)+mΦ(a−dσ)−mΦ(b−dσ)−mΦ(a−cσ)) where σ=√σ21+σ22, soft(x)=log(1+exp(x)) is the softplus function, the integral of the logistic sigmoid, and ρ=σ1.702
Proof
Let τ:=1√σ21+σ22, and note the identity from (Jebara et al., 2004; Vilnis & McCallum, 2015):
∫Rϕ(x−μ1;σ21)ϕ(x−μ2;σ22)dx=ϕ(μ1−μ2;σ21+σ22)
Then, pϕ(x∧y)=∫R∫baϕ(x−y;σ21)dy∫dcϕ(x−z;σ22)dzdxDefinition f,g=∫ba∫dc∫Rϕ(x−y;σ21)ϕ(x−z;σ22)dxdydzFubini's Theorem=∫dc∫baϕ(y−z;τ−2)dydz(Jebara et al., 2004; Vilnis & McCallum, 2015)=∫dc∫baΦ′(τ(y−z))τdydzChain Rule; Φ is the standard normal CDF=∫dcΦ(τ(b−z))−Φ(τ(a−z))dz=−1τ(mΦ(τ(b−d))−mΦ(τ(a−d))−mΦ(τ(b−c))+mΦ(τ(a−c))) Then with σ=τ−1, pϕ(x∧y)=σ(mΦ(b−cσ)+mΦ(a−dσ)−mΦ(b−dσ)−mΦ(a−cσ))
Approximation
Multiplication of Gaussian-smoothed indicators have the property f2≠f so for cases ρ>0,
- Idempotency requirement of bounded lattices is violated: a∧a=a∨a=a
- Practically then if we are to treat the outputs of pϕ as probabilities, the consequence is pϕ(x∣x)=pϕ(x,x)pϕ(x)=pϕ(x∧x)pϕ(x)≠1
So we will approximate to get a version that is idempotent.
pϕ(x∧y)≈(ρsoft(b−cρ)+ρsoft(a−dρ))−(ρsoft(b−dρ)+ρsoft(a−cρ))≈(soft(b−c)∨soft(a−d))∧(soft(b−d)∨soft(a−c))=soft(b∧d−a∨c)=soft(b∧d−a∨c)
The first approximation follows from the approximation of Φ by a logistic sigmoid given in Bowling et al. (2009).
However, it is still not idempotent except in the zero-temperature limit. Since as ρ goes to zero, we recover the original formula with the hinge function which is idempotent, pϕ(x∧y)=limρ→0(ρsoft(b−cρ)+ρsoft(a−dρ))−(ρsoft(b−dρ)+ρsoft(a−cρ))=(mh(b−c)+mh(a−d))−(mh(b−d)+mh(a−c))=mh(b∧d−a∨c)hinge function on an intervalThe limit
(From Wolfram Alpha) Two-sided limit does not exist.
Limit from the left: limρ→0−ρlog(1+exp(yρ))=0 for y>0
Limit from the right: limρ→0+ρlog(1+exp(yρ))=y for y>0
The second approximation is needed to get a similar idempotent property when using the softplus fuinction as when using the hinge function,
(mh(b−c)+mh(a−d))−(mh(b−d)+mh(a−c))=mh(b∧d−a∨c)
(soft(b−c)∨soft(a−d))∧(soft(b−d)∨soft(a−c))=soft(b∧d−a∨c)
Normalization
But finally, because softplus upper-bounds the hinge function, it is capable of outputting values that are greater than 1 , and therefore must be normalized.
Thus, in the multi-dimensional case we have:
p(x)≈∏im(i)soft(xM,i−xm,i)p(x,y)≈∏im(i)soft(xM,i∧yM,i−xm,i∨ym,i)
where,
m(i)soft(x)=soft(xρ)soft(Gm−Gmρ)
and (G(i)m,G(i)M)= Global minimum and maximum in ith dimension.
WordNet Results
Sparsity is increased (positive:negative ratio increased) in the data to create more disjoint cases. In sparser regimes, the smoothed box greatly outperforms the baselines (original hard box, and OE).
Improving Local Identifiability in Probabilistic Box Embeddings
Authors: Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, Andrew McCallum; University of Massachusetts, Amherst
Method: Gumbel Intersection, Bessel Approx. Volume
Both of the above proposed methods can have problems with local indentifiability since many different configurations of boxes can lead to the same joint probability value. This can lead to pathological parameter settings during train (Figure 2c; Figure 3c).
The method proposed here defines box edges as random variable, which leads to probabilities as expected volumes. This densifies the gradient of joint probabilities.
Gumbel Distribution
Since computing intersections of boxes involves computing minima and maxima, a min-max stable distribution is chosen: Gumbel distribution. This means that the maximum of multiple Gumbel random variables also follows a Gumbel distribution (MaxGumbel), and the minimum of multiple Gumbel random variables also follows a Gumbel distribution (MinGumbel).
Specifically,
If Xi∼MaxGumbel(μxi,β) then max(Xi)∼MaxGumbel(βln(∑ieμxiβ),β)=MaxGumbel(βLogSumExp(μxiβ)) and if Xi∼MinGumbel(μx,β), then min(Xi)∼MinGumbel(−βln(∑ie−μxiβ),β)=MinGumbel(−βLogSumExp(−μxiβ))
Gumbel-box process
For a probability space (ΩBox ,E,PBox ), with ΩBox⊆Rd, the Gumbel-box process is generated as β∈R+,μi,∨∈ΩBoxμi,∧∈ΩBoxXi,∧j∼MaxGumbel(μi,∧j,β),Xi,∨j∼MinGumbel(μi,∨j,β)Box(Xi)=d∏j=1[Xi,∧j,Xi,∨j],Xi=1Box(Xi)x1,…,xn∼P(X1,…,Xn) For a uniform base measure on Rd, each Xi is distributed according to Xi∼Bernoulli(d∏j=1(Xi,∨j−Xi,∧j))Intersection
x∧y=∏i[μmaxXt∈TXt,∨j,μminXt∈TXt,∧j]=∏i[βLogSumExp(μt,∨jβ),−βLogSumExpXt∈T(−μt,∧jβ)]
Volume Calculation
The expected length of an interval [X,Y] where X∼MaxGumbel(μx,β) and Y∼MinGumbel(μy,β) is given by E[max(Y−X,0)]=2βK0(2e−μy−μx2β) where K0 is the modified Bessel function of second kind, order 0.
Proof
E[max(Y−X,0)]=∬ We note that this is equal to E_{Y}\left[Y \cdot F_{\vee}\left(Y ; \mu_{x}\right)\right]-E_{X}\left[X \cdot F_{\vee}\left(-X ;-\mu_{y}\right)\right] however this calculation can be continued, combining the integrals, as =\int_{-\infty}^{\infty} z f_{\wedge}\left(z ; \mu_{y}\right) F_{\mathrm{\vee}}\left(z ; \mu_{x}\right)-z f_{\mathrm{V}}\left(z ; \mu_{x}\right) F_{\mathrm{V}}\left(-z ;-\mu_{y}\right) d z We integrate by parts, letting u=z, \quad d v=\frac{1}{\beta}\left(e^{\frac{\pi-\mu_{\mu}}{\beta}}-e^{-\frac{\tilde - \mu_{-}}{\beta}}\right) \exp \left(-e^{\frac{\pi-\mu_{\nu}}{\beta}}-e^{-\frac{\tilde - \mu_{2}}{\beta}}\right) d z, d u=d z, \quad v=-\exp \left(-e^{\frac{x-\mu_{\mu}}{\beta}}-e^{-\frac{x-\mu \pi}{\beta}}\right) where the d v portion made use of the substitution w=e^{\frac{\tilde - \mu_{\mu}}{\beta}}+e^{-\frac{-\mu-\mu}{\beta}}, \quad d w=\frac{1}{\beta}\left(e^{\frac{\tilde - \mu_{\mu}}{\beta}}-e^{-\frac{\tilde{-\mu}}{\beta}}\right) d z We have, therefore, that (12) becomes =-\left.z \exp \left(-e^{\frac{z-\mu_{\mu}}{\beta}}-e^{-\frac{\tilde - \mu}{\beta}}\right)\right|_{-\infty} ^{\infty}+\int_{-\infty}^{\infty} \exp \left(-e^{\frac{x-\mu_{\mu}}{\beta}}-e^{-\frac{\tilde - \mu}{\beta}}\right) d z The first term goes to 0 . Setting u=\frac{z-\left(\mu_{y}+\mu_{z}\right) / 2}{\beta} the second term becomes \beta \int_{-\infty}^{\infty} \exp \left(-e^{\frac{\mu_{x}-\mu_{u}}{2 \bar{\beta}}}\left(e^{u}+e^{-u}\right)\right) d u=2 \beta \int_{0}^{\infty} \exp \left(-2 e^{\frac{\mu_{x}-\mu_{u}}{2 \bar{N}}} \cosh u\right) d u With z=2 e^{\frac{\mu_{x}-\mu_{y}}{2, S}}, this is a known integral representation of the modified Bessel function of the second kind of order zero, K_{0}(z) [4, Eq. 10.32.9], which completes the proof of proposition 1 .
Let m(x)=\mathbb{E}[max(x,0)]=2 \beta K_{0}\left(2 e^{-\frac{x}{2 \beta}}\right) . Then the expected volume is, \begin{aligned} &\mathbb{E}\left[\prod _ { i = 1 } ^ { d } \operatorname { m a x } \left(\min _{X^{t} \in T} X_{j}^{t, \wedge}-\max _{X^{t} \in T} X_{j}^{t, \vee}, 0\right)\right] \\ &= \prod _ { i = 1 } ^ { d } \mathbb{E}\left[ \operatorname { m a x } \left(\min _{X^{t} \in T} X_{j}^{t, \wedge}-\max _{X^{t} \in T} X_{j}^{t, \vee}, 0\right)\right] \quad\text{rearrange product and expectation}\\ &=\prod _ { i = 1 } ^ { d } m(\min _{X^{t} \in T} X_{j}^{t, \wedge}-\max _{X^{t} \in T} X_{j}^{t, \vee}) \quad\text{definition of m}\\ &=\prod_{i=1}^{d} 2 \beta K_0\left(2 e ^{ \frac{-\beta \operatorname{LogSumExp}_{X^{t} \in T}\left(-\frac{\mu_{j}^{t, \wedge}}{\beta}\right)-\beta \operatorname{LogSumExp}\left(\frac{\mu_{j}^{t, \vee}}{\beta}\right)}{2\beta}} \right) \quad\text{solution for m} \end{aligned} where T is a set of variables taking the value 1, and \mu^{t, \wedge} and \mu^{t, \vee} are the vectors of location parameters for the minimum and maximum coordinates of a \operatorname{Box}\left(X^{t}\right).
Approximation
Computing conditional probabilities exactly requires the conditional expectation, \mathbb{E}\left[P\left(x^{i}, x^{j}, Z\right) / P\left(x^{j}, Z\right)\right] which is intractable since it is a quotient. Instead, we approximate it by its first-order Taylor expansion, which takes the form of a ratio of two simple expectations: \mathbb{E}\left[P\left(x^{i}, x^{j}, Z\right)\right] / \mathbb{E}\left[P\left(x^{j}, Z\right)\right]
Softplus Approximation
Due to extreme similarity of the exact volume to the softplus function, the following approximation is used:
m(x) \approx \beta \log (1+\exp (x / \beta-2 \gamma)
WordNet Results
Gumbel boxes perform better than baselines on WordNet’s noun hierarchy in terms of F1 score. Refer to paper for details on baseline models.