Tamanna Hossain-Kay

Box Embeddings: Paper Overviews

Tamanna Hossain-Kay / 2021-11-09


Geometric embeddings can be used to represent transitive asymmetric relationships. Box-embeddings are n-dimensional hyperrectangles that can represent such relationships by containment.

Source: Box Embeddings: An open-source library for representation learning
using geometric structures
Source: Box Embeddings: An open-source library for representation learning using geometric structures

Formally, a “box” is defined as a Cartesian product of closed intervals, B(θ)=i=1n[zi(θ),Zi(θ)]=[z1(θ),Z1(θ)]××[zn(θ),Zn(θ)]=[xm,1,xM,1]××[xm,n,xM,n]

Parameterization: θ may be the output from a neural network

Python package: Box Embeddings: An open-source library for representation learning using geometric structures

In order to learn box-embeddings, intersections and volumes of boxes need to be computed. This can be done in a few different ways:

Read on for brief overviews of each method.

Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures

Authors: Luke Vilnis, Xiang Li, Shikhar Murty, Andrew McCallum; University of Massachusetts Amherst

Method: Hard Intersection, Hard Volume

Defining Box operations

xy=⊥ if x and y disjoint, else xy=i[max(xm,i,ym,i),min(xM,i,yM,i)]xy=i[min(xm,i,ym,i),max(xM,i,yM,i)]

Note,

Volume can be computed as,

Vol(B(θ))=imax(xM,ixm,i,0)=imh(xM,ixm,i)

where mh is the standard hinge function.

Volume can be interpreted as probability. For a concept c associated with box B(θ),

P(c)=Vol(B(θ))

Learning WordNet

The transitive reduction of the WordNet noun hierarchy contains 82,114 entities and 84,363 edges.

The learning task is an edge classification task. Given a pair of nodes (h,t), the probability of existence of the edge hiti is computed as P(hiti)=P(hi|ti)=P(hiti)P(ti)=Vol(B(θhi)B(θti))Vol(B(θti))

Gold label: Existence (hiti)P(hi|ti)=1

Loss: Cross-entropy

Surrogate Intersection: When boxes do not intersect, the joint probability is 0 so parameters cannot be updated. In this case, a lower bound is used:

p(ab)=p(a)+p(b)p(ab)p(a)+p(b)p(ab)

WordNet Results

Box-embeddings perform better than baselines in terms of classification accuracy. Refer to paper for details on baseline models.

Smoothing the Geometry of Probabilistic Box Embeddings

Authors: Xiang Li, Luke Vilnis,Dongxu Zhang, Michael Boratko, Andrew McCallum; University of Massachusetts Amherst

Method: Hard Intersection, Soft Volume

As mentioned in the previous paper (Vilnis et al., 2018), the gradient cannot flow when boxes disjoint. Vilnis et al., 2018 address this issue using a surrogate intersection probability. Here an alternative is proposed using using a Gaussian convolution with the original boxes to smooth the box edges into density functions.

Define boxes as indicator functions of intervals. Volume of boxes can then be expressed as integrals over the products of the indicator functions. Consider the 1D case for ease of exposition and results will generalize to higher dimensions.

The volume of the intersection of two boxes representing the intervals x=[a,b] and y=[c,d] (representing the joint probability between intervals) can be expressed as,

p(xy)=R1[a,b](x)1[c,d](x)dx

The smoothed version is,

pϕ(xy)=Rf(x;a,b,σ12)g(x;c,d,σ22)dx

where,

f(x;a,b,σ2)=1[a,b](x)ϕ(x;σ2)=R1[a,b](z)ϕ(xz;σ2)dz=abϕ(xz;σ2)dz

and ϕ(z;σ2)=1σ2πez22σ2(Gaussian PDF)

Integral Solution:

pϕ(xy)=σ(mΦ(bcσ)+mΦ(adσ)mΦ(bdσ)mΦ(acσ)) where σ=σ12+σ22, soft(x)=log(1+exp(x)) is the softplus function, the integral of the logistic sigmoid, and ρ=σ1.702

Proof

Let τ:=1σ12+σ22, and note the identity from (Jebara et al., 2004; Vilnis & McCallum, 2015):

Rϕ(xμ1;σ12)ϕ(xμ2;σ22)dx=ϕ(μ1μ2;σ12+σ22)

Then, pϕ(xy)=Rabϕ(xy;σ12)dycdϕ(xz;σ22)dzdxDefinition f,g=abcdRϕ(xy;σ12)ϕ(xz;σ22)dxdydzFubini's Theorem=cdabϕ(yz;τ2)dydz(Jebara et al., 2004; Vilnis & McCallum, 2015)=cdabΦ(τ(yz))τdydzChain Rule; Φ is the standard normal CDF=cdΦ(τ(bz))Φ(τ(az))dz=1τ(mΦ(τ(bd))mΦ(τ(ad))mΦ(τ(bc))+mΦ(τ(ac))) Then with σ=τ1, pϕ(xy)=σ(mΦ(bcσ)+mΦ(adσ)mΦ(bdσ)mΦ(acσ))

Approximation

Multiplication of Gaussian-smoothed indicators have the property f2f so for cases ρ>0,

So we will approximate to get a version that is idempotent.

pϕ(xy)(ρsoft(bcρ)+ρsoft(adρ))(ρsoft(bdρ)+ρsoft(acρ))(soft(bc)soft(ad))(soft(bd)soft(ac))=soft(bdac)=soft(bdac)

The first approximation follows from the approximation of Φ by a logistic sigmoid given in Bowling et al. (2009).

However, it is still not idempotent except in the zero-temperature limit. Since as ρ goes to zero, we recover the original formula with the hinge function which is idempotent, pϕ(xy)=limρ0(ρsoft(bcρ)+ρsoft(adρ))(ρsoft(bdρ)+ρsoft(acρ))=(mh(bc)+mh(ad))(mh(bd)+mh(ac))=mh(bdac)hinge function on an interval
The limit

(From Wolfram Alpha) Two-sided limit does not exist.

Limit from the left: limρ0ρlog(1+exp(yρ))=0 for y>0

Limit from the right: limρ0+ρlog(1+exp(yρ))=y for y>0

The second approximation is needed to get a similar idempotent property when using the softplus fuinction as when using the hinge function,

(mh(bc)+mh(ad))(mh(bd)+mh(ac))=mh(bdac)

(soft(bc)soft(ad))(soft(bd)soft(ac))=soft(bdac)

Normalization

But finally, because softplus upper-bounds the hinge function, it is capable of outputting values that are greater than 1 , and therefore must be normalized.

Thus, in the multi-dimensional case we have:

p(x)imsoft(i)(xM,ixm,i)p(x,y)imsoft(i)(xM,iyM,ixm,iym,i)

where,

msoft(i)(x)=soft(xρ)soft(GmGmρ)

and (Gm(i),GM(i))= Global minimum and maximum in ith dimension.

WordNet Results

Sparsity is increased (positive:negative ratio increased) in the data to create more disjoint cases. In sparser regimes, the smoothed box greatly outperforms the baselines (original hard box, and OE).

Improving Local Identifiability in Probabilistic Box Embeddings

Authors: Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, Andrew McCallum; University of Massachusetts, Amherst

Method: Gumbel Intersection, Bessel Approx. Volume

Both of the above proposed methods can have problems with local indentifiability since many different configurations of boxes can lead to the same joint probability value. This can lead to pathological parameter settings during train (Figure 2c; Figure 3c).

The method proposed here defines box edges as random variable, which leads to probabilities as expected volumes. This densifies the gradient of joint probabilities.

Gumbel Distribution

Since computing intersections of boxes involves computing minima and maxima, a min-max stable distribution is chosen: Gumbel distribution. This means that the maximum of multiple Gumbel random variables also follows a Gumbel distribution (MaxGumbel), and the minimum of multiple Gumbel random variables also follows a Gumbel distribution (MinGumbel).

Specifically,

If XiMaxGumbel(μxi,β) then max(Xi)MaxGumbel(βln(ieμxiβ),β)=MaxGumbel(βLogSumExp(μxiβ)) and if XiMinGumbel(μx,β), then min(Xi)MinGumbel(βln(ieμxiβ),β)=MinGumbel(βLogSumExp(μxiβ))

Gumbel-box process

For a probability space (ΩBox ,E,PBox ), with ΩBoxRd, the Gumbel-box process is generated as βR+,μi,ΩBoxμi,ΩBoxXji,MaxGumbel(μji,,β),Xji,MinGumbel(μji,,β)Box(Xi)=j=1d[Xji,,Xji,],Xi=1Box(Xi)x1,,xnP(X1,,Xn) For a uniform base measure on Rd, each Xi is distributed according to XiBernoulli(j=1d(Xji,Xji,))

Intersection

xy=i[μmaxXtTXjt,,μminXtTXjt,]=i[βLogSumExp(μjt,β),βLogSumExpXtT(μjt,β)]

Volume Calculation

The expected length of an interval [X,Y] where XMaxGumbel(μx,β) and YMinGumbel(μy,β) is given by E[max(YX,0)]=2βK0(2eμyμx2β) where K0 is the modified Bessel function of second kind, order 0.

Proof

E[max(YX,0)]={y>x}(yx)f(y;μy)f(x;μx)dxdy=yyf(y;μy)f(x;μx)dxdyxxf(y;μy)f(x;μx)dydx=yf(y;μy)F(y;μx)dyxf(x;μx)(1F(x;μy))dxyf(y;μy)F(y;μx)dy=xf(x;μx)F(x;μy)dx We note that this is equal to EY[YF(Y;μx)]EX[XF(X;μy)] however this calculation can be continued, combining the integrals, as =zf(z;μy)F(z;μx)zfV(z;μx)FV(z;μy)dz We integrate by parts, letting u=z,dv=1β(eπμμβe~μβ)exp(eπμνβe~μ2β)dz, du=dz,v=exp(exμμβexμπβ) where the dv portion made use of the substitution w=e~μμβ+eμμβ,dw=1β(e~μμβeμ~β)dz We have, therefore, that (12) becomes =zexp(ezμμβe~μβ)|+exp(exμμβe~μβ)dz The first term goes to 0. Setting u=z(μy+μz)/2β the second term becomes βexp(eμxμu2β¯(eu+eu))du=2β0exp(2eμxμu2N¯coshu)du With z=2eμxμy2,S, this is a known integral representation of the modified Bessel function of the second kind of order zero, K0(z) [4, Eq. 10.32.9], which completes the proof of proposition 1 .

Let m(x)=E[max(x,0)]=2βK0(2ex2β). Then the expected volume is, E[i=1dm a x(minXtTXjt,maxXtTXjt,,0)]=i=1dE[m a x(minXtTXjt,maxXtTXjt,,0)]rearrange product and expectation=i=1dm(minXtTXjt,maxXtTXjt,)definition of m=i=1d2βK0(2eβLogSumExpXtT(μjt,β)βLogSumExp(μjt,β)2β)solution for m where T is a set of variables taking the value 1, and μt, and μt, are the vectors of location parameters for the minimum and maximum coordinates of a Box(Xt).

Approximation

Computing conditional probabilities exactly requires the conditional expectation, E[P(xi,xj,Z)/P(xj,Z)] which is intractable since it is a quotient. Instead, we approximate it by its first-order Taylor expansion, which takes the form of a ratio of two simple expectations: E[P(xi,xj,Z)]/E[P(xj,Z)]

Softplus Approximation

Due to extreme similarity of the exact volume to the softplus function, the following approximation is used:

m(x)βlog(1+exp(x/β2γ)

WordNet Results

Gumbel boxes perform better than baselines on WordNet’s noun hierarchy in terms of F1 score. Refer to paper for details on baseline models.