Home

PHOC embeddings, originally proposed by Almazan, have been widely used for isolated word recognition and keyword spotting (KWS). Several similarity/dissimilarity measures have been proposed to perform recognition or rank the candidate word images for a given query, to perform KWS. In addition, the models that extract the PHOC embedding for a given image have also improved in the last years.

In particular, a VGG-like architecture and the Bray-Curtis measure were proposed by Sudholt to extract the PHOC embedding and rank the candidate images in KWS. The PHOC embedding is just the vector containing the probability that each dimension of the PHOC representation is equal to 1, independently of the others.

In my PhD thesis I proposed a way of computing the "relevance probability" for a given pair of images, with the following assumptions.

A pair of images (X and X') is relevant if, and only if, the two images share the same pyramid of histograms of characters (H = H'). For high-dimensional PHOC representations, I argue that this is equivalent to saying that the two words are equal, which is the typical definition of "relevance" in KWS.
The two images, X and X', are independent, and their PHOC representations, H and H', only depend on their respective image.
Each dimension of the PHOC representation is independent of the others (this is obviously false, but is anyway assumed by the PHOCNet network to predict the embedding).

Given these, it's easy to show that:

where p_d = P(H_d = 1 | X) and p'_d = P(H'_d = 1 | X').

The computation of the relevance probability has the same asymptotic cost than the popular Bray-Curtis measure, but provides with a meaningful probabilistic interpretation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally