Skip to content
This repository has been archived by the owner on May 19, 2022. It is now read-only.

Bayesian Ranking

clvv edited this page Feb 18, 2012 · 2 revisions

Bayesian Ranking

Introduction

Fasd has a mathematical model for its ranking algorithm. Specifically, it utilizes Bayesian inference for ranking.

In this article, all capitalized and boldface variables are considered sets, and all corresponding lower case variables are considered elements of those sets. The set of all observed paths is denoted as (\mathbf{Path}), and the set of given patterns is denoted as (\mathbf{Pattern}).

Bayes Rule

When users provide a set of patterns, (\mathbf{Pattern}). fasd tries to find (P(path|\mathbf{Pattern})) for all ( path \in \mathbf{Path}), or in words, the probability of each path given the set of search patterns. Apply Bayes rule, we have:

[ P( path | \mathbf{Pattern} ) = \frac{ P(\mathbf{Pattern}|path)P(path) }{ \sum \limits_{pattern \in \mathbf{Pattern}} P(path|pattern)P(pattern) } ]

Since fasd only needs find the path with highest probability, and notice that the denominator is the same for all (path). So we define:

[ rank(path) = P(\mathbf{Pattern}|path)P(path) ]

(P(path)) is the prior, (P(\mathbf{Pattern}|path)) is the likelihood, and their product yields (rank(path)), the "posterior."

Calculating the Prior

Fasd keeps a database of accessed paths each with the number of times it has been accessed and the most recent timestamp.

[ \forall (path \in \mathbf{Path}) : path.times \wedge path.timestamp ]

Fasd uses the "frecency" algorithm to estimate the prior, (P(Path)).

[ P(path) \approx frencent(path) = path.times \times weight(currentTime - path.timestamp) ]

[ weight(seconds) = \left{ \begin{array}{lr} 6 : seconds \le 3600 \ 4 : 3600 \leq seconds \le 86400 \ 2 : 86400 \leq seconds \le 604800 \ 1 : otherwise \end{array} \right ]

Function (weight) yields a weight according to the time elapsed.

Calculating the Likelihood

Fasd uses common sense and the principle of least surprise to find the likelihood.

We assum that all elements of (\mathbf{Pattern}) are mutually independent. Thus:

[ P(\mathbf{Pattern}|path) = \prod \limits_{pattern \in \mathbf{Pattern}} P(pattern|path) ]

And fasd uses a function to estimate (P(pattern|path)):

[ P(pattern|path) \approx likelihood(pattern, path) ]

Here is where a trick comes in. Since (path) consists of segments which looks like "/abc/def/ghi/.../xyz", assuming the user is using a pattern to refer to a path, it is more likely that the pattern matches the filename "xyz" instaead of other segments. Consider a function

[ n(pattern, path) = \frac{ \textrm{number of segments before and including pattern in path} }{ \textrm{number of segments in path} } ]

(For instance n("def", "/abc/def/ghi") = 2/3). Then we define the estimate function (likelihood) to be

[ likelihood(pattern, path) = \left{ \begin{array}{lr} F : n(pattern. path) = 1 \ n(pattern, path) : otherwise
\end{array} \right ]

Where (F) is relatively big constant, usually at least 10. This way, we can prioritize a path if the pattern matches the last segment. This algorithm resembles the common sense when users are forming the search patterns: search patterns are more likely to match the filennames.

Calculating the Rank

The rank, or "posterior", is just the product of the prior and the likelihood, as we have previously discussed.

[ rank(path) = frecent(path) \times \prod \limits_{pattern \in \mathbf{Pattern}} likelihood(pattern, path) ]

Finding the Optimal Path

The optimal path is just the path with the highest rank.