From 92176f5fda50ec4c71602c0c03ed6ed0f07fea12 Mon Sep 17 00:00:00 2001 From: schwabPhysics Date: Wed, 11 Nov 2020 16:25:36 -0500 Subject: [PATCH] distinction of cross entropy and KL divergence I'm not entirely sure of the inner workings of the algorithm, but when reading this documentation and comparing with other sources I found that the expression for what was named 'cross entropy' did not seem correct. Instead, there are two separate terms describing two separate KL divergences (one for the change in entropy in the probability of the simplex existing, and one for not existing). It is not clear in the text (even with my suggestions) why one needs to divergences. I make no claims to the workings of the algorithm, but only suggest changes to the descriptions of the mathematics in the documentation. I hope it makes sense, and thanks to everyone for putting this great resource together! --- doc/how_umap_works.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/doc/how_umap_works.rst b/doc/how_umap_works.rst index 947034c2..d95056ef 100644 --- a/doc/how_umap_works.rst +++ b/doc/how_umap_works.rst @@ -479,13 +479,15 @@ comparing share the same 0-simplices, we can imagine that we are comparing the two vectors of probabilities indexed by the 1-simplices. Given that these are Bernoulli variables (ultimately the simplex either exists or it doesn't, and the probability is the parameter of a -Bernoulli distribution), the right choice here is the cross entropy. +Bernoulli distribution), the right choice here is the `KL divergence +`__. Explicitly, if the set of all possible 1-simplices is :math:`E`, and we have weight functions such that :math:`w_h(e)` is the weight of the 1-simplex :math:`e` in the high dimensional case and :math:`w_l(e)` is -the weight of :math:`e` in the low dimensional case, then the cross -entropy will be +the weight of :math:`e` in the low dimensional case. Using these two +distributions of weights we can find KL divergence +for the binomial distributions of the simplex existing or not existing: .. math:: @@ -493,7 +495,7 @@ entropy will be \sum_{e\in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e)) \log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right) This might look complicated, but if we go back to thinking in terms of a -graph we can view minimizing the cross entropy as a kind of force +graph we can view minimizing the KL divergence as a kind of force directed graph layout algorithm. The first term, :math:`w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right)`, @@ -522,8 +524,7 @@ Putting all these pieces together we can construct the UMAP algorithm. The first phase consists of constructing a fuzzy topological representation, essentially as described above. The second phase is simply optimizing the low dimensional representation to have as close -a fuzzy topological representation as possible as measured by cross -entropy. +a fuzzy topological representation as possible as measured by KL divergence. When constructing the initial fuzzy topological representation we can take a few shortcuts. In practice, since fuzzy set membership strengths