From 92176f5fda50ec4c71602c0c03ed6ed0f07fea12 Mon Sep 17 00:00:00 2001
From: schwabPhysics <schwab312@gmail.com>
Date: Wed, 11 Nov 2020 16:25:36 -0500
Subject: [PATCH] distinction of cross entropy and KL divergence

I'm not entirely sure of the inner workings of the algorithm, but when reading this documentation and comparing with other sources I found that the expression for what was named 'cross entropy' did not seem correct. Instead, there are two separate terms describing two separate KL divergences (one for the change in entropy in the probability of the simplex existing, and one for not existing). It is not clear in the text (even with my suggestions) why one needs to divergences. I make no claims to the workings of the algorithm, but only suggest changes to the descriptions of the mathematics in the documentation.

I hope it makes sense, and thanks to everyone for putting this great resource together!
---
 doc/how_umap_works.rst | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/doc/how_umap_works.rst b/doc/how_umap_works.rst
index 947034c2..d95056ef 100644
--- a/doc/how_umap_works.rst
+++ b/doc/how_umap_works.rst
@@ -479,13 +479,15 @@ comparing share the same 0-simplices, we can imagine that we are
 comparing the two vectors of probabilities indexed by the 1-simplices.
 Given that these are Bernoulli variables (ultimately the simplex either
 exists or it doesn't, and the probability is the parameter of a
-Bernoulli distribution), the right choice here is the cross entropy.
+Bernoulli distribution), the right choice here is the `KL divergence 
+<https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`__.
 
 Explicitly, if the set of all possible 1-simplices is :math:`E`, and we
 have weight functions such that :math:`w_h(e)` is the weight of the
 1-simplex :math:`e` in the high dimensional case and :math:`w_l(e)` is
-the weight of :math:`e` in the low dimensional case, then the cross
-entropy will be
+the weight of :math:`e` in the low dimensional case. Using these two 
+distributions of weights we can find KL divergence 
+for the binomial distributions of the simplex existing or not existing:
 
 .. math::
 
@@ -493,7 +495,7 @@ entropy will be
    \sum_{e\in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e)) \log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)
 
 This might look complicated, but if we go back to thinking in terms of a
-graph we can view minimizing the cross entropy as a kind of force
+graph we can view minimizing the KL divergence as a kind of force
 directed graph layout algorithm.
 
 The first term, :math:`w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right)`,
@@ -522,8 +524,7 @@ Putting all these pieces together we can construct the UMAP algorithm.
 The first phase consists of constructing a fuzzy topological
 representation, essentially as described above. The second phase is
 simply optimizing the low dimensional representation to have as close
-a fuzzy topological representation as possible as measured by cross
-entropy.
+a fuzzy topological representation as possible as measured by KL divergence.
 
 When constructing the initial fuzzy topological representation we can
 take a few shortcuts. In practice, since fuzzy set membership strengths