Persona-hub create diverse synthetic data by thematic roles, in one distribution (like hospital -> nurse -> patient). Therefore, I use clustering and dimensionality reduction (HDBSCAN and UMAP) to get a list of responses for a role in order to approximate the distribution of each role. Embeddings of texts are obtained using 2 approaches: BERT-based or TFiDF models.
First approach on TfiDF embeddings
Creating word frequency distribution for each cluster and converting counts to probabilities. The Kullback-Leibler divergence between two probability distributions
where:
-
$P$ and$Q$ are the probability distributions, -
$P(i)$ and$Q(i)$ are the probability mass functions for the discrete case,
Second approach on BERT embeddings
Since BERT averages the embeddings by words, normal distributions are obtained. The KL divergence between two multivariate Gaussian distributions
where:
-
$\mu_1$ and$\mu_2$ are the means of the distributions -
$\Sigma_1$ and$\Sigma_2$ are the covariance matrices -
$k$ is the dimensionality of the distributions