DIS_project_2022

Decomposition using Spark

I. Introduction

II. Homogeneity Function

The homogeneity of a dataset D is a metric $Hom(D)$ that indicates how homogeneous (similar) the elements of the dataset are. With decomposition(clustering), the homogeinty function decute to $\sum_i(Hom(D_i))$, where $D_i$ s are the clusters.

We used a generalized cosine similarity as the homogeneity function in this project, as we are using cosine similarity as the distance metrics. The Cosine Similarity $Cos()$ measures the distance between two vectors(or sequence of values), 2 orthogonal vectors have a cosine similarity of 1, 2 parallel vectors have similarity of 1, and 2 opposite vectors have -1. The cosine similarity is always used only in the positive space, therefore, it is ranged from 0 to 1.

$$ Cos(x, y) = \frac{x}{|x|} \cdot \frac{y}{|y|} $$

For the Generalized Cosine Similarity, we firstly concatenate the vector into a matrix $X$, with each column is a normalized vector $x_i$ . After that, we compute the singular value decomposition(SVD) of $X$,

$$ X = USV^T $$

Where the matrix $S$ is diagonal and the singular values $\sigma_i$ s are non-negative(it's zero if and only if there is multicollinearity in the dataset) and arranged in decreasing order. Once we get the SVD decomposition, the similarity can be calculated as,

$$ SVDsim(Y) = \frac{\sum_k^K \sigma_k^2}{||Y||^{2}_F} $$

where $||Y||^2_F$ is the Frobenius norm of $Y$.

The idea behind SVD is that the SVD gives a lower rank approximation of $X$, as the cosine similiarity can be consider as the loss of using one vector to represent the 2 vectors, the SVD similairty can be defined as using $k(k<n)$ vectors to represent the whole datasets. The default value of $k$ is 1.

Note that the above similarity score is the $\frac{\sqrt{1 + Cos(x, y)}}{2}$ when there is only 2 vectors in the dataset, to make it more consistant, we scaled it to a more interesting form.

$$ SVDsim = 2 \times (\frac{\sum_k^K \sigma_k^2}{||Y||^{2}_F})^2 - 1 $$

III. Our Approach

See in the getDecompositionbyColumn function.

The decomposition function take 2 parameters, df(dataframe) and the number of clusters K. The algoithm is designed to decompose the dataframe into multiple subgroups. In each step, the algorithm will select the query(column) that 1.maximizing the cross homogeneity score, 2.there are still fewer conjucted queries(clusters) than K, 3. the updated homogeneity score is larger than the old one, 4.the query is not used before this step. If there do not exist such a query(column), the function returns dataframe with queries.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Methods		Methods
baseline		baseline
data		data
README.md		README.md
temp.txt		temp.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIS_project_2022

I. Introduction

II. Homogeneity Function

III. Our Approach

About

Releases

Packages

Contributors 2

Languages

com3dian/DIS_project_2022

Folders and files

Latest commit

History

Repository files navigation

DIS_project_2022

I. Introduction

II. Homogeneity Function

III. Our Approach

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages