A set of methods and model evaluation metrics for predicting links in an academic citation network using Apache Spark and Scala.
In this experimental study we develop methods and try to evaluate models for predicting links in an academic citation network, by taking two different aspects into consideration:
- Having an insight about the existing network and some of its links and trying to restore a portion of it that has been deliberately removed
- Having no information about the existing network and rely only on the information of the scientific papers in order to predict the structure of the whole network.
For the first aspect we used supervised binary classification and more specifically the method of Logistic Regression which had a very good result, with F1 score close to 86% against the testing set. For the second aspect we relied mainly on Jaccard Similarity of the MinHash LSH of each paper’s abstract which had being vectorized using TF-IDF.
For more detailed information check the draft paper.
Our dataset contains 27,770 academic papers that are associated with the following information:
1. unique ID
2. publication year (between 1993 and 2003)
3. title
4. authors
5. name of journal
6. abstract
And exists under src/main/resources
.