Releases: IlyaGusev/tgcontest
Final submission
Final submission
Fasttext language identification: https://fasttext.cc/docs/en/language-identification.html
Pretrained embeddings models: fasttext with dim=50, bucket=200000, epoch=10, minCount=50
ru datasets: first 2 archives, Lenta (https://github.com/yutkin/Lenta.Ru-News-Dataset)
en datasets: first 2 archives, BBC (http://mlg.ucd.ie/datasets/bbc.html), News category (https://www.kaggle.com/rmisra/news-category-dataset), All The News (https://www.kaggle.com/snapcrack/all-the-news)
No preprocessing
News detection: manual and Yandex.Toloka markup; fasttext autotune to 2M model size with pretrained embeddings
Categories: manual and Yandex.Toloka markup + Lenta, BBC, News category datasets; fasttext autotune to 4M model size with pretrained embeddings
PageRank was computed for every agency using links in texts.
Embeddings for clusterization: fasttext average/max/min embeddings concatenated and multiplied to a matrix. Matrix training process described in https://github.com/IlyaGusev/tgcontest/blob/master/scripts/Similarity.ipynb
Clustering algorithm: SLINK (https://sites.cs.ucsb.edu/~veronika/MAE/summary_SLINK_Sibson72.pdf)
Clustering trick: all data was split to batches by time. Clustering was done by batch with intersection and merged after that. O(n^2) -> O(k * m^2)
Title choice: PageRank and time factors
Cluster ranking: time, cluster size and weighted PageRank factors