GitHub - crazyleg/telegram_data_clustering_2019: Telegram contest on unsupervised news clustering

My entry to Telegram Data Clustering content. https://entry1144-dcround1.usercontent.dev/categories/en/

What's useful here?

You can sneak peak a GoLang and C wrapper for fasttext here. It is exteneded version of something I found on the internet with extra wrapping function of vector extraction routine.

Basically, my first attempt to write something of use in GoLang. Here's some things I implemented and general approach.

FastText integration. I've used FastText C++ library for model construction and inference, so I've CC'ed C wrapper for FastText and extended it to my needs, so I could call it from GoLang. It works smoothly.
For language detection I've used existing fasttext model.
For news/non-news classification I've looked thorght the list of domain names and cherrypicked those into 2 categories - definitely trustworthy and definitly spam/fake. Then I assinged appropriate classes to all articles coming from those domain and used them exclusively to train classifier.
For categories detection I've made labels for a few (5k) english articles using Google Text Classification API and manually created a conversion rules for G.Categories into Telegram categories. Using those I trained English categories model. As for Russian, I've cheated even twice - I've translated 5k or Russian texts into Engish with Google Translation API, assigned them categories using model created for English texts and trained Russian model using those pseudo-labels, that had confidence over significant threshold.

I didn't had time to do last parts of the contenst - Top and Threads and my few starting experiments on threads were of low quality results. Naive sentence embeddings and cosine distances produced low-quality results, I've spent too much time on implementing LSH, as I was targeting super-performance to practice a bit. Surely, different approach was required here.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
3dparty/fastText-0.9.1		3dparty/fastText-0.9.1
cmd/tgnews		cmd/tgnews
internal		internal
models		models
.gitattributes		.gitattributes
README.md		README.md
TODO		TODO
main.go		main.go
test.json		test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

crazyleg/telegram_data_clustering_2019

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages