The task decision of Data Clustering Context described at
utility data flow and used algorithms are the following:
- language detection: Google's GLDv3 used to detect laguages (;
- normalization: de-facto standard UTF library (ICU) to convert documents to lowercase and remove non-alpha chars; tokenizer from;
- documents vectorisation: all words of a document are embedded into vector space ( and each document itself is embedded into a vector with the same size as the word2vec vectors;
- news detection: based on a very simple DNN with the loss binary layer, fully connected layer and document's vector (;
- category detection: based on a very simple DNN with the loss multi-class layer, fully connected layer and document's vector (;
- clustering: DBSCAN-based algorithm with a dynamic similarity threshold and improved neighbors detection logic;
- clusters ordering: based on a very simple DNN with the mean squared loss layer, fully connected layer and document's vector (
Base packages installation:
sudo apt update
sudo apt upgrade
sudo apt install -y libtool g++ git cmake pkg-config libprotobuf-dev libprotoc-dev protobuf-compiler libblas-dev liblapack-dev libicu-dev libssl-dev
sudo /usr/sbin/ldconfig
Google's gumbo-parser library (HTML5 parser):
git clone
cd ./gumbo-parser
make -j 8
sudo make install
cd ../
Google's CLD3 library (language detection):
git clone
cd ./cld3
sed -i 's/add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)//g' ./CMakeLists.txt
mkdir build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo cp ./libcld3.a /usr/local/lib
sudo mkdir /usr/local/include/google
sudo mkdir /usr/local/include/google/cld_3
sudo cp -r ./cld_3/protos /usr/local/include/google/cld_3
sudo cp -r ../src/script_span /usr/local/include/google/cld_3
sudo cp ../src/*.h /usr/local/include/google/cld_3
cd ../../
DLib library (DNN, clustering, etc):
git clone
cd ./dlib
git checkout tags/v19.19
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
Libevent library (HTTP/HTTPS server):
git clone
cd libevent
git checkout tags/release-2.1.11-stable
mkdir build-release
cd build-release
make -j 8
sudo make install
cd ../../
Rapidjson library (JSON parsing/writing):
git clone
cd rapidjson
mkdir build-release
cd build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
Sqlite3 (document attributes storage)
tar xfz ./sqlite-autoconf-3310100.tar.gz
cd ./sqlite-autoconf-3310100
./configure --enable-shared=no
make -j 8
sudo make install
cd ../
Word2vec++ library (words and documents embedding into vector space)
git clone
cd ./word2vec
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
sudo make install
cd ../../
TGNEWS utility:
sudo /usr/sbin/ldconfig
cd ./submission/src/tgnews/
mkdir ./build-release
cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ../
make -j 8
cd ../bin
- download model files archive (1.2GB)
- extract files to
folder - go to
folder and run./tgnews
for more information
#dataclustering Bossy Gnu's source code is available here: