A Korean Version of Semantic Text Segmentation with Embedding
Given a text document d and a desired number of segments k, this repo shows how to segment the document into k semantically homoginuous segments.
The general approach is as follows:
- Convert the words in d into embedding vectors using the GloVe model.
- For all word sequences s in d: the average meaning (centroid) of s is represented by taking the average embeddings of all words in s.
- The error in the centroid calculation for a sequence s is calculated as the average cosine distance between the centroid and all words in s.
- The segmentation is done using the greedy heuristic by iteratively choosing the best split point p.
The class text_segmentation_class in text_segmetnation.py contains funtions to convert the document words in GloVe embeddings and choose the splitting points. The notebook semantc_text_segmentation_example.ipynb demonstrates how to use the class.
- https://github.com/ratsgo/embedding/releases 에서 제공하는 word-embeddings.zip 중 glove 사용
- https://github.com/jroakes/glove-to-word2vec/blob/master/convert.py 이용해 glove => word2vec
- all_doc_tokens: 형태소 단위로 토크나이징(Okt.morphs() 사용)
- token_index: all_doc_tokens의 인덱스
- doc_tokens: all_doc_tokens 중 명사들만