forked from insightcampus/sesac-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10 실습 - 표현(Representation) - 문서의 표현 (BoW, TDM)
1 lines (1 loc) · 18.7 KB
/
10 실습 - 표현(Representation) - 문서의 표현 (BoW, TDM)
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"10 실습 - 표현(Representation) - 문서의 표현 (BoW, TDM)","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"zEFesPBvXe2C"},"source":["# 문서 표현 (Document Representation)"]},{"cell_type":"markdown","metadata":{"id":"52uiZhBWaR4M"},"source":["# 1 BoW (Bag of Words)"]},{"cell_type":"markdown","metadata":{"id":"xuz1lvCi_e-y"},"source":["<img src=\"https://image.slidesharecdn.com/vector-space-models-170118145044/95/cs571-vector-space-models-3-638.jpg?cb=1485433004\" />\n","\n","https://en.wikipedia.org/wiki/Bag-of-words_model\n","https://www.slideshare.net/jchoi7s/cs571-vector-space-models"]},{"cell_type":"markdown","metadata":{"id":"jUABPDuYAO7Y"},"source":["## 1.1 직접구현"]},{"cell_type":"code","metadata":{"id":"dPZCmyM7aR4O"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"zuMcIp6_aR4R"},"source":["### 1) 띄어쓰기 단위로 토큰화"]},{"cell_type":"code","metadata":{"id":"HK8UIQfKaR4S","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636920892559,"user_tz":-540,"elapsed":5,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"539ccf71-240e-4f27-cd29-0101fe7c83d6"},"source":["doc_ls = [d.split() for d in docs]\n","doc_ls"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[['오늘', '동물원에서', '원숭이를', '봤어'],\n"," ['오늘', '동물원에서', '코끼리를', '봤어', '봤어'],\n"," ['동물원에서', '원숭이에게', '바나나를', '줬어', '바나나를']]"]},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","metadata":{"id":"vxOK8R52aR4X"},"source":["### 2) 각 고유 토큰에 인덱스(Index)를 지정"]},{"cell_type":"code","metadata":{"id":"HjQvx_d1aR4Y","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921001278,"user_tz":-540,"elapsed":321,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"057acc0f-ac08-4f60-a709-bd585d12381b"},"source":["from collections import defaultdict\n","\n","word2id = defaultdict(lambda:len(word2id))\n","[word2id[t] for d in doc_ls for t in d]\n","word2id"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["defaultdict(<function __main__.<lambda>>,\n"," {'동물원에서': 1,\n"," '바나나를': 6,\n"," '봤어': 3,\n"," '오늘': 0,\n"," '원숭이를': 2,\n"," '원숭이에게': 5,\n"," '줬어': 7,\n"," '코끼리를': 4})"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","metadata":{"id":"G7cZKHjeaR4n"},"source":["### 3) BoW 생성"]},{"cell_type":"code","metadata":{"id":"VqV9atAUZ5cS"},"source":["import numpy as np\n","\n","bow_ls = []\n","\n","for i, d in enumerate(doc_ls) :\n"," bow = np.zeros(len(word2id), dtype=int)\n"," for t in d :\n"," bow[word2id[t]] += 1\n"," bow_ls.append(bow.tolist())"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Hrtwk4vrJpkS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921203322,"user_tz":-540,"elapsed":300,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"0e612ea5-1287-428c-f826-eddf6058e850"},"source":["bow_ls"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[[1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 0, 2, 1, 0, 0, 0], [0, 1, 0, 0, 0, 1, 2, 1]]"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"Ax7YxzN89aNZ"},"source":["\n","\n","---\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"id":"CHnBJ8wuaR4t"},"source":["## 1.2 단어 순서를 고려하지 않은 BoW"]},{"cell_type":"code","metadata":{"id":"4UZDtYS7aR4u"},"source":["docs = ['나는 양념 치킨을 좋아해 하지만 후라이드 치킨을 싫어해',\n"," '나는 후라이드 치킨을 좋아해 하지만 양념 치킨을 싫어해']"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"55w1w8A1a41X"},"source":["### 1) 띄어쓰기 단위로 토큰화"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"0KwyJAlWHQl1","executionInfo":{"status":"ok","timestamp":1636921324279,"user_tz":-540,"elapsed":328,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"751417c1-aea7-40cd-b96b-101484e30efc"},"source":["doc_ls = [d.split() for d in docs]\n","doc_ls"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[['나는', '양념', '치킨을', '좋아해', '하지만', '후라이드', '치킨을', '싫어해'],\n"," ['나는', '후라이드', '치킨을', '좋아해', '하지만', '양념', '치킨을', '싫어해']]"]},"metadata":{},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"oNpc8XfgHQl1"},"source":["### 2) 각 고유 토큰에 인덱스(Index)를 지정"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HJ6_ToLJHQl2","executionInfo":{"status":"ok","timestamp":1636921328698,"user_tz":-540,"elapsed":311,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"92e36ac8-3d16-41d2-dc25-4d1bc8e89bb9"},"source":["from collections import defaultdict\n","\n","word2id = defaultdict(lambda:len(word2id))\n","[word2id[t] for d in doc_ls for t in d]\n","word2id"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["defaultdict(<function __main__.<lambda>>,\n"," {'나는': 0,\n"," '싫어해': 6,\n"," '양념': 1,\n"," '좋아해': 3,\n"," '치킨을': 2,\n"," '하지만': 4,\n"," '후라이드': 5})"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","metadata":{"id":"3iwRj7FpHQl2"},"source":["### 3) BoW 생성"]},{"cell_type":"code","metadata":{"id":"iqTvOcE0HQl2"},"source":["import numpy as np\n","\n","bow_ls = []\n","\n","for i, d in enumerate(doc_ls) :\n"," bow = np.zeros(len(word2id), dtype=int)\n"," for t in d :\n"," bow[word2id[t]] += 1\n"," bow_ls.append(bow.tolist())"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8xu8zoNbHQl2","executionInfo":{"status":"ok","timestamp":1636921332952,"user_tz":-540,"elapsed":4,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"32c9649e-4bcb-46e0-b0aa-1c0669e94648"},"source":["bow_ls"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[[1, 1, 2, 1, 1, 1, 1], [1, 1, 2, 1, 1, 1, 1]]"]},"metadata":{},"execution_count":12}]},{"cell_type":"markdown","metadata":{"id":"ogtoam3Ia41Z"},"source":["### 2) 각 고유 토큰에 인덱스(Index)를 지정"]},{"cell_type":"code","metadata":{"id":"pfv9UDZPa41Z"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"M5rOzF2ma41b"},"source":["### 3) BoW 생성"]},{"cell_type":"code","metadata":{"id":"ZIRpm6pLa41c"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"GoO1Ln-wa41d"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"491wb8Avb4ij"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"Hlu7kB_fDyer"},"source":["https://en.wikipedia.org/wiki/Document-term_matrix"]},{"cell_type":"markdown","metadata":{"id":"e5Yfa1_j9Eav"},"source":["## 1.3 sklearn 활용"]},{"cell_type":"code","metadata":{"id":"LQnAj2a4-WXk"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"SYBIrkm4-QwU"},"source":["from sklearn.feature_extraction.text import CountVectorizer\n","\n","count_vect = CountVectorizer()\n","BoW = count_vect.fit_transform(docs)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"d3UWnvpANmMQ","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921453220,"user_tz":-540,"elapsed":299,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"37d3a454-88f0-4340-84d3-87d8697287d7"},"source":["BoW.toarray()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[1, 0, 1, 1, 1, 0, 0, 0],\n"," [1, 0, 2, 1, 0, 0, 0, 1],\n"," [1, 2, 0, 0, 0, 1, 1, 0]])"]},"metadata":{},"execution_count":16}]},{"cell_type":"code","metadata":{"id":"CDYxifsz-c9P","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921469933,"user_tz":-540,"elapsed":3,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"b87576d0-ecd7-4aa0-9aca-4f4904bac8bc"},"source":["count_vect.vocabulary_"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'동물원에서': 0,\n"," '바나나를': 1,\n"," '봤어': 2,\n"," '오늘': 3,\n"," '원숭이를': 4,\n"," '원숭이에게': 5,\n"," '줬어': 6,\n"," '코끼리를': 7}"]},"metadata":{},"execution_count":17}]},{"cell_type":"markdown","metadata":{"id":"lJTu69Kw9bcQ"},"source":["\n","\n","---\n"]},{"cell_type":"markdown","metadata":{"id":"ubYvSi3q9RhM"},"source":["## 1.4 gensim 활용"]},{"cell_type":"code","metadata":{"id":"64nyDwx5Ao75"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"GFXsCmaxAUV9"},"source":["import gensim\n","from gensim import corpora\n","\n","doc_ls = [d.split() for d in docs]\n","id2word = corpora.Dictionary(doc_ls)\n","bow = [id2word.doc2bow(d) for d in doc_ls]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"LBzlUW6oA1S4","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921674767,"user_tz":-540,"elapsed":317,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"4f3340ae-0a2d-4d95-9ccf-084116472cd1"},"source":["bow[2]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[(0, 1), (5, 2), (6, 1), (7, 1)]"]},"metadata":{},"execution_count":30}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":35},"id":"EToZeQ0sIgqf","executionInfo":{"status":"ok","timestamp":1636921682322,"user_tz":-540,"elapsed":474,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"44f4ba08-3069-4981-df6b-cc453cdb2682"},"source":["id2word[5]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"},"text/plain":["'바나나를'"]},"metadata":{},"execution_count":31}]},{"cell_type":"markdown","metadata":{"id":"hx3EGCegbh8f"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"lYoJ_zN4BqtH"},"source":["# 2 TDM(Term-Document Matrix)"]},{"cell_type":"markdown","metadata":{"id":"mcUysxIbrO3j"},"source":["## 2.1 직접구현"]},{"cell_type":"code","metadata":{"id":"sdljLf47YyEH"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"zsRNojiaYyEK"},"source":["### 1) 띄어쓰기 단위로 토큰화"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"N2upbwPXI-NE","executionInfo":{"status":"ok","timestamp":1636921768345,"user_tz":-540,"elapsed":4,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"5a3dfb3e-3633-44cd-e493-1ae286aa7ce9"},"source":["doc_ls = [d.split() for d in docs]\n","doc_ls"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[['오늘', '동물원에서', '원숭이를', '봤어'],\n"," ['오늘', '동물원에서', '코끼리를', '봤어', '봤어'],\n"," ['동물원에서', '원숭이에게', '바나나를', '줬어', '바나나를']]"]},"metadata":{},"execution_count":33}]},{"cell_type":"markdown","metadata":{"id":"eWG6MVdgI-NF"},"source":["### 2) 각 고유 토큰에 인덱스(Index)를 지정"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BwLdV99EI-NF","executionInfo":{"status":"ok","timestamp":1636921769178,"user_tz":-540,"elapsed":5,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"81d4a19a-7341-4728-c8c4-c313a54b8699"},"source":["from collections import defaultdict\n","\n","word2id = defaultdict(lambda:len(word2id))\n","[word2id[t] for d in doc_ls for t in d]\n","word2id"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["defaultdict(<function __main__.<lambda>>,\n"," {'동물원에서': 1,\n"," '바나나를': 6,\n"," '봤어': 3,\n"," '오늘': 0,\n"," '원숭이를': 2,\n"," '원숭이에게': 5,\n"," '줬어': 7,\n"," '코끼리를': 4})"]},"metadata":{},"execution_count":34}]},{"cell_type":"code","metadata":{"id":"SNEl8EmBYyEL"},"source":["import numpy as np\n","\n","TDM = np.zeros((len(word2id), len(doc_ls)), dtype=int)\n","for i, d in enumerate(doc_ls) :\n"," for t in d :\n"," TDM[word2id[t], i] += 1"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DSg1ouavJEW1","executionInfo":{"status":"ok","timestamp":1636921902076,"user_tz":-540,"elapsed":3,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"a8fbafd2-7d84-47d4-919d-1d0c0c11cded"},"source":["TDM"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[1, 1, 0],\n"," [1, 1, 1],\n"," [1, 0, 0],\n"," [1, 2, 0],\n"," [0, 1, 0],\n"," [0, 0, 1],\n"," [0, 0, 2],\n"," [0, 0, 1]])"]},"metadata":{},"execution_count":39}]},{"cell_type":"code","metadata":{"id":"wdXV_KsBY29j"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"M2xbmurHrzPe"},"source":["## 2.2 sklearn 활용"]},{"cell_type":"code","metadata":{"id":"PFM2h2vyrzPf"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"YUty8uPYrzPh"},"source":["from sklearn.feature_extraction.text import CountVectorizer\n","\n","count_vect = CountVectorizer()\n","DTM = count_vect.fit_transform(docs)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"C65XtqpWrzPj","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636921953422,"user_tz":-540,"elapsed":355,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"9f63b206-4e30-43c4-89d8-7cfd481700a5"},"source":["DTM.toarray()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[1, 0, 1, 1, 1, 0, 0, 0],\n"," [1, 0, 2, 1, 0, 0, 0, 1],\n"," [1, 2, 0, 0, 0, 1, 1, 0]])"]},"metadata":{},"execution_count":44}]},{"cell_type":"markdown","metadata":{"id":"jL9M4jETrzPm"},"source":["\n","\n","---\n"]},{"cell_type":"markdown","metadata":{"id":"5Ffnrgshrge8"},"source":["## 2.3 gensim 활용"]},{"cell_type":"code","metadata":{"id":"oTnp6FWorzPn"},"source":["docs = ['오늘 동물원에서 원숭이를 봤어',\n"," '오늘 동물원에서 코끼리를 봤어 봤어',\n"," '동물원에서 원숭이에게 바나나를 줬어 바나나를']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"e2VedSd5rzPq"},"source":["import gensim\n","from gensim import corpora\n","\n","doc_ls = [d.split() for d in docs]\n","id2word = corpora.Dictionary(doc_ls)\n","TDM = [id2word.doc2bow(d) for d in doc_ls]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1byNgZyw4Q1f","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636922018232,"user_tz":-540,"elapsed":384,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"9cac7cb1-780c-4ff9-d85b-4440bb225990"},"source":["TDM"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[[(0, 1), (1, 1), (2, 1), (3, 1)],\n"," [(0, 1), (1, 2), (2, 1), (4, 1)],\n"," [(0, 1), (5, 2), (6, 1), (7, 1)]]"]},"metadata":{},"execution_count":47}]},{"cell_type":"code","metadata":{"id":"r8ilGKzkNTsJ"},"source":[" "],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"CegRF6DWrijc"},"source":["---"]}]}