forked from insightcampus/sesac-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06 실습 - 표현(Representation) - 단어의 표현 (원핫인코딩, 유사도계산, 단어임베딩 개요)
1 lines (1 loc) · 5.61 KB
/
06 실습 - 표현(Representation) - 단어의 표현 (원핫인코딩, 유사도계산, 단어임베딩 개요)
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"06 실습 - 표현(Representation) - 단어의 표현 (원핫인코딩, 유사도계산, 단어임베딩 개요)","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"I_TVhSXBJk2g"},"source":["# 단어의 표현 (Word Representation)\n","\n","\n","기계는 문자를 그대로 인식할 수 없기때문에 숫자로 변환\n","\n"]},{"cell_type":"markdown","metadata":{"id":"tvwJYK2WJp3e"},"source":["# 1 원-핫 인코딩 (One-Hot Encoding)"]},{"cell_type":"markdown","metadata":{"id":"56ssCRkrSHVJ"},"source":["##1-1 직접 구현해보기"]},{"cell_type":"markdown","metadata":{"id":"64_tK53CLEAw"},"source":["###\"원숭이, 바나나, 사과\" 로 원-핫 인코딩을 한다면"]},{"cell_type":"code","metadata":{"id":"DtPpS_EkFmpg"},"source":["# 인코딩 대상 단어들을 담은 리스트\n","word_ls = ['원숭이','바나나','사과','개', '고양이']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"4TPATy0HFqCi","executionInfo":{"status":"ok","timestamp":1636892730021,"user_tz":-540,"elapsed":2,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"hCH7j4onKvzt"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"jSozqBtyLKnP"},"source":["###\"코끼리\"라는 단어가 추가된다면?"]},{"cell_type":"code","metadata":{"id":"SVITUcpOKxLy"},"source":["word_ls = ['원숭이','바나나','사과','코끼리']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"-GNKR6CDLiy6"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0zuG3Gj3dLkK"},"source":["## 1-2 sklearn 활용"]},{"cell_type":"markdown","metadata":{"id":"apagu5-2VDkF"},"source":["\n","함수명 | 설명\n","--|--\n","fit(X[, y])\t| Fit OneHotEncoder to X.\n","fit_transform(X[, y])\t| Fit OneHotEncoder to X, then transform X.\n","inverse_transform(X)\t| Convert the back data to the original representation.\n","transform(X)\t| Transform X using one-hot encoding."]},{"cell_type":"code","metadata":{"id":"b7I0OUJkTVKK"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"YpcEUwjmGjq-"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"BG0cSbZRMUK5"},"source":["# 2 \b유사도 계산"]},{"cell_type":"markdown","metadata":{"id":"oaHlpji5epCE"},"source":["## 2-1 유클리디언 거리(Euclidean distance)\n","두 벡터사이의 직선 거리. 피타고라스 정리를 생각하면 이해하기 쉬움"]},{"cell_type":"markdown","metadata":{"id":"kLqyanO4ggNv"},"source":["<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Euclidean_distance_2d.svg/220px-Euclidean_distance_2d.svg.png\" width=\"200\"/>\n","\n","<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/795b967db2917cdde7c2da2d1ee327eb673276c0\" width=\"350\"/>\n","\n","https://en.wikipedia.org/wiki/Euclidean_distance"]},{"cell_type":"code","metadata":{"id":"jN24SjnVMaFg"},"source":["word_embedding_dic = {\n"," '사과' : [1.0, 0.5],\n"," '바나나' : [0.9, 1.2],\n"," '원숭이' : [0.5, 1.5]\n","}"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"MuK3qKuCfQZN"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"RQT4BjHECnmb"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"tgDBpcUlewCn"},"source":["## 2-2 자카드 유사도(Jaccard index)"]},{"cell_type":"markdown","metadata":{"id":"3GrLhpFGksn3"},"source":["<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Intersection_of_sets_A_and_B.svg/200px-Intersection_of_sets_A_and_B.svg.png\" />\n","\n","<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7\" />\n","\n","https://en.wikipedia.org/wiki/Jaccard_index"]},{"cell_type":"code","metadata":{"id":"uBhzYo5Slphr"},"source":["s1 = '대부분 원숭이는 바나나를 좋아합니다.'\n","s2 = '코주부 원숭이는 바나나를 싫어합니다.'\n","\n"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Eq3OT71-Mho1"},"source":["## 2-3 코사인 유사도(Cosine Similarity) \n","\n","* 두 벡터간의 유사도를 측정하는 방법 중 하나\n","* 두 벡터 사이의 코사인을 측정\n","* 0도 = 1, 90도 = 0, 180도 = -1 ==> 1에 가까울수록 유사도가 높음\n","\n","\n"]},{"cell_type":"markdown","metadata":{"id":"j0_kplkjool0"},"source":["<img src=\"https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png\" width=\"300\"/>\n","\n","<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d\" width='350'/>\n","\n","https://en.wikipedia.org/wiki/Cosine_similarity"]},{"cell_type":"code","metadata":{"id":"B0KdUJ2sMaOZ"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"EIHKQr1JMnt3"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"3AR90MO-Mn25"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1MSy0DYTMoDy"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"fUQAqKS4it74"},"source":["\n","\n","---\n","\n"]}]}