-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy path04 실습 - 전처리 (Preprocessing)
1 lines (1 loc) · 35.4 KB
/
04 실습 - 전처리 (Preprocessing)
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"04 실습 - 전처리 (Preprocessing)","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"rLGhbEiOoAR7"},"source":["# 텍스트 전처리 (Text Preprocessing)\n","\n","* 텍스트를 자연어 처리를 위해 용도에 맞도록 사전에 표준화 하는 작업\n","* 텍스트 내 정보를 유지하고, 중복을 제거하여 분석 효율성을 높이기 위해 전처리를 수행\n","\n"]},{"cell_type":"markdown","metadata":{"id":"E585k45HDx5E"},"source":["### 1) 토큰화 (Tokenizing)\n","* 텍스트를 자연어 처리를 위해 분리 하는 것을\n","* 토큰화는 단어별로 분리하는 \"단어 토큰화(Word Tokenization)\"와 문장별로 분리하는 \"문장 토큰화(Sentence Tokenization)\"로 구분\n","\n","(이후 실습에서는 단어 토큰화를 \"토큰화\"로 통일하여 칭하도록 한다)"]},{"cell_type":"markdown","metadata":{"id":"senwNSwgDzQc"},"source":["### 2) 품사 부착(PoS Tagging)\n","* 각 토큰에 품사 정보를 추가\n","* 분석시에 불필요한 품사를 제거하거나 (예. 조사, 접속사 등) 필요한 품사를 필터링 하기 위해 사용"]},{"cell_type":"markdown","metadata":{"id":"R15ri5czDyzc"},"source":["### 3) 개체명 인식 (NER, Named Entity Recognition)\n","* 각 토큰의 개체 구분(기관, 인물, 지역, 날짜 등) 태그를 부착\n","* 텍스트가 무엇과 관련되어있는지 구분하기 위해 사용\n","* 예를 들어, 과일의 apple과 기업의 apple을 구분하는 방법이 개체명 인식임"]},{"cell_type":"markdown","metadata":{"id":"Dfq99EkzD1Tk"},"source":["### 4) 원형 복원 (Stemming & Lemmatization)\n","* 각 토큰의 원형 복원을 함으로써 토큰을 표준화하여 불필요한 데이터 중복을 방지 (=단어의 수를 줄일수 있어 연산을 효율성을 높임)\n","* 어간 추출(Stemming) : 품사를 무시하고 규칙에 기반하여 어간을 추출\n","* 표제어 추출 (Lemmatization) : 품사정보를 유지하여 표제어 추출"]},{"cell_type":"markdown","metadata":{"id":"R5HQOjRvDxmd"},"source":["### 5) 불용어 처리 (Stopword)\n","* 자연어 처리를 위해 불필요한 요소를 제거하는 작업\n","* 불필요한 품사를 제거하는 작업과 불필요한 단어를 제거하는 작업으로 구성\n","* 불필요한 토큰을 제거함으로써 연산의 효율성을 높임"]},{"cell_type":"markdown","metadata":{"id":"QaIYJczuaS0n"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"KysKAL3VlgQN"},"source":["# 1 영문 전처리 실습\n","\n","\n","NLTK lib (https://www.nltk.org/) 사용"]},{"cell_type":"markdown","metadata":{"id":"yv0ASXb8qa6H"},"source":["## 1) 영문 토큰화\n","https://www.nltk.org/api/nltk.tokenize.html"]},{"cell_type":"code","metadata":{"id":"ZPZeW4nqTpZD","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497813244,"user_tz":-540,"elapsed":3330,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"e7d72c8c-74b8-4f87-9f4e-dca5016056f2"},"source":["!pip install nltk"],"execution_count":65,"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)\n","Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk) (1.15.0)\n"]}]},{"cell_type":"code","metadata":{"id":"ywTmZDer4iH-","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497842120,"user_tz":-540,"elapsed":271,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"a71e539f-0f22-4edb-e4a9-4ccb744a9665"},"source":["# word_tokenize() : 마침표와 구두점(온점(.), 컴마(,), 물음표(?), 세미콜론(;), 느낌표(!) 등과 같은 기호)으로 구분하여 토큰화\n","import nltk\n","nltk.download('punkt')\n","from nltk.tokenize import word_tokenize\n","\n","text = 'Barack Obama likes fried chicken very much'\n","word_tokens = word_tokenize(text)\n","print(word_tokens)"],"execution_count":66,"outputs":[{"output_type":"stream","name":"stdout","text":["[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data] Package punkt is already up-to-date!\n","['Barack', 'Obama', 'likes', 'fried', 'chicken', 'very', 'much']\n"]}]},{"cell_type":"code","metadata":{"id":"NkMwui2CBtTQ","executionInfo":{"status":"ok","timestamp":1637497888229,"user_tz":-540,"elapsed":303,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["WordPunctTokenizer?"],"execution_count":67,"outputs":[]},{"cell_type":"code","metadata":{"id":"rygb4BNXFd13","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497913936,"user_tz":-540,"elapsed":250,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"f8668454-9267-4c7c-fb35-6d1a7496180a"},"source":["# WordPunctTokenizer() : 알파벳이 아닌문자를 구분하여 토큰화\n","import nltk\n","from nltk.tokenize import WordPunctTokenizer\n","\n","text = 'Barack Obama likes fried chicken very much'\n","wordpuncttoken = WordPunctTokenizer().tokenize(text)\n","print(wordpuncttoken)"],"execution_count":68,"outputs":[{"output_type":"stream","name":"stdout","text":["['Barack', 'Obama', 'likes', 'fried', 'chicken', 'very', 'much']\n"]}]},{"cell_type":"code","metadata":{"id":"UzZ-moT5EzwL","executionInfo":{"status":"ok","timestamp":1637497922472,"user_tz":-540,"elapsed":371,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["TreebankWordTokenizer?"],"execution_count":69,"outputs":[]},{"cell_type":"code","metadata":{"id":"VrvBRJqJlitx","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497929806,"user_tz":-540,"elapsed":313,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"7403ad8b-5b71-4916-862e-a7abb7b8b38a"},"source":["# TreebankWordTokenizer() : 정규표현식에 기반한 토큰화\n","import nltk\n","from nltk.tokenize import TreebankWordTokenizer\n","\n","text = 'Barack Obama likes fried chicken very much'\n","treebankwordtoken = TreebankWordTokenizer().tokenize(text)\n","print(treebankwordtoken)"],"execution_count":70,"outputs":[{"output_type":"stream","name":"stdout","text":["['Barack', 'Obama', 'likes', 'fried', 'chicken', 'very', 'much']\n"]}]},{"cell_type":"markdown","metadata":{"id":"8-Z-0Nnysqnq"},"source":["## 2) 영문 품사 부착 (PoS Tagging)\n","분리한 토큰마다 품사를 부착한다\n","\n","https://www.nltk.org/api/nltk.tag.html\n","\n","태크목록 : https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/"]},{"cell_type":"code","metadata":{"id":"mHWVrEmTlosg","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497969699,"user_tz":-540,"elapsed":279,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"ec3bd94f-9473-40f4-eba5-e891500dd3ee"},"source":["from nltk import pos_tag\n","nltk.download('averaged_perceptron_tagger')"],"execution_count":71,"outputs":[{"output_type":"stream","name":"stdout","text":["[nltk_data] Downloading package averaged_perceptron_tagger to\n","[nltk_data] /root/nltk_data...\n","[nltk_data] Package averaged_perceptron_tagger is already up-to-\n","[nltk_data] date!\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":71}]},{"cell_type":"code","metadata":{"id":"jwtt2LxqlrVS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497997852,"user_tz":-540,"elapsed":243,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"0ddf6f41-0b6a-4811-efef-9759fbd8372d"},"source":["taggedToken = pos_tag(word_tokens)\n","print(taggedToken)"],"execution_count":72,"outputs":[{"output_type":"stream","name":"stdout","text":["[('Barack', 'NNP'), ('Obama', 'NNP'), ('likes', 'VBZ'), ('fried', 'VBN'), ('chicken', 'JJ'), ('very', 'RB'), ('much', 'JJ')]\n"]}]},{"cell_type":"markdown","metadata":{"id":"lDo-5-khs5Oz"},"source":["## 3) 개체명 인식 (NER, Named Entity Recognition)\n","\n","http://www.nltk.org/api/nltk.chunk.html"]},{"cell_type":"code","metadata":{"id":"Clj4X6Gilsi9","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498056852,"user_tz":-540,"elapsed":285,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"94504b0b-cf03-4f78-86aa-3b3c682f395b"},"source":["nltk.download('words')\n","nltk.download('maxent_ne_chunker')"],"execution_count":73,"outputs":[{"output_type":"stream","name":"stdout","text":["[nltk_data] Downloading package words to /root/nltk_data...\n","[nltk_data] Package words is already up-to-date!\n","[nltk_data] Downloading package maxent_ne_chunker to\n","[nltk_data] /root/nltk_data...\n","[nltk_data] Package maxent_ne_chunker is already up-to-date!\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":73}]},{"cell_type":"code","metadata":{"id":"VdkMJHO7mBgi","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637497605432,"user_tz":-540,"elapsed":22,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"b812be8c-b229-42f1-dc54-8abc50fe7e92"},"source":["from nltk import ne_chunk\n","neToken = ne_chunk(taggedToken)\n","print(neToken)"],"execution_count":43,"outputs":[{"output_type":"stream","name":"stdout","text":["(S\n"," (PERSON Barack/NNP)\n"," (ORGANIZATION Obama/NNP)\n"," likes/VBZ\n"," fried/VBN\n"," chicken/JJ\n"," very/RB\n"," much/JJ)\n"]}]},{"cell_type":"markdown","metadata":{"id":"aHjV0h0ZtM-t"},"source":["## 4) 원형 복원\n","각 토큰의 원형을 복원하여 표준화 한다. "]},{"cell_type":"markdown","metadata":{"id":"r2eCnbChtXjo"},"source":["### 4-1) 어간추출 (Stemming)\n","\n","* 규칙에 기반 하여 토큰을 표준화\n","* ning제거, ful 제거 등\n","\n","https://www.nltk.org/api/nltk.stem.html\n","\n","규칙상세 : https://tartarus.org/martin/PorterStemmer/def.txt"]},{"cell_type":"code","metadata":{"id":"n-AvZXHLmCy2","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498104280,"user_tz":-540,"elapsed":269,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"805fa064-5c5e-44a7-eb3c-3d37d1d7238b"},"source":["from nltk.stem import PorterStemmer\n","ps = PorterStemmer()\n","\n","print(\"running -> \" + ps.stem(\"running\"))\n","print(\"beautiful -> \" + ps.stem(\"beautiful\"))\n","print(\"believes -> \" + ps.stem(\"believes\"))\n","print(\"using -> \" + ps.stem(\"using\"))\n","print(\"conversation -> \" + ps.stem(\"conversation\"))\n","print(\"organization -> \" + ps.stem(\"organization\"))\n","print(\"studies -> \" + ps.stem(\"studies\"))"],"execution_count":74,"outputs":[{"output_type":"stream","name":"stdout","text":["running -> run\n","beautiful -> beauti\n","believes -> believ\n","using -> use\n","conversation -> convers\n","organization -> organ\n","studies -> studi\n"]}]},{"cell_type":"markdown","metadata":{"id":"4haNWIcCtZza"},"source":["### 4-2)표제어 추출 (Lemmatization)\n","\n","* 품사정보를 보존하여 토큰을 표준화\n","\n","http://www.nltk.org/api/nltk.stem.html?highlight=lemmatizer"]},{"cell_type":"code","metadata":{"id":"MdxBuzdymR7w","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498146950,"user_tz":-540,"elapsed":666,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"1777d052-8979-4d92-f5c8-9d99710eb913"},"source":["nltk.download('wordnet')"],"execution_count":75,"outputs":[{"output_type":"stream","name":"stdout","text":["[nltk_data] Downloading package wordnet to /root/nltk_data...\n","[nltk_data] Package wordnet is already up-to-date!\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":75}]},{"cell_type":"code","metadata":{"id":"2mQSzsCZmMBd","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498164095,"user_tz":-540,"elapsed":287,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"2d928638-d8d8-4363-90b4-e034dee6f703"},"source":["from nltk.stem import WordNetLemmatizer\n","wl = WordNetLemmatizer()\n","\n","print(\"running -> \" + wl.lemmatize(\"running\"))\n","print(\"beautiful -> \" + wl.lemmatize(\"beautiful\"))\n","print(\"believes -> \" + wl.lemmatize(\"believes\"))\n","print(\"using -> \" + wl.lemmatize(\"using\"))\n","print(\"conversation -> \" + wl.lemmatize(\"conversation\"))\n","print(\"organization -> \" + wl.lemmatize(\"organization\"))\n","print(\"studies -> \" + wl.lemmatize(\"studies\"))"],"execution_count":76,"outputs":[{"output_type":"stream","name":"stdout","text":["running -> running\n","beautiful -> beautiful\n","believes -> belief\n","using -> using\n","conversation -> conversation\n","organization -> organization\n","studies -> study\n"]}]},{"cell_type":"markdown","metadata":{"id":"nmY_SvDMb0fz"},"source":["## 5) 불용어 처리 (Stopword)"]},{"cell_type":"code","metadata":{"id":"lOUE-BBKcn4S","executionInfo":{"status":"ok","timestamp":1637498256444,"user_tz":-540,"elapsed":355,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["stopPos = ['IN', 'CC', 'UH', 'TO', 'MD', 'DT', 'VBZ','VBP']"],"execution_count":78,"outputs":[]},{"cell_type":"code","metadata":{"id":"CyDJ4JiscnrY","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498257732,"user_tz":-540,"elapsed":6,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"a5b39f43-8c67-48a9-f1d3-85c3dbb04444"},"source":["# 최빈어 조회. 최빈어를 조회하여 불용어 제거 대상을 선정\n","from collections import Counter\n","Counter(taggedToken).most_common()"],"execution_count":79,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[(('Barack', 'NNP'), 1),\n"," (('Obama', 'NNP'), 1),\n"," (('likes', 'VBZ'), 1),\n"," (('fried', 'VBN'), 1),\n"," (('chicken', 'JJ'), 1),\n"," (('very', 'RB'), 1),\n"," (('much', 'JJ'), 1)]"]},"metadata":{},"execution_count":79}]},{"cell_type":"code","metadata":{"id":"zNhxqDVkcnX9","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498259362,"user_tz":-540,"elapsed":6,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"19bc2690-af21-4f3c-c67a-a0117339ebe4"},"source":["stopWord = [',','be','able','very']\n","\n","word = []\n","for tag in taggedToken:\n"," if tag[1] not in stopPos:\n"," if tag[0] not in stopWord:\n"," word.append(tag[0])\n"," \n","print(word)"],"execution_count":80,"outputs":[{"output_type":"stream","name":"stdout","text":["['Barack', 'Obama', 'fried', 'chicken', 'much']\n"]}]},{"cell_type":"markdown","metadata":{"id":"QV0orUsOb6wD"},"source":["## 6) 영문 텍스트 전처리 종합"]},{"cell_type":"code","metadata":{"id":"Pbz6tLP_mNrn","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498339918,"user_tz":-540,"elapsed":342,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"346e8cd3-d43c-43ce-c1c1-0b73a8b495ca"},"source":["import nltk\n","nltk.download('averaged_perceptron_tagger')\n","nltk.download('words')\n","nltk.download('maxent_ne_chunker')\n","nltk.download('wordnet')\n","\n","\n","from nltk.tokenize import TreebankWordTokenizer\n","sumtoken = TreebankWordTokenizer().tokenize(\"Obama loves fried chicken of KFC\")\n","print(sumtoken)\n","\n","from nltk import pos_tag\n","sumTaggedToken = pos_tag(sumtoken)\n","print(taggedToken)\n","\n","from nltk import ne_chunk\n","sumNeToken = ne_chunk(sumTaggedToken)\n","print(neToken)\n","\n","from nltk.stem import PorterStemmer\n","ps = PorterStemmer()\n","print(\"loves -> \" + ps.stem(\"loves\"))\n","print(\"fried -> \" + ps.stem(\"fried\"))\n","\n","from nltk.stem import WordNetLemmatizer\n","wl = WordNetLemmatizer()\n","print(\"loves -> \" + wl.lemmatize(\"loves\"))\n","print(\"fried -> \" + wl.lemmatize(\"fried\"))\n","\n","#불용어 처리\n","sumStopPos = ['IN']\n","sumStopWord = ['fried']\n","\n","word = []\n","for tag in sumTaggedToken:\n"," if tag[1] not in sumStopPos:\n"," if tag[0] not in sumStopWord:\n"," word.append(tag[0])\n"," \n","print(word)"],"execution_count":81,"outputs":[{"output_type":"stream","name":"stdout","text":["[nltk_data] Downloading package averaged_perceptron_tagger to\n","[nltk_data] /root/nltk_data...\n","[nltk_data] Package averaged_perceptron_tagger is already up-to-\n","[nltk_data] date!\n","[nltk_data] Downloading package words to /root/nltk_data...\n","[nltk_data] Package words is already up-to-date!\n","[nltk_data] Downloading package maxent_ne_chunker to\n","[nltk_data] /root/nltk_data...\n","[nltk_data] Package maxent_ne_chunker is already up-to-date!\n","[nltk_data] Downloading package wordnet to /root/nltk_data...\n","[nltk_data] Package wordnet is already up-to-date!\n","['Obama', 'loves', 'fried', 'chicken', 'of', 'KFC']\n","[('Barack', 'NNP'), ('Obama', 'NNP'), ('likes', 'VBZ'), ('fried', 'VBN'), ('chicken', 'JJ'), ('very', 'RB'), ('much', 'JJ')]\n","(S\n"," (PERSON Barack/NNP)\n"," (ORGANIZATION Obama/NNP)\n"," likes/VBZ\n"," fried/VBN\n"," chicken/JJ\n"," very/RB\n"," much/JJ)\n","loves -> love\n","fried -> fri\n","loves -> love\n","fried -> fried\n","['Obama', 'loves', 'chicken', 'KFC']\n"]}]},{"cell_type":"markdown","metadata":{"id":"BMErzPcbuYEa"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"C0Dhqm4zkHXl"},"source":["# 2 한글 전처리 실습\n","영문은 공백으로 토큰화가 가능하지만, 한글의 경우 품사를 고려하여 토큰화 해야한다."]},{"cell_type":"markdown","metadata":{"id":"w09FHRgIphw5"},"source":["## 1) 한글 토큰화 및 형태소 분석"]},{"cell_type":"code","metadata":{"id":"Xj3gdRSzhC8n","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498385232,"user_tz":-540,"elapsed":3444,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"70473af6-cff9-4836-b6be-b8827a41806d"},"source":["#konlpy 설치\n","!pip install konlpy"],"execution_count":82,"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: konlpy in /usr/local/lib/python3.7/dist-packages (0.5.2)\n","Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.7/dist-packages (from konlpy) (1.19.5)\n","Requirement already satisfied: colorama in /usr/local/lib/python3.7/dist-packages (from konlpy) (0.4.4)\n","Requirement already satisfied: beautifulsoup4==4.6.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (4.6.0)\n","Requirement already satisfied: tweepy>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (3.10.0)\n","Requirement already satisfied: JPype1>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (1.3.0)\n","Requirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (4.2.6)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from JPype1>=0.7.0->konlpy) (3.10.0.2)\n","Requirement already satisfied: requests[socks]>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (2.23.0)\n","Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (1.3.0)\n","Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from tweepy>=3.7.0->konlpy) (1.15.0)\n","Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->tweepy>=3.7.0->konlpy) (3.1.1)\n","Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.24.3)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2021.10.8)\n","Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2.10)\n","Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (3.0.4)\n","Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.7/dist-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.7.1)\n"]}]},{"cell_type":"markdown","metadata":{"id":"5IZWN4xX4HXW"},"source":["한글 자연어처리기 비교\n","\n","https://blog.naver.com/PostView.nhn?blogId=wideeyed&logNo=221337575742"]},{"cell_type":"code","metadata":{"id":"__e0d_9Svzor","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498423335,"user_tz":-540,"elapsed":5172,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"63801f52-5338-4b0c-97de-cb092fb2ec19"},"source":["# 코모란(Komoran) 토큰화\n","from konlpy.tag import Komoran\n","komoran= Komoran()\n","kor_text = \"인간이 컴퓨터와 대화하고 있다는 것을 깨닫지 못하고 인간과 대화를 계속할 수 있다면 컴퓨터는 지능적인 것으로 간주될 수 있습니다.\"\n","komoran_tokens = komoran.morphs(kor_text)\n","print(komoran_tokens)"],"execution_count":83,"outputs":[{"output_type":"stream","name":"stdout","text":["['인간', '이', '컴퓨터', '와', '대화', '하', '고', '있', '다는', '것', '을', '깨닫', '지', '못하', '고', '인간', '과', '대화', '를', '계속', '하', 'ㄹ', '수', '있', '다면', '컴퓨터', '는', '지능', '적', '이', 'ㄴ', '것', '으로', '간주', '되', 'ㄹ', '수', '있', '습니다', '.']\n"]}]},{"cell_type":"code","metadata":{"id":"0ZD4PsSCeztM","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498425891,"user_tz":-540,"elapsed":1776,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"c87f1bfc-c1b2-4aa5-c39e-6e3dcdec5538"},"source":["# 한나눔(Hannanum) 토큰화\n","from konlpy.tag import Hannanum\n","hannanum= Hannanum()\n","kor_text = \"인간이 컴퓨터와 대화하고 있다는 것을 깨닫지 못하고 인간과 대화를 계속할 수 있다면 컴퓨터는 지능적인 것으로 간주될 수 있습니다.\"\n","hannanum_tokens = hannanum.morphs(kor_text)\n","print(hannanum_tokens)"],"execution_count":84,"outputs":[{"output_type":"stream","name":"stdout","text":["['인간', '이', '컴퓨터', '와', '대화', '하고', '있', '다는', '것', '을', '깨닫', '지', '못하', '고', '인간', '과', '대화', '를', '계속', '하', 'ㄹ', '수', '있', '다면', '컴퓨터', '는', '지능적', '이', 'ㄴ', '것', '으로', '간주', '되', 'ㄹ', '수', '있', '습니다', '.']\n"]}]},{"cell_type":"code","metadata":{"id":"ORRFr8tHe1VX","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498425891,"user_tz":-540,"elapsed":10,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"7b54cdb1-071a-4576-9791-ae94ca9548ab"},"source":["# Okt 토큰화\n","from konlpy.tag import Okt\n","okt= Okt()\n","kor_text = \"인간이 컴퓨터와 대화하고 있다는 것을 깨닫지 못하고 인간과 대화를 계속할 수 있다면 컴퓨터는 지능적인 것으로 간주될 수 있습니다.\"\n","okt_tokens = okt.morphs(kor_text)\n","print(okt_tokens)"],"execution_count":85,"outputs":[{"output_type":"stream","name":"stdout","text":["['인간', '이', '컴퓨터', '와', '대화', '하고', '있다는', '것', '을', '깨닫지', '못', '하고', '인간', '과', '대화', '를', '계속', '할', '수', '있다면', '컴퓨터', '는', '지능', '적', '인', '것', '으로', '간주', '될', '수', '있습니다', '.']\n"]}]},{"cell_type":"code","metadata":{"id":"COrUs_nHe26J","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498428373,"user_tz":-540,"elapsed":2488,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"60d762d4-234f-40a2-8873-2738658193bd"},"source":["# Kkma 토큰화\n","from konlpy.tag import Kkma\n","kkma= Kkma()\n","kor_text = \"인간이 컴퓨터와 대화하고 있다는 것을 깨닫지 못하고 인간과 대화를 계속할 수 있다면 컴퓨터는 지능적인 것으로 간주될 수 있습니다.\"\n","kkma_tokens = kkma.morphs(kor_text)\n","print(kkma_tokens)"],"execution_count":86,"outputs":[{"output_type":"stream","name":"stdout","text":["['인간', '이', '컴퓨터', '와', '대화', '하', '고', '있', '다는', '것', '을', '깨닫', '지', '못하', '고', '인간', '과', '대화', '를', '계속', '하', 'ㄹ', '수', '있', '다면', '컴퓨터', '는', '지능', '적', '이', 'ㄴ', '것', '으로', '간주', '되', 'ㄹ', '수', '있', '습니다', '.']\n"]}]},{"cell_type":"markdown","metadata":{"id":"2M7nyptjunTG"},"source":["## 2) 한글 품사 부착 (PoS Tagging)\n","\n","PoS Tag 목록\n","\n","https://docs.google.com/spreadsheets/u/1/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0"]},{"cell_type":"code","metadata":{"id":"2t6txrctj8nC","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498506787,"user_tz":-540,"elapsed":262,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"2bb6e5b4-93f4-44f3-fe04-9c70006f9335"},"source":["# 코모란(Komoran) 품사 태깅\n","komoranTag = []\n","for token in komoran_tokens:\n"," komoranTag += komoran.pos(token)\n","print(komoranTag)"],"execution_count":87,"outputs":[{"output_type":"stream","name":"stdout","text":["[('인간', 'NNG'), ('이', 'MM'), ('컴퓨터', 'NNG'), ('오', 'VV'), ('아', 'EC'), ('대화', 'NNG'), ('하', 'NNG'), ('고', 'MM'), ('있', 'VV'), ('달', 'VV'), ('는', 'ETM'), ('것', 'NNB'), ('을', 'NNG'), ('깨닫', 'VV'), ('지', 'NNB'), ('못', 'MAG'), ('하', 'MAG'), ('고', 'MM'), ('인간', 'NNG'), ('과', 'NNG'), ('대화', 'NNG'), ('를', 'JKO'), ('계속', 'MAG'), ('하', 'NNG'), ('ㄹ', 'NA'), ('수', 'NNB'), ('있', 'VV'), ('다면', 'NNG'), ('컴퓨터', 'NNG'), ('늘', 'VV'), ('ㄴ', 'ETM'), ('지능', 'NNP'), ('적', 'NNB'), ('이', 'MM'), ('ㄴ', 'JX'), ('것', 'NNB'), ('으로', 'JKB'), ('간주', 'NNG'), ('되', 'NNB'), ('ㄹ', 'NA'), ('수', 'NNB'), ('있', 'VV'), ('습니다', 'EC'), ('.', 'SF')]\n"]}]},{"cell_type":"code","metadata":{"id":"msdBCzI6iA2w","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498508812,"user_tz":-540,"elapsed":313,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"27016e8a-d535-4595-cd12-d6427a24ce62"},"source":["# 한나눔(Hannanum) 품사 태깅\n","hannanumTag = []\n","for token in hannanum_tokens:\n"," hannanumTag += hannanum.pos(token)\n","print(hannanumTag)"],"execution_count":88,"outputs":[{"output_type":"stream","name":"stdout","text":["[('인간', 'N'), ('이', 'M'), ('컴퓨터', 'N'), ('와', 'I'), ('대화', 'N'), ('하', 'P'), ('고', 'E'), ('있', 'N'), ('다', 'M'), ('는', 'J'), ('것', 'N'), ('을', 'N'), ('깨닫', 'N'), ('지', 'N'), ('못하', 'P'), ('어', 'E'), ('고', 'M'), ('인간', 'N'), ('과', 'N'), ('대화', 'N'), ('를', 'N'), ('계속', 'M'), ('하', 'I'), ('ㄹ', 'N'), ('수', 'N'), ('있', 'N'), ('다면', 'N'), ('컴퓨터', 'N'), ('늘', 'P'), ('ㄴ', 'E'), ('지능적', 'N'), ('이', 'M'), ('ㄴ', 'N'), ('것', 'N'), ('으', 'N'), ('로', 'J'), ('간주', 'N'), ('되', 'N'), ('ㄹ', 'N'), ('수', 'N'), ('있', 'N'), ('슬', 'P'), ('ㅂ니다', 'E'), ('.', 'S')]\n"]}]},{"cell_type":"code","metadata":{"id":"dpe14zC3iCFi","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498509497,"user_tz":-540,"elapsed":413,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"8deedec9-aa6a-4881-cd0d-d46abd1a8dc7"},"source":["# Okt 품사 태깅\n","oktTag = []\n","for token in okt_tokens:\n"," oktTag += okt.pos(token)\n","print(oktTag)"],"execution_count":89,"outputs":[{"output_type":"stream","name":"stdout","text":["[('인간', 'Noun'), ('이', 'Noun'), ('컴퓨터', 'Noun'), ('와', 'Verb'), ('대화', 'Noun'), ('하고', 'Verb'), ('있다는', 'Adjective'), ('것', 'Noun'), ('을', 'Josa'), ('깨닫지', 'Verb'), ('못', 'Noun'), ('하고', 'Verb'), ('인간', 'Noun'), ('과', 'Noun'), ('대화', 'Noun'), ('를', 'Noun'), ('계속', 'Noun'), ('할', 'Verb'), ('수', 'Noun'), ('있다면', 'Adjective'), ('컴퓨터', 'Noun'), ('는', 'Verb'), ('지능', 'Noun'), ('적', 'Noun'), ('인', 'Noun'), ('것', 'Noun'), ('으로', 'Josa'), ('간주', 'Noun'), ('될', 'Verb'), ('수', 'Noun'), ('있습니다', 'Adjective'), ('.', 'Punctuation')]\n"]}]},{"cell_type":"code","metadata":{"id":"xNQBKdYaiDd0","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498510029,"user_tz":-540,"elapsed":4,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"11d1c039-9b9a-48d3-9321-8e5fdbe6b9c0"},"source":["# Kkma 품사 태깅\n","kkmaTag = []\n","for token in kkma_tokens:\n"," kkmaTag += kkma.pos(token)\n","print(kkmaTag)"],"execution_count":90,"outputs":[{"output_type":"stream","name":"stdout","text":["[('인간', 'NNG'), ('이', 'NNG'), ('컴퓨터', 'NNG'), ('오', 'VA'), ('아', 'ECS'), ('대화', 'NNG'), ('하', 'NNG'), ('고', 'NNG'), ('있', 'VA'), ('달', 'VV'), ('는', 'ETD'), ('것', 'NNB'), ('을', 'NNG'), ('깨닫', 'VV'), ('지', 'NNG'), ('못하', 'VX'), ('고', 'NNG'), ('인간', 'NNG'), ('과', 'NNG'), ('대화', 'NNG'), ('를', 'UN'), ('계속', 'MAG'), ('하', 'NNG'), ('ㄹ', 'NNG'), ('수', 'NNG'), ('있', 'VA'), ('다면', 'NNG'), ('컴퓨터', 'NNG'), ('늘', 'VA'), ('ㄴ', 'ETD'), ('지능', 'NNG'), ('적', 'NNG'), ('이', 'NNG'), ('ㄴ', 'NNG'), ('것', 'NNB'), ('으', 'UN'), ('로', 'JKM'), ('간주', 'NNG'), ('되', 'VA'), ('ㄹ', 'NNG'), ('수', 'NNG'), ('있', 'VA'), ('슬', 'VV'), ('ㅂ니다', 'EFN'), ('.', 'SF')]\n"]}]},{"cell_type":"markdown","metadata":{"id":"VZY4s8tbuuXP"},"source":["## 3) 불용어(Stopword) 처리\n","분석에 불필요한 품사를 제거하고, 불필요한 단어(불용어)를 제거한다"]},{"cell_type":"code","metadata":{"id":"Nvjk1yIYkCfj","executionInfo":{"status":"ok","timestamp":1637498608107,"user_tz":-540,"elapsed":282,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["#불용어 처리\n","stopPos = ['Suffix','Punctuation','Josa','Foreign','Alpha','Number']"],"execution_count":91,"outputs":[]},{"cell_type":"code","metadata":{"id":"573iqrTFkcJ3","executionInfo":{"status":"ok","timestamp":1637498608372,"user_tz":-540,"elapsed":2,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["# 최빈어 조회. 최빈어를 조회하여 불용어 제거 대상을 선정\n","from collections import Counter\n","#Counter(oktTag).most_common()"],"execution_count":92,"outputs":[]},{"cell_type":"code","metadata":{"id":"5lBkhHm1kYcz","executionInfo":{"status":"ok","timestamp":1637498609324,"user_tz":-540,"elapsed":2,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["stopWord = ['의','이','로','두고','들','를','은','과','수','했다','것','있는','한다','하는','그','있다','할','이런','되기','해야','있게','여기']"],"execution_count":93,"outputs":[]},{"cell_type":"code","metadata":{"id":"BJgERpoikh9s","executionInfo":{"status":"ok","timestamp":1637498611404,"user_tz":-540,"elapsed":243,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["word = []\n","for tag in oktTag:\n"," if tag[1] not in stopPos:\n"," if tag[0] not in stopWord:\n"," word.append(tag[0])"],"execution_count":94,"outputs":[]},{"cell_type":"code","metadata":{"id":"iUQTDj4KkkBN","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637498612712,"user_tz":-540,"elapsed":4,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"b9090230-cb69-418b-dc6e-62a6c45764eb"},"source":["print(word)"],"execution_count":95,"outputs":[{"output_type":"stream","name":"stdout","text":["['인간', '컴퓨터', '와', '대화', '하고', '있다는', '깨닫지', '못', '하고', '인간', '대화', '계속', '있다면', '컴퓨터', '는', '지능', '적', '인', '간주', '될', '있습니다']\n"]}]},{"cell_type":"code","metadata":{"id":"sWmQvENLhenG"},"source":[""],"execution_count":null,"outputs":[]}]}