3. Parçalama (Tokenization)

trnlp.file_prossesing.count_all_txt(folder_path, ngram=1) -> dict

"folder_path" parametresi ile belirtilen klasör ve alt klasörlerindeki tüm .txt uzantılı dosyaları okuyarak içerisindeki kelimeleri sayar ve sözlük olarak döndürür. "ngram" parametresi kelimelerin kaçarlı gruplar halinde sayılacağını belirler. "ngram" parametresi ön tanımlı olarak 1'dir.

trnlp.tokenization.unitoascii(string: str) -> str

Bu fonksiyon bir metni Türkçe için biraz daha genişletilmiş Ascii karakter setine dönüştürür. trnlp parçalama işlemi yapmadan önce girilen metni bu fonksiyon ile dönüştürür.

Bundan sonraki anlatımda aşağıdaki referans "metin" değişkeni kullanılacaktır.

metin = """Saçma ve Gereksiz Bir Yazı.
    Bakkaldan 5 TL'lik 2 çikola-
    ta al. 12.02.2018 tarihinde saat tam 15:45'te yap-
    malıyız bu işi. Tamam mı? Benimle esatmahmutbayol@gmail.com 
    adresinden iletişime geçebilirsin. Yarışta 1. oldu. Doç. Dr. 
    Esat Bayol'un(Böyle bir ünvanım yok!) yanından geliyorum.
    12 p.m. mi yoksa 12 a.m. mi? 100 milyon insan gelmiş! www.deneme.com.tr 
    adresinden sitemizi inceleyebilirsin. 24 Eylül 2018 Pazartesi günü ge-
    lecekmiş. 19 Mayıs'ı coşkuyla kutladık."""

trnlp.tokenization.simple_token(utext: str, sw=None) -> list

Metni harf ve sayı grubu haricindeki tüm karakterlerden parçalar. Boşluklar silinir. "sw" parametresi ile kullanıldığında stopwords listesi içerisindeki kelimeleri temizler. İstenildiği takdirde trnlp ile gelen hazır durak kelimeler listesi ile birlikte kullanılabilir. Bu durumda parametre sw=stopwords şeklinde girilmelidir.

from trnlp import *

print(simple_token(metin))

>> ['Saçma', 've', 'Gereksiz', 'Bir', 'Yazı', '.', 'Bakkaldan', '5', 'TL', "'", 'lik', '2', 'çikola', '-', 'ta', 'al', '.', '12', '.', '02', '.', '2018', 'tarihinde', 'saat', 'tam', '15', ':', '45', "'", 'te', 'yap', '-', 'malıyız', 'bu', 'işi', '.', 'Tamam', 'mı', '?', 'Benimle', 'esatmahmutbayol', '@', 'gmail', '.', 'com', 'adresinden', 'iletişime', 'geçebilirsin', '.', 'Yarışta', '1', '.', 'oldu', '.', 'Doç', '.', 'Dr', '.', 'Esat', 'Bayol', "'", 'un', '(', 'Böyle', 'bir', 'ünvanım', 'yok', '!', ')', 'yanından', 'geliyorum', '.', '12', 'p', '.', 'm', '.', 'mi', 'yoksa', '12', 'a', '.', 'm', '.', 'mi', '?', '100', 'milyon', 'insan', 'gelmiş', '!', 'www', '.', 'deneme', '.', 'com', '.', 'tr', 'adresinden', 'sitemizi', 'inceleyebilirsin', '.', '24', 'Eylül', '2018', 'Pazartesi', 'günü', 'ge', '-', 'lecekmiş', '.', '19', 'Mayıs', "'", 'ı', 'coşkuyla', 'kutladık', '.']

print(simple_token(metin, sw=stopwords))

>> ['Saçma', 'Gereksiz', 'Yazı', '.', 'Bakkaldan', '5', 'TL', "'", 'lik', '2', 'çikola', '-', 'ta', 'al', '.', '12', '.', '02', '.', '2018', 'tarihinde', 'saat', 'tam', '15', ':', '45', "'", 'te', 'yap', '-', 'malıyız', 'işi', '.', '?', 'Benimle', 'esatmahmutbayol', '@', 'gmail', '.', 'com', 'adresinden', 'iletişime', 'geçebilirsin', '.', 'Yarışta', '1', '.', 'oldu', '.', 'Doç', '.', 'Dr', '.', 'Esat', 'Bayol', "'", 'un', '(', 'ünvanım', 'yok', '!', ')', 'yanından', 'geliyorum', '.', '12', 'p', '.', 'm', '.', '12', 'a', '.', 'm', '.', '?', '100', 'milyon', 'insan', 'gelmiş', '!', 'www', '.', 'deneme', '.', 'com', '.', 'tr', 'adresinden', 'sitemizi', 'inceleyebilirsin', '.', '24', 'Eylül', '2018', 'Pazartesi', 'günü', 'ge', '-', 'lecekmiş', '.', '19', 'Mayıs', "'", 'ı', 'coşkuyla', 'kutladık', '.']

trnlp.tokenization.whitespace_token(utext: str, sw=None) -> list

Metni boşluklardan parçalar. Boşluklar silinir. "sw" parametresi ile kullanıldığında stopwords listesi içerisindeki kelimeleri temizler. İstenildiği takdirde trnlp ile gelen hazır durak kelimeler listesi ile birlikte kullanılabilir. Bu durumda parametre sw=stopwords şeklinde girilmelidir.

from trnlp import *

print(whitespace_token(metin))

>> ['Saçma', 've', 'Gereksiz', 'Bir', 'Yazı.\n', 'Bakkaldan', '5', "TL'lik", '2', 'çikola-\n', 'ta', 'al.', '12.02.2018', 'tarihinde', 'saat', 'tam', "15:45'te", 'yap-\n', 'malıyız', 'bu', 'işi.', 'Tamam', 'mı?', 'Benimle', 'esatmahmutbayol@gmail.com', '\n', 'adresinden', 'iletişime', 'geçebilirsin.', 'Yarışta', '1.', 'oldu.', 'Doç.', 'Dr.', '\n', 'Esat', "Bayol'un(Böyle", 'bir', 'ünvanım', 'yok!)', 'yanından', 'geliyorum.\n', '12', 'p.m.', 'mi', 'yoksa', '12', 'a.m.', 'mi?', '100', 'milyon', 'insan', 'gelmiş!', 'www.deneme.com.tr', '\n', 'adresinden', 'sitemizi', 'inceleyebilirsin.', '24', 'Eylül', '2018', 'Pazartesi', 'günü', 'ge-\n', 'lecekmiş.', '19', "Mayıs'ı", 'coşkuyla', 'kutladık.']

trnlp.tokenization.word_token(utext: str, numbers=True, stop=None) -> list

Metin içerisindeki kelimeleri listeler. Rakam içeren gruplar listelenmez. "stop" parametresi ile kullanıldığında stopwords listesi içerisindeki kelimeleri temizler. İstenildiği takdirde trnlp ile gelen hazır durak kelimeler listesi ile birlikte kullanılabilir. Bu durumda parametre stop=stopwords şeklinde girilmelidir. Ön tanımlı olarak rakamları da yakalar. Eğer rakam istemiyorsa numbers=False girilmelidir.

from trnlp import *

print(word_token(metin))

>> ['Saçma', 've', 'Gereksiz', 'Bir', 'Yazı', 'Bakkaldan', '5', "TL'lik", '2', 'çikolata', 'al', '12', '02', '2018', 'tarihinde', 'saat', 'tam', '15', "45'te", 'yapmalıyız', 'bu', 'işi', 'Tamam', 'mı', 'Benimle', 'esatmahmutbayol', 'gmail', 'com', 'adresinden', 'iletişime', 'geçebilirsin', 'Yarışta', '1', 'oldu', 'Doç', 'Dr', 'Esat', "Bayol'un", 'Böyle', 'bir', 'ünvanım', 'yok', 'yanından', 'geliyorum', '12', 'p', 'm', 'mi', 'yoksa', '12', 'a', 'm', 'mi', '100', 'milyon', 'insan', 'gelmiş', 'www', 'deneme', 'com', 'tr', 'adresinden', 'sitemizi', 'inceleyebilirsin', '24', 'Eylül', '2018', 'Pazartesi', 'günü', 'gelecekmiş', '19', "Mayıs'ı", 'coşkuyla', 'kutladık']

trnlp.tokenization.TrnlpToken

trnlp "Tokenization" işlemi bir cümleyi yada metni daha küçük anlamlı birimlere ayırmayı hedefler. Bu amaçla metin içerisindeki tarih, saat, kısaltma, finansal ifadeler, telefon numaraları, web sitesi, e-posta adreslerini indeksler. TrnlpToken klasından bir örnek oluşturduktan sonra settext() fonksiyonu ile yeni metin girişleri yapılabilir.

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)

.tokens

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.tokens)

>> ['Saçma', ' ', 've', ' ', 'Gereksiz', ' ', 'Bir', ' ', 'Yazı', '.', '\n', 'Bakkaldan', ' ', '5 TL', "'", 'lik', ' ', '2', ' ', 'çikolata', ' ', 'al', '.', ' ', '12.02.2018', ' ', 'tarihinde', ' ', 'saat', ' ', 'tam', ' ', '15:45', "'", 'te', ' ', 'yapmalıyız', '\n', 'bu', ' ', 'işi', '.', ' ', 'Tamam', ' ', 'mı', '?', ' ', 'Benimle', ' ', 'esatmahmutbayol@gmail.com', '\n', 'adresinden', ' ', 'iletişime', ' ', 'geçebilirsin', '.', ' ', 'Yarışta', ' ', '1.', ' ', 'oldu', '.', ' ', 'Doç.', ' ', 'Dr.', '\n', 'Esat', ' ', "Bayol'un", '(', 'Böyle', ' ', 'bir', ' ', 'ünvanım', ' ', 'yok', '!', ')', ' ', 'yanından', ' ', 'geliyorum', '.', '\n', '12 p.m.', ' ', 'mi', ' ', 'yoksa', ' ', '12 a.m.', ' ', 'mi', '?', ' ', '100', ' ', 'milyon', ' ', 'insan', ' ', 'gelmiş', '!', ' ', 'www.deneme.com.tr', '\n', 'adresinden', ' ', 'sitemizi', ' ', 'inceleyebilirsin', '.', ' ', '24 Eylül 2018', ' ', 'Pazartesi', ' ', 'günü', ' ', 'gelecekmiş', '.', '\n', '19 Mayıs', "'", 'ı', ' ', 'coşkuyla', ' ', 'kutladık', '.']

.spans

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.spans)

>> [(0, 5), (5, 6), (6, 8), (8, 9), (9, 17), (17, 18), (18, 21), (21, 22), (22, 26), (26, 27), (27, 28), (28, 37), (37, 38), (38, 42), (42, 43), (43, 46), (46, 47), (47, 48), (48, 49), (49, 57), (57, 58), (58, 60), (60, 61), (61, 62), (62, 72), (72, 73), (73, 82), (82, 83), (83, 87), (87, 88), (88, 91), (91, 92), (92, 97), (97, 98), (98, 100), (100, 101), (101, 111), (111, 112), (112, 114), (114, 115), (115, 118), (118, 119), (119, 120), (120, 125), (125, 126), (126, 128), (128, 129), (129, 130), (130, 137), (137, 138), (138, 163), (163, 164), (164, 174), (174, 175), (175, 184), (184, 185), (185, 197), (197, 198), (198, 199), (199, 206), (206, 207), (207, 209), (209, 210), (210, 214), (214, 215), (215, 216), (216, 220), (220, 221), (221, 224), (224, 225), (225, 229), (229, 230), (230, 238), (238, 239), (239, 244), (244, 245), (245, 248), (248, 249), (249, 256), (256, 257), (257, 260), (260, 261), (261, 262), (262, 263), (263, 271), (271, 272), (272, 281), (281, 282), (282, 283), (283, 290), (290, 291), (291, 293), (293, 294), (294, 299), (299, 300), (300, 307), (307, 308), (308, 310), (310, 311), (311, 312), (312, 315), (315, 316), (316, 322), (322, 323), (323, 328), (328, 329), (329, 335), (335, 336), (336, 337), (337, 354), (354, 355), (355, 365), (365, 366), (366, 374), (374, 375), (375, 391), (391, 392), (392, 393), (393, 406), (406, 407), (407, 416), (416, 417), (417, 421), (421, 422), (422, 432), (432, 433), (433, 434), (434, 442), (442, 443), (443, 444), (444, 445), (445, 453), (453, 454), (454, 462), (462, 463)]

.types

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.types)

>> ['word', 'space', 'word', 'space', 'word', 'space', 'word', 'space', 'word', 'eos', 'enter', 'word', 'space', 'financial', 'punch', 'suffix', 'space', 'number', 'space', 'word', 'space', 'word', 'eos', 'space', 'date', 'space', 'word', 'space', 'word', 'space', 'word', 'space', 'time', 'punch', 'suffix', 'space', 'word', 'enter', 'word', 'space', 'word', 'eos', 'space', 'word', 'space', 'word', 'eos', 'space', 'word', 'space', 'web', 'enter', 'word', 'space', 'word', 'space', 'word', 'eos', 'space', 'word', 'space', 'th', 'space', 'word', 'eos', 'space', 'abbr', 'space', 'abbr', 'enter', 'word', 'space', 'word', 'punch', 'word', 'space', 'word', 'space', 'word', 'space', 'word', 'punch', 'punch', 'space', 'word', 'space', 'word', 'eos', 'enter', 'time', 'space', 'word', 'space', 'word', 'space', 'time', 'space', 'word', 'eos', 'space', 'number', 'space', 'word', 'space', 'word', 'space', 'word', 'eos', 'space', 'web', 'enter', 'word', 'space', 'word', 'space', 'word', 'eos', 'space', 'date', 'space', 'word', 'space', 'word', 'space', 'word', 'eos', 'enter', 'date', 'punch', 'suffix', 'space', 'word', 'space', 'word', 'eos']

.wordtoken

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.wordtoken)

>> ['Saçma', 've', 'Gereksiz', 'Bir', 'Yazı', 'Bakkaldan', 'lik', 'çikolata', 'al', 'tarihinde', 'saat', 'tam', 'te', 'yapmalıyız', 'bu', 'işi', 'Tamam', 'mı', 'Benimle', 'adresinden', 'iletişime', 'geçebilirsin', 'Yarışta', 'oldu', 'Esat', "Bayol'un", 'Böyle', 'bir', 'ünvanım', 'yok', 'yanından', 'geliyorum', 'mi', 'yoksa', 'mi', 'milyon', 'insan', 'gelmiş', 'adresinden', 'sitemizi', 'inceleyebilirsin', 'Pazartesi', 'günü', 'gelecekmiş', 'ı', 'coşkuyla', 'kutladık']

.phrasetoken

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.phrasetoken)

>> ['Saçma ve Gereksiz Bir Yazı.', "Bakkaldan 5 TL'lik 2 çikolata al.", "12.02.2018 tarihinde saat tam 15:45'te yapmalıyız bu işi.", 'Tamam mı?', 'Benimle esatmahmutbayol@gmail.com adresinden iletişime geçebilirsin.', 'Yarışta 1. oldu.', "Doç. Dr. Esat Bayol'un(Böyle bir ünvanım yok!) yanından geliyorum.", '12 p.m. mi yoksa 12 a.m. mi?', '100 milyon insan gelmiş!', 'www.deneme.com.tr adresinden sitemizi inceleyebilirsin.', '24 Eylül 2018 Pazartesi günü gelecekmiş.', "19 Mayıs'ı coşkuyla kutladık."]

.wordcounter

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.wordcounter)

>> Counter({'adresinden': 2, 'mi': 2, 'Saçma': 1, 've': 1, 'Gereksiz': 1, 'Bir': 1, 'Yazı': 1, 'Bakkaldan': 1, 'lik': 1, 'çikolata': 1, 'al': 1, 'tarihinde': 1, 'saat': 1, 'tam': 1, 'te': 1, 'yapmalıyız': 1, 'bu': 1, 'işi': 1, 'Tamam': 1, 'mı': 1, 'Benimle': 1, 'iletişime': 1, 'geçebilirsin': 1, 'Yarışta': 1, 'oldu': 1, 'Esat': 1, "Bayol'un": 1, 'Böyle': 1, 'bir': 1, 'ünvanım': 1, 'yok': 1, 'yanından': 1, 'geliyorum': 1, 'yoksa': 1, 'milyon': 1, 'insan': 1, 'gelmiş': 1, 'sitemizi': 1, 'inceleyebilirsin': 1, 'Pazartesi': 1, 'günü': 1, 'gelecekmiş': 1, 'ı': 1, 'coşkuyla': 1, 'kutladık': 1})

.ziptoken

zip token liste formatı : [(token_1:str, token_tipi_1:str, token_span_1:tupple),...]

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.ziptoken)

>> [('Saçma', 'word', (0, 5)), (' ', 'space', (5, 6)), ('ve', 'word', (6, 8)), (' ', 'space', (8, 9)), ('Gereksiz', 'word', (9, 17)), (' ', 'space', (17, 18)), ('Bir', 'word', (18, 21)), (' ', 'space', (21, 22)), ('Yazı', 'word', (22, 26)), ('.', 'eos', (26, 27)), ('\n', 'enter', (27, 28)), ('Bakkaldan', 'word', (28, 37)), (' ', 'space', (37, 38)), ('5 TL', 'financial', (38, 42)), ("'", 'punch', (42, 43)), ('lik', 'suffix', (43, 46)), (' ', 'space', (46, 47)), ('2', 'number', (47, 48)), (' ', 'space', (48, 49)), ('çikolata', 'word', (49, 57)), (' ', 'space', (57, 58)), ('al', 'word', (58, 60)), ('.', 'eos', (60, 61)), (' ', 'space', (61, 62)), ('12.02.2018', 'date', (62, 72)), (' ', 'space', (72, 73)), ('tarihinde', 'word', (73, 82)), (' ', 'space', (82, 83)), ('saat', 'word', (83, 87)), (' ', 'space', (87, 88)), ('tam', 'word', (88, 91)), (' ', 'space', (91, 92)), ('15:45', 'time', (92, 97)), ("'", 'punch', (97, 98)), ('te', 'suffix', (98, 100)), (' ', 'space', (100, 101)), ('yapmalıyız', 'word', (101, 111)), ('\n', 'enter', (111, 112)), ('bu', 'word', (112, 114)), (' ', 'space', (114, 115)), ('işi', 'word', (115, 118)), ('.', 'eos', (118, 119)), (' ', 'space', (119, 120)), ('Tamam', 'word', (120, 125)), (' ', 'space', (125, 126)), ('mı', 'word', (126, 128)), ('?', 'eos', (128, 129)), (' ', 'space', (129, 130)), ('Benimle', 'word', (130, 137)), (' ', 'space', (137, 138)), ('esatmahmutbayol@gmail.com', 'web', (138, 163)), ('\n', 'enter', (163, 164)), ('adresinden', 'word', (164, 174)), (' ', 'space', (174, 175)), ('iletişime', 'word', (175, 184)), (' ', 'space', (184, 185)), ('geçebilirsin', 'word', (185, 197)), ('.', 'eos', (197, 198)), (' ', 'space', (198, 199)), ('Yarışta', 'word', (199, 206)), (' ', 'space', (206, 207)), ('1.', 'th', (207, 209)), (' ', 'space', (209, 210)), ('oldu', 'word', (210, 214)), ('.', 'eos', (214, 215)), (' ', 'space', (215, 216)), ('Doç.', 'abbr', (216, 220)), (' ', 'space', (220, 221)), ('Dr.', 'abbr', (221, 224)), ('\n', 'enter', (224, 225)), ('Esat', 'word', (225, 229)), (' ', 'space', (229, 230)), ("Bayol'un", 'word', (230, 238)), ('(', 'punch', (238, 239)), ('Böyle', 'word', (239, 244)), (' ', 'space', (244, 245)), ('bir', 'word', (245, 248)), (' ', 'space', (248, 249)), ('ünvanım', 'word', (249, 256)), (' ', 'space', (256, 257)), ('yok', 'word', (257, 260)), ('!', 'punch', (260, 261)), (')', 'punch', (261, 262)), (' ', 'space', (262, 263)), ('yanından', 'word', (263, 271)), (' ', 'space', (271, 272)), ('geliyorum', 'word', (272, 281)), ('.', 'eos', (281, 282)), ('\n', 'enter', (282, 283)), ('12 p.m.', 'time', (283, 290)), (' ', 'space', (290, 291)), ('mi', 'word', (291, 293)), (' ', 'space', (293, 294)), ('yoksa', 'word', (294, 299)), (' ', 'space', (299, 300)), ('12 a.m.', 'time', (300, 307)), (' ', 'space', (307, 308)), ('mi', 'word', (308, 310)), ('?', 'eos', (310, 311)), (' ', 'space', (311, 312)), ('100', 'number', (312, 315)), (' ', 'space', (315, 316)), ('milyon', 'word', (316, 322)), (' ', 'space', (322, 323)), ('insan', 'word', (323, 328)), (' ', 'space', (328, 329)), ('gelmiş', 'word', (329, 335)), ('!', 'eos', (335, 336)), (' ', 'space', (336, 337)), ('www.deneme.com.tr', 'web', (337, 354)), ('\n', 'enter', (354, 355)), ('adresinden', 'word', (355, 365)), (' ', 'space', (365, 366)), ('sitemizi', 'word', (366, 374)), (' ', 'space', (374, 375)), ('inceleyebilirsin', 'word', (375, 391)), ('.', 'eos', (391, 392)), (' ', 'space', (392, 393)), ('24 Eylül 2018', 'date', (393, 406)), (' ', 'space', (406, 407)), ('Pazartesi', 'word', (407, 416)), (' ', 'space', (416, 417)), ('günü', 'word', (417, 421)), (' ', 'space', (421, 422)), ('gelecekmiş', 'word', (422, 432)), ('.', 'eos', (432, 433)), ('\n', 'enter', (433, 434)), ('19 Mayıs', 'date', (434, 442)), ("'", 'punch', (442, 443)), ('ı', 'suffix', (443, 444)), (' ', 'space', (444, 445)), ('coşkuyla', 'word', (445, 453)), (' ', 'space', (453, 454)), ('kutladık', 'word', (454, 462)), ('.', 'eos', (462, 463))]

clean_punch(self, zip_list=None) -> list

Metin içerisindeki noktalama, boşluk ve enter karakterlerini temizler. Fakat tarih, saat vb. tiplerin içerisindeki noktalama işaretlerini temizlemez. Tüm noktalama işaretlerinin temizlenmesi isteniyorsa simple_token, whitespace_token, word_token fonksiyonları kullanılmalıdır. "zip_list" parametresi "ziptoken" tipinde girişi kabul eder. Ve çıkış olarak ta "ziptoken" tipinde çıkış alınır. "clean_stopwords" fonksiyonu ile birbirlerine parametre olarak gönderilebilir.

zip token liste formatı : [(token_1:str, token_tipi_1:str, token_span_1:tupple),...]

UYARI: Bu fonsiyon obje içerisindeki diğer değişkenleri değiştirmez. Yani sonradan kullanılmak isteniyorsa sonuç bir değişkene atanmalıdır. Örneğin;

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.clean_punch()) # Parametresiz kullanımda "ziptoken" değişkeni üzerinden işlem yapılır.

>> [('Saçma', 'word', (0, 5)), ('ve', 'word', (6, 8)), ('Gereksiz', 'word', (9, 17)), ('Bir', 'word', (18, 21)), ('Yazı', 'word', (22, 26)), ('Bakkaldan', 'word', (28, 37)), ('5 TL', 'financial', (38, 42)), ('lik', 'suffix', (43, 46)), ('2', 'number', (47, 48)), ('çikolata', 'word', (49, 57)), ('al', 'word', (58, 60)), ('12.02.2018', 'date', (62, 72)), ('tarihinde', 'word', (73, 82)), ('saat', 'word', (83, 87)), ('tam', 'word', (88, 91)), ('15:45', 'time', (92, 97)), ('te', 'suffix', (98, 100)), ('yapmalıyız', 'word', (101, 111)), ('bu', 'word', (112, 114)), ('işi', 'word', (115, 118)), ('Tamam', 'word', (120, 125)), ('mı', 'word', (126, 128)), ('Benimle', 'word', (130, 137)), ('esatmahmutbayol@gmail.com', 'web', (138, 163)), ('adresinden', 'word', (164, 174)), ('iletişime', 'word', (175, 184)), ('geçebilirsin', 'word', (185, 197)), ('Yarışta', 'word', (199, 206)), ('1.', 'th', (207, 209)), ('oldu', 'word', (210, 214)), ('Doç.', 'abbr', (216, 220)), ('Dr.', 'abbr', (221, 224)), ('Esat', 'word', (225, 229)), ("Bayol'un", 'word', (230, 238)), ('Böyle', 'word', (239, 244)), ('bir', 'word', (245, 248)), ('ünvanım', 'word', (249, 256)), ('yok', 'word', (257, 260)), ('yanından', 'word', (263, 271)), ('geliyorum', 'word', (272, 281)), ('12 p.m.', 'time', (283, 290)), ('mi', 'word', (291, 293)), ('yoksa', 'word', (294, 299)), ('12 a.m.', 'time', (300, 307)), ('mi', 'word', (308, 310)), ('100', 'number', (312, 315)), ('milyon', 'word', (316, 322)), ('insan', 'word', (323, 328)), ('gelmiş', 'word', (329, 335)), ('www.deneme.com.tr', 'web', (337, 354)), ('adresinden', 'word', (355, 365)), ('sitemizi', 'word', (366, 374)), ('inceleyebilirsin', 'word', (375, 391)), ('24 Eylül 2018', 'date', (393, 406)), ('Pazartesi', 'word', (407, 416)), ('günü', 'word', (417, 421)), ('gelecekmiş', 'word', (422, 432)), ('19 Mayıs', 'date', (434, 442)), ('ı', 'suffix', (443, 444)), ('coşkuyla', 'word', (445, 453)), ('kutladık', 'word', (454, 462))]

print(obj.clean_punch(obj.clean_stopwords())) # Hem durak kelimeler hem de noktalamaları temizlemiş olduk.

>> [('Saçma', 'word', (0, 5)), ('Gereksiz', 'word', (9, 17)), ('Yazı', 'word', (22, 26)), ('Bakkaldan', 'word', (28, 37)), ('5 TL', 'financial', (38, 42)), ('lik', 'suffix', (43, 46)), ('2', 'number', (47, 48)), ('çikolata', 'word', (49, 57)), ('al', 'word', (58, 60)), ('12.02.2018', 'date', (62, 72)), ('tarihinde', 'word', (73, 82)), ('saat', 'word', (83, 87)), ('tam', 'word', (88, 91)), ('15:45', 'time', (92, 97)), ('te', 'suffix', (98, 100)), ('yapmalıyız', 'word', (101, 111)), ('işi', 'word', (115, 118)), ('Benimle', 'word', (130, 137)), ('esatmahmutbayol@gmail.com', 'web', (138, 163)), ('adresinden', 'word', (164, 174)), ('iletişime', 'word', (175, 184)), ('geçebilirsin', 'word', (185, 197)), ('Yarışta', 'word', (199, 206)), ('1.', 'th', (207, 209)), ('oldu', 'word', (210, 214)), ('Doç.', 'abbr', (216, 220)), ('Dr.', 'abbr', (221, 224)), ('Esat', 'word', (225, 229)), ("Bayol'un", 'word', (230, 238)), ('ünvanım', 'word', (249, 256)), ('yok', 'word', (257, 260)), ('yanından', 'word', (263, 271)), ('geliyorum', 'word', (272, 281)), ('12 p.m.', 'time', (283, 290)), ('12 a.m.', 'time', (300, 307)), ('100', 'number', (312, 315)), ('milyon', 'word', (316, 322)), ('insan', 'word', (323, 328)), ('gelmiş', 'word', (329, 335)), ('www.deneme.com.tr', 'web', (337, 354)), ('adresinden', 'word', (355, 365)), ('sitemizi', 'word', (366, 374)), ('inceleyebilirsin', 'word', (375, 391)), ('24 Eylül 2018', 'date', (393, 406)), ('Pazartesi', 'word', (407, 416)), ('günü', 'word', (417, 421)), ('gelecekmiş', 'word', (422, 432)), ('19 Mayıs', 'date', (434, 442)), ('ı', 'suffix', (443, 444)), ('coşkuyla', 'word', (445, 453)), ('kutladık', 'word', (454, 462))]

print(obj.tokens)

>> ['Saçma', ' ', 've', ' ', 'Gereksiz', ' ', 'Bir', ' ', 'Yazı', '.', '\n', 'Bakkaldan', ' ', '5 TL', "'", 'lik', ' ', '2', ' ', 'çikolata', ' ', 'al', '.', ' ', '12.02.2018', ' ', 'tarihinde', ' ', 'saat', ' ', 'tam', ' ', '15:45', "'", 'te', ' ', 'yapmalıyız', '\n', 'bu', ' ', 'işi', '.', ' ', 'Tamam', ' ', 'mı', '?', ' ', 'Benimle', ' ', 'esatmahmutbayol@gmail.com', '\n', 'adresinden', ' ', 'iletişime', ' ', 'geçebilirsin', '.', ' ', 'Yarışta', ' ', '1.', ' ', 'oldu', '.', ' ', 'Doç.', ' ', 'Dr.', '\n', 'Esat', ' ', "Bayol'un", '(', 'Böyle', ' ', 'bir', ' ', 'ünvanım', ' ', 'yok', '!', ')', ' ', 'yanından', ' ', 'geliyorum', '.', '\n', '12 p.m.', ' ', 'mi', ' ', 'yoksa', ' ', '12 a.m.', ' ', 'mi', '?', ' ', '100', ' ', 'milyon', ' ', 'insan', ' ', 'gelmiş', '!', ' ', 'www.deneme.com.tr', '\n', 'adresinden', ' ', 'sitemizi', ' ', 'inceleyebilirsin', '.', ' ', '24 Eylül 2018', ' ', 'Pazartesi', ' ', 'günü', ' ', 'gelecekmiş', '.', '\n', '19 Mayıs', "'", 'ı', ' ', 'coşkuyla', ' ', 'kutladık', '.']

clean_stopwords(self, zip_list=None) -> list

Kullanım şekli clean_punch() fonksiyonu ile aynıdır. Durak kelimeleri temizler ve objenin değişkenlerini değiştirmez.

zip token liste formatı : [(token_1:str, token_tipi_1:str, token_span_1:tupple),...]

from trnlp import *

obj = TrnlpToken()
obj.settext(metin)
print(obj.clean_stopwords()) # Parametresiz kullanımda "ziptoken" değişkeni üzerinden işlem yapılır.

>> [('Saçma', 'word', (0, 5)), (' ', 'space', (5, 6)), (' ', 'space', (8, 9)), ('Gereksiz', 'word', (9, 17)), (' ', 'space', (17, 18)), (' ', 'space', (21, 22)), ('Yazı', 'word', (22, 26)), ('.', 'eos', (26, 27)), ('\n', 'enter', (27, 28)), ('Bakkaldan', 'word', (28, 37)), (' ', 'space', (37, 38)), ('5 TL', 'financial', (38, 42)), ("'", 'punch', (42, 43)), ('lik', 'suffix', (43, 46)), (' ', 'space', (46, 47)), ('2', 'number', (47, 48)), (' ', 'space', (48, 49)), ('çikolata', 'word', (49, 57)), (' ', 'space', (57, 58)), ('al', 'word', (58, 60)), ('.', 'eos', (60, 61)), (' ', 'space', (61, 62)), ('12.02.2018', 'date', (62, 72)), (' ', 'space', (72, 73)), ('tarihinde', 'word', (73, 82)), (' ', 'space', (82, 83)), ('saat', 'word', (83, 87)), (' ', 'space', (87, 88)), ('tam', 'word', (88, 91)), (' ', 'space', (91, 92)), ('15:45', 'time', (92, 97)), ("'", 'punch', (97, 98)), ('te', 'suffix', (98, 100)), (' ', 'space', (100, 101)), ('yapmalıyız', 'word', (101, 111)), ('\n', 'enter', (111, 112)), (' ', 'space', (114, 115)), ('işi', 'word', (115, 118)), ('.', 'eos', (118, 119)), (' ', 'space', (119, 120)), (' ', 'space', (125, 126)), ('?', 'eos', (128, 129)), (' ', 'space', (129, 130)), ('Benimle', 'word', (130, 137)), (' ', 'space', (137, 138)), ('esatmahmutbayol@gmail.com', 'web', (138, 163)), ('\n', 'enter', (163, 164)), ('adresinden', 'word', (164, 174)), (' ', 'space', (174, 175)), ('iletişime', 'word', (175, 184)), (' ', 'space', (184, 185)), ('geçebilirsin', 'word', (185, 197)), ('.', 'eos', (197, 198)), (' ', 'space', (198, 199)), ('Yarışta', 'word', (199, 206)), (' ', 'space', (206, 207)), ('1.', 'th', (207, 209)), (' ', 'space', (209, 210)), ('oldu', 'word', (210, 214)), ('.', 'eos', (214, 215)), (' ', 'space', (215, 216)), ('Doç.', 'abbr', (216, 220)), (' ', 'space', (220, 221)), ('Dr.', 'abbr', (221, 224)), ('\n', 'enter', (224, 225)), ('Esat', 'word', (225, 229)), (' ', 'space', (229, 230)), ("Bayol'un", 'word', (230, 238)), ('(', 'punch', (238, 239)), (' ', 'space', (244, 245)), (' ', 'space', (248, 249)), ('ünvanım', 'word', (249, 256)), (' ', 'space', (256, 257)), ('yok', 'word', (257, 260)), ('!', 'punch', (260, 261)), (')', 'punch', (261, 262)), (' ', 'space', (262, 263)), ('yanından', 'word', (263, 271)), (' ', 'space', (271, 272)), ('geliyorum', 'word', (272, 281)), ('.', 'eos', (281, 282)), ('\n', 'enter', (282, 283)), ('12 p.m.', 'time', (283, 290)), (' ', 'space', (290, 291)), (' ', 'space', (293, 294)), (' ', 'space', (299, 300)), ('12 a.m.', 'time', (300, 307)), (' ', 'space', (307, 308)), ('?', 'eos', (310, 311)), (' ', 'space', (311, 312)), ('100', 'number', (312, 315)), (' ', 'space', (315, 316)), ('milyon', 'word', (316, 322)), (' ', 'space', (322, 323)), ('insan', 'word', (323, 328)), (' ', 'space', (328, 329)), ('gelmiş', 'word', (329, 335)), ('!', 'eos', (335, 336)), (' ', 'space', (336, 337)), ('www.deneme.com.tr', 'web', (337, 354)), ('\n', 'enter', (354, 355)), ('adresinden', 'word', (355, 365)), (' ', 'space', (365, 366)), ('sitemizi', 'word', (366, 374)), (' ', 'space', (374, 375)), ('inceleyebilirsin', 'word', (375, 391)), ('.', 'eos', (391, 392)), (' ', 'space', (392, 393)), ('24 Eylül 2018', 'date', (393, 406)), (' ', 'space', (406, 407)), ('Pazartesi', 'word', (407, 416)), (' ', 'space', (416, 417)), ('günü', 'word', (417, 421)), (' ', 'space', (421, 422)), ('gelecekmiş', 'word', (422, 432)), ('.', 'eos', (432, 433)), ('\n', 'enter', (433, 434)), ('19 Mayıs', 'date', (434, 442)), ("'", 'punch', (442, 443)), ('ı', 'suffix', (443, 444)), (' ', 'space', (444, 445)), ('coşkuyla', 'word', (445, 453)), (' ', 'space', (453, 454)), ('kutladık', 'word', (454, 462)), ('.', 'eos', (462, 463))]

print(obj.clean_stopwords(obj.clean_punch())) # Hem durak kelimeler hem de noktalamaları temizlemiş olduk.

>> [('Saçma', 'word', (0, 5)), ('Gereksiz', 'word', (9, 17)), ('Yazı', 'word', (22, 26)), ('Bakkaldan', 'word', (28, 37)), ('5 TL', 'financial', (38, 42)), ('lik', 'suffix', (43, 46)), ('2', 'number', (47, 48)), ('çikolata', 'word', (49, 57)), ('al', 'word', (58, 60)), ('12.02.2018', 'date', (62, 72)), ('tarihinde', 'word', (73, 82)), ('saat', 'word', (83, 87)), ('tam', 'word', (88, 91)), ('15:45', 'time', (92, 97)), ('te', 'suffix', (98, 100)), ('yapmalıyız', 'word', (101, 111)), ('işi', 'word', (115, 118)), ('Benimle', 'word', (130, 137)), ('esatmahmutbayol@gmail.com', 'web', (138, 163)), ('adresinden', 'word', (164, 174)), ('iletişime', 'word', (175, 184)), ('geçebilirsin', 'word', (185, 197)), ('Yarışta', 'word', (199, 206)), ('1.', 'th', (207, 209)), ('oldu', 'word', (210, 214)), ('Doç.', 'abbr', (216, 220)), ('Dr.', 'abbr', (221, 224)), ('Esat', 'word', (225, 229)), ("Bayol'un", 'word', (230, 238)), ('ünvanım', 'word', (249, 256)), ('yok', 'word', (257, 260)), ('yanından', 'word', (263, 271)), ('geliyorum', 'word', (272, 281)), ('12 p.m.', 'time', (283, 290)), ('12 a.m.', 'time', (300, 307)), ('100', 'number', (312, 315)), ('milyon', 'word', (316, 322)), ('insan', 'word', (323, 328)), ('gelmiş', 'word', (329, 335)), ('www.deneme.com.tr', 'web', (337, 354)), ('adresinden', 'word', (355, 365)), ('sitemizi', 'word', (366, 374)), ('inceleyebilirsin', 'word', (375, 391)), ('24 Eylül 2018', 'date', (393, 406)), ('Pazartesi', 'word', (407, 416)), ('günü', 'word', (417, 421)), ('gelecekmiş', 'word', (422, 432)), ('19 Mayıs', 'date', (434, 442)), ('ı', 'suffix', (443, 444)), ('coşkuyla', 'word', (445, 453)), ('kutladık', 'word', (454, 462))]

print(obj.tokens)

>> ['Saçma', ' ', 've', ' ', 'Gereksiz', ' ', 'Bir', ' ', 'Yazı', '.', '\n', 'Bakkaldan', ' ', '5 TL', "'", 'lik', ' ', '2', ' ', 'çikolata', ' ', 'al', '.', ' ', '12.02.2018', ' ', 'tarihinde', ' ', 'saat', ' ', 'tam', ' ', '15:45', "'", 'te', ' ', 'yapmalıyız', '\n', 'bu', ' ', 'işi', '.', ' ', 'Tamam', ' ', 'mı', '?', ' ', 'Benimle', ' ', 'esatmahmutbayol@gmail.com', '\n', 'adresinden', ' ', 'iletişime', ' ', 'geçebilirsin', '.', ' ', 'Yarışta', ' ', '1.', ' ', 'oldu', '.', ' ', 'Doç.', ' ', 'Dr.', '\n', 'Esat', ' ', "Bayol'un", '(', 'Böyle', ' ', 'bir', ' ', 'ünvanım', ' ', 'yok', '!', ')', ' ', 'yanından', ' ', 'geliyorum', '.', '\n', '12 p.m.', ' ', 'mi', ' ', 'yoksa', ' ', '12 a.m.', ' ', 'mi', '?', ' ', '100', ' ', 'milyon', ' ', 'insan', ' ', 'gelmiş', '!', ' ', 'www.deneme.com.tr', '\n', 'adresinden', ' ', 'sitemizi', ' ', 'inceleyebilirsin', '.', ' ', '24 Eylül 2018', ' ', 'Pazartesi', ' ', 'günü', ' ', 'gelecekmiş', '.', '\n', '19 Mayıs', "'", 'ı', ' ', 'coşkuyla', ' ', 'kutladık', '.']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Parçalama (Tokenization)

trnlp.file_prossesing.count_all_txt(folder_path, ngram=1) -> dict

trnlp.tokenization.unitoascii(string: str) -> str

trnlp.tokenization.simple_token(utext: str, sw=None) -> list

trnlp.tokenization.whitespace_token(utext: str, sw=None) -> list

trnlp.tokenization.word_token(utext: str, numbers=True, stop=None) -> list

trnlp.tokenization.TrnlpToken

.tokens

.spans

.types

.wordtoken

.phrasetoken

.wordcounter

.ziptoken

clean_punch(self, zip_list=None) -> list

clean_stopwords(self, zip_list=None) -> list

Clone this wiki locally