Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1000 samples tokenization #4

Merged
merged 3 commits into from
Aug 22, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions word-tokenization/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# WiseSight Samples with Word Tokenization Label

This directory contains WiseSight samples by tokenized humans. These samples are randomly drawn from the corpus, with 40 samples for each label.
This directory contains WiseSight samples by tokenized humans. These samples are randomly drawn from the corpus.

Because these samples are representative of real word content, we believe having these annotaed samples will allow the community to robustly evaluate tokenization algorithms.
For wisesight-160, we draw 40 samples for each label, while 250 samples for wisesight-1000.

**Remark:** We removed a couple of samples from wiseight-1000 because they look like spam.

Althought we have two sets of data, we recommend to use **wisesight-1000** because it contains more samples.
Hence, its evaluation is more respresentative and reliable.

Because these samples are representative of real word content, we believe having these annotaed samples will allow the community to robustly evaluate tokenization algorithms.

## Acknowledgement

The annotation was done by several people, including Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and [Dr.Attapol Rutherford][ate].

[ate] https://attapol.github.io/index.html
[ate]: https://attapol.github.io/index.html
62 changes: 54 additions & 8 deletions word-tokenization/data-preparation-and-post-processing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -132,30 +132,76 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Postprocessing"
"# Postprocessing\n",
"\n",
"Google Spreadsheet: https://docs.google.com/spreadsheets/d/1F_qT33T2iy0tKbflnVC8Ma-EoWEHimV3NmNRgLjN00o/edit#gid=1302375309"
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"filepath = \"https://docs.google.com/spreadsheets/d/e/2PACX-1vRm-f8qstNhxICHzEfhbCacJNQSAZptP-6ockKwsxyck5vtl7e1-A2726Qj2hgp4Oht7WfcbdivQNPT/pub?gid=1302375309&single=true&output=csv\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"we have 160 samples\n"
"we have 1000 samples\n"
]
}
],
"source": [
"df = pd.read_csv(\"./wisesight-tokenised.csv\")\n",
"df = pd.read_csv(filepath)\n",
"print(\"we have %d samples\" % len(df))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"should_removed = ~df.label.apply(lambda x: len(x.split(\"-\")) > 1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"df_filtered = df[should_removed]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"we have 993 after samples\n"
]
}
],
"source": [
"print(\"we have %d after samples\" % len(df_filtered))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -164,12 +210,12 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"with open(filename, \"w\") as ft, open(filename.replace(\".txt\", \".label\"), \"w\") as fl:\n",
" for l in df.tokenised.values:\n",
" for l in df_filtered.tokenised.values:\n",
" l = l.strip()\n",
" ft.write(\"%s\\n\" % l.replace(\"|\", \"\"))\n",
" fl.write(\"%s\\n\" % l)"
Expand Down
Loading