{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!bash /home/azureuser/cloudfiles/code/blobfuse/blobfuse_raadsinformatie.sh" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"..\")\n", "\n", "# MAKE SURE TO SET-UP PATH -> use local to run with demo data; use azure to run with complete dataset (access required)\n", "# Select where to run notebook: \"azure\" or \"local\"\n", "my_run = \"local\"\n", "\n", "if my_run == \"azure\":\n", " import config_azure as cf\n", " running_demo = False\n", "elif my_run == \"local\":\n", " import config as cf\n", " running_demo = True\n", "\n", "\n", "import os\n", "if my_run == \"azure\":\n", " if not os.path.exists(cf.HUGGING_CACHE):\n", " os.mkdir(cf.HUGGING_CACHE)\n", " os.environ[\"TRANSFORMERS_CACHE\"] = cf.HUGGING_CACHE\n", "\n", "import pandas as pd\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook overview\n", "Goal: manually check some of the classes. especially docs that have very low token count.\n", "- Starting point: txtfiles_notcleaned.pkl\n", "- Ending point: txtfiles.pkl -> only including 'good' documents\n", "\n", "We found that there are documents with very low token count. Thus we further checked for each class how clean the dataset was. This led to the complete removal of three classes and the removal of some documents based on their length. \n", "\n", "In this notebook:\n", "1. We will first display the token distribution to highlight the imbalance. \n", "2. Then, we will provide the code to remove the faulty documents.\n", "3. After, we also remove duplicate documents. \n", "4. Then, we remove unneccary columns. We don't need the tokens, token_count, clean_tokens, clean_tokens_count columns anymore, those were used for analysis of the raw data.\n", "5. Finally, we will explain the reasons for the removal of the faulty documents, if applicable, for each class. The explanation is put at the end of the notebook since it takes up a lot of cells. \n", "\n", "Previous notebook: load_txt.ipynb\n", "\n", "Next notebook: duplicates.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Check document length\n", "We load in the unclean dataset and look into the document length. We can see that some documents are unusually short. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>count</th>\n", " <th>mean</th>\n", " <th>std</th>\n", " <th>min</th>\n", " <th>25%</th>\n", " <th>50%</th>\n", " <th>75%</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>label</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>actualiteit</th>\n", " <td>10.0</td>\n", " <td>535.0</td>\n", " <td>470.433370</td>\n", " <td>108.0</td>\n", " <td>233.75</td>\n", " <td>376.0</td>\n", " <td>633.00</td>\n", " <td>1606.0</td>\n", " </tr>\n", " <tr>\n", " <th>agenda</th>\n", " <td>10.0</td>\n", " <td>527.1</td>\n", " <td>626.263957</td>\n", " <td>111.0</td>\n", " <td>173.75</td>\n", " <td>240.0</td>\n", " <td>687.75</td>\n", " <td>2157.0</td>\n", " </tr>\n", " <tr>\n", " <th>besluit</th>\n", " <td>10.0</td>\n", " <td>2145.7</td>\n", " <td>3939.064160</td>\n", " <td>112.0</td>\n", " <td>246.75</td>\n", " <td>993.0</td>\n", " <td>1873.50</td>\n", " <td>13155.0</td>\n", " </tr>\n", " <tr>\n", " <th>brief</th>\n", " <td>10.0</td>\n", " <td>2064.0</td>\n", " <td>2160.666307</td>\n", " <td>287.0</td>\n", " <td>694.00</td>\n", " <td>1136.5</td>\n", " <td>2378.00</td>\n", " <td>6906.0</td>\n", " </tr>\n", " <tr>\n", " <th>factsheet</th>\n", " <td>10.0</td>\n", " <td>1642.7</td>\n", " <td>953.800241</td>\n", " <td>679.0</td>\n", " <td>823.75</td>\n", " <td>1481.0</td>\n", " <td>2035.00</td>\n", " <td>3388.0</td>\n", " </tr>\n", " <tr>\n", " <th>motie</th>\n", " <td>10.0</td>\n", " <td>268.5</td>\n", " <td>72.873178</td>\n", " <td>190.0</td>\n", " <td>219.25</td>\n", " <td>248.5</td>\n", " <td>293.50</td>\n", " <td>404.0</td>\n", " </tr>\n", " <tr>\n", " <th>onderzoeksrapport</th>\n", " <td>10.0</td>\n", " <td>14672.2</td>\n", " <td>8575.697416</td>\n", " <td>880.0</td>\n", " <td>9840.25</td>\n", " <td>13164.0</td>\n", " <td>20732.75</td>\n", " <td>28872.0</td>\n", " </tr>\n", " <tr>\n", " <th>raadsadres</th>\n", " <td>10.0</td>\n", " <td>842.0</td>\n", " <td>461.315269</td>\n", " <td>183.0</td>\n", " <td>541.75</td>\n", " <td>750.5</td>\n", " <td>1262.00</td>\n", " <td>1526.0</td>\n", " </tr>\n", " <tr>\n", " <th>raadsnotulen</th>\n", " <td>10.0</td>\n", " <td>32888.6</td>\n", " <td>10886.157226</td>\n", " <td>18848.0</td>\n", " <td>24880.25</td>\n", " <td>31943.5</td>\n", " <td>36826.25</td>\n", " <td>53547.0</td>\n", " </tr>\n", " <tr>\n", " <th>schriftelijke vraag</th>\n", " <td>10.0</td>\n", " <td>919.1</td>\n", " <td>672.090511</td>\n", " <td>278.0</td>\n", " <td>449.25</td>\n", " <td>717.5</td>\n", " <td>1135.00</td>\n", " <td>2482.0</td>\n", " </tr>\n", " <tr>\n", " <th>voordracht</th>\n", " <td>10.0</td>\n", " <td>1045.7</td>\n", " <td>1548.645433</td>\n", " <td>220.0</td>\n", " <td>315.50</td>\n", " <td>420.5</td>\n", " <td>955.75</td>\n", " <td>5319.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " count mean std min 25% 50% \\\n", "label \n", "actualiteit 10.0 535.0 470.433370 108.0 233.75 376.0 \n", "agenda 10.0 527.1 626.263957 111.0 173.75 240.0 \n", "besluit 10.0 2145.7 3939.064160 112.0 246.75 993.0 \n", "brief 10.0 2064.0 2160.666307 287.0 694.00 1136.5 \n", "factsheet 10.0 1642.7 953.800241 679.0 823.75 1481.0 \n", "motie 10.0 268.5 72.873178 190.0 219.25 248.5 \n", "onderzoeksrapport 10.0 14672.2 8575.697416 880.0 9840.25 13164.0 \n", "raadsadres 10.0 842.0 461.315269 183.0 541.75 750.5 \n", "raadsnotulen 10.0 32888.6 10886.157226 18848.0 24880.25 31943.5 \n", "schriftelijke vraag 10.0 919.1 672.090511 278.0 449.25 717.5 \n", "voordracht 10.0 1045.7 1548.645433 220.0 315.50 420.5 \n", "\n", " 75% max \n", "label \n", "actualiteit 633.00 1606.0 \n", "agenda 687.75 2157.0 \n", "besluit 1873.50 13155.0 \n", "brief 2378.00 6906.0 \n", "factsheet 2035.00 3388.0 \n", "motie 293.50 404.0 \n", "onderzoeksrapport 20732.75 28872.0 \n", "raadsadres 1262.00 1526.0 \n", "raadsnotulen 36826.25 53547.0 \n", "schriftelijke vraag 1135.00 2482.0 \n", "voordracht 955.75 5319.0 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>count</th>\n", " <th>mean</th>\n", " <th>std</th>\n", " <th>min</th>\n", " <th>25%</th>\n", " <th>50%</th>\n", " <th>75%</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>label</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>actualiteit</th>\n", " <td>10.0</td>\n", " <td>272.0</td>\n", " <td>236.156257</td>\n", " <td>51.0</td>\n", " <td>111.75</td>\n", " <td>190.0</td>\n", " <td>362.00</td>\n", " <td>790.0</td>\n", " </tr>\n", " <tr>\n", " <th>agenda</th>\n", " <td>10.0</td>\n", " <td>293.4</td>\n", " <td>319.327488</td>\n", " <td>72.0</td>\n", " <td>107.50</td>\n", " <td>155.0</td>\n", " <td>385.50</td>\n", " <td>1115.0</td>\n", " </tr>\n", " <tr>\n", " <th>besluit</th>\n", " <td>10.0</td>\n", " <td>1071.2</td>\n", " <td>1951.986612</td>\n", " <td>62.0</td>\n", " <td>132.75</td>\n", " <td>497.0</td>\n", " <td>935.25</td>\n", " <td>6527.0</td>\n", " </tr>\n", " <tr>\n", " <th>brief</th>\n", " <td>10.0</td>\n", " <td>1010.6</td>\n", " <td>1062.573313</td>\n", " <td>154.0</td>\n", " <td>339.00</td>\n", " <td>555.0</td>\n", " <td>1159.00</td>\n", " <td>3397.0</td>\n", " </tr>\n", " <tr>\n", " <th>factsheet</th>\n", " <td>10.0</td>\n", " <td>1006.9</td>\n", " <td>502.997338</td>\n", " <td>467.0</td>\n", " <td>560.50</td>\n", " <td>916.0</td>\n", " <td>1438.25</td>\n", " <td>1778.0</td>\n", " </tr>\n", " <tr>\n", " <th>motie</th>\n", " <td>10.0</td>\n", " <td>136.4</td>\n", " <td>37.107052</td>\n", " <td>99.0</td>\n", " <td>111.00</td>\n", " <td>122.5</td>\n", " <td>146.75</td>\n", " <td>203.0</td>\n", " </tr>\n", " <tr>\n", " <th>onderzoeksrapport</th>\n", " <td>10.0</td>\n", " <td>7659.5</td>\n", " <td>4403.828745</td>\n", " <td>504.0</td>\n", " <td>5192.75</td>\n", " <td>6929.5</td>\n", " <td>10964.75</td>\n", " <td>14792.0</td>\n", " </tr>\n", " <tr>\n", " <th>raadsadres</th>\n", " <td>10.0</td>\n", " <td>410.3</td>\n", " <td>223.746409</td>\n", " <td>79.0</td>\n", " <td>259.50</td>\n", " <td>352.0</td>\n", " <td>602.50</td>\n", " <td>783.0</td>\n", " </tr>\n", " <tr>\n", " <th>raadsnotulen</th>\n", " <td>10.0</td>\n", " <td>14854.4</td>\n", " <td>5198.432674</td>\n", " <td>8394.0</td>\n", " <td>11067.00</td>\n", " <td>13942.5</td>\n", " <td>16973.75</td>\n", " <td>24926.0</td>\n", " </tr>\n", " <tr>\n", " <th>schriftelijke vraag</th>\n", " <td>10.0</td>\n", " <td>447.2</td>\n", " <td>307.041003</td>\n", " <td>143.0</td>\n", " <td>228.75</td>\n", " <td>357.5</td>\n", " <td>554.25</td>\n", " <td>1143.0</td>\n", " </tr>\n", " <tr>\n", " <th>voordracht</th>\n", " <td>10.0</td>\n", " <td>525.6</td>\n", " <td>745.903956</td>\n", " <td>122.0</td>\n", " <td>165.25</td>\n", " <td>222.5</td>\n", " <td>495.25</td>\n", " <td>2580.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " count mean std min 25% 50% \\\n", "label \n", "actualiteit 10.0 272.0 236.156257 51.0 111.75 190.0 \n", "agenda 10.0 293.4 319.327488 72.0 107.50 155.0 \n", "besluit 10.0 1071.2 1951.986612 62.0 132.75 497.0 \n", "brief 10.0 1010.6 1062.573313 154.0 339.00 555.0 \n", "factsheet 10.0 1006.9 502.997338 467.0 560.50 916.0 \n", "motie 10.0 136.4 37.107052 99.0 111.00 122.5 \n", "onderzoeksrapport 10.0 7659.5 4403.828745 504.0 5192.75 6929.5 \n", "raadsadres 10.0 410.3 223.746409 79.0 259.50 352.0 \n", "raadsnotulen 10.0 14854.4 5198.432674 8394.0 11067.00 13942.5 \n", "schriftelijke vraag 10.0 447.2 307.041003 143.0 228.75 357.5 \n", "voordracht 10.0 525.6 745.903956 122.0 165.25 222.5 \n", "\n", " 75% max \n", "label \n", "actualiteit 362.00 790.0 \n", "agenda 385.50 1115.0 \n", "besluit 935.25 6527.0 \n", "brief 1159.00 3397.0 \n", "factsheet 1438.25 1778.0 \n", "motie 146.75 203.0 \n", "onderzoeksrapport 10964.75 14792.0 \n", "raadsadres 602.50 783.0 \n", "raadsnotulen 16973.75 24926.0 \n", "schriftelijke vraag 554.25 1143.0 \n", "voordracht 495.25 2580.0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load data\n", "import ast\n", "import pandas as pd\n", "\n", "txtfiles = pd.read_pickle(f\"{cf.output_path}/txtfiles_notcleaned.pkl\")\n", "df = txtfiles.copy()\n", "\n", "display(txtfiles.groupby('label')['token_count'].describe())\n", "\n", "# clean tokens -> stopwords, punctuation etc. removed. very simple cleaning of tokens applied\n", "display(txtfiles.groupby('label')['clean_tokens_count'].describe())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Remove the faulty documents.\n", "Below each class is checked, and explained why these documents are removed.\n", "\n", "The removal includes:\n", "- classes: 'Adviesaanvraag', 'Amendement', 'Begroting'\n", "- some documents that are gibberish\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original amount of docs: 110\n", "Amount of docs after removal: 110\n" ] } ], "source": [ "print(\"Original amount of docs:\", len(txtfiles))\n", "# remove Adviesaanvraag, Amendement and Begroting. These classes include docs from other classes.\n", "txtfiles = txtfiles.loc[~txtfiles['label'].isin(['Adviesaanvraag', 'Amendement', 'Begroting'])]\n", "\n", "# remove docs that are too short (only include gibberish)\n", "subdf = txtfiles.loc[txtfiles['label']== 'Schriftelijke Vragen']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "subdf = subdf.loc[txtfiles['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n", "\n", "subdf = txtfiles.loc[(txtfiles['label']== 'Brief') & (txtfiles['clean_tokens_count'] < 75)]\n", "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n", "\n", "subdf = txtfiles.loc[(txtfiles['label']== 'Raadsadres') & (txtfiles['clean_tokens_count'] < 25)]\n", "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n", "\n", "# rename verslag column\n", "txtfiles.loc[txtfiles['label']=='Verslag','label'] = 'Raadsnotulen'\n", "\n", "\n", "print(\"Amount of docs after removal:\", len(txtfiles))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Remove duplicates\n", "\n", "goal: check data for duplicates\n", "\n", "- Use md5 hashing to remove duplicates. \n", "- Data is re-split into subsets, using 2-split (train and test), 4-split (train, test, val and dev) and balance-split (train, test and val; used for research)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6 docs removed. New total: 104 docs.\n" ] } ], "source": [ "import hashlib\n", "import sys\n", "\n", "# load function to split data into subsets (train,test,val and dev)\n", "sys.path.append('../src/') \n", "from data_split import save_split, save_balanced_split\n", "\n", "def calculate_md5(file_path):\n", " \"\"\"Calculate the MD5 hash of a file.\"\"\"\n", " hash_md5 = hashlib.md5()\n", " with open(file_path, \"rb\") as f:\n", " for chunk in iter(lambda: f.read(4096), b\"\"):\n", " hash_md5.update(chunk)\n", " return hash_md5.hexdigest()\n", "\n", "\"\"\"Calculate MD5 hash for each file path in a DataFrame.\"\"\"\n", "def drop_duplicates(df, file_path_column):\n", " df['md5_hash'] = df[file_path_column].apply(calculate_md5)\n", "\n", " # remove rows with duplicate md5_hash\n", " clean_df = df.drop_duplicates(subset=['md5_hash'])\n", " print(f\"{len(df)-len(clean_df)} docs removed. New total: {len(clean_df)} docs.\")\n", " return clean_df\n", "\n", "\"\"\" After removing duplicates and faulty documents, data needs to be resplit \"\"\"\n", "def redo_datasplit(df):\n", " df = save_split(df)\n", " df = save_balanced_split(df, demo=running_demo)\n", " return df\n", "\n", "hash_df = drop_duplicates(txtfiles, 'path')\n", "cleaned_df = redo_datasplit(hash_df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Remove unnessecary columns & save cleaned df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns before removing: ['label', 'path', 'id', '4split', '2split', 'balanced_split', 'text', 'tokens', 'token_count', 'clean_tokens', 'clean_tokens_count', 'pdf_path', 'num_pages', 'md5_hash']\n", "Columns after removing: ['label', 'path', 'id', '4split', '2split', 'balanced_split', 'text', 'pdf_path', 'num_pages', 'md5_hash']\n" ] } ], "source": [ "print(f\"Columns before removing: {list(cleaned_df.columns)}\")\n", "cleaned_df = cleaned_df.drop(columns=['tokens', 'token_count', 'clean_tokens', 'clean_tokens_count'])\n", "print(f\"Columns after removing: {list(cleaned_df.columns)}\")\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "cleaned_df.to_pickle(f\"{cf.output_path}/txtfiles.pkl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Explanation: Check each class\n", "\n", "NOTE: this analysis is run on 'df', which is the uncleaned dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Actualiteit**\n", "\n", "After checking the lowest 1 percent of number of tokens (10 documents which have less than 58 tokens) were checked using the PDFs. These documents are just very short, thus not a mistake of the txt extraction. " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "actualiteit = df.loc[df['label']=='Actualiteit']\n", "threshold = actualiteit['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = actualiteit[actualiteit['clean_tokens_count'] < threshold]\n", "# actualiteit['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adviesaanvraag**\n", "\n", "Txt extraction from PDFs went well, except for image. The images result in gibberish in the data. Additionally, it does not seem like there are only adviesaanvragen in there. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "adviesaanvraag = df.loc[df['label']=='Adviesaanvraag']\n", "threshold = adviesaanvraag['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = adviesaanvraag[adviesaanvraag['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# adviesaanvraag['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Agenda**\n", "\n", "Agenda's can just be very short. Txt extraction went well." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Agenda']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Amendement**\n", "\n", "Not only amendement docs are included. txt extraction went well." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Amendement']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Begroting**\n", "\n", "It does not seem like there are only begrotingen. txt went well. except for tables (file:///C:/Users/femke/Documents/MasterThesis/discardfiles/1543430.pdf)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Begroting']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Besluit**\n", "\n", "Looks good" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Besluit']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Brief**\n", "\n", "Remove docs with less than 75 tokens. These are some weird poster. The rest looks good." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Brief']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Factsheets**\n", "\n", "Lots of posters. Lots of images included" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Factsheets']\n", "threshold = subdf['clean_tokens_count'].quantile(0.1)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Motie**\n", "\n", "I cant see the original PDFs. Looks good tho." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Motie']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Onderzoeksrapport** \n", " \n", "Include posters, presentations? (/home/azureuser/cloudfiles/code/blobfuse/raadsinformatie/OpenResearch/onderzoek-in-de-gemeenteraad/onderzoeken-rapporten-2019-gemeenteraad/5g-technische-sessie.pdf.ocr\n", ")\n", "\n", "\n", "Includes images." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Onderzoeksrapport']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Raadsadres**\n", "\n", "remove less than 25 tokens. looks good." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Raadsadres']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Schriftelijke Vragen**\n", "\n", "Cant open PDFs. Looks like a mess. Middle looks much better. Should remove lowest 0.01 cleaned_tokens_count" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']== 'Schriftelijke Vragen']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "# threshold2 = subdf['clean_tokens_count'].quantile(0.02)\n", "\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# lower_5_percentile_rows = subdf[(subdf['clean_tokens_count'] < threshold2) & (subdf['clean_tokens_count']>threshold)].sort_values(by=['clean_tokens_count'])\n", "\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Termijnagenda**\n", "\n", "Does include many tables.\n", "Should be combined with agenda." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Termijnagenda']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Verslag**\n", "\n", "Should rename into raadsnotulen. looks good" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Verslag']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Voordracht**\n", "\n", "Looks good" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "subdf = df.loc[df['label']=='Voordracht']\n", "threshold = subdf['clean_tokens_count'].quantile(0.01)\n", "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n", "# print(len(lower_5_percentile_rows))\n", "\n", "# for index, row in lower_5_percentile_rows.iterrows():\n", "# print(row['clean_tokens_count'], row['path'])\n", "# print(row['text'])\n", "# print('\\n\\n')\n", "\n", "# subdf['clean_tokens_count'].describe()\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 2 }