{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash /home/azureuser/cloudfiles/code/blobfuse/blobfuse_raadsinformatie.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "\n",
    "# MAKE SURE TO SET-UP PATH -> use local to run with demo data; use azure to run with complete dataset (access required)\n",
    "# Select where to run notebook: \"azure\" or \"local\"\n",
    "my_run = \"local\"\n",
    "\n",
    "if my_run == \"azure\":\n",
    "    import config_azure as cf\n",
    "    running_demo = False\n",
    "elif my_run == \"local\":\n",
    "    import config as cf\n",
    "    running_demo = True\n",
    "\n",
    "\n",
    "import os\n",
    "if my_run == \"azure\":\n",
    "    if not os.path.exists(cf.HUGGING_CACHE):\n",
    "        os.mkdir(cf.HUGGING_CACHE)\n",
    "    os.environ[\"TRANSFORMERS_CACHE\"] = cf.HUGGING_CACHE\n",
    "\n",
    "import pandas as pd\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook overview\n",
    "Goal: manually check some of the classes. especially docs that have very low token count.\n",
    "- Starting point: txtfiles_notcleaned.pkl\n",
    "- Ending point: txtfiles.pkl -> only including 'good' documents\n",
    "\n",
    "We found that there are documents with very low token count. Thus we further checked for each class how clean the dataset was. This led to the complete removal of three classes and the removal of some documents based on their length. \n",
    "\n",
    "In this notebook:\n",
    "1. We will first display the token distribution to highlight the imbalance. \n",
    "2. Then, we will provide the code to remove the faulty documents.\n",
    "3. After, we also remove duplicate documents. \n",
    "4. Then, we remove unneccary columns. We don't need the tokens, token_count, clean_tokens, clean_tokens_count columns anymore, those were used for analysis of the raw data.\n",
    "5. Finally, we will explain the reasons for the removal of the faulty documents, if applicable, for each class. The explanation is put at the end of the notebook since it takes up a lot of cells. \n",
    "\n",
    "Previous notebook: load_txt.ipynb\n",
    "\n",
    "Next notebook: duplicates.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Check document length\n",
    "We load in the unclean dataset and look into the document length. We can see that some documents are unusually short. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>actualiteit</th>\n",
       "      <td>10.0</td>\n",
       "      <td>535.0</td>\n",
       "      <td>470.433370</td>\n",
       "      <td>108.0</td>\n",
       "      <td>233.75</td>\n",
       "      <td>376.0</td>\n",
       "      <td>633.00</td>\n",
       "      <td>1606.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agenda</th>\n",
       "      <td>10.0</td>\n",
       "      <td>527.1</td>\n",
       "      <td>626.263957</td>\n",
       "      <td>111.0</td>\n",
       "      <td>173.75</td>\n",
       "      <td>240.0</td>\n",
       "      <td>687.75</td>\n",
       "      <td>2157.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>besluit</th>\n",
       "      <td>10.0</td>\n",
       "      <td>2145.7</td>\n",
       "      <td>3939.064160</td>\n",
       "      <td>112.0</td>\n",
       "      <td>246.75</td>\n",
       "      <td>993.0</td>\n",
       "      <td>1873.50</td>\n",
       "      <td>13155.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>brief</th>\n",
       "      <td>10.0</td>\n",
       "      <td>2064.0</td>\n",
       "      <td>2160.666307</td>\n",
       "      <td>287.0</td>\n",
       "      <td>694.00</td>\n",
       "      <td>1136.5</td>\n",
       "      <td>2378.00</td>\n",
       "      <td>6906.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>factsheet</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1642.7</td>\n",
       "      <td>953.800241</td>\n",
       "      <td>679.0</td>\n",
       "      <td>823.75</td>\n",
       "      <td>1481.0</td>\n",
       "      <td>2035.00</td>\n",
       "      <td>3388.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>motie</th>\n",
       "      <td>10.0</td>\n",
       "      <td>268.5</td>\n",
       "      <td>72.873178</td>\n",
       "      <td>190.0</td>\n",
       "      <td>219.25</td>\n",
       "      <td>248.5</td>\n",
       "      <td>293.50</td>\n",
       "      <td>404.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>onderzoeksrapport</th>\n",
       "      <td>10.0</td>\n",
       "      <td>14672.2</td>\n",
       "      <td>8575.697416</td>\n",
       "      <td>880.0</td>\n",
       "      <td>9840.25</td>\n",
       "      <td>13164.0</td>\n",
       "      <td>20732.75</td>\n",
       "      <td>28872.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>raadsadres</th>\n",
       "      <td>10.0</td>\n",
       "      <td>842.0</td>\n",
       "      <td>461.315269</td>\n",
       "      <td>183.0</td>\n",
       "      <td>541.75</td>\n",
       "      <td>750.5</td>\n",
       "      <td>1262.00</td>\n",
       "      <td>1526.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>raadsnotulen</th>\n",
       "      <td>10.0</td>\n",
       "      <td>32888.6</td>\n",
       "      <td>10886.157226</td>\n",
       "      <td>18848.0</td>\n",
       "      <td>24880.25</td>\n",
       "      <td>31943.5</td>\n",
       "      <td>36826.25</td>\n",
       "      <td>53547.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>schriftelijke vraag</th>\n",
       "      <td>10.0</td>\n",
       "      <td>919.1</td>\n",
       "      <td>672.090511</td>\n",
       "      <td>278.0</td>\n",
       "      <td>449.25</td>\n",
       "      <td>717.5</td>\n",
       "      <td>1135.00</td>\n",
       "      <td>2482.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>voordracht</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1045.7</td>\n",
       "      <td>1548.645433</td>\n",
       "      <td>220.0</td>\n",
       "      <td>315.50</td>\n",
       "      <td>420.5</td>\n",
       "      <td>955.75</td>\n",
       "      <td>5319.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     count     mean           std      min       25%      50%  \\\n",
       "label                                                                           \n",
       "actualiteit           10.0    535.0    470.433370    108.0    233.75    376.0   \n",
       "agenda                10.0    527.1    626.263957    111.0    173.75    240.0   \n",
       "besluit               10.0   2145.7   3939.064160    112.0    246.75    993.0   \n",
       "brief                 10.0   2064.0   2160.666307    287.0    694.00   1136.5   \n",
       "factsheet             10.0   1642.7    953.800241    679.0    823.75   1481.0   \n",
       "motie                 10.0    268.5     72.873178    190.0    219.25    248.5   \n",
       "onderzoeksrapport     10.0  14672.2   8575.697416    880.0   9840.25  13164.0   \n",
       "raadsadres            10.0    842.0    461.315269    183.0    541.75    750.5   \n",
       "raadsnotulen          10.0  32888.6  10886.157226  18848.0  24880.25  31943.5   \n",
       "schriftelijke vraag   10.0    919.1    672.090511    278.0    449.25    717.5   \n",
       "voordracht            10.0   1045.7   1548.645433    220.0    315.50    420.5   \n",
       "\n",
       "                          75%      max  \n",
       "label                                   \n",
       "actualiteit            633.00   1606.0  \n",
       "agenda                 687.75   2157.0  \n",
       "besluit               1873.50  13155.0  \n",
       "brief                 2378.00   6906.0  \n",
       "factsheet             2035.00   3388.0  \n",
       "motie                  293.50    404.0  \n",
       "onderzoeksrapport    20732.75  28872.0  \n",
       "raadsadres            1262.00   1526.0  \n",
       "raadsnotulen         36826.25  53547.0  \n",
       "schriftelijke vraag   1135.00   2482.0  \n",
       "voordracht             955.75   5319.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>actualiteit</th>\n",
       "      <td>10.0</td>\n",
       "      <td>272.0</td>\n",
       "      <td>236.156257</td>\n",
       "      <td>51.0</td>\n",
       "      <td>111.75</td>\n",
       "      <td>190.0</td>\n",
       "      <td>362.00</td>\n",
       "      <td>790.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agenda</th>\n",
       "      <td>10.0</td>\n",
       "      <td>293.4</td>\n",
       "      <td>319.327488</td>\n",
       "      <td>72.0</td>\n",
       "      <td>107.50</td>\n",
       "      <td>155.0</td>\n",
       "      <td>385.50</td>\n",
       "      <td>1115.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>besluit</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1071.2</td>\n",
       "      <td>1951.986612</td>\n",
       "      <td>62.0</td>\n",
       "      <td>132.75</td>\n",
       "      <td>497.0</td>\n",
       "      <td>935.25</td>\n",
       "      <td>6527.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>brief</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1010.6</td>\n",
       "      <td>1062.573313</td>\n",
       "      <td>154.0</td>\n",
       "      <td>339.00</td>\n",
       "      <td>555.0</td>\n",
       "      <td>1159.00</td>\n",
       "      <td>3397.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>factsheet</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1006.9</td>\n",
       "      <td>502.997338</td>\n",
       "      <td>467.0</td>\n",
       "      <td>560.50</td>\n",
       "      <td>916.0</td>\n",
       "      <td>1438.25</td>\n",
       "      <td>1778.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>motie</th>\n",
       "      <td>10.0</td>\n",
       "      <td>136.4</td>\n",
       "      <td>37.107052</td>\n",
       "      <td>99.0</td>\n",
       "      <td>111.00</td>\n",
       "      <td>122.5</td>\n",
       "      <td>146.75</td>\n",
       "      <td>203.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>onderzoeksrapport</th>\n",
       "      <td>10.0</td>\n",
       "      <td>7659.5</td>\n",
       "      <td>4403.828745</td>\n",
       "      <td>504.0</td>\n",
       "      <td>5192.75</td>\n",
       "      <td>6929.5</td>\n",
       "      <td>10964.75</td>\n",
       "      <td>14792.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>raadsadres</th>\n",
       "      <td>10.0</td>\n",
       "      <td>410.3</td>\n",
       "      <td>223.746409</td>\n",
       "      <td>79.0</td>\n",
       "      <td>259.50</td>\n",
       "      <td>352.0</td>\n",
       "      <td>602.50</td>\n",
       "      <td>783.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>raadsnotulen</th>\n",
       "      <td>10.0</td>\n",
       "      <td>14854.4</td>\n",
       "      <td>5198.432674</td>\n",
       "      <td>8394.0</td>\n",
       "      <td>11067.00</td>\n",
       "      <td>13942.5</td>\n",
       "      <td>16973.75</td>\n",
       "      <td>24926.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>schriftelijke vraag</th>\n",
       "      <td>10.0</td>\n",
       "      <td>447.2</td>\n",
       "      <td>307.041003</td>\n",
       "      <td>143.0</td>\n",
       "      <td>228.75</td>\n",
       "      <td>357.5</td>\n",
       "      <td>554.25</td>\n",
       "      <td>1143.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>voordracht</th>\n",
       "      <td>10.0</td>\n",
       "      <td>525.6</td>\n",
       "      <td>745.903956</td>\n",
       "      <td>122.0</td>\n",
       "      <td>165.25</td>\n",
       "      <td>222.5</td>\n",
       "      <td>495.25</td>\n",
       "      <td>2580.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     count     mean          std     min       25%      50%  \\\n",
       "label                                                                         \n",
       "actualiteit           10.0    272.0   236.156257    51.0    111.75    190.0   \n",
       "agenda                10.0    293.4   319.327488    72.0    107.50    155.0   \n",
       "besluit               10.0   1071.2  1951.986612    62.0    132.75    497.0   \n",
       "brief                 10.0   1010.6  1062.573313   154.0    339.00    555.0   \n",
       "factsheet             10.0   1006.9   502.997338   467.0    560.50    916.0   \n",
       "motie                 10.0    136.4    37.107052    99.0    111.00    122.5   \n",
       "onderzoeksrapport     10.0   7659.5  4403.828745   504.0   5192.75   6929.5   \n",
       "raadsadres            10.0    410.3   223.746409    79.0    259.50    352.0   \n",
       "raadsnotulen          10.0  14854.4  5198.432674  8394.0  11067.00  13942.5   \n",
       "schriftelijke vraag   10.0    447.2   307.041003   143.0    228.75    357.5   \n",
       "voordracht            10.0    525.6   745.903956   122.0    165.25    222.5   \n",
       "\n",
       "                          75%      max  \n",
       "label                                   \n",
       "actualiteit            362.00    790.0  \n",
       "agenda                 385.50   1115.0  \n",
       "besluit                935.25   6527.0  \n",
       "brief                 1159.00   3397.0  \n",
       "factsheet             1438.25   1778.0  \n",
       "motie                  146.75    203.0  \n",
       "onderzoeksrapport    10964.75  14792.0  \n",
       "raadsadres             602.50    783.0  \n",
       "raadsnotulen         16973.75  24926.0  \n",
       "schriftelijke vraag    554.25   1143.0  \n",
       "voordracht             495.25   2580.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# load data\n",
    "import ast\n",
    "import pandas as pd\n",
    "\n",
    "txtfiles = pd.read_pickle(f\"{cf.output_path}/txtfiles_notcleaned.pkl\")\n",
    "df = txtfiles.copy()\n",
    "\n",
    "display(txtfiles.groupby('label')['token_count'].describe())\n",
    "\n",
    "# clean tokens -> stopwords, punctuation etc. removed. very simple cleaning of tokens applied\n",
    "display(txtfiles.groupby('label')['clean_tokens_count'].describe())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Remove the faulty documents.\n",
    "Below each class is checked, and explained why these documents are removed.\n",
    "\n",
    "The removal includes:\n",
    "- classes: 'Adviesaanvraag', 'Amendement', 'Begroting'\n",
    "- some documents that are gibberish\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original amount of docs: 110\n",
      "Amount of docs after removal: 110\n"
     ]
    }
   ],
   "source": [
    "print(\"Original amount of docs:\", len(txtfiles))\n",
    "# remove Adviesaanvraag, Amendement and Begroting. These classes include docs from other classes.\n",
    "txtfiles = txtfiles.loc[~txtfiles['label'].isin(['Adviesaanvraag', 'Amendement', 'Begroting'])]\n",
    "\n",
    "# remove docs that are too short (only include gibberish)\n",
    "subdf = txtfiles.loc[txtfiles['label']== 'Schriftelijke Vragen']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "subdf = subdf.loc[txtfiles['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n",
    "\n",
    "subdf = txtfiles.loc[(txtfiles['label']== 'Brief') & (txtfiles['clean_tokens_count'] < 75)]\n",
    "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n",
    "\n",
    "subdf = txtfiles.loc[(txtfiles['label']== 'Raadsadres') & (txtfiles['clean_tokens_count'] < 25)]\n",
    "txtfiles = txtfiles.loc[~txtfiles['id'].isin(subdf['id'])]\n",
    "\n",
    "# rename verslag column\n",
    "txtfiles.loc[txtfiles['label']=='Verslag','label'] = 'Raadsnotulen'\n",
    "\n",
    "\n",
    "print(\"Amount of docs after removal:\", len(txtfiles))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Remove duplicates\n",
    "\n",
    "goal: check data for duplicates\n",
    "\n",
    "- Use md5 hashing to remove duplicates. \n",
    "- Data is re-split into subsets, using 2-split (train and test), 4-split (train, test, val and dev) and balance-split (train, test and val; used for research)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6 docs removed. New total: 104 docs.\n"
     ]
    }
   ],
   "source": [
    "import hashlib\n",
    "import sys\n",
    "\n",
    "# load function to split data into subsets (train,test,val and dev)\n",
    "sys.path.append('../src/') \n",
    "from data_split import save_split, save_balanced_split\n",
    "\n",
    "def calculate_md5(file_path):\n",
    "    \"\"\"Calculate the MD5 hash of a file.\"\"\"\n",
    "    hash_md5 = hashlib.md5()\n",
    "    with open(file_path, \"rb\") as f:\n",
    "        for chunk in iter(lambda: f.read(4096), b\"\"):\n",
    "            hash_md5.update(chunk)\n",
    "    return hash_md5.hexdigest()\n",
    "\n",
    "\"\"\"Calculate MD5 hash for each file path in a DataFrame.\"\"\"\n",
    "def drop_duplicates(df, file_path_column):\n",
    "    df['md5_hash'] = df[file_path_column].apply(calculate_md5)\n",
    "\n",
    "    # remove rows with duplicate md5_hash\n",
    "    clean_df = df.drop_duplicates(subset=['md5_hash'])\n",
    "    print(f\"{len(df)-len(clean_df)} docs removed. New total: {len(clean_df)} docs.\")\n",
    "    return clean_df\n",
    "\n",
    "\"\"\" After removing duplicates and faulty documents, data needs to be resplit \"\"\"\n",
    "def redo_datasplit(df):\n",
    "    df = save_split(df)\n",
    "    df = save_balanced_split(df, demo=running_demo)\n",
    "    return df\n",
    "\n",
    "hash_df = drop_duplicates(txtfiles, 'path')\n",
    "cleaned_df = redo_datasplit(hash_df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Remove unnessecary columns & save cleaned df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns before removing: ['label', 'path', 'id', '4split', '2split', 'balanced_split', 'text', 'tokens', 'token_count', 'clean_tokens', 'clean_tokens_count', 'pdf_path', 'num_pages', 'md5_hash']\n",
      "Columns after removing: ['label', 'path', 'id', '4split', '2split', 'balanced_split', 'text', 'pdf_path', 'num_pages', 'md5_hash']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns before removing: {list(cleaned_df.columns)}\")\n",
    "cleaned_df = cleaned_df.drop(columns=['tokens', 'token_count', 'clean_tokens', 'clean_tokens_count'])\n",
    "print(f\"Columns after removing: {list(cleaned_df.columns)}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_df.to_pickle(f\"{cf.output_path}/txtfiles.pkl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Explanation: Check each class\n",
    "\n",
    "NOTE: this analysis is run on 'df', which is the uncleaned dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Actualiteit**\n",
    "\n",
    "After checking the lowest 1 percent of number of tokens (10 documents which have less than 58 tokens) were checked using the PDFs. These documents are just very short, thus not a mistake of the txt extraction. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "actualiteit = df.loc[df['label']=='Actualiteit']\n",
    "threshold = actualiteit['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = actualiteit[actualiteit['clean_tokens_count'] < threshold]\n",
    "# actualiteit['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Adviesaanvraag**\n",
    "\n",
    "Txt extraction from PDFs went well, except for image. The images result in gibberish in the data. Additionally, it does not seem like there are only adviesaanvragen in there. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "adviesaanvraag = df.loc[df['label']=='Adviesaanvraag']\n",
    "threshold = adviesaanvraag['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = adviesaanvraag[adviesaanvraag['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# adviesaanvraag['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Agenda**\n",
    "\n",
    "Agenda's can just be very short. Txt extraction went well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Agenda']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Amendement**\n",
    "\n",
    "Not only amendement docs are included. txt extraction went well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Amendement']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Begroting**\n",
    "\n",
    "It does not seem like there are only begrotingen. txt went well. except for tables (file:///C:/Users/femke/Documents/MasterThesis/discardfiles/1543430.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Begroting']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Besluit**\n",
    "\n",
    "Looks good"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Besluit']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Brief**\n",
    "\n",
    "Remove docs with less than 75 tokens. These are some weird poster. The rest looks good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Brief']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Factsheets**\n",
    "\n",
    "Lots of posters. Lots of images included"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Factsheets']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.1)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Motie**\n",
    "\n",
    "I cant see the original PDFs. Looks good tho."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Motie']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Onderzoeksrapport**   \n",
    "   \n",
    "Include posters, presentations? (/home/azureuser/cloudfiles/code/blobfuse/raadsinformatie/OpenResearch/onderzoek-in-de-gemeenteraad/onderzoeken-rapporten-2019-gemeenteraad/5g-technische-sessie.pdf.ocr\n",
    ")\n",
    "\n",
    "\n",
    "Includes images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Onderzoeksrapport']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Raadsadres**\n",
    "\n",
    "remove less than 25 tokens. looks good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Raadsadres']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Schriftelijke Vragen**\n",
    "\n",
    "Cant open PDFs. Looks like a mess. Middle looks much better. Should remove lowest 0.01 cleaned_tokens_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']== 'Schriftelijke Vragen']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "# threshold2 = subdf['clean_tokens_count'].quantile(0.02)\n",
    "\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# lower_5_percentile_rows = subdf[(subdf['clean_tokens_count'] < threshold2) & (subdf['clean_tokens_count']>threshold)].sort_values(by=['clean_tokens_count'])\n",
    "\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Termijnagenda**\n",
    "\n",
    "Does include many tables.\n",
    "Should be combined with agenda."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Termijnagenda']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Verslag**\n",
    "\n",
    "Should rename into raadsnotulen. looks good"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Verslag']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Voordracht**\n",
    "\n",
    "Looks good"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdf = df.loc[df['label']=='Voordracht']\n",
    "threshold = subdf['clean_tokens_count'].quantile(0.01)\n",
    "lower_5_percentile_rows = subdf[subdf['clean_tokens_count'] < threshold].sort_values(by=['clean_tokens_count'])\n",
    "# print(len(lower_5_percentile_rows))\n",
    "\n",
    "# for index, row in lower_5_percentile_rows.iterrows():\n",
    "#     print(row['clean_tokens_count'], row['path'])\n",
    "#     print(row['text'])\n",
    "#     print('\\n\\n')\n",
    "\n",
    "# subdf['clean_tokens_count'].describe()\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}