readme.html

<!DOCTYPE html>
<html>
<head>
<title>README.md</title>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">

<style>
/* https://github.com/microsoft/vscode/blob/master/extensions/markdown-language-features/media/markdown.css */
/*---------------------------------------------------------------------------------------------
 *  Copyright (c) Microsoft Corporation. All rights reserved.
 *  Licensed under the MIT License. See License.txt in the project root for license information.
 *--------------------------------------------------------------------------------------------*/

body {
	font-family: var(--vscode-markdown-font-family, -apple-system, BlinkMacSystemFont, "Segoe WPC", "Segoe UI", "Ubuntu", "Droid Sans", sans-serif);
	font-size: var(--vscode-markdown-font-size, 14px);
	padding: 0 26px;
	line-height: var(--vscode-markdown-line-height, 22px);
	word-wrap: break-word;
}

#code-csp-warning {
	position: fixed;
	top: 0;
	right: 0;
	color: white;
	margin: 16px;
	text-align: center;
	font-size: 12px;
	font-family: sans-serif;
	background-color:#444444;
	cursor: pointer;
	padding: 6px;
	box-shadow: 1px 1px 1px rgba(0,0,0,.25);
}

#code-csp-warning:hover {
	text-decoration: none;
	background-color:#007acc;
	box-shadow: 2px 2px 2px rgba(0,0,0,.25);
}

body.scrollBeyondLastLine {
	margin-bottom: calc(100vh - 22px);
}

body.showEditorSelection .code-line {
	position: relative;
}

body.showEditorSelection .code-active-line:before,
body.showEditorSelection .code-line:hover:before {
	content: "";
	display: block;
	position: absolute;
	top: 0;
	left: -12px;
	height: 100%;
}

body.showEditorSelection li.code-active-line:before,
body.showEditorSelection li.code-line:hover:before {
	left: -30px;
}

.vscode-light.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(0, 0, 0, 0.15);
}

.vscode-light.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(0, 0, 0, 0.40);
}

.vscode-light.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

.vscode-dark.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(255, 255, 255, 0.4);
}

.vscode-dark.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(255, 255, 255, 0.60);
}

.vscode-dark.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

.vscode-high-contrast.showEditorSelection .code-active-line:before {
	border-left: 3px solid rgba(255, 160, 0, 0.7);
}

.vscode-high-contrast.showEditorSelection .code-line:hover:before {
	border-left: 3px solid rgba(255, 160, 0, 1);
}

.vscode-high-contrast.showEditorSelection .code-line .code-line:hover:before {
	border-left: none;
}

img {
	max-width: 100%;
	max-height: 100%;
}

a {
	text-decoration: none;
}

a:hover {
	text-decoration: underline;
}

a:focus,
input:focus,
select:focus,
textarea:focus {
	outline: 1px solid -webkit-focus-ring-color;
	outline-offset: -1px;
}

hr {
	border: 0;
	height: 2px;
	border-bottom: 2px solid;
}

h1 {
	padding-bottom: 0.3em;
	line-height: 1.2;
	border-bottom-width: 1px;
	border-bottom-style: solid;
}

h1, h2, h3 {
	font-weight: normal;
}

table {
	border-collapse: collapse;
}

table > thead > tr > th {
	text-align: left;
	border-bottom: 1px solid;
}

table > thead > tr > th,
table > thead > tr > td,
table > tbody > tr > th,
table > tbody > tr > td {
	padding: 5px 10px;
}

table > tbody > tr + tr > td {
	border-top: 1px solid;
}

blockquote {
	margin: 0 7px 0 5px;
	padding: 0 16px 0 10px;
	border-left-width: 5px;
	border-left-style: solid;
}

code {
	font-family: Menlo, Monaco, Consolas, "Droid Sans Mono", "Courier New", monospace, "Droid Sans Fallback";
	font-size: 1em;
	line-height: 1.357em;
}

body.wordWrap pre {
	white-space: pre-wrap;
}

pre:not(.hljs),
pre.hljs code > div {
	padding: 16px;
	border-radius: 3px;
	overflow: auto;
}

pre code {
	color: var(--vscode-editor-foreground);
	tab-size: 4;
}

/** Theming */

.vscode-light pre {
	background-color: rgba(220, 220, 220, 0.4);
}

.vscode-dark pre {
	background-color: rgba(10, 10, 10, 0.4);
}

.vscode-high-contrast pre {
	background-color: rgb(0, 0, 0);
}

.vscode-high-contrast h1 {
	border-color: rgb(0, 0, 0);
}

.vscode-light table > thead > tr > th {
	border-color: rgba(0, 0, 0, 0.69);
}

.vscode-dark table > thead > tr > th {
	border-color: rgba(255, 255, 255, 0.69);
}

.vscode-light h1,
.vscode-light hr,
.vscode-light table > tbody > tr + tr > td {
	border-color: rgba(0, 0, 0, 0.18);
}

.vscode-dark h1,
.vscode-dark hr,
.vscode-dark table > tbody > tr + tr > td {
	border-color: rgba(255, 255, 255, 0.18);
}

</style>

<style>
/* Tomorrow Theme */
/* http://jmblog.github.com/color-themes-for-google-code-highlightjs */
/* Original theme - https://github.com/chriskempson/tomorrow-theme */

/* Tomorrow Comment */
.hljs-comment,
.hljs-quote {
	color: #8e908c;
}

/* Tomorrow Red */
.hljs-variable,
.hljs-template-variable,
.hljs-tag,
.hljs-name,
.hljs-selector-id,
.hljs-selector-class,
.hljs-regexp,
.hljs-deletion {
	color: #c82829;
}

/* Tomorrow Orange */
.hljs-number,
.hljs-built_in,
.hljs-builtin-name,
.hljs-literal,
.hljs-type,
.hljs-params,
.hljs-meta,
.hljs-link {
	color: #f5871f;
}

/* Tomorrow Yellow */
.hljs-attribute {
	color: #eab700;
}

/* Tomorrow Green */
.hljs-string,
.hljs-symbol,
.hljs-bullet,
.hljs-addition {
	color: #718c00;
}

/* Tomorrow Blue */
.hljs-title,
.hljs-section {
	color: #4271ae;
}

/* Tomorrow Purple */
.hljs-keyword,
.hljs-selector-tag {
	color: #8959a8;
}

.hljs {
	display: block;
	overflow-x: auto;
	color: #4d4d4c;
	padding: 0.5em;
}

.hljs-emphasis {
	font-style: italic;
}

.hljs-strong {
	font-weight: bold;
}
</style>

<style>
/*
 * Markdown PDF CSS
 */

 body {
	font-family: -apple-system, BlinkMacSystemFont, "Segoe WPC", "Segoe UI", "Ubuntu", "Droid Sans", sans-serif, "Meiryo";
	padding: 0 12px;
}

pre {
	background-color: #f8f8f8;
	border: 1px solid #cccccc;
	border-radius: 3px;
	overflow-x: auto;
	white-space: pre-wrap;
	overflow-wrap: break-word;
}

pre:not(.hljs) {
	padding: 23px;
	line-height: 19px;
}

blockquote {
	background: rgba(127, 127, 127, 0.1);
	border-color: rgba(0, 122, 204, 0.5);
}

.emoji {
	height: 1.4em;
}

code {
	font-size: 14px;
	line-height: 19px;
}

/* for inline code */
:not(pre):not(.hljs) > code {
	color: #C9AE75; /* Change the old color so it seems less like an error */
	font-size: inherit;
}

/* Page Break : use <div class="page"/> to insert page break
-------------------------------------------------------- */
.page {
	page-break-after: always;
}

</style>

<script src="https://unpkg.com/mermaid/dist/mermaid.min.js"></script>
</head>
<body>
  <script>
    mermaid.initialize({
      startOnLoad: true,
      theme: document.body.classList.contains('vscode-dark') || document.body.classList.contains('vscode-high-contrast')
          ? 'dark'
          : 'default'
    });
  </script>
<h1 style="background-color:tomato;">Background of the Data</h1>
<ul>
<li>Data Source: https://catalog.data.gov/dataset/consumer-complaint-database</li>
</ul>
<p>The dataset is obtained from the public data from data.gov website under the domain <code>consumer-complaint-database</code>. The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. The database generally updates daily. So, each day when we download the dataset, it may be larger than the previous dataset.</p>
<p>The dataset has above 1 million rows and 18 columns out of which, for the text data category classification, we are only interested in two features: <code>Product</code> and <code>Consumer complaint narrative</code>.</p>
<h1 style="background-color:tomato;">Business Problem</h1>
<p>This project aims to accurately classify the Product category of the complaint. There are more than 10 categories of the product such as <code>Mortgage</code>, <code>Debt collection</code> and so on. Our aim is to read the text complaint and classify as on of these category.</p>
<p style="color:green;">NOTE</p>
<p>Originally there are more than 10 categories in original database, but some of the categories are
ambiguous, such as there are three different categories <code>Credit card</code>, <code>Prepaid card</code>, and <code>Credit card or prepaid card</code>. If we have a given complaint about credit card, what should it be classified as? <code>Credit card</code> or <code>Credit card and prepaid card</code> ? To avoid this problem the ambiguous categories are merged into one single categories and finally we have only 10 different categories.
For a sample of 2,000 data, the category distribution looks like this:
<img src="images/labels.png" alt=""></p>
<h1 style="background-color:tomato;">Text Data Cleaning</h1>
<p>Usually the written text is full of informal language and requires cleaning the text before we proceed with
analyzing the text. For example, we need to remove the STOPWORDS and expand the contractions.
Data cleaning strategy:</p>
<ol>
<li>split combined text: <code>areYou</code> ==&gt; <code>are You</code></li>
<li>lowercase: <code>You</code> ==&gt; <code>you</code></li>
<li>expand apostrophes: <code>you're</code> ==&gt; <code>you are</code></li>
<li>remove punctuation: <code>hi !</code> ==&gt; <code>hi</code></li>
<li>remove digits: <code>gr8</code> ==&gt; <code>gr</code></li>
<li>remove repeated substring: <code>ha ha</code> ==&gt; <code>ha</code></li>
<li>remove stop words: <code>I am good</code> ==&gt; <code>good</code></li>
<li>lemmatize: <code>apples</code> ==&gt; <code>apple</code></li>
</ol>
<h1 style="background-color:tomato;">Tf-idf:</h1>
<p>For the text processing tasks (NLP), we usually use a method called <code>Term Frequency - Inverse Document Frequency</code>.</p>
<p style="color:green;">Term Frequency:</p>
<p>This gives how often a given word appears within a document.</p>
<pre class="hljs"><code><div>
TF = Number of times the term appears in the doc
     ----------------------------------------------
	 Total number of words in the doc

</div></code></pre>
<p style="color:green;">Inverse Document Frequency:</p>
<p>This gives how often the word appears across the documents.
If a term is very common among documents (e.g.,<code>the</code>, <code>a</code>, <code>is</code>),
then we have low IDF score.</p>
<pre class="hljs"><code><div>     Number of docs the term appears
DF = -----------------------------------
	Total number docs in the corpus

But, conventionally, document frequency (Df) is defined as log of ratio,

            Number of docs the term appears
DF = ln (  ----------------------------------)
		  Total number docs in the corpus
</div></code></pre>
<p style="color:green;">Term Frequency – Inverse Document Frequency TF-IDF:</p>
<p>TF-IDF is the product of the TF and IDF scores of the term.</p>
<pre class="hljs"><code><div>           TF
TF-IDF = ------
           DF
</div></code></pre>
<h1 style="background-color:tomato;">Top N correlated terms per category</h1>
<p>We can use scikitlearn text vectorizer class <code>sklearn.feature_extraction.text.TfidfVectorizer</code> to get
the vectorized form of given text data. Then using feature selection (<code>sklearn.feature_selection.chi2</code>) we get
following top most unigrams and bigrams for each categories:
<img src="images/top_correlated_terms.png" alt=""></p>
<h1 style="background-color:tomato;">Modelling Text data</h1>
<p>We can not use the raw text data as the input for <code>scikit-learn</code> classifiers.
We first need to vectorize them and convert the words to number. Here, in this
project I have used the Tf-idf vectorizer with ngram of (1,2) and tried various
classifiers. Among many classifiers, I found svm.LinearSVC gave me the best accuracy.
For the 2019 data with sampling of 2000 samples with random seed of 100, I got the
accuracy of 0.8125 for the test data. For the full data of 2019 (124,907 almost 125k)
after splitting train-test as 80%-20%, I got the accuracy of 0.8068.</p>
<h1 style="background-color:tomato;">Model Evaluation</h1>
<p><img src="images/classification_report.png" alt="">
<img src="images/confusion_matrix.png" alt="">
<img src="images/roc_auc.png" alt="">
<img src="images/precision_recall.png" alt="">
<img src="images/class_prediction_error.png" alt=""></p>
<h1 style="background-color:tomato;">Big Data Analysis</h1>
<p>Here, we have so far used only the small portion of the data (2,000 samples out of million samples) and used <code>scikit-learn</code> models for the text analysis. But, for the real world data, we may need to use all the data for better performances.</p>
<p>For large data, pandas crashes and we need to look for alternative methods such as Amazon AWS or IBM Watson. Also, we can use the open source modules such as <code>dask</code> or <code>pyspark</code> which can scale up to multiple gigabytes of data. For this project, I have used both <code>pyspark</code> and Amazon AWS servers.</p>
<p style="color:green;">NOTE:</p>
<p><code>Pyspark</code> is an immature library. It was borrowed from scala and many functionalities are still need to be implemented. For example, while reading the <code>complaints.csv</code> file, using pandas we can simply use <code>pd.read_csv</code>, however, pyspark is not sophisticated enough to read the csv file automatically when it has multiline. To circumvent these obstacles we can use spark read option with <code>multiLine=True, escape='&quot;'</code>.</p>
<h1 style="background-color:tomato;">Modelling Pipeline</h1>
<p>For text data processing using <code>pyspark</code>, here I have used following pipelines:</p>
<pre class="hljs"><code><div><span class="hljs-keyword">from</span> pyspark.ml.feature <span class="hljs-keyword">import</span> Tokenizer,StopWordsRemover,HashingTF,IDF

tokenizer = Tokenizer().setInputCol(<span class="hljs-string">"complaint"</span>).setOutputCol(<span class="hljs-string">"words"</span>)
remover= StopWordsRemover().setInputCol(<span class="hljs-string">"words"</span>).setOutputCol(<span class="hljs-string">"filtered"</span>).setCaseSensitive(<span class="hljs-literal">False</span>)
hashingTF = HashingTF().setNumFeatures(<span class="hljs-number">1000</span>).setInputCol(<span class="hljs-string">"filtered"</span>).setOutputCol(<span class="hljs-string">"rawFeatures"</span>)
idf = IDF().setInputCol(<span class="hljs-string">"rawFeatures"</span>).setOutputCol(<span class="hljs-string">"features"</span>).setMinDocFreq(<span class="hljs-number">0</span>)
</div></code></pre>

</body>
</html>