Skip to content

Commit

Permalink
Merge pull request #145 from pryn-kb/main
Browse files Browse the repository at this point in the history
additions to summary statistics
  • Loading branch information
pryn-kb authored Sep 2, 2024
2 parents 37c7a17 + 9d6983b commit d027848
Show file tree
Hide file tree
Showing 9 changed files with 131 additions and 87 deletions.
Binary file modified .DS_Store
Binary file not shown.
Binary file not shown.
Binary file modified _build/.doctrees/environment.pickle
Binary file not shown.
53 changes: 31 additions & 22 deletions _build/html/_sources/docs/pandas/05_pandas_summary_statistics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"id": "641047cc-98b3-42c1-87ee-4333d820e7fd",
"metadata": {
"tags": [
Expand Down Expand Up @@ -58,7 +58,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 3,
"id": "18196144-f324-4fc3-b575-7857c1138dbf",
"metadata": {
"editable": true,
Expand Down Expand Up @@ -376,16 +376,12 @@
},
{
"cell_type": "markdown",
"id": "dfa1c006-f748-47ac-ab42-fd307a4e0cf1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
"metadata": {},
"source": [
"What is the average age of the Titanic passengers?"
"Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
"\n",
"For example, we can use `.mean()` to find the average age of the Titanic passengers:"
]
},
{
Expand All @@ -401,7 +397,7 @@
{
"data": {
"text/plain": [
"29.69911764705882"
"np.float64(29.69911764705882)"
]
},
"execution_count": 5,
Expand Down Expand Up @@ -475,15 +471,15 @@
"tags": []
},
"source": [
"The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [subset data tutorial](03_pandas_subsets)) is calculated for each numeric column."
"The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
]
},
{
"cell_type": "markdown",
"id": "18813351-2fb9-4201-b69a-9fdd61adef72",
"metadata": {},
"source": [
"The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the `describe()` function."
"Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` function."
]
},
{
Expand Down Expand Up @@ -729,7 +725,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 16,
"id": "8dc2bb4f-52df-4ae0-900d-b16253cb1900",
"metadata": {
"tags": [
Expand Down Expand Up @@ -785,7 +781,7 @@
"male 30.726645"
]
},
"execution_count": 9,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -931,7 +927,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 18,
"id": "e2405337-1719-4950-8f59-1c5f389d008b",
"metadata": {
"tags": [
Expand All @@ -948,7 +944,7 @@
"Name: Age, dtype: float64"
]
},
"execution_count": 11,
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -970,10 +966,23 @@
"id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
"metadata": {},
"source": [
"```{admonition} Categorical data\n",
"````{admonition} Categorical data\n",
":class: seealso dropdown\n",
"\n",
"The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a `Categorical` data type to handle this type of data. More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)."
"The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
"\n",
"Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
"\n",
"To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
"\n",
"```python\n",
"titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
"```\n",
"\n",
"This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
"\n",
"More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
"````"
]
},
{
Expand Down Expand Up @@ -1145,7 +1154,7 @@
"## Key points\n",
"\n",
"* Aggregation statistics can be calculated on entire columns or rows.\n",
"* `groupby()* provides the power of the **split-apply-combine** pattern.\n",
"* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
"* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
]
},
Expand Down Expand Up @@ -1178,7 +1187,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.12.5"
}
},
"nbformat": 4,
Expand Down
20 changes: 14 additions & 6 deletions _build/html/docs/pandas/05_pandas_summary_statistics.html
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
<link rel="stylesheet" type="text/css" href="../../_static/styles/sphinx-book-theme.css?v=384b581d" />
<link rel="stylesheet" type="text/css" href="../../_static/togglebutton.css?v=13237357" />
<link rel="stylesheet" type="text/css" href="../../_static/copybutton.css?v=76b2166b" />
<link rel="stylesheet" type="text/css" href="../../_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" />
<link rel="stylesheet" type="text/css" href="../../_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css?v=be8a1c11" />
<link rel="stylesheet" type="text/css" href="../../_static/sphinx-thebe.css?v=4fa983c6" />
<link rel="stylesheet" type="text/css" href="../../_static/sphinx-design.min.css?v=87e54e7c" />

Expand Down Expand Up @@ -683,7 +683,8 @@ <h2>Creating a <code class="docutils literal notranslate"><span class="pre">Data
<section id="aggregating-statistics">
<h2>Aggregating statistics<a class="headerlink" href="#aggregating-statistics" title="Link to this heading">#</a></h2>
<p><img alt="Aggregating statistics from a single coloumn in a DataFrame" src="../../_images/05_aggregate.svg" /></p>
<p>What is the average age of the Titanic passengers?</p>
<p>Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.</p>
<p>For example, we can use <code class="docutils literal notranslate"><span class="pre">.mean()</span></code> to find the average age of the Titanic passengers:</p>
<div class="cell tag_hide-output docutils container">
<div class="cell_input above-output-prompt docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[</span><span class="s2">&quot;Age&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
Expand Down Expand Up @@ -725,8 +726,8 @@ <h2>Aggregating statistics<a class="headerlink" href="#aggregating-statistics" t
</div>
</details>
</div>
<p>The statistic applied to multiple columns of a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> (the selection of two columns returns a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>, see the <a class="reference internal" href="03_pandas_subsets.html"><span class="doc std std-doc">subset data tutorial</span></a>) is calculated for each numeric column.</p>
<p>The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the <code class="docutils literal notranslate"><span class="pre">describe()</span></code> function.</p>
<p>The statistic applied to multiple columns of a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> (the selection of two columns returns a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>, see the <a class="reference internal" href="03_pandas_subsets.html"><span class="doc std std-doc">Subsets tutorial</span></a>) is calculated for each numeric column.</p>
<p>Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the <code class="docutils literal notranslate"><span class="pre">describe()</span></code> function.</p>
<div class="cell tag_hide-output docutils container">
<div class="cell_input above-output-prompt docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[[</span><span class="s2">&quot;Age&quot;</span><span class="p">,</span> <span class="s2">&quot;Fare&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
Expand Down Expand Up @@ -1049,7 +1050,14 @@ <h2>Aggregating statistics grouped by category<a class="headerlink" href="#aggre
<p><img alt="The split-apply-combine pattern" src="../../_images/05_groupby_select_detail.svg" /></p>
<div class="seealso dropdown admonition">
<p class="admonition-title">Categorical data</p>
<p>The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a <code class="docutils literal notranslate"><span class="pre">Categorical</span></code> data type to handle this type of data. More information is provided in the official Pandas documentation on <a class="reference external" href="https://pandas.pydata.org/docs/user_guide/categorical.html">Categorical data</a>.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn’t make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.</p>
<p>Pandas provides a <code class="docutils literal notranslate"><span class="pre">Categorical</span></code> data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.</p>
<p>To convert the <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> column from numerical to categorical data in Pandas, you can use the following code:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[</span><span class="s1">&#39;Pclass&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">titanic</span><span class="p">[</span><span class="s1">&#39;Pclass&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>This will treat <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.</p>
<p>More information is provided in the official Pandas documentation on <a class="reference external" href="https://pandas.pydata.org/docs/user_guide/categorical.html">Categorical data</a>.</p>
</div>
<p>What is the mean ticket fare price for each of the sex and cabin class combinations?</p>
<div class="cell tag_hide-output docutils container">
Expand Down Expand Up @@ -1142,7 +1150,7 @@ <h2>Count number of records by category<a class="headerlink" href="#count-number
<h2>Key points<a class="headerlink" href="#key-points" title="Link to this heading">#</a></h2>
<ul class="simple">
<li><p>Aggregation statistics can be calculated on entire columns or rows.</p></li>
<li><p>`groupby()* provides the power of the <strong>split-apply-combine</strong> pattern.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">groupby()</span></code> provides the power of the <strong>split-apply-combine</strong> pattern.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">value_counts()</span></code> is a convenient shortcut to count the number of entries in each category of a variable.</p></li>
</ul>
</section>
Expand Down
2 changes: 1 addition & 1 deletion _build/html/searchindex.js

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -376,16 +376,12 @@
},
{
"cell_type": "markdown",
"id": "dfa1c006-f748-47ac-ab42-fd307a4e0cf1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
"metadata": {},
"source": [
"What is the average age of the Titanic passengers?"
"Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
"\n",
"For example, we can use `.mean()` to find the average age of the Titanic passengers:"
]
},
{
Expand Down Expand Up @@ -475,15 +471,15 @@
"tags": []
},
"source": [
"The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [subset data tutorial](03_pandas_subsets)) is calculated for each numeric column."
"The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
]
},
{
"cell_type": "markdown",
"id": "18813351-2fb9-4201-b69a-9fdd61adef72",
"metadata": {},
"source": [
"The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the `describe()` function."
"Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` function."
]
},
{
Expand Down Expand Up @@ -970,10 +966,23 @@
"id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
"metadata": {},
"source": [
"```{admonition} Categorical data\n",
"````{admonition} Categorical data\n",
":class: seealso dropdown\n",
"\n",
"The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a `Categorical` data type to handle this type of data. More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)."
"The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
"\n",
"Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
"\n",
"To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
"\n",
"```python\n",
"titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
"```\n",
"\n",
"This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
"\n",
"More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
"````"
]
},
{
Expand Down Expand Up @@ -1145,7 +1154,7 @@
"## Key points\n",
"\n",
"* Aggregation statistics can be calculated on entire columns or rows.\n",
"* `groupby()* provides the power of the **split-apply-combine** pattern.\n",
"* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
"* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
]
},
Expand Down
Loading

0 comments on commit d027848

Please sign in to comment.