Merge pull request #145 from pryn-kb/main

additions to summary statistics
KUBDatalab · Sep 2, 2024 · d027848 · d027848
2 parents 37c7a17 + 9d6983b
commit d027848
Show file tree

Hide file tree

Showing 9 changed files with 131 additions and 87 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/_build/.doctrees/docs/pandas/05_pandas_summary_statistics.doctree b/_build/.doctrees/docs/pandas/05_pandas_summary_statistics.doctree
diff --git a/_build/.doctrees/environment.pickle b/_build/.doctrees/environment.pickle
diff --git a/_build/html/_sources/docs/pandas/05_pandas_summary_statistics.ipynb b/_build/html/_sources/docs/pandas/05_pandas_summary_statistics.ipynb
@@ -18,7 +18,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 2,
    "id": "641047cc-98b3-42c1-87ee-4333d820e7fd",
    "metadata": {
     "tags": [
@@ -58,7 +58,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 3,
    "id": "18196144-f324-4fc3-b575-7857c1138dbf",
    "metadata": {
     "editable": true,
@@ -376,16 +376,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dfa1c006-f748-47ac-ab42-fd307a4e0cf1",
-   "metadata": {
-    "editable": true,
-    "slideshow": {
-     "slide_type": ""
-    },
-    "tags": []
-   },
+   "id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
+   "metadata": {},
    "source": [
-    "What is the average age of the Titanic passengers?"
+    "Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
+    "\n",
+    "For example, we can use `.mean()` to find the average age of the Titanic passengers:"
    ]
   },
   {
@@ -401,7 +397,7 @@
     {
      "data": {
       "text/plain": [
-       "29.69911764705882"
+       "np.float64(29.69911764705882)"
       ]
      },
      "execution_count": 5,
@@ -475,15 +471,15 @@
     "tags": []
    },
    "source": [
-    "The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [subset data tutorial](03_pandas_subsets)) is calculated for each numeric column."
+    "The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "18813351-2fb9-4201-b69a-9fdd61adef72",
    "metadata": {},
    "source": [
-    "The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the `describe()` function."
+    "Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` function."
    ]
   },
   {
@@ -729,7 +725,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 16,
    "id": "8dc2bb4f-52df-4ae0-900d-b16253cb1900",
    "metadata": {
     "tags": [
@@ -785,7 +781,7 @@
        "male    30.726645"
       ]
      },
-     "execution_count": 9,
+     "execution_count": 16,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -931,7 +927,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 18,
    "id": "e2405337-1719-4950-8f59-1c5f389d008b",
    "metadata": {
     "tags": [
@@ -948,7 +944,7 @@
        "Name: Age, dtype: float64"
       ]
      },
-     "execution_count": 11,
+     "execution_count": 18,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -970,10 +966,23 @@
    "id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
    "metadata": {},
    "source": [
-    "```{admonition} Categorical data\n",
+    "````{admonition} Categorical data\n",
     ":class: seealso dropdown\n",
     "\n",
-    "The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a `Categorical` data type to handle this type of data. More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)."
+    "The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
+    "\n",
+    "Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
+    "\n",
+    "To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
+    "\n",
+    "```python\n",
+    "titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
+    "```\n",
+    "\n",
+    "This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
+    "\n",
+    "More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
+    "````"
    ]
   },
   {
@@ -1145,7 +1154,7 @@
     "## Key points\n",
     "\n",
     "* Aggregation statistics can be calculated on entire columns or rows.\n",
-    "* `groupby()* provides the power of the **split-apply-combine** pattern.\n",
+    "* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
     "* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
    ]
   },
@@ -1178,7 +1187,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.8"
+   "version": "3.12.5"
   }
  },
  "nbformat": 4,

diff --git a/_build/html/docs/pandas/05_pandas_summary_statistics.html b/_build/html/docs/pandas/05_pandas_summary_statistics.html
@@ -32,7 +32,7 @@
     <link rel="stylesheet" type="text/css" href="../../_static/styles/sphinx-book-theme.css?v=384b581d" />
     <link rel="stylesheet" type="text/css" href="../../_static/togglebutton.css?v=13237357" />
     <link rel="stylesheet" type="text/css" href="../../_static/copybutton.css?v=76b2166b" />
-    <link rel="stylesheet" type="text/css" href="../../_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" />
+    <link rel="stylesheet" type="text/css" href="../../_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css?v=be8a1c11" />
     <link rel="stylesheet" type="text/css" href="../../_static/sphinx-thebe.css?v=4fa983c6" />
     <link rel="stylesheet" type="text/css" href="../../_static/sphinx-design.min.css?v=87e54e7c" />
 
@@ -683,7 +683,8 @@ <h2>Creating a <code class="docutils literal notranslate"><span class="pre">Data
 <section id="aggregating-statistics">
 <h2>Aggregating statistics<a class="headerlink" href="#aggregating-statistics" title="Link to this heading">#</a></h2>
 <p><img alt="Aggregating statistics from a single coloumn in a DataFrame" src="../../_images/05_aggregate.svg" /></p>
-<p>What is the average age of the Titanic passengers?</p>
+<p>Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.</p>
+<p>For example, we can use <code class="docutils literal notranslate"><span class="pre">.mean()</span></code> to find the average age of the Titanic passengers:</p>
 <div class="cell tag_hide-output docutils container">
 <div class="cell_input above-output-prompt docutils container">
 <div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[</span><span class="s2">&quot;Age&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
@@ -725,8 +726,8 @@ <h2>Aggregating statistics<a class="headerlink" href="#aggregating-statistics" t
 </div>
 </details>
 </div>
-<p>The statistic applied to multiple columns of a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> (the selection of two columns returns a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>, see the <a class="reference internal" href="03_pandas_subsets.html"><span class="doc std std-doc">subset data tutorial</span></a>) is calculated for each numeric column.</p>
-<p>The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the <code class="docutils literal notranslate"><span class="pre">describe()</span></code> function.</p>
+<p>The statistic applied to multiple columns of a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> (the selection of two columns returns a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>, see the <a class="reference internal" href="03_pandas_subsets.html"><span class="doc std std-doc">Subsets tutorial</span></a>) is calculated for each numeric column.</p>
+<p>Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the <code class="docutils literal notranslate"><span class="pre">describe()</span></code> function.</p>
 <div class="cell tag_hide-output docutils container">
 <div class="cell_input above-output-prompt docutils container">
 <div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[[</span><span class="s2">&quot;Age&quot;</span><span class="p">,</span> <span class="s2">&quot;Fare&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
@@ -1049,7 +1050,14 @@ <h2>Aggregating statistics grouped by category<a class="headerlink" href="#aggre
 <p><img alt="The split-apply-combine pattern" src="../../_images/05_groupby_select_detail.svg" /></p>
 <div class="seealso dropdown admonition">
 <p class="admonition-title">Categorical data</p>
-<p>The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a <code class="docutils literal notranslate"><span class="pre">Categorical</span></code> data type to handle this type of data. More information is provided in the official Pandas documentation on <a class="reference external" href="https://pandas.pydata.org/docs/user_guide/categorical.html">Categorical data</a>.</p>
+<p>The <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn’t make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.</p>
+<p>Pandas provides a <code class="docutils literal notranslate"><span class="pre">Categorical</span></code> data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.</p>
+<p>To convert the <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> column from numerical to categorical data in Pandas, you can use the following code:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">titanic</span><span class="p">[</span><span class="s1">&#39;Pclass&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">titanic</span><span class="p">[</span><span class="s1">&#39;Pclass&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">&#39;category&#39;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>This will treat <code class="docutils literal notranslate"><span class="pre">Pclass</span></code> as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.</p>
+<p>More information is provided in the official Pandas documentation on <a class="reference external" href="https://pandas.pydata.org/docs/user_guide/categorical.html">Categorical data</a>.</p>
 </div>
 <p>What is the mean ticket fare price for each of the sex and cabin class combinations?</p>
 <div class="cell tag_hide-output docutils container">
@@ -1142,7 +1150,7 @@ <h2>Count number of records by category<a class="headerlink" href="#count-number
 <h2>Key points<a class="headerlink" href="#key-points" title="Link to this heading">#</a></h2>
 <ul class="simple">
 <li><p>Aggregation statistics can be calculated on entire columns or rows.</p></li>
-<li><p>`groupby()* provides the power of the <strong>split-apply-combine</strong> pattern.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">groupby()</span></code> provides the power of the <strong>split-apply-combine</strong> pattern.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">value_counts()</span></code> is a convenient shortcut to count the number of entries in each category of a variable.</p></li>
 </ul>
 </section>

diff --git a/_build/html/searchindex.js b/_build/html/searchindex.js
diff --git a/_build/jupyter_execute/docs/pandas/05_pandas_summary_statistics.ipynb b/_build/jupyter_execute/docs/pandas/05_pandas_summary_statistics.ipynb
@@ -376,16 +376,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dfa1c006-f748-47ac-ab42-fd307a4e0cf1",
-   "metadata": {
-    "editable": true,
-    "slideshow": {
-     "slide_type": ""
-    },
-    "tags": []
-   },
+   "id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
+   "metadata": {},
    "source": [
-    "What is the average age of the Titanic passengers?"
+    "Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
+    "\n",
+    "For example, we can use `.mean()` to find the average age of the Titanic passengers:"
    ]
   },
   {
@@ -475,15 +471,15 @@
     "tags": []
    },
    "source": [
-    "The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [subset data tutorial](03_pandas_subsets)) is calculated for each numeric column."
+    "The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "18813351-2fb9-4201-b69a-9fdd61adef72",
    "metadata": {},
    "source": [
-    "The aggregating statistic can be calculated for multiple columns at the same time. For this we can use the `describe()` function."
+    "Multiple aggregating statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` function."
    ]
   },
   {
@@ -970,10 +966,23 @@
    "id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
    "metadata": {},
    "source": [
-    "```{admonition} Categorical data\n",
+    "````{admonition} Categorical data\n",
     ":class: seealso dropdown\n",
     "\n",
-    "The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a `Categorical` data type to handle this type of data. More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)."
+    "The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
+    "\n",
+    "Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
+    "\n",
+    "To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
+    "\n",
+    "```python\n",
+    "titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
+    "```\n",
+    "\n",
+    "This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
+    "\n",
+    "More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
+    "````"
    ]
   },
   {
@@ -1145,7 +1154,7 @@
     "## Key points\n",
     "\n",
     "* Aggregation statistics can be calculated on entire columns or rows.\n",
-    "* `groupby()* provides the power of the **split-apply-combine** pattern.\n",
+    "* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
     "* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
    ]
   },