Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(act): add activity 3 + test and check lessons that support it #18

Merged
merged 9 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions clean-modular-code/activity-3/clean-code-activity-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.16.4
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

+++ {"editable": true, "slideshow": {"slide_type": ""}}

# Activity 3: Tests & Checks for your code

* In [activity 1](../activity-1/clean-code-activity-1), you took some code and made it cleaner using expressive variable names and docstrings to document the module.
* In [activity 2](../activity-2/clean-code-activity-2), you made your code more DRY ("Don't Repeat Yourself") using documented functions and conditionals.

In this activity, you will build checks into your workflow to handle data processing "features".


### Real world data processing & workflows and edge cases
Real-world data rarely can be imported without "work arounds". You will often find unusual data entries and values you don't expect. Sometimes, these values are documented - for example, a 9999 may represent a missing value in a dataset. Other times, there are typos and other errors in the data that you need to handle. These unusual values or instances in a dataset or workflow are sometimes called "edge cases".

Writing robust code that handles unexpected values will make your code run smoothly and fail gracefully. This type of code, which combines functions (or classes) and checks within the functions that handle messy data, will make your code easier to maintain over time.

:::{tip}
Using functions, classes, and methods (functions within a class) is a great first step in handling messy data. A function or method provides a modular unit you can test outside of the workflow for the edge cases you may encounter. Also, because a function is a modular unit, you can add elements to handle unexpected processing features as you build your workflow.
:::

something about debuggers?
* https://jupyterlab.readthedocs.io/en/stable/user/debugger.html

## Manage the unexpected

In this activity, you will apply the following strategies:

* [conditional statements](../checks-conditionals/python-conditionals)
* try/except blocks

to process the JOSS citation data.

:::{todo}
What branch is the lesson with try/except // ask for forgiveness, checks elements in??
IN THIS PR:
https://github.com/pyOpenSci/lessons/pull/14/files#diff-7f4ff1b75e85d38f3955cca051e68e8746773c279b34c9a0a400b9c2dc1240ff
:::

When you can, try to use the Pythonic approach of asking for forgiveness later (ie use try/except blocks) rather than conditional statements.

```{code-cell} ipython3
---
editable: true
slideshow:
slide_type: ''
---
# This works but is less pythonic
def clean_title(title):
"""Notice that this function checks explicitly to see if it's a list and then processes the data.
"""
if isinstance(title, list):
return title[0]
return title
```

## More "pythonic" - ask for forgiveness

easier to ask for forgiveness

```{code-cell} ipython3
---
editable: true
slideshow:
slide_type: ''
---
# This is the preferred way to catch an error
def clean_title(title):
"""
It's more Pythonic to try first and then ask for forgiveness later.
If you are writing tests this also makes your code easier to test.
"""
try:
return title[0]
except (TypeError, IndexError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really helpful to the attendees if you explained how YOU knew what exceptions to catch, and why you handled (only) two exceptions here.

This is still a very hard skill, exceptions are not always documented, and even when they are, potential transitive exceptions can still be unexpected.

edit: I see the most common exception classes are touched on later, but IMHO it would still be very valuable to have attendees reason about what exceptions might occur for novel edge cases and be able to check their work. Lesson time permitting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree! I do think this content by itself could be a full workshop!
What if i fleshed out that concept a bit. more in the lesson below (this document is the activity)

then maybe we could have them spend time looking at some examples of issues in the data and identifying what type of error it may throw. i could make the notebook more interactive. And it could link to the explanations in the other lesson. i think your idea is a great one AND it might be more valuable than forcing them to debug the code honestly. debugging takes time and is super hard too. we have 4 hours and 3 activities.

return title
```

+++ {"editable": true, "slideshow": {"slide_type": ""}, "tags": ["hide-output", "hide-cell"]}

:::{tip}
### Applying functions to DataFrame values--`.apply`

The `.apply()` function in pandas allows you to apply any function to rows or columns in a `pandas.DataFrame`. For example, You can use it to perform operations on specific column or row values. When you use `.apply()`, you can specify whether you want to apply the function across columns `(axis=0)` (the default) or across rows `(axis=1)`. For example, if you want to apply a function to each row of a DataFrame, you would use `df.apply(your_function, axis=1)`. This function is especially useful for applying logic that can’t be easily achieved with built-in pandas functions, allowing for more flexibility in data processing.

You can use `.apply` in pandas to efficiently replace `for loops` to process row and column values in a `pandas.DataFrame`.

:::

+++ {"editable": true, "slideshow": {"slide_type": ""}}

### What's changed in your workflow?

:::{warning}
You have a new data file to open in your list of `.json` files in this activity. This file has some unexpected "features" that your code needs to handle gracefully so it can process all of the data.
:::

Your goal is to make the code below run on the data provided in the activity-3 `data/` directory.

:::{tip}
The code below will fail. You will likely want to use a debugger to determine why it's failing and get the code running.
:::

The code below is an example of what your code might look like after completing activity 2. You can choose to work with this code, or you can use the code that you completed in activity 2.

+++ {"editable": true, "slideshow": {"slide_type": ""}, "tags": ["raises-exception"]}

```python
import json
from pathlib import Path

import pandas as pd


def load_clean_json(file_path, columns_to_keep):
"""
Load JSON data from a file. Drop unnecessary columns and normalize
to DataFrame.

Parameters
----------
file_path : Path
Path to the JSON file.
columns_to_keep : list
List of columns to keep in the DataFrame.

Returns
-------
dict
Loaded JSON data.
"""

with file_path.open("r") as json_file:
json_data = json.load(json_file)
normalized_data = pd.json_normalize(json_data)

return normalized_data.filter(items=columns_to_keep)


def format_date(date_parts: list) -> str:
"""
Format date parts into a string.

Parameters
----------
date_parts : list
List containing year, month, and day.

Returns
-------
str
Formatted date string.
"""
return f"{date_parts[0]}-{date_parts[1]:02d}-{date_parts[2]:02d}"


def clean_title(value):
"""A function that removes a value contained in a list."""
return value[0]


def process_published_date(date_parts):
"""Parse a date provided as a list of values into a proper date format.

Parameters
----------
date_parts : str or int
The elements of a date provided as a list from CrossRef

Returns
-------
pd.datetime
A date formatted as a pd.datetime object.
"""

date_str = (
f"{date_parts[0][0]}-{date_parts[0][1]:02d}-{date_parts[0][2]:02d}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeating the logic you wrote in format_date (which isn't used anywhere).

)
return pd.to_datetime(date_str, format="%Y-%m-%d")


columns_to_keep = [
"publisher",
"DOI",
"type",
"author",
"is-referenced-by-count",
"title",
"published.date-parts",
]

data_dir = Path("data")

all_papers_list = []
for json_file in data_dir.glob("*.json"):
papers_df = load_clean_json(json_file, columns_to_keep)

papers_df["title"] = papers_df["title"].apply(clean_title)
papers_df["published_date"] = papers_df["published.date-parts"].apply(
process_published_date
)

all_papers_list.append(papers_df)

all_papers_df = pd.concat(all_papers_list, axis=0, ignore_index=True)

print("Final shape of combined DataFrame:", all_papers_df.shape)
```

+++ {"editable": true, "slideshow": {"slide_type": ""}}

:::{admonition} On your own 1
:class: attention

Ideas for on your own welcome!
:::

+++ {"editable": true, "slideshow": {"slide_type": ""}}

:::{admonition} On your own 2
:class: attention
Ideas welcome?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An observation I have is that this code uses neither conditionals nor exception blocks, even though those were the two stated tool for data "edge cases"

:::

I want to have them move their code into a module if possible during this workshop but we could also kick that off in the day 2 workshop.

```{code-cell} ipython3
---
editable: true
slideshow:
slide_type: ''
---

```

```{code-cell} ipython3
---
editable: true
slideshow:
slide_type: ''
---

```
Loading
Loading