-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(act): add activity 3 + test and check lessons that support it #18
Changes from all commits
4465543
130393e
e54ac8b
43c029c
f5fcf6f
90b8803
3eb9892
bc8e377
e1b297a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,254 @@ | ||
--- | ||
jupytext: | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.16.4 | ||
kernelspec: | ||
display_name: Python 3 (ipykernel) | ||
language: python | ||
name: python3 | ||
--- | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}} | ||
|
||
# Activity 3: Tests & Checks for your code | ||
|
||
* In [activity 1](../activity-1/clean-code-activity-1), you took some code and made it cleaner using expressive variable names and docstrings to document the module. | ||
* In [activity 2](../activity-2/clean-code-activity-2), you made your code more DRY ("Don't Repeat Yourself") using documented functions and conditionals. | ||
|
||
In this activity, you will build checks into your workflow to handle data processing "features". | ||
|
||
|
||
### Real world data processing & workflows and edge cases | ||
Real-world data rarely can be imported without "work arounds". You will often find unusual data entries and values you don't expect. Sometimes, these values are documented - for example, a 9999 may represent a missing value in a dataset. Other times, there are typos and other errors in the data that you need to handle. These unusual values or instances in a dataset or workflow are sometimes called "edge cases". | ||
|
||
Writing robust code that handles unexpected values will make your code run smoothly and fail gracefully. This type of code, which combines functions (or classes) and checks within the functions that handle messy data, will make your code easier to maintain over time. | ||
|
||
:::{tip} | ||
Using functions, classes, and methods (functions within a class) is a great first step in handling messy data. A function or method provides a modular unit you can test outside of the workflow for the edge cases you may encounter. Also, because a function is a modular unit, you can add elements to handle unexpected processing features as you build your workflow. | ||
::: | ||
|
||
something about debuggers? | ||
* https://jupyterlab.readthedocs.io/en/stable/user/debugger.html | ||
|
||
## Manage the unexpected | ||
|
||
In this activity, you will apply the following strategies: | ||
|
||
* [conditional statements](../checks-conditionals/python-conditionals) | ||
* try/except blocks | ||
|
||
to process the JOSS citation data. | ||
|
||
:::{todo} | ||
What branch is the lesson with try/except // ask for forgiveness, checks elements in?? | ||
IN THIS PR: | ||
https://github.com/pyOpenSci/lessons/pull/14/files#diff-7f4ff1b75e85d38f3955cca051e68e8746773c279b34c9a0a400b9c2dc1240ff | ||
::: | ||
|
||
When you can, try to use the Pythonic approach of asking for forgiveness later (ie use try/except blocks) rather than conditional statements. | ||
|
||
```{code-cell} ipython3 | ||
--- | ||
editable: true | ||
slideshow: | ||
slide_type: '' | ||
--- | ||
# This works but is less pythonic | ||
def clean_title(title): | ||
"""Notice that this function checks explicitly to see if it's a list and then processes the data. | ||
""" | ||
if isinstance(title, list): | ||
return title[0] | ||
return title | ||
``` | ||
|
||
## More "pythonic" - ask for forgiveness | ||
|
||
easier to ask for forgiveness | ||
|
||
```{code-cell} ipython3 | ||
--- | ||
editable: true | ||
slideshow: | ||
slide_type: '' | ||
--- | ||
# This is the preferred way to catch an error | ||
def clean_title(title): | ||
""" | ||
It's more Pythonic to try first and then ask for forgiveness later. | ||
If you are writing tests this also makes your code easier to test. | ||
""" | ||
try: | ||
return title[0] | ||
except (TypeError, IndexError): | ||
return title | ||
``` | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}, "tags": ["hide-output", "hide-cell"]} | ||
|
||
:::{tip} | ||
### Applying functions to DataFrame values--`.apply` | ||
|
||
The `.apply()` function in pandas allows you to apply any function to rows or columns in a `pandas.DataFrame`. For example, You can use it to perform operations on specific column or row values. When you use `.apply()`, you can specify whether you want to apply the function across columns `(axis=0)` (the default) or across rows `(axis=1)`. For example, if you want to apply a function to each row of a DataFrame, you would use `df.apply(your_function, axis=1)`. This function is especially useful for applying logic that can’t be easily achieved with built-in pandas functions, allowing for more flexibility in data processing. | ||
|
||
You can use `.apply` in pandas to efficiently replace `for loops` to process row and column values in a `pandas.DataFrame`. | ||
|
||
::: | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}} | ||
|
||
### What's changed in your workflow? | ||
|
||
:::{warning} | ||
You have a new data file to open in your list of `.json` files in this activity. This file has some unexpected "features" that your code needs to handle gracefully so it can process all of the data. | ||
::: | ||
|
||
Your goal is to make the code below run on the data provided in the activity-3 `data/` directory. | ||
|
||
:::{tip} | ||
The code below will fail. You will likely want to use a debugger to determine why it's failing and get the code running. | ||
::: | ||
|
||
The code below is an example of what your code might look like after completing activity 2. You can choose to work with this code, or you can use the code that you completed in activity 2. | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}, "tags": ["raises-exception"]} | ||
|
||
```python | ||
import json | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
|
||
|
||
def load_clean_json(file_path, columns_to_keep): | ||
""" | ||
Load JSON data from a file. Drop unnecessary columns and normalize | ||
to DataFrame. | ||
|
||
Parameters | ||
---------- | ||
file_path : Path | ||
Path to the JSON file. | ||
columns_to_keep : list | ||
List of columns to keep in the DataFrame. | ||
|
||
Returns | ||
------- | ||
dict | ||
Loaded JSON data. | ||
""" | ||
|
||
with file_path.open("r") as json_file: | ||
json_data = json.load(json_file) | ||
normalized_data = pd.json_normalize(json_data) | ||
|
||
return normalized_data.filter(items=columns_to_keep) | ||
|
||
|
||
def format_date(date_parts: list) -> str: | ||
""" | ||
Format date parts into a string. | ||
|
||
Parameters | ||
---------- | ||
date_parts : list | ||
List containing year, month, and day. | ||
|
||
Returns | ||
------- | ||
str | ||
Formatted date string. | ||
""" | ||
return f"{date_parts[0]}-{date_parts[1]:02d}-{date_parts[2]:02d}" | ||
|
||
|
||
def clean_title(value): | ||
"""A function that removes a value contained in a list.""" | ||
return value[0] | ||
|
||
|
||
def process_published_date(date_parts): | ||
"""Parse a date provided as a list of values into a proper date format. | ||
|
||
Parameters | ||
---------- | ||
date_parts : str or int | ||
The elements of a date provided as a list from CrossRef | ||
|
||
Returns | ||
------- | ||
pd.datetime | ||
A date formatted as a pd.datetime object. | ||
""" | ||
|
||
date_str = ( | ||
f"{date_parts[0][0]}-{date_parts[0][1]:02d}-{date_parts[0][2]:02d}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is repeating the logic you wrote in |
||
) | ||
return pd.to_datetime(date_str, format="%Y-%m-%d") | ||
|
||
|
||
columns_to_keep = [ | ||
"publisher", | ||
"DOI", | ||
"type", | ||
"author", | ||
"is-referenced-by-count", | ||
"title", | ||
"published.date-parts", | ||
] | ||
|
||
data_dir = Path("data") | ||
|
||
all_papers_list = [] | ||
for json_file in data_dir.glob("*.json"): | ||
papers_df = load_clean_json(json_file, columns_to_keep) | ||
|
||
papers_df["title"] = papers_df["title"].apply(clean_title) | ||
papers_df["published_date"] = papers_df["published.date-parts"].apply( | ||
process_published_date | ||
) | ||
|
||
all_papers_list.append(papers_df) | ||
|
||
all_papers_df = pd.concat(all_papers_list, axis=0, ignore_index=True) | ||
|
||
print("Final shape of combined DataFrame:", all_papers_df.shape) | ||
``` | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}} | ||
|
||
:::{admonition} On your own 1 | ||
:class: attention | ||
|
||
Ideas for on your own welcome! | ||
::: | ||
|
||
+++ {"editable": true, "slideshow": {"slide_type": ""}} | ||
|
||
:::{admonition} On your own 2 | ||
:class: attention | ||
Ideas welcome? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An observation I have is that this code uses neither conditionals nor exception blocks, even though those were the two stated tool for data "edge cases" |
||
::: | ||
|
||
I want to have them move their code into a module if possible during this workshop but we could also kick that off in the day 2 workshop. | ||
|
||
```{code-cell} ipython3 | ||
--- | ||
editable: true | ||
slideshow: | ||
slide_type: '' | ||
--- | ||
|
||
``` | ||
|
||
```{code-cell} ipython3 | ||
--- | ||
editable: true | ||
slideshow: | ||
slide_type: '' | ||
--- | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be really helpful to the attendees if you explained how YOU knew what exceptions to catch, and why you handled (only) two exceptions here.
This is still a very hard skill, exceptions are not always documented, and even when they are, potential transitive exceptions can still be unexpected.
edit: I see the most common exception classes are touched on later, but IMHO it would still be very valuable to have attendees reason about what exceptions might occur for novel edge cases and be able to check their work. Lesson time permitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree! I do think this content by itself could be a full workshop!
What if i fleshed out that concept a bit. more in the lesson below (this document is the activity)
then maybe we could have them spend time looking at some examples of issues in the data and identifying what type of error it may throw. i could make the notebook more interactive. And it could link to the explanations in the other lesson. i think your idea is a great one AND it might be more valuable than forcing them to debug the code honestly. debugging takes time and is super hard too. we have 4 hours and 3 activities.