DOC: update cheatsheet #39806

OliEfr · 2021-02-14T11:56:43Z

closes QST: Publish my personal modification of the official pandas cheat sheet #39724
tests added / passed (not required)
Ensure all linting tests pass, see here for how to run them (not required)
whatsnew entry (not required)

Major changes

Replaced Heading "Data Wrangling ..." with section "General"
- I found it useful to have the hyperlinks to pandas api reference and user guide included
- Also, all the time time when you work with pandas you need other packages (matplotlib etc.). That's why added hyperlinks for them as well
- There is some place left for other useful links!
Replaces explanatory section "Tidy data" with section "Display & Visualize data"
- Cheat sheets are for users who already have basic knowledge. So this explanation of Tidy Data is of no use here
- Displaying and visualizing data is one of the main tasks. So this section should be included in the cheat sheet
- On the previous cheat sheet there also was a section "Plotting". But it was very small at the end of the second page.
Added section "Apply Funtions" (bottom of second page)
- There are many ways to apply functions and I always find it confusing to remember them. So I added them here with a short description
- However, if you have a better suggestion for this section, it can be interchanged
Refractor section "Subset Observation"
- This section was pretty confusing in styling and layout before. E.g. there where multiple places where you can find the df.loc[] and df.iloc [] functions.
- I reordered this section
- I added df.iloc[row_ind, column_ind] and df.loc[row_ind, column_ind]

Let me know what you think about it and make suggestions.
Oli

Dr-Irv

@OliEfr
Thanks for your efforts, but I'm not OK with some of these changes, so please do the following:

On first page, please put back the top left part with "Data Wrangling with pandas Cheat Sheet". You can make the font smaller if you want. You can include the references to the pandas user guide and API reference, but do not include the references to Seaborn and matplotlib. We're not in a position to pick any particular visualization tool.
I'm OK with changing the title of the left panel from "Syntax - Creating DataFrames" to "Creating DataFrames", but please don't include the reference to IO Tools and read_csv there (see below for a suggestion)
Under Method Chaining, put back the comment "This improves readability of code"
At top of first page, please put back the Tidy Data section. Move the "Display and Visualize Data" section back to page 2 with just the original examples (but see below for a suggestion)
In Subset Observations, please put back the pictures that illustrate what subsetting is all about. Try to fit in a df.query example. Separate out the ones for rows (which are about subsetting observations) and columns (which are about subsetting the variables), as they were before. I'm fine with moving the last 3 examples on the current cheat sheet that subset variables above the df.filter example, which would move the regex table down. Note that those 3 example are about filtering columns (with the last example being about filtering rows and columns)
On page 2, I'm OK with adding df.shape, but if you reformat that part to put df.shape to the right of df.len(), then things will fit better.
Remove the comment in "Group Data" about "Possibly use reset_index() after!"
Remove the "apply Functions" example, and put the simple plotting example back. df.assign is also described above, so no need to include it again. (But see suggestion below).
For the attribution on the bottom of the 2 pages, please use "Cheatsheet for pandas (http://pandas.pydata.org/) originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet". Keep it flush right on both pages (that was a nice change)

Now for the suggestion. I've long wanted to add a third page that would have additional information that people would find useful, and just never had the time to do it, so it would be great if you could start this. This could include:

More visualization examples (but do NOT use references to other libraries, e.g. your seaborn example)
The options you included, and more (also display_max_columns) See the list of frequently used ones here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#frequently-used-options
I/O: A whole section showing a variety of popular IO usage (CSV, Excel, SQL, HTML), and also output file formats (feather, parquet, HDF)
The Apply Functions section - although I'm not sure that is the right name for the section you created.
The new Extension Types (String, Integer, Float) and how pd.NA works
Anything else that could fill up the space (if needed)

OliEfr · 2021-02-19T20:10:01Z

OliEfr · 2021-02-21T06:22:15Z

Idea: when we have a third page, there will be enough space. So why don't we create a section - I don't mind on which page - "Design Principles" or "Best Practice". Within this section we can have the comment about tidy data, vectorization and also things such as "First column is 0".

github-actions · 2021-03-24T00:14:17Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

Dr-Irv · 2021-03-27T18:45:37Z

First, sorry for the delay on reviewing this. It slipped through the proverbial cracks.

Done. However, I don't know why you are keen on having it there. People using pandas know this. Or am I mistaken / missing something?

Some people start using pandas by using the cheat sheet, so that's why I like putting the stuff at the top. Also, it maps the words observation to row and variable to column, which corresponds to wording in the statistics community.

I partly agree. Thing is, I found it confusing to have the same function at multiple locations when using the sheet for the first times. More importantly, I think it is very useful to have multiple examples of loc and iloc, because they are so versatile. I made a suggestion for this section. Let me know what you think.

We're getting close. See below.

Not sure what you mean. Is this alright?

No. df.shape is like len(df). I think you can make a new line for df.shape and describe it as "Tuple of # of rows, # of columns in DataFrame"

Now, for other comments:

On first page, let's use the white space below "Subset Variables - columns" as a new section "Using query", and remove df.query(expr) from the Subset Observations section , but provide a few examples in the new section. Here are 3 examples to use
- df.query("Length > 7")
- df.query("Length > 7 and Width < 8")
- df.query("Name.str.startswith('abc')", engine="python")
On first page, for .iloc and .loc provide examples rather than using row_ind and column_ind
On second page, in the "Group Data" section, in the groupby example, remove the .mean()

OliEfr · 2021-03-28T12:40:13Z

Alright, I adapted everything, apart from:

On first page, for .iloc and .loc provide examples rather than using row_ind and column_ind

For me, it was not obvious at first that the first index is row_ind and the second is column_ind. I always confused them and had to look it up quite often.

I removed row_ind and column_ind and added the sentence: First index selects rows, second index columns.

Dr-Irv · 2021-03-28T16:16:55Z

@OliEfr We're getting even closer Additional changes to make:

In the query examples, make sure the quotes are straight quotes, not curly quotes
Have a hyperlink from each query example to the docs
In the blue box for query, make it smaller, then have text before the examples that says "query() allows boolean expressions for filtering rows". To help make space, change the text for the df[['width', 'length', 'species']] examples to put the word "names" on the same line as the rest of the text.
In the examples for loc and iloc, add hyperlinks to the docs
This is a small thing, but on page 1, the box for "Creating DataFrames" should top align with the box for "Reshaping Data". If you move "Creating DataFrames" dow n a bit, you can increase the font size in the "Data Wrangling with pandas Cheat Sheet" that was closer to the old font size.
Since there is now some space at the top of the box with the "Creating DataFrames" example, spread out the examples vertically to distribute the white space
In the df.iat[1,2] example, add a space after the comma.

OliEfr · 2021-03-28T17:36:38Z

Check

Dr-Irv · 2021-03-29T11:54:06Z

Should have caught a few more issues (some of which existed prior to your PR):

On page 2, in Group Data, in first example, get rid of .max()
In the df.iloc and df.loc examples, make sure there are spaces after every comma
Add spaces after commas in the last examples with MultiIndex. If the alignment isn't great, consider splitting the tuples over multiple lines, which you can now do because you have the white space available

OliEfr · 2021-03-29T13:45:54Z

Check - and corrected a few more comma and spacing issues.

Dr-Irv · 2021-03-29T15:24:13Z

Thanks @OliEfr

OliEfr · 2021-03-29T15:56:36Z

If you want, you can create an issue with your ideas for a third page/further ideas.

I'll probably keep working on it.

Dr-Irv · 2021-03-29T16:19:19Z

If you want, you can create an issue with your ideas for a third page/further ideas.

I'll probably keep working on it.

Good idea. Created #40680

If you do that work, I'll try to be more responsive!

* Added links to official docs in cheat sheets * DOC: added links to official docs in cheat sheet (update) * DOC: update cheatsheet * DOC: minor changes cheatsheet; update honors * DOC: rework according to requested changes * Update Cheatsheet * Update Cheatsheet * update cheatsheet * update cheatsheets * update cheatsheet!

maripisravankumar · 2021-08-31T10:47:15Z

changes

for update cheatsheet issue #39806
1.replace the heading Data Wrangling to Shortcuts for DATASCIENCE tools

insert some referral links at the top of the sheet for easy to understand
[https://www.w3schools.com/python/pandas/default.asp]

remake as simple learning Data Frames how to read the data from the csv file:

*for example

`from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()`

2.subset variables / observations insert some python codes useful for pandas data frames and updates

better introduce
2.. # Select the species and plot columns from the DataFrame surveys_df[['species_id', 'plot_id']]

Select the last element in the list

(the slice starts at the last element, and ends at the end of the list)

surveys_df[-1:]

Using the 'copy() method'

true_copy_surveys_df = surveys_df.copy()

Using the '=' operator

ref_surveys_df = surveys_df

###3.Summarizing Data and Handling the missing Data:

summarize the dataset

print(dataset.describe())

count the number of missing values for each column

num_missing = (dataset[[1,2,3,4,5]] == 0).sum()

report the result

Dr-Irv · 2021-08-31T21:46:01Z

changes

for update cheatsheet issue #39806

@maripisravankumar This is a closed PR on the cheat sheet. If you would like to propose other changes to the cheat sheet, you can create your own PR and do edits, and I will either accept or reject them, or suggest changes.

OliEfr added 11 commits February 9, 2021 20:53

Added links to official docs in cheat sheets

fec98ad

Merge https://github.com/pandas-dev/pandas

863ac94

Merge https://github.com/pandas-dev/pandas

2effcd7

DOC: added links to official docs in cheat sheet (update)

21c9099

Merge branch 'master' of https://github.com/OliEfr/pandas

e8afcfe

Merge https://github.com/pandas-dev/pandas

86f0784

Merge https://github.com/pandas-dev/pandas

5e984be

Merge https://github.com/pandas-dev/pandas

7051523

DOC: update cheatsheet

951c7d8

Merge branch 'master' of https://github.com/OliEfr/pandas

36fd6df

DOC: minor changes cheatsheet; update honors

20e82d2

lithomas1 requested a review from Dr-Irv February 14, 2021 22:45

lithomas1 added the Docs label Feb 14, 2021

Dr-Irv requested changes Feb 15, 2021

View reviewed changes

DOC: rework according to requested changes

2ffda8d

OliEfr added 2 commits February 19, 2021 21:14

Merge https://github.com/pandas-dev/pandas

ef995dc

Merge branch 'master' of https://github.com/OliEfr/pandas

b5b6969

github-actions bot added the Stale label Mar 24, 2021

Dr-Irv removed the Stale label Mar 27, 2021

OliEfr added 5 commits March 28, 2021 14:40

Update Cheatsheet

745b5f4

Update Cheatsheet

ea45c4d

update cheatsheet

ace0061

Merge https://github.com/pandas-dev/pandas

dac0f9b

Merge branch 'master' of https://github.com/OliEfr/pandas

8b10222

update cheatsheets

66d779e

OliEfr added 2 commits March 29, 2021 15:44

update cheatsheet!

1834e83

Merge https://github.com/pandas-dev/pandas

31260a3

Dr-Irv merged commit 8719c7a into pandas-dev:master Mar 29, 2021

Dr-Irv mentioned this pull request Mar 29, 2021

DOC: Additional items for the cheat sheet #40680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: update cheatsheet #39806

DOC: update cheatsheet #39806

OliEfr commented Feb 14, 2021 •

edited

Loading

Dr-Irv left a comment

OliEfr commented Feb 19, 2021 •

edited

Loading

OliEfr commented Feb 21, 2021 •

edited

Loading

github-actions bot commented Mar 24, 2021

Dr-Irv commented Mar 27, 2021

OliEfr commented Mar 28, 2021

Dr-Irv commented Mar 28, 2021

OliEfr commented Mar 28, 2021

Dr-Irv commented Mar 29, 2021

OliEfr commented Mar 29, 2021

Dr-Irv commented Mar 29, 2021

OliEfr commented Mar 29, 2021

Dr-Irv commented Mar 29, 2021

maripisravankumar commented Aug 31, 2021

Dr-Irv commented Aug 31, 2021

changes

DOC: update cheatsheet #39806

DOC: update cheatsheet #39806

Conversation

OliEfr commented Feb 14, 2021 • edited Loading

Major changes

Dr-Irv left a comment

Choose a reason for hiding this comment

OliEfr commented Feb 19, 2021 • edited Loading

OliEfr commented Feb 21, 2021 • edited Loading

github-actions bot commented Mar 24, 2021

Dr-Irv commented Mar 27, 2021

OliEfr commented Mar 28, 2021

Dr-Irv commented Mar 28, 2021

OliEfr commented Mar 28, 2021

Dr-Irv commented Mar 29, 2021

OliEfr commented Mar 29, 2021

Dr-Irv commented Mar 29, 2021

OliEfr commented Mar 29, 2021

Dr-Irv commented Mar 29, 2021

maripisravankumar commented Aug 31, 2021

changes

remake as simple learning Data Frames how to read the data from the csv file:

2.subset variables / observations insert some python codes useful for pandas data frames and updates

Select the last element in the list

(the slice starts at the last element, and ends at the end of the list)

Using the 'copy() method'

Using the '=' operator

summarize the dataset

count the number of missing values for each column

report the result

Dr-Irv commented Aug 31, 2021

changes

OliEfr commented Feb 14, 2021 •

edited

Loading

OliEfr commented Feb 19, 2021 •

edited

Loading

OliEfr commented Feb 21, 2021 •

edited

Loading