Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update cheatsheet #39806

Merged
merged 22 commits into from
Mar 29, 2021
Merged

DOC: update cheatsheet #39806

merged 22 commits into from
Mar 29, 2021

Conversation

OliEfr
Copy link
Contributor

@OliEfr OliEfr commented Feb 14, 2021

Major changes

  1. Replaced Heading "Data Wrangling ..." with section "General"
    • I found it useful to have the hyperlinks to pandas api reference and user guide included
    • Also, all the time time when you work with pandas you need other packages (matplotlib etc.). That's why added hyperlinks for them as well
    • There is some place left for other useful links!
  2. Replaces explanatory section "Tidy data" with section "Display & Visualize data"
    • Cheat sheets are for users who already have basic knowledge. So this explanation of Tidy Data is of no use here
    • Displaying and visualizing data is one of the main tasks. So this section should be included in the cheat sheet
    • On the previous cheat sheet there also was a section "Plotting". But it was very small at the end of the second page.
  3. Added section "Apply Funtions" (bottom of second page)
    • There are many ways to apply functions and I always find it confusing to remember them. So I added them here with a short description
    • However, if you have a better suggestion for this section, it can be interchanged
  4. Refractor section "Subset Observation"
    • This section was pretty confusing in styling and layout before. E.g. there where multiple places where you can find the df.loc[] and df.iloc [] functions.
    • I reordered this section
    • I added df.iloc[row_ind, column_ind] and df.loc[row_ind, column_ind]

Let me know what you think about it and make suggestions.
Oli

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OliEfr
Thanks for your efforts, but I'm not OK with some of these changes, so please do the following:

  1. On first page, please put back the top left part with "Data Wrangling with pandas Cheat Sheet". You can make the font smaller if you want. You can include the references to the pandas user guide and API reference, but do not include the references to Seaborn and matplotlib. We're not in a position to pick any particular visualization tool.
  2. I'm OK with changing the title of the left panel from "Syntax - Creating DataFrames" to "Creating DataFrames", but please don't include the reference to IO Tools and read_csv there (see below for a suggestion)
  3. Under Method Chaining, put back the comment "This improves readability of code"
  4. At top of first page, please put back the Tidy Data section. Move the "Display and Visualize Data" section back to page 2 with just the original examples (but see below for a suggestion)
  5. In Subset Observations, please put back the pictures that illustrate what subsetting is all about. Try to fit in a df.query example. Separate out the ones for rows (which are about subsetting observations) and columns (which are about subsetting the variables), as they were before. I'm fine with moving the last 3 examples on the current cheat sheet that subset variables above the df.filter example, which would move the regex table down. Note that those 3 example are about filtering columns (with the last example being about filtering rows and columns)
  6. On page 2, I'm OK with adding df.shape, but if you reformat that part to put df.shape to the right of df.len(), then things will fit better.
  7. Remove the comment in "Group Data" about "Possibly use reset_index() after!"
  8. Remove the "apply Functions" example, and put the simple plotting example back. df.assign is also described above, so no need to include it again. (But see suggestion below).
  9. For the attribution on the bottom of the 2 pages, please use "Cheatsheet for pandas (http://pandas.pydata.org/) originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet". Keep it flush right on both pages (that was a nice change)

Now for the suggestion. I've long wanted to add a third page that would have additional information that people would find useful, and just never had the time to do it, so it would be great if you could start this. This could include:

  • More visualization examples (but do NOT use references to other libraries, e.g. your seaborn example)
  • The options you included, and more (also display_max_columns) See the list of frequently used ones here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#frequently-used-options
  • I/O: A whole section showing a variety of popular IO usage (CSV, Excel, SQL, HTML), and also output file formats (feather, parquet, HDF)
  • The Apply Functions section - although I'm not sure that is the right name for the section you created.
  • The new Extension Types (String, Integer, Float) and how pd.NA works
  • Anything else that could fill up the space (if needed)

@OliEfr
Copy link
Contributor Author

OliEfr commented Feb 19, 2021

    • done (sorry this if obvious - I should have known it!)
    • done
    • done
    • Done. However, I don't know why you are keen on having it there. People using pandas know this. Or am I mistaken / missing something?
    • I partly agree. Thing is, I found it confusing to have the same function at multiple locations when using the sheet for the first times. More importantly, I think it is very useful to have multiple examples of loc and iloc, because they are so versatile. I made a suggestion for this section. Let me know what you think.
    • Not sure what you mean. Is this alright?
    • done
    • done
    • done

I havent had time yet, but I agree with that third page. Will work on it later.

  • ToDo: Adding third page

@OliEfr
Copy link
Contributor Author

OliEfr commented Feb 21, 2021

Idea: when we have a third page, there will be enough space. So why don't we create a section - I don't mind on which page - "Design Principles" or "Best Practice". Within this section we can have the comment about tidy data, vectorization and also things such as "First column is 0".

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 24, 2021
@Dr-Irv Dr-Irv removed the Stale label Mar 27, 2021
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 27, 2021

First, sorry for the delay on reviewing this. It slipped through the proverbial cracks.

  1. Done. However, I don't know why you are keen on having it there. People using pandas know this. Or am I mistaken / missing something?

Some people start using pandas by using the cheat sheet, so that's why I like putting the stuff at the top. Also, it maps the words observation to row and variable to column, which corresponds to wording in the statistics community.

  1. I partly agree. Thing is, I found it confusing to have the same function at multiple locations when using the sheet for the first times. More importantly, I think it is very useful to have multiple examples of loc and iloc, because they are so versatile. I made a suggestion for this section. Let me know what you think.

We're getting close. See below.

  1. Not sure what you mean. Is this alright?

No. df.shape is like len(df). I think you can make a new line for df.shape and describe it as "Tuple of # of rows, # of columns in DataFrame"

Now, for other comments:

  1. On first page, let's use the white space below "Subset Variables - columns" as a new section "Using query", and remove df.query(expr) from the Subset Observations section , but provide a few examples in the new section. Here are 3 examples to use
    • df.query("Length > 7")
    • df.query("Length > 7 and Width < 8")
    • df.query("Name.str.startswith('abc')", engine="python")
  2. On first page, for .iloc and .loc provide examples rather than using row_ind and column_ind
  3. On second page, in the "Group Data" section, in the groupby example, remove the .mean()

@OliEfr
Copy link
Contributor Author

OliEfr commented Mar 28, 2021

Alright, I adapted everything, apart from:

  1. On first page, for .iloc and .loc provide examples rather than using row_ind and column_ind

For me, it was not obvious at first that the first index is row_ind and the second is column_ind. I always confused them and had to look it up quite often.

I removed row_ind and column_ind and added the sentence: First index selects rows, second index columns.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 28, 2021

@OliEfr We're getting even closer Additional changes to make:

  1. In the query examples, make sure the quotes are straight quotes, not curly quotes
  2. Have a hyperlink from each query example to the docs
  3. In the blue box for query, make it smaller, then have text before the examples that says "query() allows boolean expressions for filtering rows". To help make space, change the text for the df[['width', 'length', 'species']] examples to put the word "names" on the same line as the rest of the text.
  4. In the examples for loc and iloc, add hyperlinks to the docs
  5. This is a small thing, but on page 1, the box for "Creating DataFrames" should top align with the box for "Reshaping Data". If you move "Creating DataFrames" dow n a bit, you can increase the font size in the "Data Wrangling with pandas Cheat Sheet" that was closer to the old font size.
  6. Since there is now some space at the top of the box with the "Creating DataFrames" example, spread out the examples vertically to distribute the white space
  7. In the df.iat[1,2] example, add a space after the comma.

@OliEfr
Copy link
Contributor Author

OliEfr commented Mar 28, 2021

Check

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2021

Should have caught a few more issues (some of which existed prior to your PR):

  1. On page 2, in Group Data, in first example, get rid of .max()
  2. In the df.iloc and df.loc examples, make sure there are spaces after every comma
  3. Add spaces after commas in the last examples with MultiIndex. If the alignment isn't great, consider splitting the tuples over multiple lines, which you can now do because you have the white space available

@OliEfr
Copy link
Contributor Author

OliEfr commented Mar 29, 2021

Check - and corrected a few more comma and spacing issues.

@Dr-Irv Dr-Irv merged commit 8719c7a into pandas-dev:master Mar 29, 2021
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2021

Thanks @OliEfr

@OliEfr
Copy link
Contributor Author

OliEfr commented Mar 29, 2021

If you want, you can create an issue with your ideas for a third page/further ideas.

I'll probably keep working on it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2021

If you want, you can create an issue with your ideas for a third page/further ideas.

I'll probably keep working on it.

Good idea. Created #40680

If you do that work, I'll try to be more responsive!

vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021
* Added links to official docs in cheat sheets

* DOC: added links to official docs in cheat sheet (update)

* DOC: update cheatsheet

* DOC: minor changes cheatsheet; update honors

* DOC: rework according to requested changes

* Update Cheatsheet

* Update Cheatsheet

* update cheatsheet

* update cheatsheets

* update cheatsheet!
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
* Added links to official docs in cheat sheets

* DOC: added links to official docs in cheat sheet (update)

* DOC: update cheatsheet

* DOC: minor changes cheatsheet; update honors

* DOC: rework according to requested changes

* Update Cheatsheet

* Update Cheatsheet

* update cheatsheet

* update cheatsheets

* update cheatsheet!
@maripisravankumar
Copy link

changes

for update cheatsheet issue #39806
1.replace the heading Data Wrangling to Shortcuts for DATASCIENCE tools

  • insert some referral links at the top of the sheet for easy to understand
  • [https://www.w3schools.com/python/pandas/default.asp]

remake as simple learning Data Frames how to read the data from the csv file:

*for example

  1. `from sklearn.datasets import load_iris
    import pandas as pd

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()`

image

2.subset variables / observations insert some python codes useful for pandas data frames and updates

  1. better introduce
    2.. # Select the species and plot columns from the DataFrame surveys_df[['species_id', 'plot_id']]

Select the last element in the list

(the slice starts at the last element, and ends at the end of the list)

surveys_df[-1:]

Using the 'copy() method'

true_copy_surveys_df = surveys_df.copy()

Using the '=' operator

ref_surveys_df = surveys_df

###3.Summarizing Data and Handling the missing Data:

summarize the dataset

print(dataset.describe())

count the number of missing values for each column

num_missing = (dataset[[1,2,3,4,5]] == 0).sum()

report the result

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 31, 2021

changes

for update cheatsheet issue #39806

@maripisravankumar This is a closed PR on the cheat sheet. If you would like to propose other changes to the cheat sheet, you can create your own PR and do edits, and I will either accept or reject them, or suggest changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QST: Publish my personal modification of the official pandas cheat sheet
4 participants