The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.
pip install pandas-cub
pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.
pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.
- Subset selection with the brackets
- Arithmetic and comparison operators (+, -, <, !=, etc...)
- Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
- Grouping via pivot tables
- String-only methods for columns containing strings
- Reading in simple comma-separated value files
- Several other methods
pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.
Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.
pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.
pandas_cub consists of a single function, read_csv
, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head
method returns the first 5 rows by default.
import pandas_cub as pdc
df = pdc.read_csv('data/employee.csv')
df.head()
dept | race | gender | salary | |
---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 |
1 | Houston Fire Department (HFD) | White | Male | 63166 |
2 | Houston Police Department-HPD | Black | Male | 66614 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 |
4 | Houston Airport System (HAS) | White | Male | 42390 |
The shape
property returns a tuple of the number of rows and columns
df.shape
(1535, 4)
The len
function returns just the number of rows.
len(df)
1535
The dtypes
property returns a DataFrame of the column names and their respective data type.
df.dtypes
Column Name | Data Type | |
---|---|---|
0 | dept | string |
1 | race | string |
2 | gender | string |
3 | salary | int |
The columns
property returns a list of the columns.
df.columns
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the columns
property to a list.
df.columns = ['department', 'race', 'gender', 'salary']
df.head()
department | race | gender | salary | |
---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 |
1 | Houston Fire Department (HFD) | White | Male | 63166 |
2 | Houston Police Department-HPD | Black | Male | 66614 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 |
4 | Houston Airport System (HAS) | White | Male | 42390 |
The values
property returns a single numpy array of all the data.
df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.
df['race'].head()
race | |
---|---|
0 | White |
1 | White |
2 | Black |
3 | Asian |
4 | White |
Select multiple columns with a list of strings.
df[['race', 'salary']].head()
race | salary | |
---|---|---|
0 | White | 45279 |
1 | White | 63166 |
2 | Black | 66614 |
3 | Asian | 71680 |
4 | White | 42390 |
Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.
rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
salary | race | |
---|---|---|
0 | 77076 | Black |
1 | 81239 | White |
2 | 81239 | White |
You can use integers for the columns as well.
rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
gender | department | |
---|---|---|
0 | Male | Houston Police Department-HPD |
1 | Male | Houston Police Department-HPD |
2 | Male | Houston Police Department-HPD |
You can use a single integer and not just a list.
df[99, 3]
salary | |
---|---|
0 | 66614 |
Or a single string for the columns
df[99, 'salary']
salary | |
---|---|
0 | 66614 |
You can use a slice for the rows
df[20:100:10, ['race', 'gender']]
race | gender | |
---|---|---|
0 | White | Male |
1 | White | Male |
2 | Hispanic | Male |
3 | White | Male |
4 | White | Male |
5 | Hispanic | Male |
6 | Hispanic | Male |
7 | Black | Female |
You can also slice the columns with either integers or strings
df[20:100:10, :2]
department | race | |
---|---|---|
0 | Houston Police Department-HPD | White |
1 | Houston Fire Department (HFD) | White |
2 | Houston Police Department-HPD | Hispanic |
3 | Houston Police Department-HPD | White |
4 | Houston Fire Department (HFD) | White |
5 | Houston Police Department-HPD | Hispanic |
6 | Houston Fire Department (HFD) | Hispanic |
7 | Houston Police Department-HPD | Black |
df[20:100:10, 'department':'gender']
department | race | gender | |
---|---|---|---|
0 | Houston Police Department-HPD | White | Male |
1 | Houston Fire Department (HFD) | White | Male |
2 | Houston Police Department-HPD | Hispanic | Male |
3 | Houston Police Department-HPD | White | Male |
4 | Houston Fire Department (HFD) | White | Male |
5 | Houston Police Department-HPD | Hispanic | Male |
6 | Houston Fire Department (HFD) | Hispanic | Male |
7 | Houston Police Department-HPD | Black | Female |
You can do boolean selection if you pass the brackets a one-column boolean DataFrame.
filt = df['salary'] > 100000
filt.head()
salary | |
---|---|
0 | False |
1 | False |
2 | False |
3 | False |
4 | False |
df[filt].head()
department | race | gender | salary | |
---|---|---|---|---|
0 | Public Works & Engineering-PWE | White | Male | 107962 |
1 | Health & Human Services | Black | Male | 180416 |
2 | Houston Fire Department (HFD) | Hispanic | Male | 165216 |
3 | Health & Human Services | White | Female | 100791 |
4 | Houston Airport System (HAS) | White | Male | 120916 |
df[filt, ['race', 'salary']].head()
race | salary | |
---|---|---|
0 | White | 107962 |
1 | Black | 180416 |
2 | Hispanic | 165216 |
3 | White | 100791 |
4 | White | 120916 |
You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:
df['bonus'] = 1000
df.head()
department | race | gender | salary | bonus | |
---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 1000 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1000 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 1000 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 1000 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 1000 |
You can assign with a numpy array the same length as a column.
import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
department | race | gender | salary | bonus | |
---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1296 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 511 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3766 |
You can assign a new column with a one column DataFrame.
df['salary'] + df['bonus']
salary | |
---|---|
0 | 48815 |
1 | 64462 |
2 | 67125 |
3 | 75947 |
4 | 46156 |
5 | 110001 |
6 | 53738 |
7 | 185348 |
8 | 32575 |
9 | 57918 |
... | ... |
1525 | 32936 |
1526 | 49294 |
1527 | 34218 |
1528 | 82795 |
1529 | 104900 |
1530 | 46408 |
1531 | 67050 |
1532 | 47368 |
1533 | 60013 |
1534 | 52624 |
df['total salary'] = df['salary'] + df['bonus']
df.head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3766 | 46156 |
df1 = df[['salary', 'bonus']] * 5
df1.head()
salary | bonus | |
---|---|---|
0 | 226395 | 17680 |
1 | 315830 | 6480 |
2 | 333070 | 2555 |
3 | 358400 | 21335 |
4 | 211950 | 18830 |
df1 = df[['salary', 'bonus']] > 100000
df1.head()
salary | bonus | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
df1 = df['race'] == 'White'
df1.head()
race | |
---|---|
0 | True |
1 | True |
2 | False |
3 | False |
4 | True |
Most of the common aggregation methods are available. They only work down the columns and not across the rows.
df.min()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Health & Human Services | Asian | Female | 24960 | 101 | 25913 |
Columns that the aggregation does not work are dropped.
df.mean()
salary | bonus | total salary | |
---|---|---|---|
0 | 56278.746 | 2594.283 | 58873.029 |
df.argmax()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | 3 | 0 | 0 | 145 | 1516 | 145 |
df['salary'].argmin()
salary | |
---|---|
0 | 347 |
Check if all salaries are greater than 20000
df1 = df['salary'] > 20000
df1.all()
salary | |
---|---|
0 | True |
Count the number of non-missing values
df.count()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | 1535 | 1535 | 1535 | 1535 | 1535 | 1535 |
Get number of unique values.
df.nunique()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | 6 | 5 | 2 | 548 | 1318 | 1524 |
These are methods that do not return a single value.
Get the unique values of each column. The unique
method returns a list of DataFrames containing the unique values for each column.
dfs = df.unique()
dfs[0]
department | |
---|---|
0 | Health & Human Services |
1 | Houston Airport System (HAS) |
2 | Houston Fire Department (HFD) |
3 | Houston Police Department-HPD |
4 | Parks & Recreation |
5 | Public Works & Engineering-PWE |
dfs[1]
race | |
---|---|
0 | Asian |
1 | Black |
2 | Hispanic |
3 | Native American |
4 | White |
dfs[2]
gender | |
---|---|
0 | Female |
1 | Male |
Rename columns with a dictionary.
df.rename({'department':'dept', 'bonus':'BONUS'}).head()
dept | race | gender | salary | BONUS | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3766 | 46156 |
Drop columns with a string or list of strings.
df.drop('race').head()
department | gender | salary | bonus | total salary | |
---|---|---|---|---|---|
0 | Houston Police Department-HPD | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | Male | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | Male | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | Male | 42390 | 3766 | 46156 |
df.drop(['race', 'gender']).head()
department | salary | bonus | total salary | |
---|---|---|---|---|
0 | Houston Police Department-HPD | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | 42390 | 3766 | 46156 |
The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns.
Absolute value
df.abs().head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3766 | 46156 |
Cumulative min, max, and sum
df.cummax().head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 3536 | 64462 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 3536 | 67125 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | White | Male | 71680 | 4267 | 75947 |
Clip values to be within a range.
df.clip(40000, 60000).head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 40000 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 60000 | 40000 | 60000 |
2 | Houston Police Department-HPD | Black | Male | 60000 | 40000 | 60000 |
3 | Public Works & Engineering-PWE | Asian | Male | 60000 | 40000 | 60000 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 40000 | 46156 |
Round numeric columns
df.round(-3).head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45000 | 4000 | 49000 |
1 | Houston Fire Department (HFD) | White | Male | 63000 | 1000 | 64000 |
2 | Houston Police Department-HPD | Black | Male | 67000 | 1000 | 67000 |
3 | Public Works & Engineering-PWE | Asian | Male | 72000 | 4000 | 76000 |
4 | Houston Airport System (HAS) | White | Male | 42000 | 4000 | 46000 |
Copy the DataFrame
df.copy().head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | 45279 | 3536 | 48815 |
1 | Houston Fire Department (HFD) | White | Male | 63166 | 1296 | 64462 |
2 | Houston Police Department-HPD | Black | Male | 66614 | 511 | 67125 |
3 | Public Works & Engineering-PWE | Asian | Male | 71680 | 4267 | 75947 |
4 | Houston Airport System (HAS) | White | Male | 42390 | 3766 | 46156 |
Take the nth difference.
df.diff(2).head(10)
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | nan | nan | nan |
1 | Houston Fire Department (HFD) | White | Male | nan | nan | nan |
2 | Houston Police Department-HPD | Black | Male | 21335.000 | -3025.000 | 18310.000 |
3 | Public Works & Engineering-PWE | Asian | Male | 8514.000 | 2971.000 | 11485.000 |
4 | Houston Airport System (HAS) | White | Male | -24224.000 | 3255.000 | -20969.000 |
5 | Public Works & Engineering-PWE | White | Male | 36282.000 | -2228.000 | 34054.000 |
6 | Houston Fire Department (HFD) | Hispanic | Male | 10254.000 | -2672.000 | 7582.000 |
7 | Health & Human Services | Black | Male | 72454.000 | 2893.000 | 75347.000 |
8 | Public Works & Engineering-PWE | Black | Male | -22297.000 | 1134.000 | -21163.000 |
9 | Health & Human Services | Black | Male | -125147.000 | -2283.000 | -127430.000 |
Find the nth percentage change.
df.pct_change(2).head(10)
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | White | Male | nan | nan | nan |
1 | Houston Fire Department (HFD) | White | Male | nan | nan | nan |
2 | Houston Police Department-HPD | Black | Male | 0.471 | -0.855 | 0.375 |
3 | Public Works & Engineering-PWE | Asian | Male | 0.135 | 2.292 | 0.178 |
4 | Houston Airport System (HAS) | White | Male | -0.364 | 6.370 | -0.312 |
5 | Public Works & Engineering-PWE | White | Male | 0.506 | -0.522 | 0.448 |
6 | Houston Fire Department (HFD) | Hispanic | Male | 0.242 | -0.710 | 0.164 |
7 | Health & Human Services | Black | Male | 0.671 | 1.419 | 0.685 |
8 | Public Works & Engineering-PWE | Black | Male | -0.424 | 1.037 | -0.394 |
9 | Health & Human Services | Black | Male | -0.694 | -0.463 | -0.688 |
Sort the DataFrame by one or more columns
df.sort_values('salary').head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | Black | Female | 24960 | 953 | 25913 |
1 | Public Works & Engineering-PWE | Hispanic | Male | 26104 | 4258 | 30362 |
2 | Public Works & Engineering-PWE | Black | Female | 26125 | 3247 | 29372 |
3 | Houston Airport System (HAS) | Hispanic | Female | 26125 | 832 | 26957 |
4 | Houston Airport System (HAS) | Black | Female | 26125 | 2461 | 28586 |
Sort descending
df.sort_values('salary', asc=False).head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Fire Department (HFD) | White | Male | 210588 | 3724 | 214312 |
1 | Houston Police Department-HPD | White | Male | 199596 | 848 | 200444 |
2 | Houston Airport System (HAS) | Black | Male | 186192 | 1778 | 187970 |
3 | Health & Human Services | Black | Male | 180416 | 4932 | 185348 |
4 | Public Works & Engineering-PWE | White | Female | 178331 | 2124 | 180455 |
Sort by multiple columns
df.sort_values(['race', 'salary']).head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Airport System (HAS) | Asian | Female | 26125 | 4446 | 30571 |
1 | Houston Police Department-HPD | Asian | Male | 27914 | 2855 | 30769 |
2 | Houston Police Department-HPD | Asian | Male | 28169 | 2572 | 30741 |
3 | Public Works & Engineering-PWE | Asian | Male | 28995 | 2874 | 31869 |
4 | Public Works & Engineering-PWE | Asian | Male | 30347 | 4938 | 35285 |
Randomly sample the DataFrame
df.sample(n=3)
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Fire Department (HFD) | White | Male | 62540 | 2995 | 65535 |
1 | Public Works & Engineering-PWE | White | Male | 63336 | 1547 | 64883 |
2 | Houston Police Department-HPD | White | Male | 52514 | 1150 | 53664 |
Randomly sample a fraction
df.sample(frac=.005)
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Houston Police Department-HPD | Hispanic | Female | 60347 | 1200 | 61547 |
1 | Public Works & Engineering-PWE | Black | Male | 49109 | 3598 | 52707 |
2 | Health & Human Services | Black | Female | 48984 | 4602 | 53586 |
3 | Houston Police Department-HPD | White | Male | 55461 | 2813 | 58274 |
4 | Houston Airport System (HAS) | Black | Female | 29286 | 1877 | 31163 |
5 | Houston Police Department-HPD | Asian | Male | 66614 | 4480 | 71094 |
6 | Houston Fire Department (HFD) | White | Male | 28024 | 4475 | 32499 |
Sample with replacement
df.sample(n=10000, replace=True).head()
department | race | gender | salary | bonus | total salary | |
---|---|---|---|---|---|---|
0 | Parks & Recreation | Black | Female | 31075 | 1665 | 32740 |
1 | Public Works & Engineering-PWE | Hispanic | Male | 67038 | 644 | 67682 |
2 | Houston Police Department-HPD | Black | Male | 37024 | 1532 | 38556 |
3 | Health & Human Services | Black | Female | 57433 | 3106 | 60539 |
4 | Public Works & Engineering-PWE | Black | Male | 53373 | 924 | 54297 |
Use the str
accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.
df.str.count('department', 'P').head()
department | |
---|---|
0 | 2 |
1 | 0 |
2 | 2 |
3 | 2 |
4 | 0 |
df.str.lower('department').head()
department | |
---|---|
0 | houston police department-hpd |
1 | houston fire department (hfd) |
2 | houston police department-hpd |
3 | public works & engineering-pwe |
4 | houston airport system (has) |
df.str.find('department', 'Houston').head()
department | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | -1 |
4 | 0 |
pandas_cub provides the value_counts
method for simple frequency counting of unique values and pivot_table
for grouping and aggregating.
The value_counts
method returns a list of DataFrames, one for each column.
dfs = df[['department', 'race', 'gender']].value_counts()
dfs[0]
department | count | |
---|---|---|
0 | Houston Police Department-HPD | 570 |
1 | Houston Fire Department (HFD) | 365 |
2 | Public Works & Engineering-PWE | 341 |
3 | Health & Human Services | 103 |
4 | Houston Airport System (HAS) | 103 |
5 | Parks & Recreation | 53 |
dfs[1]
race | count | |
---|---|---|
0 | White | 542 |
1 | Black | 518 |
2 | Hispanic | 381 |
3 | Asian | 87 |
4 | Native American | 7 |
dfs[2]
gender | count | |
---|---|---|
0 | Male | 1135 |
1 | Female | 400 |
If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize
parameter to True
.
df['race'].value_counts(normalize=True)
race | count | |
---|---|---|
0 | White | 0.353 |
1 | Black | 0.337 |
2 | Hispanic | 0.248 |
3 | Asian | 0.057 |
4 | Native American | 0.005 |
The pivot_table
method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.
df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
race | Female | Male | |
---|---|---|---|
0 | Asian | 58304.222 | 60622.957 |
1 | Black | 48133.382 | 51853.000 |
2 | Hispanic | 44216.960 | 55493.064 |
3 | Native American | 58844.333 | 68850.500 |
4 | White | 66415.528 | 63439.196 |
If you don't provide values
or aggfunc
then by default it will return frequency (a contingency table).
df.pivot_table(rows='race', columns='gender')
race | Female | Male | |
---|---|---|---|
0 | Asian | 18 | 69 |
1 | Black | 207 | 311 |
2 | Hispanic | 100 | 281 |
3 | Native American | 3 | 4 |
4 | White | 72 | 470 |
You can group by just a single column.
df.pivot_table(rows='department', values='salary', aggfunc='mean')
department | mean | |
---|---|---|
0 | Health & Human Services | 51324.981 |
1 | Houston Airport System (HAS) | 53990.369 |
2 | Houston Fire Department (HFD) | 59960.441 |
3 | Houston Police Department-HPD | 60428.746 |
4 | Parks & Recreation | 39426.151 |
5 | Public Works & Engineering-PWE | 50207.806 |
df.pivot_table(columns='department', values='salary', aggfunc='mean')
Health & Human Services | Houston Airport System (HAS) | Houston Fire Department (HFD) | Houston Police Department-HPD | Parks & Recreation | Public Works & Engineering-PWE | |
---|---|---|---|---|---|---|
0 | 51324.981 | 53990.369 | 59960.441 | 60428.746 | 39426.151 | 50207.806 |