The Lahman Baseball Dataset contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.
In this project, we're interested in answering a few questions (see below) about Major League Baseball. To answer them, we're going to write some python, and use the Lahman Baseball Dataset!
As a University of Michigan Alumni (Go Blue!) I'm always curious which universities produce top talent. We'll look at which universities produce the most MLB All-Stars.
The salaries that MLB players get, compared to the general population, are astronomically high. Let's look at how MLB salaries have changed over time.
What's also interesting, is the best players seem to get paid disproportionately high compared to the rest of the 'normal' players. Let's look at the salaries of All-Stars versus their average counterparts.
You'll need to install:
Master.csv
- Full Set of Baseball DataAllstarFull.csv
- A CSV file of MLB All-Star PlayersCollegePlaying.csv
- A CSV file of MLB Players and their CollegesSalaries.csv
- A CSV file of MLB Players and their SalariesSchools.csv
- A CSV file of MLB Players and their School ID's
Baseball_Data-Analysis.ipynb
- Main project file, the IPython notebook that contains the analysis.baseball_data-audit.py
- Audits the baseball data for cleanliness.
The project Baseball_Data-Analysis.ipynb
can be read using a Jupyter Notebook. There's also an HTML version Baseball_Data-Analysis.html
included for easier viewability.
- Open your Command Prompt (PC) or terminal (Mac or Linux).
- On a PC click the Start button and search for "Command Prompt".
- On a Mac type command + spacebar. Then, type "terminal" in the Spotlight Search. You can also search for "terminal" in finder.
- Navigate to the directory where you downloaded the Jupyter notebook file.
- On a PC you might type: cd C:\Users\username\Downloads, replacing your username. Learn more about basic terminal commands.
- On Mac or Linux you might type: cd ~/Downloads.
- Run the command
jupyter notebook Baseball_Data-Analysis.ipynb
in your terminal.
If you try running a code block in the notebook and get an error message like 'no module named matplotlib', then your distribution of Anaconda may be missing a package used in the project. That's okay, there's an easy way that you can install these packages. Google any missing library for easy to use guides on installation!
-
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
-
Person identification and demographics data are provided by Chadwick Baseball Bureau, from its Register of baseball personnel.
-
Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
-
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License