Skip to content

04 Chapter: The Python Data Science Stack

Mikiko Bazeley edited this page Dec 19, 2019 · 1 revision

Overview

This unit introduces an ecosystem of useful and powerful Python tools that are specific to data science. The primary focus is on visualization libraries (Matplotlib). You’ll also learn advanced programming constructs and some of the best practices of writing code in Python. Knowing basic Python is a prerequisite of this unit, and if you need a refresher, review the pre-work. Take a look at the What Will Help section of the Unit Plan for ideas on how to warm-up for this unit.

What You’ll Learn: Learning Objectives

  • Learn some of the specialized Python tools and techniques that data scientists use every day, e.g. regular expressions, iterators and list comprehensions.
  • Learn how to write high-quality Python code that makes life easier for you, and makes your portfolio pop in a hiring manager’s eyes
  • Learn the fundamentals of data visualization and plotting in Python, starting with Matplotlib, followed by Seaborn
  • Propose Capstone Project 1

Words to Know: Key Terms & Concepts

  • Tools in python : idioms/ constructs/ syntax are certain ways to do things.
  • Libraries: pieces of code that other people have written that you can pull into your code so that you don’t have to reinvent the wheel.
  • Technology stack: The collective set of tools and programs used in an organization or team.
  • Regular expressions: A technique to quickly search for or substitute complex patterns in strings.

Ch 4.1 Python for Data Science

In this unit, we'll cover basic Python functions and standard libraries that provide the foundation for becoming a great data scientist, including libraries for iterating over large volumes of data, handling errors, and visualizing data.

Interactive Exercises: Data Types for Data Science

In this DataCamp resource, you'll consolidate and practice your knowledge of lists, dictionaries, tuples, sets, and date times, leveraging them to solve multistep problems, including an extended case study using Chicago metropolitan area transit data. You'll also learn how to use objects in the Python collections module, allowing you to store and manipulate data for a variety of data scientific purposes. After taking this course, you'll be ready to tackle many data science challenges Pythonically.

Interactive Exercises: Python Data Science Toolbox (Part 1)

It's now time to push forward and develop your Python chops even further using this DataCamp resource. As a data scientist, you'll constantly need to write functions to solve problems that are dictated by your data. By the end of this resource, you'll be able to write custom functions, complete with multiple parameters and multiple return values, along with default arguments and variable-length arguments. You'll also gain insight into scoping in Python and writing lambda functions, handling errors, and analyzing Twitter DataFrames for practice.

Interactive Exercises: Introduction to Data Visualization in Python

  • https://www.springboard.com/workshops/data-science-career-track/learn/#/curriculum/2598 This DataCamp resource provides a strong foundation in data visualization in Python, with a broad coverage of the Matplotlib library and an overview of Seaborn, a package for statistical graphics. Topics covered include customizing graphics, plotting two-dimensional arrays (e.g. pseudocolor plots, contour plots, images), statistical graphics (e.g. visualizing distributions and regressions), and working with time series and image data.

Chapter 4.2 - Write Better Python Code

1 Video: Transforming Code into Beautiful, Idiomatic Python Students typically spend 1 - 1.5 Hours

As Guido van Rossum, creator of Python, has said, “Code is read more often than it is written.” As a data scientist, it’s critical that you learn to write code that is easy to read and modify. This video tutorial by Raymond Hettinger teaches you to take advantage of Python’s best features and improve existing code through a series of code transformations.

2 Video: PEP8 and Writing Readable Code Students typically spend 10 - 15 Minutes

This tutorial provides an overview of PEP8, the communal coding standard that has been widely adopted by the Python developer community.

3 Video: The Importance of Perseverance in Programming Students typically spend 20 - 40 Minutes

Programming is a skill. Most of the work programmers do involves grit: working through difficult challenges, trying and failing until something works. This video emphasizes the importance of perseverance in programming. If you find yourself struggling with programming, this video is for you! Don’t forget to reach out to your course TA if you need help with technical material.

4 Link: PEP8 Documentation OPTIONAL Open link
Students typically spend 1 - 2 Hours

We highly recommend taking some time to read through the entire PEP8 coding guidelines.


Capstone Project Proposal

By this time, you should have a solid capstone project idea and dataset to convert to a project proposal. The next few resources will help you frame the problem in a meaningful and compelling way. This is often an iterative process and entire semester-long classes are devoted to this exercise in graduate programs, so don’t give up! You may slightly shift the focus of your statement and a proposal as you start your exploratory data analysis later in the course, which is normal.

1 Video: Using Decisions in Framing Analytics Problems Students typically spend 30 - 40 Minutes

In data science applications, a key determinant of success is how you can frame an analytical problem, even before selecting any data sources or algorithms. This talk explores a framework in which the goal of analytics is to help organizations use data to make better decisions. The talk encourages you to ask three key questions for any analytics or data science project: What decision is improved using analytics?

Who’s deciding?

What’s the value of an improved decision?

2 Article: Training Data Scientists: Problem Solving Read article
Students typically spend 10 - 15 Minutes

This article from Data Science for Social Good (DSSG) uses a real-life case study to demonstrate problem framing for data scientists.

Clone this wiki locally