13 Chapter: Software Engineering for Data Scientists

Overview

In this unit, you’ll learn best practices to help you be a better engineer and to work more effectively with and engineering team. As a data scientist, no matter how many algorithms you design, how much data you crunch or charts you create, ultimately, you’ll be writing software. Some companies expect their data scientists to contribute directly to the code base, others have engineers who are around to help translate prototype code to production. No matter which kind of team you’re working with, it’s critical to learn how to be a good citizen of the code base, so that you can make life easier for yourself and the rest of your team. The better your code is, the easier it is to deploy, and the greater the likelihood that you'll see your projects having an impact on the company!

What You’ll Learn: Learning Objectives

Practice writing better code that provides clarity and comments.
Learn best practices in discussing code across teams.
Practice testing and debugging code.

Words to Know: Key Terms & Concepts

Cauldron notebook: An up-and-coming alternative to Jupyter Notebook, with many software engineering best practices built in.
Debugging: A daily activity of any programmer to fix issues in their code, including insertion of print() statements and beyond.

Chapter 13.1 - Write Better Code

In this section, you’ll learn the best practices of software engineering that apply to data scientists, including how to write better tests, use libraries and APIs effectively to reduce duplication of work, and how to give better feedback to your colleagues about their code.

Video: Stephanie Kim - How to be a 10x Data Scientist

Using basic principles from the world of software development, this talk covers ideas on how to become a more productive data scientist. This includes common principles such as:

Not reinventing the wheel by using APIs and libraries (instead of writing your own code)
Writing tests to future-proof your code and be your own QA
How to make your models available to team members regardless of what language they use
How to write your code for production, versioning, and automation

Video: Data Science is Software | SciPy 2016 Tutorial | Peter Bull & Isaac Slavitt

This tutorial provides vital developer lifehacks for the working Jupyter data scientist. You’ll learn the basics of writing good software, which will prepare you to be valuable contributor to your company’s wider engineering organization. You’ll learn the following topics:

How to effectively structure your project, using the cookiecutter-datascience package
How to set up a virtual environment, allowing you to abstract the current project you’re working from your other projects
How to use a Linux tool called “make” to create automating parts of your project easier
How to better write your code, so it’s reproducible, meaning you can come back to a project six months later and easily figure out all the things you’ve done
How to modularize your code into packages so you don’t end up writing the same things repeatedly

Video: Erik Rose Constructive Code Review PyCon 2017

Too many code reviews are negative and sap the enthusiasm that drives open source. Instead, let’s explore how to give reviews that are truthful but encouraging, boosting the skill level of contributors and the quality of the project. We’ll look at “tact hacks” that nudge communication in a friendly direction, antipatterns to avoid, the pesky human emotions that can tempt us into reviewing poorly, and techniques for leveling up newcomers without losing all your coding time.

Video: James Powell - So you want to be a Python expert?

This is an engaging talk that includes highly intuitive and hands-on examples covering concepts that’ll turn you into the Python ninja that everyone wants, including: metaclasses, decorators, generators, and context managers.

Video: Cauldron with Scott Ernst - Episode 111

The Cauldron notebook is an up-and-coming alternative to Jupyter notebook with many software engineering best practices built in.

Testing and Debugging for Data Scientists

Testing and debugging are often treated as chores by many programmers (and anyone else who writes code). However, when used strategically, these can save you hours of time and many headaches down the line.

Video: Ned Batchelder: Getting Started Testing - PyCon 2014

Although many people may view tests as an arduous task, they’re actually a solution to a problem that is important to you: Does my code work? This tutorial shows how Python tests are written and why.

Video: Best Debugging Practices for Python

Students typically spend 1.5 - 2 Hours

https://www.youtube.com/watch?v=04paHt9xG9U

Debugging is a daily activity of any programmer. Frequently, it is assumed that programmers can debug. However, programmers often have to deal with existing code that simply does not work. This tutorial attempts to change that by introducing concepts for debugging and corresponding programming techniques. It will teach you debugging methods far beyond the usual insertion of print() statements that you’ve been used to so far.

Video: Testing for Data Scientists OPTIONAL

Watch video
Students typically spend 3 - 4 Hours

https://www.youtube.com/watch?v=yACtdj1_IxE&feature=youtu.be

In this tutorial, you'll gain practical hands-on experience writing tests in a data science setting so that you can continually ensure the integrity of your code and data. You'll learn how to use py.test, coverage.py, and hypothesis to write better tests for your code. You’ll learn both testing best practices used in general, as well as those that are more directly related to data science-specific work.

Working with Production Systems

Building machine learning models in a Jupyter Notebook is one thing, but actually deploying those models in the real world requires some more work and skill. Knowing how to do that well will make you stand out to your team members and massively successful at your data science job.

1 Video: Deploying Python Models to Production Students typically spend 30 - 50 Minutes

https://www.youtube.com/watch?v=f3I0izerPvc

In this talk, you’ll learn how to deploy Pandas/Scikit machine learning models to production using Flask, Docker, and Kubernetes. You’ll also understand the Continuous Integration (CI) process which automated away all manual steps.

2 Interactive Exercises: Production Data Science using Git Open exercises
Students typically spend 1 - 2 Hours

https://github.com/Satalia/production-data-science

This guide merges the gap that data scientists may have in software development practices. You’ll look at the data science workflow in Python that adapts ideas from software development to ease collaborations and keep the project in a state that is easy to productionize.

3 Video: From Model to Production Like a Pro Students typically spend 30 - 50 Minutes

https://www.youtube.com/watch?v=MKrPXfvIWoc

After an initial data-science proof-of-concept is completed, it often needs to be professionalized and deployed to production. Underestimating this step often leads to higher maintenance and slower time-to-market of new features. In this talk, you’ll learn a set of practical software-engineering best practices for industrializing a machine-learning model to help minimize such problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly