Skip to content

13 Chapter: Software Engineering for Data Scientists

Mikiko Bazeley edited this page Dec 19, 2019 · 1 revision

Overview

In this unit, you’ll learn best practices to help you be a better engineer and to work more effectively with and engineering team. As a data scientist, no matter how many algorithms you design, how much data you crunch or charts you create, ultimately, you’ll be writing software. Some companies expect their data scientists to contribute directly to the code base, others have engineers who are around to help translate prototype code to production. No matter which kind of team you’re working with, it’s critical to learn how to be a good citizen of the code base, so that you can make life easier for yourself and the rest of your team. The better your code is, the easier it is to deploy, and the greater the likelihood that you'll see your projects having an impact on the company!

What You’ll Learn: Learning Objectives

  • Practice writing better code that provides clarity and comments.
  • Learn best practices in discussing code across teams.
  • Practice testing and debugging code.

Words to Know: Key Terms & Concepts

  • Cauldron notebook: An up-and-coming alternative to Jupyter Notebook, with many software engineering best practices built in.
  • Debugging: A daily activity of any programmer to fix issues in their code, including insertion of print() statements and beyond.

Chapter 13.1 - Write Better Code

  • In this section, you’ll learn the best practices of software engineering that apply to data scientists, including how to write better tests, use libraries and APIs effectively to reduce duplication of work, and how to give better feedback to your colleagues about their code.
  • Using basic principles from the world of software development, this talk covers ideas on how to become a more productive data scientist. This includes common principles such as:
  1. Not reinventing the wheel by using APIs and libraries (instead of writing your own code)
  2. Writing tests to future-proof your code and be your own QA
  3. How to make your models available to team members regardless of what language they use
  4. How to write your code for production, versioning, and automation
  • This tutorial provides vital developer lifehacks for the working Jupyter data scientist. You’ll learn the basics of writing good software, which will prepare you to be valuable contributor to your company’s wider engineering organization. You’ll learn the following topics:
  1. How to effectively structure your project, using the cookiecutter-datascience package
  2. How to set up a virtual environment, allowing you to abstract the current project you’re working from your other projects
  3. How to use a Linux tool called “make” to create automating parts of your project easier
  4. How to better write your code, so it’s reproducible, meaning you can come back to a project six months later and easily figure out all the things you’ve done
  5. How to modularize your code into packages so you don’t end up writing the same things repeatedly
  • Too many code reviews are negative and sap the enthusiasm that drives open source. Instead, let’s explore how to give reviews that are truthful but encouraging, boosting the skill level of contributors and the quality of the project. We’ll look at “tact hacks” that nudge communication in a friendly direction, antipatterns to avoid, the pesky human emotions that can tempt us into reviewing poorly, and techniques for leveling up newcomers without losing all your coding time.
  • This is an engaging talk that includes highly intuitive and hands-on examples covering concepts that’ll turn you into the Python ninja that everyone wants, including: metaclasses, decorators, generators, and context managers.
  • The Cauldron notebook is an up-and-coming alternative to Jupyter notebook with many software engineering best practices built in.

Testing and Debugging for Data Scientists

Testing and debugging are often treated as chores by many programmers (and anyone else who writes code). However, when used strategically, these can save you hours of time and many headaches down the line.

  • Although many people may view tests as an arduous task, they’re actually a solution to a problem that is important to you: Does my code work? This tutorial shows how Python tests are written and why.

Video: Best Debugging Practices for Python

Students typically spend 1.5 - 2 Hours

Debugging is a daily activity of any programmer. Frequently, it is assumed that programmers can debug. However, programmers often have to deal with existing code that simply does not work. This tutorial attempts to change that by introducing concepts for debugging and corresponding programming techniques. It will teach you debugging methods far beyond the usual insertion of print() statements that you’ve been used to so far.

Video: Testing for Data Scientists OPTIONAL

Watch video
Students typically spend 3 - 4 Hours

In this tutorial, you'll gain practical hands-on experience writing tests in a data science setting so that you can continually ensure the integrity of your code and data. You'll learn how to use py.test, coverage.py, and hypothesis to write better tests for your code. You’ll learn both testing best practices used in general, as well as those that are more directly related to data science-specific work.


Working with Production Systems

Building machine learning models in a Jupyter Notebook is one thing, but actually deploying those models in the real world requires some more work and skill. Knowing how to do that well will make you stand out to your team members and massively successful at your data science job.

1 Video: Deploying Python Models to Production Students typically spend 30 - 50 Minutes

In this talk, you’ll learn how to deploy Pandas/Scikit machine learning models to production using Flask, Docker, and Kubernetes. You’ll also understand the Continuous Integration (CI) process which automated away all manual steps.

2 Interactive Exercises: Production Data Science using Git Open exercises
Students typically spend 1 - 2 Hours

This guide merges the gap that data scientists may have in software development practices. You’ll look at the data science workflow in Python that adapts ideas from software development to ease collaborations and keep the project in a state that is easy to productionize.

3 Video: From Model to Production Like a Pro Students typically spend 30 - 50 Minutes

After an initial data-science proof-of-concept is completed, it often needs to be professionalized and deployed to production. Underestimating this step often leads to higher maintenance and slower time-to-market of new features. In this talk, you’ll learn a set of practical software-engineering best practices for industrializing a machine-learning model to help minimize such problems.

Clone this wiki locally