Skip to content

Commit

Permalink
Tutorial proposal: Massively cutting server costs for model training …
Browse files Browse the repository at this point in the history
…with spot instances on Azure (#1718)

* Add proposal for course automation task

* Rework proposal

* Slight modification

* Update README.md

* Add essay proposal

* Add Khalid to proposal

* Delete README.md

Co-authored-by: Khashayar Etemadi <khaes@kth.se>
Co-authored-by: César Soto Valero <cesarsotovalero@gmail.com>
  • Loading branch information
3 people authored Apr 6, 2022
1 parent 4a2d3c0 commit 32e7f2a
Showing 1 changed file with 23 additions and 0 deletions.
23 changes: 23 additions & 0 deletions contributions/executable-tutorial/marcelj/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Assignment Proposal

## Title

Massively cutting server costs for model training with spot instances on Azure.

## Names and KTH ID
- Marcel Juschak (marcelj@kth.se)
- Khalid El Yaacoub (khalidey@kth.se)

## Deadline

Task 3

## Category

Executable Tutorial

## Description

Training a machine learning (ML) model is one of the core components of MLOps, e.g. for continuous deployment. However, training a model can require high-end hardware resources over many days which in turn leads to high monetary costs. Spot instances on cloud platforms like Azure are servers that can be rented for usually 10% to 25% of the original price. As a downside, access to the instance may be withdrawn at any point after a 30 second notice. Model training can take advantage of spot instances by checkpointing the training state so that training can be resumed from a checkpoint after termination and restart of the server.

In this tutorial we will show how to create a docker container that trains an ML model with checkpointing and resumes training after random termination + automatic restart of the Azure spot instance.

0 comments on commit 32e7f2a

Please sign in to comment.