(A manuscript under development)
Authors: Simon Goring, Yingzi Jin, John Shiu, Tony Shum, Orix Au Yeung, Rohan Alexander, Tiffany A. Timbers
Date: 2024
Checklists have been shown to decrease errors in safety critical systems (Gawande 2010), and the use of a reproducibility checklist at the Machine Learning NeurIPS 2019 conference led to an increase in the percentage of authors submitting the code for their work (from 50% to 75%; Pineau et al. 2021). Thus, in an effort to make applied machine learning software more trustworthy by increasing its robustness, we aimed to create a general and robust checklist for software tests for applied machine learning code. This checklist includes tests for data presence, quality and ingestion at the beginning of the analysis, the model fitting and evaluation, as well as tests for the artifacts (presence and quality) which are created by the analysis. The test checklist items were derived from software tests for applied machine learning code which either industry, or the scholarly literature have deemed as important for correct and robust applied machine learning software, as well as from examining things that commonly go wrong in machine learning code (threat model). Such a checklist may be used by data scientists and machine learning engineers to guide the manual writing of tests. It may also act as a source for engineering large-language model (LLM) prompts that act as reliable starting points for evaluating the quality of existing applied machine learning code, as well as for engineering reproducible test data and software tests themselves for each item on the checklist.
Please see proposal/tests-for-machine-learning.pdf
for details on data and sources.
- https://docs.google.com/presentation/d/1jHAjrQcv1aPqy4PbYn8q8nq2T1rlk95e2AlsUulANMM/edit?usp=sharing
Software licensed under the MIT License, non-software content licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. See the license file for more information.