This module provides a comprehensive introduction to Unix shell language, covering file and directory navigation, command usage, script creation, and basic functions involving pipes, filters, and loops. Participants will get started with version control and GitHub, exploring the ethical implications regarding reproducibility. Topics include Git setup, repository management (recording, viewing, and undoing changes), branch creation, and collaborative workflows. Advanced commands, debugging, and history editing will be introduced. Participants will learn effective problem-solving techniques with Google and Stackflow, emphasizing reproducibility and documentation. The module emphasizes ethics and equity considerations in projects, fostering discussion-based learning with pre-class readings and live coding exercises.
Learning Outcomes:
- Develop the ability to comfortably access the terminal and proficiently write scripts using basic commands, variables, pipes, filters, and loops.
- Understand how to utilize version control systems effectively for preserving personal work, accessing and editing previous code versions, collaborating with peers, and identifying and debugging errors in code.
- Develop the skills to independently troubleshoot issues by identifying problems, conducting research, and formulating questions using components of reproducibility.
- Identify ethical considerations within the field, including scrutinizing the composition of datasets for biases and considering historical context of power abuses.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
Learners will be assigned to one or the other based on their familiarity with programming languages. Knowledge of Python is needed for subsequent Certificate modules and its importance for industry.
The first half of the module will focus on the essentials of coding in Python and ethical considerations of using algorithms. Participants will learn how to design functions, repeat code using loops, store data in lists, conduct code testing and debugging, and manipulate data using various data analysis and visualization tools such as numpy, pandas, matplotlib, seaborn, and plotly. Participants will participate in a facilitated discussion about the Tuskegee experiment, its long-term effects, and the trustworthiness of AI applications in disparate social systems.
Learning Outcomes:
- Understand various Python data types and their role in coding, including differentiating and evaluating expressions using numeric types (integer, long, and floating-point numbers), Booleans, strings, and lists.
- Implement the Function Design Recipe to create functions in Python and reduce duplication.
- Utilize numpy and pandas to analyze a dataset and use these libraries to manipulate numerical and tabular data in Python.
- Interact with databases using Python, using visualization techniques like matplotlib, seaborn, and plotly.
- Learn debugging and testing techniques to troubleshoot errors, select test cases, and ensure code correctness, reliability, and robustness.
- Understand the ethical issues with software and awareness of case studies involving software failures.
- Prepare to confidently answer technical job interview questions.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
The first part of this module teaches R with a focus on manipulating and visualizing data. Participants will get set up with a functional RStudio workflow, use different file types, transform data tables, import and manipulate data, use functions and loops, create data visualizations, make a Shiny app, and learn how to solve problems with their programming. Both base R and tidyverse methods are taught. To work reproducibly, participants will create R Projects. The second part of the module will cover the ethics of consent, Equity, Diversity & Inclusion (EDI) training, and professional skills including presentation, project management, and data security. Finally, the module will conclude with an industry case study.
Learning Outcomes:
- Gain proficiency in utilizing R, including RStudio and coding best practices, and understand R's role in data science.
- Apply manipulation and wrangling techniques to describe and define the characteristics of datasets.
- Develop the ability to identify and describe data structures, reshape datasets through manipulation techniques, detect missing values, clean data, summarize and export data, exporting data, and report findings.
- Produce a wide range of informative and visually appealing charts, graphs, and plots to effectively communicate insights and patterns hidden within the data.
- Develop the skills and strategies necessary to diagnose and fix errors in R code, interpret error messages, and employ effective debugging strategies.
- Demonstrate an understanding of data consent principles, addressing ethical and legal considerations for data collection and usage.
- Acquire the skills and knowledge required to create compelling presentations and effectively manage projects using R, including visualizations and code documentation.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
Much research these days is done using software. Researchers need to develop comfort with building, maintaining and improving high-quality software. The first half of this module focuses on equipping students with the skills to build robust software that can be used to answer research questions. It focuses on how to effectively write short programs, as part of a small team, in a reproducible way. Research software that is built correctly can be used by other teams, not just the researcher who originally wrote it.
SQL is used across the machine learning pipeline, and is a fundamental skill for data scientists to master. The second part of this module will focus on the technical skills needed for working with SQL, including flat-file datasets (JSON, CSV) ingestion, query design, and relational database management. Additionally, it will examine common data management concerns, data access management, and data privacy adherence. Learners will be introduced to principles around reproducibility, sharing data, and data ethics (for example, respecting those whose data we use). This module will also cover professional skills such as communication (with a variety of stakeholders) and documentation.
Learning Outcomes:
- Learn how to work as a team within a Git/GitHub setting. Specifically, branching, merging, conflicts, and pull requests.
- Know how to create bug reports and prioritize requests.
- Develop a comfort with using makefiles and configuring programs.
- Proficiently test software, handle errors, and track provenance.
- Know how to create Python packages.
- Acquire a comfort of calling APIs.
- Develop comfort with Docker.
- Develop a better understanding of the structure of databases.
- Save and transport data in CSV and JSON file formats.
- Familiarity with the essentials of querying and manipulating data in SQL and how to search for future questions.
- Familiarity with the legal framework around sharing data.
- Analyze data requirements and work with different stakeholders such as analysts and managers.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
This module provides the skills required to design, implement, test, and validate a variety of supervised learning models. The basics of statistical learning including modeling with the goal of prediction versus inference, prediction accuracy and model interpretability trade-off, and the all-important bias-variance trade-off will be covered. Each section of this module will address a unique set of methods used for supervised learning on real data sets.
Learning Outcomes:
- Understand, implement and interpret the results from several supervised learning approaches for regression and classification.
- Utilize resampling methods to extract more information from a data set and to choose the best model.
- Perform exploratory data analysis for unsupervised learning.
- Understand what is required for reproducible learning.
- Appreciate the uncertainties associated with model results and the ethical consequences of acting on these results.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
Building a model differs significantly from creating a model that is usable by others. This module focuses on everything that happens after the model has been put together, specifically addressing machine learning system requirements such as: reliability, scalability, maintainability, and adaptability; feature engineering; model development and deployment; monitoring; and infrastructure and tooling.
Learning Outcomes:
- Design machine learning systems that are reliable, scalable, maintainable, and adaptable.
- Apply feature engineering techniques to optimize machine learning models.
- Deploy machine learning models in production using various strategies.
- Implement monitoring and alerting systems to troubleshoot and diagnose issues in production.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
Participants pursuing the Data Science Certificate must complete the following four electives:
This module is designed to introduce the fundamentals of sampling, probability and survey methodology. It covers various topics, including; simple probability samples, stratified sampling, cluster sampling, addressing non-response, estimating and survey quality. Participants will consider the theoretical foundations of different sampling approaches, as well as practical applications of this knowledge in contexts such as market research, political polling and the Canadian census. Analysis using the R programming language will also be highlighted, drawing on skills developed in Introduction to R.
Learning Outcomes:
- Gain proficiency in executing simple probability samples.
- Understand complex sampling procedures and the tradeoffs involved.
- Acquire the skills to identify and address sources of error or inaccuracies in data stemming from sampling strategies.
- Develop an intuition around survey quality.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
Regardless of the quality of your analyses and data-related findings, if you cannot effectively communicate them, their impact will be severely limited. Technical skills in this module will focus on a step-by-step walkthrough of choosing, creating and modifying data visualizations in R using ggplot. Discussions will include general design principles applicable to other data visualization software used in industry and academia (e.g., Python, Tableau, PowerBI).
Incorporating case studies and real-world examples, the ethical components of this module will include:
- Ensuring reproducibility with data visualization.
- Building awareness of the decision-making that goes into sharing data visually.
- Addressing inequity in data visualization by focusing on accessible design.
Learning Outcomes:
- Acquire the skills to create and customize data visualizations start to finish using R.
- Gain insights into the general design principles for creating accessible/equitable data visualizations in R and other software.
- Develop an understanding of data visualization as purposeful/telling a story (and the ethical/professional implications).
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
There is often a need to work out which algorithm or data structure should be used given some practical situation. This module focuses on developing comfort with algorithms and data structures using Big-O notation, recursive functions, and data structures.
Learning Outcomes:
- Assess options and choices around fundamental algorithms and data structures using Big-O notation.
- Develop comfort with recursive functions.
- Identify appropriate data structures.
- Transform a client-led problem into an optimization challenge and identify opportunities for improvement.
- Identify causes for slow-running code and implement strategies to optimize performance.
Module Delivery:
Two weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 20 hours)
MACHINE LEARNING SOFTWARE FOUNDATIONS CERTIFICATE moduleS (following completion of foundational modules)
Participants pursuing the Machine Learning Software Foundations Certificate must complete the following two modules:
This module builds upon the statistical foundation provided in the Estimation, Testing & Machine Learning, adding theory around linear methods and classification. The module will also focus on model assessment, inference and boosting, and sets the foundation for deep learning with neural networks and related approaches. This module requires: Unix shell, Git and GitHub, R and Python.
Learning Outcomes:
- Apply advanced linear methods such as Lasso and Ridge regression for feature selection and regularization, and understand their theoretical underpinnings.
- Evaluate machine learning models with techniques such as hypothesis testing and confidence intervals, and interpret the results in the context of the problem domain.
- Apply boosting algorithms such as XGBoost and LightGBM to improve the performance of machine learning models on large and complex datasets.
- Implement neural network architectures such as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) in Python and/or R, and understand how to tune their hyperparameters for optimal performance.
- Discuss ethical considerations in machine learning, such as fairness, accountability, and transparency, and identify potential biases and issues that may arise in the development and deployment of machine learning models.
Module Delivery:
Three weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 25 hours)
In this module, participants will apply machine learning techniques to software applications using the Python language. Successful participants will apply each step of the machine learning workflow in relevant industry applications, as well as be exposed to cutting-edge techniques and the underlying theory. Technical topics include: Machine Learning in Practice, Data Preparation and Feature Engineering, Supervised and Unsupervised Learning, Model Evaluation, and Ensemble Learning. Participants will also work on other skills, such as networking and building a community of influence, mastering the interview process, including behavioral and technical interviews.
Learning Outcomes:
- Propose and present a business pitch for a machine learning project in a real-world business setting through a business case study with an industry partner or at their current organization.
- Design, formulate, and construct a full comprehensive lifecycle project in machine learning via a module project by managing a timeline.
- Apply, deploy, and implement each step of the machine learning lifecycle in Python programming language, and debug errors and iterate on improvements.
Module Delivery:
Three weeks of technical facilitator-led live webinars, each lasting 2.5 hours (totaling 25 hours)