Department of Biochemistry & Biomedical Sciences, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
This is a living document, content will be updated frequently.
Introduction to bioinformatics theory, tools, and practice with an emphasis on high-throughput DNA sequencing technologies. Areas of emphasis include gene sequence analysis, functional prediction, genome assembly and annotation, gene expression analysis, gene regulation analysis, genome databases, and microbial genomics. Includes an introduction to the command line, software development, and cloud computing.
By the end of this course, the student should have practical skills with a number of bioinformatics techniques common in a modern research laboratory, familiarity with online databases and their use, and a knowledge of the use of genomics data for hypothesis testing.
https://academiccalendars.romcmaster.ca/preview_course_nopop.php?catoid=24&coid=142047
This GitHub repository only contains material developed by Dr. McArthur directly and does not include guest lectures, student generated content, or course documents. These are only available to registered students via Avenue to Learn. In addition, some of the exercises require password access to class servers, available to registered students only. These can be provided by request for undergraduate and graduate students in Biochemistry & Biomedical Sciences, the Michael G. DeGroote Institute for Infectious Disease Research, or other affiliated programs. Please see License and Copyright information.
Week | Dates | Lecture | Tutorial | Flash Updates | Assessment |
---|---|---|---|---|---|
1 | September 3, 5, 6 | Lecture 1: Introduction to Bioinformatics & the Course (~42 minute video) | Tours of FHS SeqCore & Computer Services Unit | ||
2 | September 10, 12, 13 | Tours of FHS SeqCore & Computer Services Unit | Lab 1: Introduction to Lab & Genome Databases | GenBank, Ensembl, Growth of Sequencing Data | tutorial, Flash Updates |
3 | September 17, 19, 20 | Lecture 2: Sequence Similarity & Searching (~48 minute video) | Lab 2: Searches, Protein Annotation | BLAST, Pfam, PROSITE | tutorial, Flash Updates |
4 | September 24, 26, 27 | Lecture 3: Evolutionary Biology (~50 minute video), Bonus: Bayesian Phylogenetics | Lab 3: Phylogeny Lab | Terminology, Sequence Alignment, Phylogenetic Trees | tutorial, Flash Updates |
5 | October 1, 3, 4 | Lecture 4: Beyond the Gene - Networks, Ontologies (~35 minute video) | Lab 4: Ontology and Antimicrobial Resistance | Gene Ontology, KEGG, CARD | tutorial, Flash Updates |
6 | October 8, 10, 11 | Lecture 5: TA Research Talk - Jalees (~15 minute video); Maddie (~15 minute video); Karyn (~15 minute video) | Linux & Sequencing Informatics (demo) | Sanger Sequencing, FASTA, Linux | lecture quiz, Flash Updates |
7 | October 15, 17, 18 | mid-term recess | |||
8 | October 22, 24, 25 | Lecture 6: DNA Sequencing & Genome Assembly (~50 minute video), Bonus: De Bruijn graph walkthrough | Lab 6: Galaxy, FASTQ, Assembly | Illumina Sequencing, FASTQ, Galaxy | tutorial, Flash Updates |
9 | October 29, 31, & November 1 | Lecture 7: Molecular Epidemiology (~37 minute video) | Lab 7: SNP analysis & Molecular Epidemiology | SNPs, Horizontal Gene Transfer, Metagenomics | tutorial, Flash Updates |
10 | November 5, 7, 8 | Lecture 8: Gene Expression Analysis (~53 minute video) | Lab 8: Microarray Lab (demo) | Microarrays, Normalization, False Discovery | Flash Updates |
11 | November 12, 14, 15 | Lecture 9: RNA-Seq, ChIP-Seq, Bisulfite-Seq (~34 minute video) | Lab 9: RNA-Seq | RNA-Seq, Illumina HT-12, Tn-Seq | tutorial, Flash Updates |
12 | November 19, 21, 22 | Guest Lecture: Dr. Kathleen Houlahan - Machine Learning (~42 minute video) | Lab 10: Machine learning classifier to predict breast cancer | Random Forest, Logistic Regression, Natural Language Processing | lecture quiz, tutorial, Flash Updates |
13 | November 26, 28, 29 | Lecture 10: Advances in DNA Sequencing (~31 minute video) | lecture quiz | ||
14 | December 3 | Lecture 11: Genomics of Pandemics (~63 minute video) | lecture quiz |
BONUS LECTURE (not official course content in 2024): Lecture 12: Internet of Things & Big Data video ~44 minutes (recorded in 2023)
- All assignments are to be submitted to A2L by 11:59 pm on the date the assignment is due unless otherwise stated.
- The Critical Review is to be submitted to the assessment drop box on A2L by 11:59 pm on October 27, 2024.
- Throughout the term, each student will give a single 10-minute Flash Update presentation on an assigned topic and must upload their slides to A2L before the start of their tutorial.
WEEK 2 - GenBank, Ensembl, Growth of Sequencing Data
- NCBI & GenBank. Provide a review of the GenBank resource, with an emphasis on the variety of tools and data it offers. See Nucleic Acids Res. 2023 Jan 6;51(D1):D29-D38. PMID 36370100, Nucleic Acids Res. 2022 Jan 7;50(D1):D161-D164. PMID 34850943.
- Ensembl. Provide a review of the Ensembl resource, with an emphasis on the variety of tools and data it offers. See Nucleic Acids Res. 2023 Jan 6;51(D1):D933-D941. PMID 36318249.
- Growth of Sequencing Data. Provide an overview of the growth of DNA sequencing data as well as predicted growth. See Nucleic Acids Res. 2023 Jan 6;51(D1):D141-D144. PMID 36350640, GenBank and WGS Statistics, The Cost of Sequencing a Human Genome, and In The Year 2030—Looking at How Genomic Data Might Evolve.
WEEK 3 - BLAST, Pfam, PROSITE
- BLAST. Provide a review of the purpose of BLAST algorithms for database searching and how to perform them online. Specifically, outline the difference between BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX. See Nature Education 1: 215 & Curr Protoc Mol Biol. 2001 May;Chapter 19:Unit 19.3 PMID 18265177.
- Pfam. Provide a review of the Pfam resource, with an emphasis on the variety of tools and data it offers (as well as its migration to InterPro). See Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419 PMID 33125078 and Nucleic Acids Res. 2023 Jan 6; 51(D1): D418–D427 PMID 36350672.
- PROSITE. Provide a review of the PROSITE resource, with an emphasis on the variety of tools and data it offers. See Nucleic Acids Res. 2013 41(Database issue):D344-7 PMID 23161676 and the PROSITE website.
WEEK 4 - Terminology, Sequence Alignment, Phylogenetic Trees
- Terminology. Explain the difference between the terms “similarity” and “homology”. Differentiate between the terms “homolog”, “paralog”, “ortholog”. See Annu Rev Genet. 2005;39:309-38 PMID 16285863 and BLAST Glossary.
- Sequence Alignment. Explain the difference between local alignment (e.g. BLAST) and global alignment (e.g. CLUSTAL) and introduce the CLUSTAL family of algorithms. See Protein Sci. 2018 Jan;27(1):135-145 PMID 28884485.
- Phylogenetic Trees. Overview what a phylogenetic tree represents and the major concepts for its interpretation. See Nature Education 1: 190 and How to read a phylogenetic tree.
WEEK 5 - Gene Ontology, KEGG, CARD
- Gene Ontology. Introduce the Gene Ontology. See Nucleic Acids Res. 2019 Jan 8;47(D1):D330-D338 PMID 30395331 and Genetics 2023 May 4;224(1):iyad031 PMID 36866529.
- KEGG. Introduce the Kyoto Encyclopedia of Genes and Genomes (KEGG). See Nucleic Acids Res. 2023 Jan 6;51(D1):D587-D592 PMID 36300620 and Nucleic Acids Res. 2019 Jan 8;47(D1):D590-D595 PMID 30321428.
- CARD. Introduce the Comprehensive Antibiotic Resistance Database. See Nucleic Acids Res. 2023 Jan 6;51(D1):D690-D699 PMID 36263822 and Nucleic Acids Res. 2020 48(Database issue):D517-D525 PMID 31665441.
WEEK 6 - Sanger Sequencing, FASTA, Linux
- Sanger Sequencing. Review the Sanger DNA sequencing method, with emphasis upon automation by Applied Biosystems. See Nature Education 1:193 and The Order of Nucleotides in a Gene Is Revealed by DNA Sequencing. Note: You do not need to introduce 454, Illumina, or Next-Generation Sequencing (NGS) methods.
- FASTA. Introduce the FASTA file format, review it’s origins and illustrate how it was adapted for raw DNA sequencing results. Also introduce the concept of quality scores generated by the legacy base calling software PHRED (the QUAL format file). See Wikipedia, PHRED, and Nucleic Acids Res. 2010 38:1767-71 PMID 20015970. Note: You do not need to introduce the FASTQ format for Next-Generation Sequencing (NGS) methods.
- Linux. Introduce the concept of the operating systems (Windows, Mac, “command line”). Give a brief history of the origins of UNIX and how it differs from LINUX. See What is Linux, Differentiating UNIX and Linux, and Difference between Unix and Linux.
WEEK 8 - Illumina Sequencing, FASTQ, Galaxy
- Illumina Sequencing. Review the Illumina DNA sequencing method, see DNA Sequencing: How to Choose the Right Technology and Explore Illumina sequencing technology. Note: you may use images from the “Illumina Sequencing Introduction” PDF).
- FASTQ. Introduce the FASTQ file format, review how it was developed for Next-Generation Sequencing (NGS). Review the concept of base calling quality and how it is encoded in FASTQ. Nucleic Acids Res. 2010 38:1767-71 PMID 20015970. Note: We will be handling recent Illumina FASTQ data, which uses an offset of 33, see https://en.wikipedia.org/wiki/FASTQ_format.
- Galaxy. Introduce the Galaxy platform for bioinformatics analysis, see Genome Biol. 2010;11(8):R86 PMID 20738864 and Nucleic Acids Res. 2022 Jul 5;50(W1):W345-W351 PMID 35446428.
WEEK 9 - SNPs, Horizontal Gene Transfer, Metagenomics
- SNPs. Define the term Single Nucleotide Polymorphism (SNP) and explain how these data can be used to determine organism/strain relatedness. Use SARS-CoV-2 as an example, see Microbiol Spectr. 2023 Jun 15;11(3):e0190022 PMID 37093060 and Phylogenetic Analysis of SARS-CoV-2 in Ontario.
- Horizontal Gene Transfer. Define the term Horizontal Gene Transfer (HGT; also known as Lateral Gene Transfer, LGT) and discuss how it could confound determination of organism/strain relatedness using SNP analysis. Use the emergence of MCR-1 as an example, Lancet Infect Dis. 2015 Nov 18. pii: S1473-3099(15)00424-7 PMID 26603172.
- Metagenomics. Introduce metagenomics in the context of molecular and clinical epidemiology. See Expert Rev Mol Diagn. 2018 Jul;18(7):605-615. PMID 29898605.
WEEK 10 - Microarrays, Normalization, False Discovery
- Microarrays. Review microarray technology for measurement of absolute or relative gene expression levels. Highlight the key difference between microarrays and RNA sequencing approaches. See Nature Education 1:195 and Scientists Can Study an Organism's Entire Genome with Microarray Analysis.
- Normalization. Introduce the concept of normalization and why it is needed in microarray analysis. Review the major normalization approaches. See Nat Genet. 32 Suppl:496-501. PMID 12454644.
- False Discovery. Introduce the concept of the false discovery rate and how it is handled in genomic analyses. See Proc Natl Acad Sci USA. 100: 9440-5. PMID 12883005 and P-values, False Discovery Rate (FDR) and q-values.
WEEK 11 - RNA-Seq, Illumina HT-12, Tn-Seq
- RNA-Seq. Overview the steps in RNA-Seq analysis of transcriptomes. See Nat Rev Genet. 10:57-63. PMID 19015660 and Study gene expression using RNA sequencing.
- Illumina Bead Microarrays. Introduce ‘bead chip’ technologies for measurement of gene expression levels. Contrast the method with RNA-Seq and traditional two-channel microarrays. Illustrate how the technology can be use for gene expression, gene copy number, and gene methylation measurement. See Bead-Based Microarray Technology and embedded links.
- Tn-Seq. Provide an overview on the Tn-Seq approach to examining bacterial genetics. See MBio 2:e00315-10. PMID 21253457.
WEEK 12 - Random Forest, Logistic Regression, Natural Language Processing
- Random Forest. Provide an overview of the Random Forest method for classification of complex data. see An Introduction to Random Forest, Proc Natl Acad Sci U.S.A. 115:1690-1692 PMID 29440440, and Front Genet. 9:297 PMID 30123241.
- Logistic Regression. Provide an overview of Logistic Regression, a predictive machine learning method. See Introduction to Logistic Regression and mSystems 4:e00211-19 PMID 31387929.
- Natural Language Processing. Provide an overview of Natural Language Processing for turning text into data, see Introduction to Natural Language Processing for Text and BMC Bioinformatics 9:193 PMID 18410678.