Skip to content
Ali Haider Bangash edited this page Apr 6, 2020 · 51 revisions

Investigating feature identification approaches for AA sequences, K-mers, in-silico prediction of epitopes, and study of secondary structures.

Project Home

Communication

For the time being, there is a #machinelearning channel on the Slack group (check out the virtual-biohackathon@googlegroups.com group for the invitation link). During the BioHackathon, we'll update this section.

We've setup a dedicated GitHub organization here. For a detailed list of all tasks, code and resources, please go there. This page will be updated for the main points only.

Coordination calls

  • 1st e-meeting Sunday, April 5th @ 17:00 CEST, using zoom.
  • 2nd e-meeting: tbd

Participants

Resources

Find the starter introductory code to read .fasta files at Biopython library- CoVid 2019 BH20- starter notebook

Please check out the Datasets and Tools page.

Any new resources you might have in mind, please add them there directly.

Data Structures

The FASTA format is a simple text-based file format for sequence data that can accommodate both nucleic acid (DNA/RNA) and amino acid (Protein) sequences.

It consists of a header which begins with a > symbol and various fields that are usually separated by | characters (although not always), followed by the sequence data using IUPAC symbol representations. Example:

$head MN90847.3.fasta
>MN908947.3 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT

We'll try to make most of the raw data available in this format to make developing workflows easier, but if we start incorporating others, we'll provide descriptions here.

Luckily, libraries such as Biopython, make I/O tasks with this data very easy with flexible parsers such as SeqIO.

Example:

>>> from Bio import SeqIO
>>> seq = SeqIO.read("MN908947.3.fasta",'fasta')
>>> seq
SeqRecord(seq=Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', SingleLetterAlphabet()), id='MN908947.3', name='MN908947.3', description='MN908947.3 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome', dbxrefs=[])

The .read() method returns a SeqRecord object when the target FASTA has a single entry. The .parse() method returns a generator to parse FASTA files that contain multiple entries. A SeqRecord object in Biopython contains a .seq attribute that contains the actual sequence data. This can be efficient broadcasted to a python str object for Python string operations.

>>> str(seq.seq)
'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGAT...'

Vectorization

There are many ways to take this sequence data and apply a transformation that makes it suitable for machine learning applications. The simplest example would be to reduce the sequence to a single summary statistic such as guanine-cytosine content (GC) content. Example:

>>> from Bio.SeqUtils import GC
>>> GC(seq.seq)
37.97277865097148

This is obviously extremely reductive and is unlikely to contain much information about any response variables we might be interested in modelling.

Another approach would be to take K-sized substrings, known as K-mers. The DNA class from Scikit-bio has convenient methods to produce K-mer counts from a sequence string. Example:

>>> from skbio import DNA
>>> DNA(str(seq.seq)).kmer_frequencies(5)
>>> kmers = DNA(str(seq1.seq)).kmer_frequencies(5)
>>> kmers
{'ATTAA': 60, 'TTAAA': 95, 'TAAAG': 64, 'AAAGG': 45, 'AAGGT': 46, 'AGGTT': 51, 'GGTTT': 55, 'GTTTA': 69, 'TTTAT': 75,...}

We can use this returned dictionary to easily assemble a matrix using the pandas DataFrame constructor. Let's pull in the original SARS genome to demonstrate:

    >>> seq2 = SeqIO.read("AY274119.3.fasta",'fasta')
    >>> seq2    
    SeqRecord(seq=Seq('ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGA...AAA', SingleLetterAlphabet()), id='AY274119.3', name='AY274119.3', description='AY274119.3 Severe acute respiratory syndrome-related coronavirus isolate Tor2, complete genome', dbxrefs=[])
    >>> kmers2 = DNA(str(seq2.seq)).kmer_frequencies(5)
    >>> import pandas as pd
    >>> pd.DataFrame([kmers,kmers2])
   AAAAA  AAAAC  AAAAG  AAAAT  AAACA  AAACC  AAACG  AAACT  AAAGA  AAAGC  \
0     85     59     66     70     75     42     18     56     73     30   
1     66     41     53     63     62     33     16     44     52     24   

   ...    TTTCG  TTTCT  TTTGA  TTTGC  TTTGG  TTTGT  TTTTA  TTTTC  TTTTG  TTTTT  
0  ...       11     63     56     58     55     88     90     51     97     61  
1  ...       19     80     49     53     50     56     64     46     67     39  

Choosing different K sizes allows for tuning the dimensionality of the data representation, but can be constrained by low fidelity of replication in RNA virus genomes. As the size of K increases the feature space expands very quickly, i.e. theoretical K-mers of size 13 are 4^13 = 67108864. Vectors of this size require special data structures for efficient storage in memory when most of the entries in the vector become 0. These are implemented in Scipy as Sparse Matrices and require special routines for efficient construction and operations. Luckily, many routines in Sci-kit learn allow for these sparse objects to be passed to .fit() methods. Feature vectors designed using these K-mer substrings can be represented as an array of Z integer counts, or as a boolean vector that tracks the presence or absence of set membership. These feature extraction parameter choices can influence the interpretation of downstream analysis and choice of methods.

Ideas for Projects

Left here for reference / legacy - refer to the covid19-bh-machine-learning GitHub repo for details.

Investigating Multi-Layer Perceptrons, Convolutional Neural Networks, Regression Models, and Ensembl methods for prediction of disease progression, the impact of geographical distribution, etc.

(Side Note: There seems to be some overlap between this Task and the BioStatistics Task. It may be worth considering merging these two.)

  • Machine learning requires much computing resources, in many cases GPUs. Kubeflow, as a highly portable and cloud native platform for workflows, is highly optimised for machine learning. Containerised workloads can easily be ported onto it.

  • Apply Markovian Clustering (MCL) on the currently available SARS-CoV-2 sequences GenBank sequences in order to identify potential groupings beyond the traditional phylogenetic ones. Apply both at the NT and the AA level, based on a number of distance metrics (aka e-value, string distance, etc).

  • Diagnose COVID-19 based on image data from CT scans and X-rays, using neural net models for image classification.