Skip to content
This repository has been archived by the owner on Feb 6, 2022. It is now read-only.

A Haskell script that performs knuth-morris-pratt string search across a region of fasta file for ambiguity codes, and determines whether variants are found in ambiguity codes.

License

Notifications You must be signed in to change notification settings

Matthew-Mosior/Fasta-Region-Inspector-OLD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fasta-Region-Inspector: A Somatic Hypermutation Analysis Tool

Introduction

Fasta-Region-Inspector (FRI) is a computational tool for analyzing somatic hypermutation.
This haskell script takes in variant information, corresponding region information, and ambiguity code string(s) to determine:

  1. Whether the user-defined variants are within 2 kb of the transcription start site (TSS) of the corresponding gene.
  2. All possible start locations of mapped ambiguity code strings within 2 Kb of TSS.
  3. Final list of user-defined variants that lie within an mapped ambiguity code string inside of a 2 Kb window of TSS of corresponding gene.

FRI outputs three files:

  1. variants.tsv - Outputs list of user-defined variants and corresponding region information and whether the variant is within 2 Kb (Y/N).
  2. ambiguity_codes.tsv - Outputs list of mapped ambiguity codes strings, along with corresponding region information, and all start locations where these mapped ambiguity code strings are found.
  3. variants_in_ambiguity_codes.tsv - Outputs final list of user-defined variants that are within 2 Kb of TSS of corresponding gene that lie within a mapped ambiguity code string.

Downstream analysis is left to the user, as the output files can easily be filtered using a scripting language like awk. Filtering on reference/alternate bases, strand orientation (1 vs. -1), and SYMBOL (gene) are some examples of ways to continue narrowing down the output of FRI. After filtering, basic statistics summarizing variants within ambiguity codes is fairly simple, and can give the user some perspective on which patient(s) phenotypes may be explained by somatic hypermutation.

Theory and Implementation

String-searching plays a large role in this program, as it makes up an overwhelming proportion of the programs runtime. An implementation of the Knuth-Morris-Pratt algorithm provided by the stringsearch is used to find all possible locations of mapped ambiguity code strings across the 2 Kb window of the TSS.
The Knuth-Morris-Pratt implementation was chosen as opposed to other string-searching algorithms like Boyer-Moore or Rabin-Karp due to its strength with low complexity search alphabets (ATGC).

Generating all posssible strings that match a regular expression is needed for this program, as the user defined ambiguity code string(s) can be thought of as regular expressions.
Thompson's construction algorithm can be used to create a nondeterministic finite automaton (NFA) to solve this problem.

An example of transforming an ambiguity code string into a regular expression would be the following:

ambiguity code: <-> regular expression:

WRCY <-> [A|T][A|G]C[C|T]

FRI utilizes the sbv package to generate all possible strings from user-defined ambiguity code(s), which utilizes a theorem prover to construct a NFA.

Prerequisites

fri.hs assumes you have a the GHC compiler and packages installed that it imports. The easiest way to do this is to download the Haskell Platform.

Installing required packages

To install the peripheral packages fri.hs requires, you can call the following command assuming you have cabal, a package manager and build system for Haskell, installed on your system (it comes with the Haskell Platform).

$ cabal install [packagename]

Required packages

  • Bio.Core.Sequence
  • Bio.Sequence.Fasta
  • Control.DeepSeq
  • Data.ByteString
  • Data.ByteString.Char8
  • Data.ByteString.Lazy
  • Data.ByteString.Search.DFA
  • Data.Char
  • Data.Functor
  • Data.List
  • Data.List.Split
  • Data.Ord
  • Data.SBV
  • Data.SBV.String
  • Data.SBV.RegExp
  • Data.Traversable
  • System.Console.GetOpt
  • System.Process
  • System.Environment
  • System.Exit
  • System.IO
  • System.IO.Temp
  • Text.PrettyPrint.Boxes
  • Text.Regex.TDFA

Input

FRI requires four inputs:

  1. Variant Input - This input tsv file needs to have the following fields:

Sample\tSymbol\tChromosome\tStart\tStop\tRef\tAlt

This file hold all of the variant information.

  1. Region Input - This input tsv file needs to have the following fields:

Chromosome\tTSS\tStrand\tGene_Name

This file holds all of the corresponding region information.

To create this file, you will typically take each variant of interest's corresponding ENST, and query bioMart to return the following fields:

Gene stable ID
Transcript stable ID
Strand
Chromosome/scaffold name
Transcription start site (TSS)
Gene name

You can then subset this file to contain only fields Chromosome/scaffold name, Transcription start site (TSS), Strand and Gene name.

  1. Ambiguity Codes String - This string argument describes the ambiguity codes to search for within the TSS of each gene.

The user may define as many ambiguity code strings as desired, but keep in mind, the more ambiguity codes you input, the more strings that need to be searched for within the TSS.

The string input is semicolon delimited and the string needs to start and end with a semicolon.

To following is an example string:

Assume the user wants to search for WRCY, WRC, YYGG and CCGY in the TSS.

Ambiguity Code String -> ;WRCY;WRC;YYGG;CCGY;

FRI sees this as: ;[A|T][A|G]C[C|T];[A|T][A|G]C;[C|T][C|T]GG;CCG[C|T];

These mapped ambiguity code strings will be used for genes where the TSS is on the forward (1) strand.

FRI automatically calculates the reverse complement for each individual input ambiguity code string, so the following will also be taken into account given the input above:

;WRCY;WRC;YYGG;CCGY; -> ;RGYW;RGY;CCRR;RCGG;

So, FRI see this as: ;[A|G]G[C|T][A|T];[A|G]G[C|T];CC[A|G][A|G];[A|G]CGG;

These mapped ambiguity code strings will be used for genes where the TSS is on the reverse (-1) strand.

Please see https://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html for nucleotide ambiguity codes.

Please see examples for actual test Variant Input and Region Input.

  1. Fasta File - The argument is the fasta file used to string search against.

IMPORTANT: Make sure the naming conventions between the chromosome field in the region input matches that contained within the fasta file.

Usage

fri.hs is easy to use.

You can call it using the runghc command provided by the GHC compiler as such: (NOT RECOMMENDED)
$ runghc fri.hs -o "/path/to/output/directory/" all_sequences.fa fasta-region-inspection-region-input_final_final.tsv fasta-region-inspection-variants-input_final.tsv ";WRCY;WRC;YYGG;CCGY;"

For maximum performance, please compile and run the source code as follows: (RECOMMENDED)
$ ghc -O2 -o FRI fri.hs
$ ./FRI -o "/path/to/output/directory/" all_sequences.fa fasta-region-inspection-region-input_final_final.tsv fasta-region-inspection-variants-input_final.tsv ";WRCY;WRC;YYGG;CCGY;"

There is a know issue with FRI where you will get all three expected output files, but both ambiguity_codes.tsv and variants_in_ambigiuty_codes.tsv are empty (program stalls). This is a memory related issue, the function ambiguityCodesWithinRegionCheck is the culprit, as it performs a knuth-morris-pratt string search on each 2 Kb TSS chunk of the fasta for each mapped ambiguity code string per gene defined in the region input.

To remedy this, you need to provide more memory to FRI.

Arguments

FRI has few different command line arguments:

Fasta Region Inspector, Copyright (c) 2020 Matthew Mosior.
Usage: fri [-vV?o] [Fasta File] [Region File] [Variant File] [Ambiguity Codes String]
  -v               --verbose                       Output on stderr.
  -V, -?           --version                       Show version number.
  -o OUTDIRECTORY  --outputdirectory=OUTDIRECTORY  The directory path where output files will be printed.
                   --TSSwindowsize=TSSWINSIZE      The size of the window of which to search each region from the TSS.
                   --help                          Print this help message.

The -v option, the verbose option, will provide a full error message.
The -V option, the version option, will show the version of fri in use.
The -o option, the outputdirectory option, is used to specify the directory where output files will be printed.
The --TSSwindowsize option specifies size of the window of which to search each region from the TSS (default: 2000 bp).
Finally, the --help option outputs the help message seen above.

Docker

A docker container exists that contains all the necessary software to run FRI: matthewmosior/fastaregioninspector:final

Credits

Documentation was added February 2020.
Author : Matthew Mosior

About

A Haskell script that performs knuth-morris-pratt string search across a region of fasta file for ambiguity codes, and determines whether variants are found in ambiguity codes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published