Skip to content

Latest commit

 

History

History
executable file
·
122 lines (89 loc) · 7.44 KB

FILE_FORMATS.md

File metadata and controls

executable file
·
122 lines (89 loc) · 7.44 KB

This document describes file formats used in the Taiyaki package.

Fast5 files

The package reads single or multi-read fast5 files using a wrapper around the ont_fast5_api package.

Strand lists

Strand lists should be tab-separated text files having columns ('filename' or 'filename_fast5' and not both) and / or 'read_id'. If a strand list file is supplied as an optional argument to a script, then

  1. If no column 'read_id' is present, then all files with names in the column 'filename' or 'filename_fast5' are read.
  2. If no column 'filename' or 'filename_fast5' is present, then all reads with read_ids in the 'read_id' column are are read from files in the directory specified.
  3. If there is a ('filename' or 'filename_fast5') column and a 'read_id' column, then the strand list is regarded as a list of pairs (filename, read_id).

Per-read parameter files

The script bin/generate_per_read_parameters.py creates a tsv file with columns ('UUID', 'trim_start', 'trim_end', 'shift', 'scale') which give instructions for handling of each read. The shift and scale parameters are chosen so that

y = (current_in_pA - shift)/scale

is standardised: that is, so that roughly, mean(y)=0 and std(y)=1 (more robust statistics are used by the script to generate the parameters).

UUID				trim_start	trim_end	shift			scale
6a8a74ff-5316-41d8-825d-a018af4242bf	200	50	85.43114135742188	15.168887446289057
906f26ce-367a-4d3c-b279-ca86f6db7255	200	50	97.36762817382814	15.927331818603507
90b3c72f-ac34-4337-b33a-2fecd0216b99	200	50	82.36786376953125	15.076369731445316

We expect users to find reasons to generate their own per-read-parameter files or to modify the ones generated by this script.

Reference files

Files to store the reference for each read are used as an ingredient for remapping.

These are fasta files where the comment line for each sequence is the UUID:

>6a8a74ff-5316-41d8-825d-a018af4242bf
GTGCTTGTGGGGTATTGCTCAAGAAATTTTTGCCCAGATCAATGTTCTGGAGATTTTACCCAATGT.....
>906f26ce-367a-4d3c-b279-ca86f6db7255
AATCCTGCCTCTAAAGAAAGAAAAAAAAAAATCAGCTAGGTGTAGCCATAGGCAGCTGTAGTCCCA.....

Mapped signal files (v. 8)

Data for training is stored in mapped signal files. The MappedSignalReader and MappedSignalWriter in taiyaiki/mapped_signal_files.py provides an API for reading and writing these files, and also methods for checking that a file conforms to the specification.

The files are HDF5 files with the following structure.

HDF5_file/
  ├── attribute: alphabet (str)
  ├── attribute: collapse_alphabet (str)
  ├── attribute: mod_long_names (str)
  ├── attribute: version (integer)
  └── group: Reads/
      ├── group: <read_id_1>
      ├── group: <read_id_2>
      ├── group: <read_id_3>
      .
      .

Each read_id is a UUID, and the data in each read group is:

name attribute/dataset type description
shift_frompA attr float shift parameter - see 'per-read-parameter files' above
scale_frompA attr float scale parameter - see 'per-read-parameter files' above
range attr float see equation below
offset attr float see equation below
digitisation attr float see equation below
Dacs dataset int16 signal data representing current through pore (see equation below)
Ref_to_signal dataset int32 Ref_to_signal[n] = location in Dacs associated with Reference[n]
Reference dataset int16 alphabet[Reference[n]] is the nth base in the reference sequence
mapping_score attr (optional) str score associated with mapping of ref to signal
mapping_method str (optional) str short description of mapping method

The current in pA is calculated from the integers in Dacs by the equation

current = (Dacs + offset ) * range / digitisation

The batched variant of the HDF5 mapped signal format was introduced in version 5.2. This variant replaces the Reads group with a Batches group. Each group within the Batches group contain the same set of attributes and datasets listed in the table above, but these values for a set of reads are concatenated together into one dataset per batch. For each variable length dataset, a new [dataset_name]_lengths dataset is added in order to split the data set by read (e.g. with numpy.split). The batched format is readalbe via the same API within the mapped_signal_files modeule.

Model files

  • Neural network descriptions (with parameters not specified) are needed as an input to training. These are python files: an example is given in the directory models.
  • It is also possible to use the result of earlier training runs as a starting point: in this case use a .checkpoint file (see below).
  • Trained network files are much larger than the python files which define the structure of a network. For example, bin/train_flipflop.py saves trained models at each checkpoint and at the end of training in two different formats:
    • .params files store the model parameters in a flat pytorch structure.
    • .checkpoint files can be used to read a network directly into a pytorch function using torch.load().
    • The script bin/dump_json.py transforms a .checkpoint file into a json-based format which can be used by Guppy.
    • bin/prepare_mapped_reads.py needs a trained flip-flop network to use for remapping. This is in the .checkpoint format, and an example can be found in the models directory.

Modified base output file

The modified base output file, produced by bin/basecall.py, stores the information about the presence of modifications given the basecall. The information is stored in a per-read dataset, containing the conditional (log) probability of modification for each position of the basecall. The calls are ordered according to the names given in the mod_long_names dataset. Impossible calls, where the canonical basecall position and modification are incompatible, are indicated by nan values.

The files are HDF5 files with the following structure:

HDF5_file/
  ├── dataset: mod_long_names (string)
  └── group: Reads/
      ├── dataset: <read_id_1>
      ├── dataset: <read_id_2>
      ├── dataset: <read_id_3>
      .
      .

Each read_id is a UUID, and each read dataset is of size [basecalls length] x [number of modified bases]. Rows represents the modified base scores for that index within that read's basecalls. Columns represent scores for the modified base in the order specified in mod_long_names. Modified base scores are only produced where applicable according to the canonical base associated with each modification (e.g. 5mC calls are only produced at C basecalls). All other values within the read datasets are nan.