This document describes file formats used in the Taiyaki package.

Fast5 files

The package reads single or multi-read fast5 files using a wrapper around the ont_fast5_api package.

Strand lists

Strand lists should be tab-separated text files having columns ('filename' or 'filename_fast5' and not both) and / or 'read_id'. If a strand list file is supplied as an optional argument to a script, then

If no column 'read_id' is present, then all files with names in the column 'filename' or 'filename_fast5' are read.
If no column 'filename' or 'filename_fast5' is present, then all reads with read_ids in the 'read_id' column are are read from files in the directory specified.
If there is a ('filename' or 'filename_fast5') column and a 'read_id' column, then the strand list is regarded as a list of pairs (filename, read_id).

Per-read parameter files

The script bin/generate_per_read_parameters.py creates a tsv file with columns ('UUID', 'trim_start', 'trim_end', 'shift', 'scale') which give instructions for handling of each read. The shift and scale parameters are chosen so that

y = (current_in_pA - shift)/scale

is standardised: that is, so that roughly, mean(y)=0 and std(y)=1 (more robust statistics are used by the script to generate the parameters).

UUID				trim_start	trim_end	shift			scale
6a8a74ff-5316-41d8-825d-a018af4242bf	200	50	85.43114135742188	15.168887446289057
906f26ce-367a-4d3c-b279-ca86f6db7255	200	50	97.36762817382814	15.927331818603507
90b3c72f-ac34-4337-b33a-2fecd0216b99	200	50	82.36786376953125	15.076369731445316

We expect users to find reasons to generate their own per-read-parameter files or to modify the ones generated by this script.

Reference files

Files to store the reference for each read are used as an ingredient for remapping.

These are fasta files where the comment line for each sequence is the UUID:

>6a8a74ff-5316-41d8-825d-a018af4242bf
GTGCTTGTGGGGTATTGCTCAAGAAATTTTTGCCCAGATCAATGTTCTGGAGATTTTACCCAATGT.....
>906f26ce-367a-4d3c-b279-ca86f6db7255
AATCCTGCCTCTAAAGAAAGAAAAAAAAAAATCAGCTAGGTGTAGCCATAGGCAGCTGTAGTCCCA.....

Mapped signal files (v. 8)

Data for training is stored in mapped signal files. The MappedSignalReader and MappedSignalWriter in taiyaiki/mapped_signal_files.py provides an API for reading and writing these files, and also methods for checking that a file conforms to the specification.

The files are HDF5 files with the following structure.

HDF5_file/
  ├── attribute: alphabet (str)
  ├── attribute: collapse_alphabet (str)
  ├── attribute: mod_long_names (str)
  ├── attribute: version (integer)
  └── group: Reads/
      ├── group: <read_id_1>
      ├── group: <read_id_2>
      ├── group: <read_id_3>
      .
      .

Each read_id is a UUID, and the data in each read group is:

name	attribute/dataset	type	description
shift_frompA	attr	float	shift parameter - see 'per-read-parameter files' above
scale_frompA	attr	float	scale parameter - see 'per-read-parameter files' above
range	attr	float	see equation below
offset	attr	float	see equation below
digitisation	attr	float	see equation below
Dacs	dataset	int16	signal data representing current through pore (see equation below)
Ref_to_signal	dataset	int32	Ref_to_signal[n] = location in Dacs associated with Reference[n]
Reference	dataset	int16	alphabet[Reference[n]] is the nth base in the reference sequence
mapping_score	attr (optional)	str	score associated with mapping of ref to signal
mapping_method	str (optional)	str	short description of mapping method

The current in pA is calculated from the integers in Dacs by the equation

current = (Dacs + offset ) * range / digitisation

The batched variant of the HDF5 mapped signal format was introduced in version 5.2. This variant replaces the Reads group with a Batches group. Each group within the Batches group contain the same set of attributes and datasets listed in the table above, but these values for a set of reads are concatenated together into one dataset per batch. For each variable length dataset, a new [dataset_name]_lengths dataset is added in order to split the data set by read (e.g. with numpy.split). The batched format is readalbe via the same API within the mapped_signal_files modeule.

Model files

Neural network descriptions (with parameters not specified) are needed as an input to training. These are python files: an example is given in the directory models.
It is also possible to use the result of earlier training runs as a starting point: in this case use a .checkpoint file (see below).
Trained network files are much larger than the python files which define the structure of a network. For example, bin/train_flipflop.py saves trained models at each checkpoint and at the end of training in two different formats:
- .params files store the model parameters in a flat pytorch structure.
- .checkpoint files can be used to read a network directly into a pytorch function using torch.load().
- The script bin/dump_json.py transforms a .checkpoint file into a json-based format which can be used by Guppy.
- bin/prepare_mapped_reads.py needs a trained flip-flop network to use for remapping. This is in the .checkpoint format, and an example can be found in the models directory.

Modified base output file

The modified base output file, produced by bin/basecall.py, stores the information about the presence of modifications given the basecall. The information is stored in a per-read dataset, containing the conditional (log) probability of modification for each position of the basecall. The calls are ordered according to the names given in the mod_long_names dataset. Impossible calls, where the canonical basecall position and modification are incompatible, are indicated by nan values.

The files are HDF5 files with the following structure:

HDF5_file/
  ├── dataset: mod_long_names (string)
  └── group: Reads/
      ├── dataset: <read_id_1>
      ├── dataset: <read_id_2>
      ├── dataset: <read_id_3>
      .
      .

Each read_id is a UUID, and each read dataset is of size [basecalls length] x [number of modified bases]. Rows represents the modified base scores for that index within that read's basecalls. Columns represent scores for the modified base in the order specified in mod_long_names. Modified base scores are only produced where applicable according to the canonical base associated with each modification (e.g. 5mC calls are only produced at C basecalls). All other values within the read datasets are nan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FILE_FORMATS.md

FILE_FORMATS.md

Fast5 files

Strand lists

Per-read parameter files

Reference files

Mapped signal files (v. 8)

Model files

Modified base output file

Files

FILE_FORMATS.md

Latest commit

History

FILE_FORMATS.md

File metadata and controls

Fast5 files

Strand lists

Per-read parameter files

Reference files

Mapped signal files (v. 8)

Model files

Modified base output file