Skip to content

AgResearch/kseq_split

Repository files navigation

kseq_split suite

These are kseq utilities based on Heng Li's kseq parser

The utilities were previously distibuted within tardis.

kseq_split

The kseq_split program splits a fastq or fasta file into chunks (for example for submission to a compute cluster), optionally subsampling the file. Input files may be compressed or uncompressed. It is compatible for use with a client that submits each chunk to processing as it becomes available, with an "interim" filename used to write data, and then this is renamed to the actual target chunkname only when complete. Thus a client can poll for chunk files, with a guarantee that any found are completed.

kseq_count

The kseq_count program counts the number of logical records in a fastq or fasta file , and prints the count to stdout. The input file may be compressed or uncompressed. It has an optional "approximate mode" (-a) , which estimates the number of records by reading and writing a small preview , of n records, and then estimating N = empirical_adjustment_function( n * original file size / preview filesize) (if the original is compressed then so is the preview). The empirical adjustment is determined by fitting a model to a test dataset of Y = actual/raw_approximation , in terms of X1=filesize in bytes, X2=compression type.

bugs / limitations associated with the -a option :

  1. accuracy in general on compressed data is somewhat poor as compression size is probably nonlinear w.r.t file size
  2. accuracy will be poor if the preview seq lengths are unrepresentative
  3. will not recognise .zz compression
  4. for all compression types other than gzip, the -a option approximation may be poor (because the compressed preview is always gzip )
  5. not extensively tested on formats other than gzip, and uncompressed
  6. there are big-endian/little-endian variations on the compression magic bytes that are not yet supported. (One of these,for gzip, is supported)
    • if compression is not detected, -a option will be "way out"
  7. not tested on variant fastq and fasta
  8. The empirical adjustment is based on a fairly small test dataset and could be improved with more data and a better model

kseq_test

Undocumented test program.

Installation

The suite is packaged as an installable Nix flake. See the example flake for how to consume it.

About

kseq_split suite from Tardis extracted as standalone

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published