These are kseq utilities based on Heng Li's kseq parser
The utilities were previously distibuted within tardis.
The kseq_split
program splits a fastq or fasta file into chunks (for example for
submission to a compute cluster), optionally subsampling the file. Input
files may be compressed or uncompressed. It is compatible for use with a
client that submits each chunk to processing as it becomes available, with
an "interim" filename used to write data, and then this is renamed to the actual target
chunkname only when complete. Thus a client can poll for chunk files,
with a guarantee that any found are completed.
The kseq_count
program counts the number of logical records in a fastq or fasta file , and prints the count to stdout.
The input file may be compressed or uncompressed.
It has an optional "approximate mode" (-a) , which estimates the number of records by
reading and writing a small preview , of n records, and then estimating N = empirical_adjustment_function( n * original file size / preview filesize)
(if the original is compressed then so is the preview). The empirical adjustment is determined by fitting a model to
a test dataset of Y = actual/raw_approximation , in terms of X1=filesize in bytes, X2=compression type.
- accuracy in general on compressed data is somewhat poor as compression size is probably nonlinear w.r.t file size
- accuracy will be poor if the preview seq lengths are unrepresentative
- will not recognise .zz compression
- for all compression types other than gzip, the -a option approximation may be poor (because the compressed preview is always gzip )
- not extensively tested on formats other than gzip, and uncompressed
- there are big-endian/little-endian variations on the compression magic bytes that are not yet supported. (One of these,for gzip, is supported)
- if compression is not detected, -a option will be "way out"
- not tested on variant fastq and fasta
- The empirical adjustment is based on a fairly small test dataset and could be improved with more data and a better model
Undocumented test program.
The suite is packaged as an installable Nix flake. See the example flake for how to consume it.