Minimal encoding of canonical k-mers
gcc -O3 -D_FILE_OFFSET_BITS=64 -pthread -mbmi -o canonical fasta.c canonical.c -lm
./canonical <fasta> <k> <b>
where
fasta
is a (multiple) fasta filek
is the k-mer length, optional, default 5, maximum 31b
is the number of buckets, optional, default 4
Output: distribution of k-mers to buckets
To switch to standard 2-bit encoding, (un)comment the following lines:
// process_string(seq,k,threads,t)
process_string_std(seq,k,threads,t)
For encoding canonical k-mers on general (non-DNA) alphabets, Python scripts are provided in the according subfolder, where minenc.py
outputs the encoding of all k-mers for a given alphabet size, and minenc_rc.py
encodes considering reverse complementation.
An implementation of the functionality in C++ is available in the genesis library, see here. The library also offers other useful functionality for working with sequences and k-mers.
Please cite: Wittler R. General encoding of canonical k-mers. Peer Community Journal. 2023;3: e87. https://doi.org/10.24072/pcjournal.323
- fasta.c and fasta.h are borrowed from FragGeneScan-Plus.
- MinEncCanKmer is licensed under the GNU general public license.