This crate provides a Rust interface to the Edlib C++ library by Martin Šošić. See Martinsos-edlib
The reference paper is :
Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 [btw753. doi] https://doi.org/10.1093/bioinformatics/btw753
The crate offers 2 interfaces to edlib.
The first, accessed via module bindings, is direcly the interface generated by the bindgen crate.
The second, accessed via module edlibrs, provides a more idiomatic Rust interface. It comes at the cost of cloning information stored in pointers startLocations and endLocations in C struct EdlibAlignResult to get a Rust struct EdlibAlignResultRs with Option<Vec<u8>> fields instead of pointers. The cigar string representation is also cloned when computed.
As a consequence memory management is fully transferred to Rust.
Structures and functions have the same name as in edlib with just "Rs" appended to original names.
For the edlibrs interface we have for example:
in normal mode:
use edlib_rs::edlibrs::*;
...
let query = "ACCTCTG";
let target = "ACTCTGAAA";
let align_res = edlibAlignRs(query.as_bytes(), target.as_bytes(), &EdlibAlignConfigRs::default());
assert_eq!(align_res.status, EDLIB_STATUS_OK);
assert_eq!(align_res.editDistance, 4);
in the infix mode :
use edlib_rs::edlibrs::*;
...
let query = "ACCTCTG";
let target = "TTTTTTTTTTTTTTTTTTTTTACTCTGAAA";
//
let mut config = EdlibAlignConfigRs::default();
config.mode = EdlibAlignModeRs::EDLIB_MODE_HW;
let align_res = edlibAlignRs(query.as_bytes(), target.as_bytes(), &config);
assert_eq!(align_res.editDistance, 1);
The package has the original Edlib library sources embedded in the source tree (See directory edlib-c, corresponding to sources at the date of Decembre 2020) minus the original test_data directory to limit the size of the crate. The standard "cargo build" command runs the edlib's cmake.
The crate enables a logger to monitor the call to the C-interface which is by default set in Cargo.toml to info for release mode and trace for debug mode, but can changed by setting the variable RUST_LOG (see env_logger doc).
Some tests in module edlib.rs can serve as basic examples.
In directory examples there is also a small version of the edlib edaligner module (see apps/aligner in edlib installation dir) which runs on Fasta files containing only one sequence as contained in the original edlib directory test_data.
As the embedded sources do not contain the original test_data sub-directory, it is necessary to download them separately to run the edaligner example module.
Contrary to the edlib version the module given a query and a target sequence runs the 3 modes (normal/NW, prefix/SHW and infix/HW) in one pass.
With RUST_LOG=info ./target/release/examples/edaligner --dirdata "$edlibpath/test_data/Enterobacteria_Phage_1" --tf "Enterobacteria_phage_1.fasta" --qf "mutated_90_perc.fasta"
we get the following timing in release mode for Enterobacteria_phage_1.fasta as target sequence and mutated_90_perc.fasta as query sequence.
mode | edlibrs time(s) | edlib time(s) | distance |
---|---|---|---|
NW | 0.106 | 0.106 | 9506 |
SHW | 0.184 | 0.191 | 9502 |
HW | 0.682 | 0.695 | 9502 |
We get the following timing in release mode for Enterobacteria_phage_1.fasta as target sequence and mutated_60_perc.fasta as query sequence.
mode | edlibrs time(s) | edlib time(s) | distance |
---|---|---|---|
NW | 0.398 | 0.398 | 39829 |
SHW | 0.670 | 0.684 | 39828 |
HW | 1.182 | 1.206 | 39828 |
Except for infinitesimal variations of cpu time measurement we see we have the same computation times.
Licensed under either of
- Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0
- MIT license LICENSE-MIT or http://opensource.org/licenses/MIT
at your option.
This software was written on my own while working at CEA, CEA-LIST