Conway-Bromage-Lyndon
A Rust library providing fully dynamic sets of k-mers with high locality.
The data structure is described in Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets, please cite it if you use this library.
It supports the following operations:
- inserting a single k-mer (with
insert), or every k-mer from a sequence (withinsert_seq) - deleting a single k-mer (with
remove), or every k-mer from a sequence (withremove_seq) - membership of a single k-mer (with
contains), or every k-mer from a sequence (withcontains_seq) - iterating over the k-mers stored in the set (with
iter) - union / intersection / difference of two sets (with
|/&/-) - (de)serialization with serde
Requirements
Rust nightly 1.77+
If you have not installed Rust yet, please visit rustup.rs to install it.
This library uses some nightly features of the Rust compiler (version 1.77+), you can install the latest nightly version with
rustup install nightlyIf you don't want to use the +nightly flag every time you run cargo, you can set it as default with
rustup default nightlyAdditional headers for Linux
This library uses C++ bindings for the sux library and tiered vectors.
Depending on your configuration, some headers used for the bindings might be missing, in that case please install the following packages:
Ubuntu
sudo apt install -y libstdc++-12-dev libclang-devFedora
sudo dnf install -y clang15-develUsing the library
You can add CBL in an existing Rust project with
cargo +nightly add --git https://github.com/imartayan/CBL.gitor by adding the following dependency in your Cargo.toml
cbl = { git = "https://github.com/imartayan/CBL.git" }If the build fails, try to install additional headers.
Choosing the right parameters
The CBL struct takes two main parameters as constants:
- an integer
Kspecifying the size of the k-mers - an integer type
T(e.g.u32,u64,u128) that must be large enough to store both a k-mer and its number of bits together
Therefore T should be large enough to store
In particular, since primitive integers cannot store more than 128 bits, this means that K must be ≤ 59.
Additionally, you can specify a third (optional) parameter PREFIX_BITS which determines the size of the underlying bitvector.
Changing this parameter affects the space usage and the query time of the data structure, see the paper for more details.
Example usage
use cbl::CBL;
use needletail::parse_fastx_file;
use std::env::args;
// define the parameters K and T
const K: usize = 25;
type T = u64; // T must be large enough to store $2k + \lg(2k)$ bits
fn main() {
let args: Vec<String> = args().collect();
let input_filename = args.get(1).expect("No argument given");
// create a CBL index with parameters K and T
let mut cbl = CBL::<K, T>::new();
let mut reader = parse_fastx_file(input_filename).unwrap();
// for each sequence of the FASTA/Q file
while let Some(record) = reader.next() {
let seqrec = record.expect("Invalid record");
// insert each k-mer of the sequence in the index
cbl.insert_seq(&seqrec.seq());
}
}Building from source
You can clone the repository and its submodules with
git clone --recursive https://github.com/imartayan/CBL.gitIf you did not use the --recursive flag, make sure to load the submodules with
git submodule update --init --recursiveRunning the binaries
You can compile the binaries with
cargo +nightly build --release --examplesIf the build fails, try to install additional headers.
By default, the binaries are compiled with a fixed K equal to 25, you can compile them with a different K as follows
K=59 cargo +nightly build --release --examplesNote that K values ≥ 60 are not supported by this library.
Similarly, PREFIX_BITS is equal to 24 by default and you can change it with
K=59 PREFIX_BITS=28 cargo +nightly build --release --examplesNote that PREFIX_BITS values ≥ 29 are not supported by this library.
Once compiled, the main binary will be located at target/release/examples/cbl.
It supports the following commands:
Usage: cbl <COMMAND>
Commands:
build Build an index containing the k-mers of a FASTA/Q file
count Count the k-mers contained in an index
list List the k-mers contained in an index
query Query an index for every k-mer contained in a FASTA/Q file
insert Add the k-mers of a FASTA/Q file to an index
remove Remove the k-mers of a FASTA/Q file from an index
merge Compute the union of two indexes
inter Compute the intersection of two indexes
diff Compute the difference of two indexes
sym-diff Compute the symmetric difference of two indexes
repartition Show the repartition of the k-mers in the data structure
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print versionRunning the tests
You can run all the tests with
cargo +nightly test --libBuilding the documentation
You can build the documentation of the library and open it in your browser with
cargo +nightly doc --lib --no-deps --openCitation
Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Martayan, I., Cazaux, B., Limasset, A., and Marchet, C. https://doi.org/10.1093/bioinformatics/btae217
@article{cbl,
title = {{Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets}},
author = {Martayan, Igor and Cazaux, Bastien and Limasset, Antoine and Marchet, Camille},
journal = {Bioinformatics},
volume = {40},
number = {Supplement_1},
pages = {i48-i57},
year = {2024},
month = {06},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae217},
url = {https://doi.org/10.1093/bioinformatics/btae217},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/Supplement\_1/i48/58354678/btae217.pdf}
}