Wednesday, July 2
🕐 13:30 – 14:30 (JST)
Giulio Ermanno PIBIRI
Ca’ Foscari University
🔗 GitHub Profile
Succinct Information Processing Team in RIKEN-AIP
Efficiently storing and querying large collections of kmers while preserving their original order in sequences is a critical challenge in bioinformatics. Some approaches often optimize for membership queries but neglect the importance of maintaining the sequential arrangement of kmers, which is essential for applications like read mapping and genome assembly.
We will introduce an order-preserving kmer dictionary built upon three fundamental building blocks:
Spectrum-preserving string sets, which enable compact representation of kmer
sets while maintaining the original spectrum.
(Random) minimizers, a class of randomized algorithms that sample representative
kmers, allowing for space-efficient indexing and localized searches.
(Compressed) Minimal perfect hashing, providing collision-free and constant-time lookups in highly compressed space.
By integrating these components, our data structure efficiently encodes the order of kmers and allows fast exact membership, as well as streaming and navigational queries. This data structure is also at the basis for the indexing of colored de Bruijn graph that we will cover on July 4th.
Public events of RIKEN Center for Advanced Intelligence Project (AIP)
Join community