| Title: | 'R' Bindings to the 'sassy' Approximate String Matcher |
|---|---|
| Description: | Fast approximate string matching for short patterns in longer texts using the 'sassy' Rust crate. 'sassy' implements SIMD-accelerated fuzzy search over ASCII, DNA, and IUPAC alphabets, with support for reverse-complement search, overhang alignments, CIGAR strings, and batched searches. See Beeloo and Groot Koerkamp (2025) <doi:10.1101/2025.07.22.666207> and Beeloo and Groot Koerkamp (2026) <doi:10.64898/2026.03.10.710811>. |
| Authors: | Sounkou Mahamane Toure [aut, cre], Ragnar Groot Koerkamp [cph], Rick Beeloo [cph] |
| Maintainer: | Sounkou Mahamane Toure <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.2.1-0.1.0.9000 |
| Built: | 2026-05-31 07:00:26 UTC |
| Source: | https://github.com/sounkou-bioinfo/Rsassy |
Print Rsassy feature information
## S3 method for class 'sassy_features' print(x, ...)## S3 method for class 'sassy_features' print(x, ...)
x |
A |
... |
Ignored; accepted for compatibility with |
x, invisibly.
Print sassy match data frames
## S3 method for class 'sassy_matches' print(x, ..., color = getOption("Rsassy.coloring", FALSE))## S3 method for class 'sassy_matches' print(x, ..., color = getOption("Rsassy.coloring", FALSE))
x |
A |
... |
Ignored; accepted for compatibility with |
color |
If |
x, invisibly.
Rsassy normally follows the upstream sassy TSV convention: reverse-strand
match_region values are reverse-complemented and CIGAR strings are oriented
in the input pattern direction. sassy_as_sam() converts reverse-strand rows
to the text direction used by SAM and by upstream sassy --sam output.
sassy_as_sam(x, alphabet = "dna")sassy_as_sam(x, alphabet = "dna")
x |
A |
alphabet |
Alphabet profile used for the search. One of |
A copy of x with reverse-strand cigar values reversed and, when
present, reverse-strand match_region values reverse-complemented back to
text direction.
sassy_as_sam( sassy_search(list("ACGA"), list("TTTCGTTT"), 0, alphabet = "dna", match_region = TRUE), alphabet = "dna" )sassy_as_sam( sassy_search(list("ACGA"), list("TTTCGTTT"), 0, alphabet = "dna", match_region = TRUE), alphabet = "dna" )
sassy_crispr() is an R-level equivalent of the upstream sassy crispr
workflow for in-memory sequences. Guides include the PAM at the end. By
default, the PAM must match exactly under IUPAC matching, while the rest of
the guide may have up to k edits.
sassy_crispr( guide, text, k, pam_length = 3L, allow_pam_edits = FALSE, max_n_frac = 0.2, rc = TRUE, threads = 1L, pattern_id = NULL, text_id = NULL )sassy_crispr( guide, text, k, pam_length = 3L, allow_pam_edits = FALSE, max_n_frac = 0.2, rc = TRUE, threads = 1L, pattern_id = NULL, text_id = NULL )
guide |
List of guide sequences including the PAM suffix. Each element must be a raw vector or non-missing character scalar. |
text |
List of text sequences to search. Each element must be a raw vector or non-missing character scalar. |
k |
Maximum edit distance for the searched guide sequence. With
|
pam_length |
Length of the PAM suffix. |
allow_pam_edits |
If |
max_n_frac |
Maximum allowed fraction of |
rc |
If |
threads |
Number of worker threads to request. |
pattern_id |
Optional guide/pattern identifiers. If supplied, must be a
character vector with one entry per guide and adds/replaces a |
text_id |
Optional text identifiers. If supplied, must be a character
vector with one entry per text and adds/replaces a |
A data frame with CLI-style columns: guide, cost, strand,
start, end, match_region, and cigar. If pattern_id or text_id
are supplied, mapped identifier columns are included.
sassy_crispr(list("ACGTNGG"), list("TTTACGTAGGTTT"), k = 0, rc = FALSE, text_id = "chr1")sassy_crispr(list("ACGTNGG"), list("TTTACGTAGGTTT"), k = 0, rc = FALSE, text_id = "chr1")
sassy_fastx_iter() opens a FASTA or FASTQ file and returns an iterator that
yields record-count-bounded batches. Parsing is performed by the vendored
Rust needletail parser. Sequence and quality data in each batch are exposed
as read-only raw ALTREP slices over immutable native batch buffers; they are
not eagerly materialized as R strings.
sassy_fastx_iter(path, batch_records = 100000L, include_qual = TRUE)sassy_fastx_iter(path, batch_records = 100000L, include_qual = TRUE)
path |
Path to a FASTA/FASTQ file. Gzip-compressed input is supported by
the vendored |
batch_records |
Maximum number of records returned by each
|
include_qual |
If |
An external pointer with class sassy_fastx_iter.
fq <- tempfile(fileext = ".fastq") writeLines(c("@r1", "ACGT", "+", "!!!!"), fq, useBytes = TRUE) it <- sassy_fastx_iter(fq, batch_records = 1) batch <- sassy_fastx_next(it) rawToChar(batch$seq[[1]])fq <- tempfile(fileext = ".fastq") writeLines(c("@r1", "ACGT", "+", "!!!!"), fq, useBytes = TRUE) it <- sassy_fastx_iter(fq, batch_records = 1) batch <- sassy_fastx_next(it) rawToChar(batch$seq[[1]])
Get the next FASTA/FASTQ batch
sassy_fastx_next(iter)sassy_fastx_next(iter)
iter |
An iterator created by |
NULL at end of file, otherwise a sassy_fastx_batch list with
id, seq, and qual elements. id is an ALTREP character vector, while
seq and qual are ALTREP lists whose elements are raw ALTREP vectors.
fq <- tempfile(fileext = ".fastq") writeLines(c("@r1", "ACGT", "+", "!!!!"), fq, useBytes = TRUE) it <- sassy_fastx_iter(fq, batch_records = 1) batch <- sassy_fastx_next(it) length(batch$id)fq <- tempfile(fileext = ".fastq") writeLines(c("@r1", "ACGT", "+", "!!!!"), fq, useBytes = TRUE) it <- sassy_fastx_iter(fq, batch_records = 1) batch <- sassy_fastx_next(it) length(batch$id)
Returns diagnostic information about the selected Rsassy backend. Calling
this initializes the native backend if it has not already been loaded.
rsassy_selected_backend reports the runtime-selected backend.
rsassy_installed_backends is a character vector of backend libraries found
in the package installation, and rsassy_supported_backends is the subset
supported by the current CPU/runtime. With "auto" selection, Rsassy chooses
the best supported installed backend: AVX-512 before AVX2 on x86_64, NEON on
arm64, WebAssembly SIMD128 on wasm, and scalar otherwise. The selected_*
fields describe the loaded Rust backend. The cpu_* fields are detected by
the C shim.
sassy_features()sassy_features()
A sassy_features list of build, selected-backend, and
CPU/runtime feature values.
sassy_features()sassy_features()
Convenience wrapper that creates a searcher, searches, and returns a
sassy_matches data frame. Coordinates are 0-based and half-open.
sassy_search( pattern, text, k, alphabet = "dna", rc = TRUE, alpha = NULL, all = FALSE, threads = 1L, strategy = "pairwise", pattern_id = NULL, text_id = NULL, match_region = FALSE, sam = FALSE )sassy_search( pattern, text, k, alphabet = "dna", rc = TRUE, alpha = NULL, all = FALSE, threads = 1L, strategy = "pairwise", pattern_id = NULL, text_id = NULL, match_region = FALSE, sam = FALSE )
pattern |
List of raw vectors or non-missing character scalars. |
text |
List of raw vectors or non-missing character scalars. |
k |
Maximum edit distance. |
alphabet |
Alphabet profile. One of |
rc |
If |
alpha |
Optional IUPAC overhang cost in |
all |
If |
threads |
Number of worker threads to request for bulk searches. |
strategy |
Search strategy. |
pattern_id |
Optional pattern identifiers. If supplied, must be a
non-missing character vector with one entry per pattern and adds/replaces a
|
text_id |
Optional text identifiers. If supplied, must be a non-missing
character vector with one entry per text and adds/replaces a |
match_region |
If |
sam |
If |
A data frame with 0-based indices and coordinates: pattern_idx, text_idx, text_start, text_end, pattern_start, pattern_end, cost, strand, and cigar. If pattern_id or text_id are supplied, mapped identifier columns are included. If requested, also includes match_region. Rows are ordered by input text, then text start/end coordinate, then pattern index.
sassy_search(list("ACGT"), list("TTACGTAA"), 0, alphabet = "dna", rc = FALSE)sassy_search(list("ACGT"), list("TTACGTAA"), 0, alphabet = "dna", rc = FALSE)
A searcher stores the selected alphabet profile and reverse-complement behavior. Reuse a searcher when searching many patterns or texts with the same settings.
sassy_searcher(alphabet = "dna", rc = TRUE, alpha = NULL)sassy_searcher(alphabet = "dna", rc = TRUE, alpha = NULL)
alphabet |
Alphabet profile. One of |
rc |
If |
alpha |
Optional IUPAC overhang cost in |
An external pointer with class sassy_searcher.
searcher <- sassy_searcher("dna", rc = FALSE) sassy_searcher_search(searcher, list("ACGT"), list("TTACGTAA"), 0)searcher <- sassy_searcher("dna", rc = FALSE) sassy_searcher_search(searcher, list("ACGT"), list("TTACGTAA"), 0)
pattern and text must be lists of sequences. Each element must be a raw
vector or a non-missing character scalar. Every pattern is searched against
every text and the returned pattern_idx and text_idx columns identify the
0-based input indices. Use threads > 1 for larger batches.
sassy_searcher_search( searcher, pattern, text, k, all = FALSE, threads = 1L, strategy = "pairwise", pattern_id = NULL, text_id = NULL, match_region = FALSE, sam = FALSE )sassy_searcher_search( searcher, pattern, text, k, all = FALSE, threads = 1L, strategy = "pairwise", pattern_id = NULL, text_id = NULL, match_region = FALSE, sam = FALSE )
searcher |
A searcher created by |
pattern |
List of raw vectors or non-missing character scalars. |
text |
List of raw vectors or non-missing character scalars. |
k |
Maximum edit distance. |
all |
If |
threads |
Number of worker threads to request for bulk searches. |
strategy |
Search strategy. |
pattern_id |
Optional pattern identifiers. If supplied, must be a
non-missing character vector with one entry per pattern and adds/replaces a
|
text_id |
Optional text identifiers. If supplied, must be a non-missing
character vector with one entry per text and adds/replaces a |
match_region |
If |
sam |
If |
A data frame with 0-based indices and coordinates: pattern_idx, text_idx, text_start, text_end, pattern_start, pattern_end, cost, strand, and cigar. If pattern_id or text_id are supplied, mapped identifier columns are included. If requested, also includes match_region. Rows are ordered by input text, then text start/end coordinate, then pattern index.
Select a backend for the current R process. Backend loading is intentionally
one-shot: the selected shared library is fixed for the lifetime of the R
process. This must be called before the first native Rsassy operation,
including sassy_features(), sassy_searcher(), or sassy_search(). Rsassy
does not unload and replace backend DLLs because that is not reliable across R
platforms. Use this for benchmarking installed backends against each other in
separate fresh R processes.
sassy_set_backend( backend = c("auto", "scalar", "avx2", "avx512", "neon", "wasm_simd128") )sassy_set_backend( backend = c("auto", "scalar", "avx2", "avx512", "neon", "wasm_simd128") )
backend |
One of |
The requested backend name, invisibly. "auto" means runtime dispatch
will choose the best installed backend supported by the current CPU/runtime
when the backend is first loaded.