| Title: | Exploring Genome Scale Embeddings Parquet Files |
|---|---|
| Description: | Some explorations of the genomics embedding from the paper "Incorporating LLM Embeddings for Variation Across the Human Genome" <https://arxiv.org/html/2509.20702v1> |
| Authors: | Sounkou Mahamane Toure [aut, cre] |
| Maintainer: | Sounkou Mahamane Toure <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.0.0.9000 |
| Built: | 2026-06-04 08:07:32 UTC |
| Source: | https://github.com/sounkou-bioinfo/GenomeScaleEmbeddings |
Attach a houba file and return the bigmemory::big.matrix
attachHoubaBigMatrix(houba_file)attachHoubaBigMatrix(houba_file)
houba_file |
Path to houba file (without .desc extension) |
bigmemory::big.matrix object
Copy remote parquet files into a local DuckDB database file using explicit URLs
CopyParquetToDuckDB( db_path = "local_embeddings.duckdb", urlList = DatasetParquetUrlList(), table_name = "embeddings", overwrite = FALSE )CopyParquetToDuckDB( db_path = "local_embeddings.duckdb", urlList = DatasetParquetUrlList(), table_name = "embeddings", overwrite = FALSE )
db_path |
Path to the local DuckDB database file |
urlList |
Character vector of parquet URLs |
table_name |
Name of the table to create in the DuckDB database |
overwrite |
Whether to overwrite the existing database |
Compute correlation between PC scores and genomic position, per chromosome
correlatePCWithPosition(pc_scores, info_df, pc = 1, method = "spearman")correlatePCWithPosition(pc_scores, info_df, pc = 1, method = "spearman")
pc_scores |
Matrix of principal component scores (variants x PCs) |
info_df |
Data frame with variant info (must contain 'chrom' and 'pos') |
pc |
Which principal component to correlate (default: 1) |
method |
Correlation method (default: 'spearman') |
Named vector of correlation values per chromosome
List of the huggingface datasets for the paper "Incorporating LLM Embeddings for Variation Across the Human Genome"
DatasetParquetUrlList()DatasetParquetUrlList()
A character vector of dataset names
Quick summary for houba mmatrix
embeddingSummary(embMat)embeddingSummary(embMat)
embMat |
houba mmatrix |
List with dim, colMeans, rowMeans
Get PCA scores from houbaPCA result
getPcaScores(houbaPCA_res)getPcaScores(houbaPCA_res)
houbaPCA_res |
List returned by houbaPCA (with 'pca' and 'houbaM') |
Matrix of principal component scores (variants x PCs)
PCA using bigPCAcpp on houba mmatrix or bigmemory::big.matrix
houbaPCA( embMat = "local_embeddings.houba", center = TRUE, scale = TRUE, ncomp = 15 )houbaPCA( embMat = "local_embeddings.houba", center = TRUE, scale = TRUE, ncomp = 15 )
embMat |
houba mmatrix path or bigmemory::big.matrix |
center |
logical, whether to center columns |
scale |
logical, whether to scale columns |
ncomp |
number of principal components to compute |
List with PCA result object and houbaM big.matrix
Quick summary for houba info mmatrix
infoSummary(infoMat)infoSummary(infoMat)
infoMat |
houba mmatrix |
List with dim, unique chroms, and example rsids
Iterate over embeddings as matrix batches from a local DuckDB file
IterateEmbeddingsMatrixBatches( chunk_size = 1e+05, db_path = "local_embeddings.duckdb", table_name = "embeddings" )IterateEmbeddingsMatrixBatches( chunk_size = 1e+05, db_path = "local_embeddings.duckdb", table_name = "embeddings" )
chunk_size |
Number of rows per batch |
db_path |
Path to the local DuckDB database file |
table_name |
Name of the table in the DuckDB database |
A list of embedding batches
Open remote parquet files as a DuckDB VIEW and return as tibble (minimal, http(s) only)
OpenRemoteParquetView( urlList = DatasetParquetUrlList(), view_name = "embeddings", db_path = tempfile(fileext = ".duckdb"), unify_schemas = FALSE )OpenRemoteParquetView( urlList = DatasetParquetUrlList(), view_name = "embeddings", db_path = tempfile(fileext = ".duckdb"), unify_schemas = FALSE )
urlList |
Character vector of parquet URLs |
view_name |
Name of the DuckDB view to create |
db_path |
Path to DuckDB database file (default: temporary file, null for in-memory) |
unify_schemas |
Whether to unify schemas across files |
dplyr tibble referencing the DuckDB view
Plot PCA dimensions using ggplot2, colored by annotation
plotPcaDims(pc_scores, info_df, annotation_col = "chrom", dim1 = 1, dim2 = 2)plotPcaDims(pc_scores, info_df, annotation_col = "chrom", dim1 = 1, dim2 = 2)
pc_scores |
Matrix of principal component scores (variants x PCs) |
info_df |
Data frame with variant info (must contain annotation column) |
annotation_col |
Name of column in info_df to color by (e.g. 'gwas') |
dim1 |
First PC dimension to plot (default: 1) |
dim2 |
Second PC dimension to plot (default: 2) |
Plot spatial correlation between PC scores and genomic position, faceted by chromosome
plotPCSpatialCorrelation(pc_scores, info_df, pc = 1)plotPCSpatialCorrelation(pc_scores, info_df, pc = 1)
pc_scores |
Matrix of principal component scores (variants x PCs) |
info_df |
Data frame with variant info (must contain 'chrom' and 'pos') |
pc |
Which principal component to plot (default: 1) |
Write embeddings to houba mmatrix and return info as data.frame from a local DuckDB file
writeEmbeddingsHoubaFromDuckDB( dbPath = "local_embeddings.duckdb", tableName = "embeddings", embeddingCol = "embedding", batchSize = 1e+05, embeddingDim = 3072, embeddingFile = gsub("\\.duckdb$", ".houba", dbPath), overwrite = FALSE )writeEmbeddingsHoubaFromDuckDB( dbPath = "local_embeddings.duckdb", tableName = "embeddings", embeddingCol = "embedding", batchSize = 1e+05, embeddingDim = 3072, embeddingFile = gsub("\\.duckdb$", ".houba", dbPath), overwrite = FALSE )
dbPath |
Path to the local DuckDB database file |
tableName |
Name of the table in the DuckDB database |
embeddingCol |
Name of the embeddings column |
batchSize |
Number of rows per batch |
embeddingDim |
Dimension of each embedding vector |
embeddingFile |
Path for houba mmatrix file |
overwrite |
Whether to overwrite existing houba file |
An object of class 'HoubaEmbeddings' with mmatrix, info data.frame, and houba file path