Published August 1, 2025 | https://doi.org/10.59350/005jj-nf396

Weekly Recap (Aug 2025, Part 1)

Creators & Contributors

Feature image

This week's recap highlights Variant-EFFECTS for rewriting regulatory DNA to dissect and reprogram gene expression, zero-shot evaluation revealing the limitations of single-cell foundation models, EcoWeaver for large-scale prediction of gene functional associations from coevolutionary signals, and how assemblies of long-read metagenomes suffer from diverse errors.

Others that caught my attention include protein structure alignment with SoftAlign, off-target analysis with SANTON, fast+accurate clustering of viral genomes with Vclust, viral genome assembly from nanopore reads with BonoboFlow, strain-level metagenomic classification with MADRe, lossless pangenome indexing using tag arrays, a multistage fusion tabular transformer for disease prediction using metagenomic data, ConsensuSV-ONT for accurate structural variant calling, BEstimate for design and interpretation of CRISPR base editing experiments, and the evolutionary consequences of functional synonymous mutations.

Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail.

Buy me a coffee ☕️

Deep dive

Rewriting regulatory DNA to dissect and reprogram gene expression

Paper: Martyn, G. E., Montgomery, M. T., Jones, H., et al., "Rewriting regulatory DNA to dissect and reprogram gene expression" in Cell, 2025. https://doi.org/10.1016/j.cell.2025.03.034

Transcription factors bind to regulatory DNA to encode cell-type specific gene expresison patterns, but predicting and programming these regulatory sequences isn't easy. This paper explores this using CRISPR screens.

TLDR: This paper introduces a high-throughput way to understand how specific DNA changes affect gene expression. Variant-EFFECTS precisely measures the impact of hundreds of designed edits in endogenous regulatory DNA, giving us a clearer picture of how gene regulation works and offering a path to new gene-editing therapies.

Summary: Variant-EFFECTS (Variant Effects From Flow-sorting Experiments with CRISPR Targeting Screens) is a new high-throughput technology designed to precisely quantify the effects of CRISPR-mediated edits on gene expression within endogenous regulatory DNA. The researchers applied this method to dissect and reprogram three regulatory elements across two genes in two cell types, providing detailed insights into their function. This work is important because Variant-EFFECTS overcomes key limitations of prior high-throughput methods like massively parallel reporter assays (MPRAs) by assessing variants directly in their native genomic context, ensuring more physiologically relevant measurements. The approach also accounts for complexities such as varying editing efficiencies and the presence of heterozygous genotypes, thereby improving the accuracy of estimated effect sizes. As a result, Variant-EFFECTS offers a robust and generalizable tool for mapping the logic of cis-regulatory elements and advancing the development of predictive models for gene regulation. Furthermore, this technology has significant potential for therapeutic applications, enabling the identification of small, precise edits that can tune gene expression over a wide range, paving the way for new prime-editing-based gene therapies targeting regulatory DNA.

Methodological highlights:

  • Variant-EFFECTS integrates pooled CRISPR prime editing with fluorescence-activated cell sorting (FACS), using RNA FlowFISH or fluorescent antibodies, to directly measure the quantitative effects of designed sequence edits on target gene expression.

  • The method incorporates a sophisticated computational analysis workflow that precisely estimates the effect of each edit by accounting for potential technical confounders, such as variable prime editing efficiencies and the presence of heterozygous genotypes within the cell population.

  • It enables high-throughput tiling mutagenesis screens across endogenous regulatory regions, facilitating the identification and quantitative measurement of functional transcription factor motif instances and their impact on gene expression in their native genomic context.

New tools, data, and resources:

Figure 1 from Martyn 2025: Variant-EFFECTS combines prime editing with RNA FlowFISH to investigate the effects of non-coding variants on target gene expression.

Zero-shot evaluation reveals limitations of single-cell foundation models

Paper: Kedzierska, K. Z., Crawford, L., Amini, A. P., et al., "Zero-shot evaluation reveals limitations of single-cell foundation models" in Genome Biology, 2025. https://doi.org/10.1186/s13059-025-03574-x.

I've linked to it before but if you haven't read Abhishaike Mahajan's primer on scRNAseq foundation models, go read it now.

TLDR: This paper challenges the common assumption that large single-cell foundation models always perform well, especially in zero-shot scenarios. The study rigorously evaluates Geneformer and scGPT, showing that their zero-shot performance in tasks like cell type clustering and batch integration often falls short of simpler, established methods, highlighting the need for more robust evaluation practices in single-cell research.

Summary: This paper critically evaluates the zero-shot performance of two prominent single-cell foundation models, Geneformer and scGPT, in tasks such as cell type clustering and batch integration. Unlike evaluations focused solely on fine-tuning, this study reveals that in scenarios where no further training is possible, these models frequently underperform compared to simpler, established methods like selecting highly variable genes (HVG), scVI, or Harmony. The importance of this work lies in its rigorous assessment of models in a zero-shot setting, which is crucial for exploratory single-cell tasks where predefined labels for fine-tuning are often unavailable. The findings underscore that pretraining on vast datasets does not automatically guarantee superior out-of-the-box performance and can expose vulnerabilities not apparent through fine-tuned evaluations alone. This research serves as a vital call to action for the single-cell community to prioritize robust zero-shot evaluations, especially for applications where labels are unknown, to ensure the practical reliability and utility of developing foundation models for democratizing advanced single-cell analysis.

Methodological highlights:

  • The study performs a direct head-to-head zero-shot evaluation of Geneformer and scGPT against simpler baselines (highly variable genes, scVI, Harmony) for cell type clustering and batch integration tasks.

  • It assesses the models' performance on the gene expression reconstruction pretraining task (masked language modeling objective for scGPT, gene rankings for Geneformer) to investigate why they underperform in downstream tasks.

  • The evaluation includes an analysis of how pretraining dataset size and composition affect zero-shot performance, including testing different variants of scGPT trained on tissue-specific and large general human cell datasets.

New tools, data, and resources:

Fig. 1
Figure 1 from Kedzierska 2025: Evaluation of the cell embedding space generated by the models. A Overview of the evaluation setup. We compare Geneformer and scGPT to scVI, Harmony, and the selection of highly variable genes (HVG) on five diverse datasets. B Average BIO score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. C, D Visualization of the UMAP projections of the Pancreas (16k) dataset using the cell embedding space generated by the models. Cells are color-coded by cell type (C) and batch (D). E Average batch score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. Dashed line in B and E signifies the median calculated across the datasets.

EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals

Paper: Lakshman, A. H., & Wright, E. S., "EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals" in Nature Communications, 2025. https://doi.org/10.1038/s41467-025-59175-6

Tons of proteins have been discovered and their encoding genes sequenced, but we still have no idea what they do. You can compare their similarity to proteins with known functions, but what if those related proteins are also under-studied?

TLDR: This paper introduces a powerful new method called EvoWeaver that combines many signals of coevolution to predict how genes are functionally related on a large scale. EvoWeaver offers a highly accurate and scalable way to uncover previously unknown connections between proteins by analyzing how their genes have evolved together, which is a major step forward for annotating the vast number of uncharacterized proteins.

Summary: This paper introduces EvoWeaver, a novel computational method designed for the large-scale prediction of gene functional associations by integrating 12 distinct coevolutionary signals. Unlike existing methods that often rely on similarity to previously studied proteins, EvoWeaver leverages signals such as phylogenetic profiling, phylogenetic structure, gene organization, and sequence-level patterns to quantify shared evolutionary histories between genes. This approach is crucial because the number of uncharacterized proteins is rapidly expanding, and traditional annotation methods are not keeping pace. EvoWeaver addresses significant limitations of prior coevolutionary algorithms, such as insufficient accuracy and scalability, by implementing all analyses within a single software package and optimizing for large datasets. Its ability to combine disparate coevolutionary signals leads to higher-accuracy predictions of protein complexes and biochemical pathways. The application of EvoWeaver allows for the reconstruction of known biochemical pathways without prior knowledge beyond genomic sequences and can reveal missing or undiscovered links in popular databases. This makes EvoWeaver a valuable tool for generating high-quality hypotheses about the functions of understudied proteins, thus helping to combat annotation inequality and accelerate biomedical progress.

Methodological highlights:

  • EvoWeaver integrates 12 distinct coevolutionary signals across four categories (Phylogenetic Profiling, Phylogenetic Structure, Gene Organization, and Sequence Level methods) to quantify the degree of shared evolution between genes.

  • It employs ensemble machine learning classifiers (logistic regression, random forest, and neural network) to combine the 12 component scores, demonstrating significantly improved predictive accuracy for functional associations compared to individual signals.

  • The method is designed for high scalability, allowing for the analysis of large datasets (e.g., 1545 gene groups from 8564 genomes) by optimizing algorithms and distributing pairwise comparisons across compute clusters.

New tools, data, and resources:

Fig. 1
Figure 1 from Lakshman 2025: Overview of the EvoWeaver algorithm and benchmarking. a Phylogenetic trees from groups of orthologous genes serve as the primary input to EvoWeaver. Four categories of coevolutionary signal are quantified for each pair of genes. These signals are combined in an ensemble classifier to predict functional relationships between gene pairs. EvoWeaver provides as output its 12 predictions for signals of coevolution, and can optionally provide an ensemble prediction using built-in pretrained models. b Functional associations often result in correlated gain/loss patterns on a reference phylogenetic tree (e.g., a species tree). EvoWeaver assesses the presence/absence patterns, correlation between gain/loss events, and distance between gain/loss events as signals of coevolution. c Similarity in phylogenetic structure is another indicator of coevolution between genes. EvoWeaver computes topological distance as well as correlation in patristic distances following dimensionality reduction using random projection. d Functionally associated genes sometimes cluster on the genome due to co-regulation or horizontal gene transfer. EvoWeaver derives signals from the conservation in gene orientation and the distance between gene pairs. e Functional associations sometimes cause concerted changes in sequences that are interrogated by EvoWeaver. EvoWeaver can analyze nucleotide sequences or amino acid sequences, though nucleotide sequences are pictured here. f Proteins involved in the same complex are functionally associated and can be identified through signals of coevolution. The goal of the Complexes benchmark is to distinguish orthology groups in the same complex (i.e., positives) from those in different complexes (i.e., negatives). g Functional associations between proteins that are adjacent in the same module are stronger than those between different modules. The goal of the Modules benchmark is to distinguish adjacent proteins in the same module from independent modules.

Assemblies of long-read metagenomes suffer from diverse errors

Paper: Trigodet, F., et al., "Assemblies of long-read metagenomes suffer from diverse errors" bioRxiv, 2025. https://doi.org/10.1101/2025.04.22.649783

Check out the Anvi'o website at anvio.org, which this tool is a part of. It's a great resource for integrated multi-omics analysis and visualization.

TLDR: This paper sheds light on the unexpected types of errors that can creep into metagenome assemblies when using long-read sequencing data. It's a critical evaluation of long-read metagenome assemblers, revealing that while they produce highly contiguous assemblies, they often introduce complex errors like misassemblies, chimeric contigs, and premature circularization that aren't easily caught by standard quality control.

Summary: This paper rigorously investigates the types of errors prevalent in metagenome assemblies generated from long-read sequencing data, specifically focusing on data from PacBio HiFi and Oxford Nanopore Technologies (ONT). The authors demonstrate that while long-read assemblers produce remarkably contiguous assemblies, they are prone to introducing diverse and often complex errors, including misassemblies (e.g., scaffolding errors, insertions, deletions), chimeric contigs combining sequences from different organisms, and premature circularization of linear chromosomes. The importance of this work lies in its comprehensive characterization of these errors, which are frequently overlooked or poorly detected by conventional quality control metrics. By using both synthetic and mock community datasets, the study highlights that existing quality control methods often fail to identify these intricate issues, leading to potentially flawed biological conclusions. The applications of these findings are critical for the metagenomics community, emphasizing the need for more sophisticated validation strategies beyond simple contiguity and completeness metrics. It underscores the importance of scrutinizing assembled genomes for these specific error types to ensure the reliability of downstream analyses, particularly for studies aiming at accurate genome reconstruction and functional annotation of microbial communities.

Methodological highlights:

  • The study systematically generated and analyzed diverse long-read metagenome assemblies from both synthetic and mock community datasets to characterize a broad spectrum of assembly errors.

  • They developed an in-depth validation approach using reference genomes and visualization tools to identify and classify complex errors such as misassemblies, chimeras, and premature circularization, which are often missed by standard quality control metrics.

  • The paper introduces a pipeline for benchmarking and visualizing assembly errors, allowing for direct comparison of different assemblers and highlighting their specific error profiles.

New tools, data, and resources:

Figure 1 from Trigodet 2025: Assembly errors across assembler. (A) A schematic representation of long reads mapping to a contig with multiple types of read disagreement with the reference, including indel and single nucleotide variants representing more than half or all the coverage, and clipping events spanning the entire coverage. All metrics for B, C, D, E, F, G are normalized by assembly size and exclude the two mock community metagenomes. (B) Number of clipping events supported by at least 10 reads. (C) Number of regions over 1,000 bp with no apparent coverage. Number of single-nucleotide variants representing (D) > 50% or (E) all the coverage (G) at a given locus. Distribution of (F) indels > 50% of the coverage, or (G) all the coverage. (H) Length distribution of circular contigs by each assembler. The darker color represents the distribution of circular contigs with at least one clipping event.

Subscribe now

Other papers of note

Fig. 1
Figure 1 from Zielezinski 2025 "Ultrafast and accurate sequence alignment and clustering of viral genomes"

Buy me a coffee ☕️

Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail.

Additional details

Description

This week's recap highlights Variant-EFFECTS for rewriting regulatory DNA to dissect and reprogram gene expression, zero-shot evaluation revealing the limitations of single-cell foundation models, EcoWeaver for large-scale prediction of gene functional associations from coevolutionary signals, and how assemblies of long-read metagenomes suffer from diverse errors.

Identifiers

UUID
c57b3425-7fd2-4cc7-9e37-c110c2919355
GUID
165892906
URL
https://blog.stephenturner.us/p/weekly-recap-aug-2025-part-1

Dates

Issued
2025-08-01T11:54:05
Updated
2025-08-01T11:54:05