Publications | Computational Applied Mathematics & AI Lab

2025

LOG Spotlight at MLGenX

Relaxed Equivariance via Multitask Learning

Ahmed A. Elhag, T. Konstantin Rusch, Francesco Di Giovanni, and Michael M Bronstein

Proceedings of the Fourth Learning on Graphs Conference, 2025

Abs PDF

Incorporating equivariance as an inductive bias into deep learning architectures to take advantage of the data symmetry has been successful in multiple applications, such as chemistry and dynamical systems. In particular, roto-translations are crucial for effectively modeling geometric graphs and molecules, where understanding the 3D structures enhances generalization. However, equivariant models often pose challenges due to their high computational complexity. In this paper, we introduce REMUL, a training procedure for approximating equivariance with multitask learning. We show that unconstrained models (which do not build equivariance into the architecture) can learn approximate symmetries by minimizing an additional simple equivariance loss. By formulating equivariance as a new learning objective, we can control the level of approximate equivariance in the model. Our method achieves competitive performance compared to equivariant baselines while being 10x faster at inference and 2.5x at training.
arXiv

The Curious Case of In-Training Compression of State Space Models

Makram Chahine, Philipp Nazari, Daniela Rus, and T. Konstantin Rusch

Arxiv preprint., 2025

Abs PDF Code

State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs during training, where only dimensions of high influence are identified and preserved. Our approach applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance.
arXiv

Neural Low-Discrepancy Sequences

Michael Etienne Van Huffel, Nathan Kirk, Makram Chahine, Daniela Rus, and T. Konstantin Rusch

Arxiv preprint., 2025

Abs PDF Code

Low-discrepancy points are designed to efficiently fill the space in a uniform manner. This uniformity is highly advantageous in many problems in science and engineering, including in numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. Whereas most previous low-discrepancy constructions rely on abstract algebra and number theory, Message-Passing Monte Carlo (MPMC) was recently introduced to exploit machine learning methods for generating point sets with lower discrepancy than previously possible. However, MPMC is limited to generating point sets and cannot be extended to low-discrepancy sequences (LDS), i.e., sequences of points in which every prefix has low discrepancy, a property essential for many applications. To address this limitation, we introduce Neural Low-Discrepancy Sequences (NeuroLDS), the first machine learning-based framework for generating LDS. Drawing inspiration from classical LDS, we train a neural network to map indices to points such that the resulting sequences exhibit minimal discrepancy across all prefixes. To this end, we deploy a two-stage learning process: supervised approximation of classical constructions followed by unsupervised fine-tuning to minimize prefix discrepancies. We demonstrate that NeuroLDS outperforms all previous LDS constructions by a significant margin with respect to discrepancy measures. Moreover, we demonstrate the effectiveness of NeuroLDS across diverse applications, including numerical integration, robot motion planning, and scientific machine learning. These results highlight the promise and broad significance of Neural Low-Discrepancy Sequences.
arXiv

On the optimization of discrepancy measures

François Clément, Nathan Kirk, Art B. Owen, and T. Konstantin Rusch

Arxiv preprint., 2025

Abs PDF

Points in the unit cube with low discrepancy can be constructed using algebra or, more recently, by direct computational optimization of a criterion. The usual L_∞star discrepancy is a poor criterion for this because it is computationally expensive and lacks differentiability. Its usual replacement, the L_2 star discrepancy, is smooth but exhibits other pathologies shown by J. Matoušek. In an attempt to address these problems, we introduce the \textitaverage squared discrepancy which averages over 2^d versions of the L_2 star discrepancy anchored in the different vertices of [0,1]^d. Not only can this criterion be computed in O(dn^2) time, like the L_2 star discrepancy, but also we show that it is equivalent to a weighted symmetric L_2 criterion of Hickernell’s by a constant factor. We compare this criterion with a wide range of traditional discrepancy measures, and show that only the average squared discrepancy avoids the problems raised by Matoušek. Furthermore, we present a comprehensive numerical study showing in particular that optimizing for the average squared discrepancy leads to strong performance for the L_2 star discrepancy, whereas the converse does not hold.
CoRL

Improving Efficiency of Sampling-based Motion Planning via Message-Passing Monte Carlo

Makram Chahine, T. Konstantin Rusch, Zach J Patterson, and Daniela Rus

9th Annual Conference on Robot Learning, 2025

Abs PDF

Sampling-based motion planning methods, while effective in high-dimensional spaces, often suffer from inefficiencies due to irregular sampling distributions, leading to suboptimal exploration of the configuration space. In this paper, we propose an approach that enhances the efficiency of these methods by utilizing low-discrepancy distributions generated through Message-Passing Monte Carlo (MPMC). MPMC leverages Graph Neural Networks (GNNs) to generate point sets that uniformly cover the space, with uniformity assessed using the L_p-discrepancy measure, which quantifies the irregularity of sample distributions. By improving the uniformity of the point sets, our approach significantly reduces computational overhead and the number of samples required for solving motion planning problems. Experimental results demonstrate that our method outperforms traditional sampling techniques in terms of planning efficiency.
arXiv

Learning to Move in Rhythm: Task-Conditioned Motion Policies with Orbital Stability Guarantees

Maximilian Stölzle, T. Konstantin Rusch, Zach J Patterson, Rodrigo Pérez-Dattari, Francesco Stella, Josie Hughes, Cosimo Della Santina, and Daniela Rus

Arxiv preprint., 2025

Abs PDF

Learning from demonstration provides a sample-efficient approach to acquiring complex behaviors, enabling robots to move robustly, compliantly, and with fluidity. In this context, Dynamic Motion Primitives offer built - in stability and robustness to disturbances but often struggle to capture complex periodic behaviors. Moreover, they are limited in their ability to interpolate between different tasks. These shortcomings substantially narrow their applicability, excluding a wide class of practically meaningful tasks such as locomotion and rhythmic tool use. In this work, we introduce Orbitally Stable Motion Primitives (OSMPs) - a framework that combines a learned diffeomorphic encoder with a supercritical Hopf bifurcation in latent space, enabling the accurate acquisition of periodic motions from demonstrations while ensuring formal guarantees of orbital stability and transverse contraction. Furthermore, by conditioning the bijective encoder on the task, we enable a single learned policy to represent multiple motion objectives, yielding consistent zero-shot generalization to unseen motion objectives within the training distribution. We validate the proposed approach through extensive simulation and real-world experiments across a diverse range of robotic platforms - from collaborative arms and soft manipulators to a bio-inspired rigid-soft turtle robot - demonstrating its versatility and effectiveness in consistently outperforming state-of-the-art baselines such as diffusion policies, among others.
arXiv

Learning to Dissipate Energy in Oscillatory State-Space Models

Jared Boyer, T. Konstantin Rusch, and Daniela Rus

Arxiv preprint., 2025

Abs PDF Code

State-space models (SSMs) are a class of networks for sequence learning that benefit from fixed state size and linear complexity with respect to sequence length, contrasting the quadratic scaling of typical attention mechanisms. Inspired from observations in neuroscience, Linear Oscillatory State-Space models (LinOSS) are a recently proposed class of SSMs constructed from layers of discretized forced harmonic oscillators. Although these models perform competitively, leveraging fast parallel scans over diagonal recurrent matrices and achieving state-of-the-art performance on tasks with sequence length up to 50k, LinOSS models rely on rigid energy dissipation ("forgetting") mechanisms that are inherently coupled to the timescale of state evolution. As forgetting is a crucial mechanism for long-range reasoning, we demonstrate the representational limitations of these models and introduce Damped Linear Oscillatory State-Space models (D-LinOSS), a more general class of oscillatory SSMs that learn to dissipate latent state energy on multiple timescales. We analyze the spectral distribution of the model’s recurrent matrices and prove that the SSM layers exhibit stable dynamics under simple, flexible parameterizations. D-LinOSS consistently outperforms previous LinOSS methods on long-range learning tasks, without introducing additional complexity, and simultaneously reduces the hyperparameter search space by 50%.
ICLR Oral

Oscillatory State-Space Models

T. Konstantin Rusch and Daniela Rus

The 13th International Conference on Learning Representations., 2025

Oral Presentation (top 1.8% of all submitted papers)

Abs PDF Code

We propose Linear Oscillatory State-Space models (LinOSS) for efficiently learning on long sequences. Inspired by cortical dynamics of biological neural networks, we base our proposed LinOSS model on a system of forced harmonic oscillators. A stable discretization, integrated over time using fast associative parallel scans, yields the proposed state-space model. We prove that LinOSS produces stable dynamics only requiring nonnegative diagonal state matrix. This is in stark contrast to many previous state-space models relying heavily on restrictive parameterizations. Moreover, we rigorously show that LinOSS is universal, i.e., it can approximate any continuous and causal operator mapping between time-varying functions, to desired accuracy. In addition, we show that an implicit-explicit discretization of LinOSS perfectly conserves the symmetry of time reversibility of the underlying dynamics. Together, these properties enable efficient modeling of long-range interactions, while ensuring stable and accurate long-horizon forecasting. Finally, our empirical results, spanning a wide range of time-series tasks from mid-range to very long-range classification and regression, as well as long-horizon forecasting, demonstrate that our proposed LinOSS model consistently outperforms state-of-the-art sequence models. Notably, LinOSS outperforms Mamba by nearly 2x and LRU by 2.5x on a sequence modeling task with sequences of length 50k.
FPI

Low Stein Discrepancy via Message-Passing Monte Carlo

Nathan Kirk, T. Konstantin Rusch, Jakob Zech, and Daniela Rus

ICLR Workshop on Frontiers in Probabilistic Inference, 2025

Abs PDF

Message-Passing Monte Carlo (MPMC) was recently introduced as a novel low-discrepancy sampling approach leveraging tools from geometric deep learning. While originally designed for generating uniform point sets, we extend this framework to sample from general multivariate probability distributions with known probability density function. Our proposed method, Stein-Message-Passing Monte Carlo (Stein-MPMC), minimizes a kernelized Stein discrepancy, ensuring improved sample quality. Finally, we show that Stein-MPMC outperforms competing methods, such as Stein Variational Gradient Descent and (greedy) Stein Points, by achieving a lower Stein discrepancy.

2024

PNAS

Message-Passing Monte Carlo: Generating low-discrepancy point sets via Graph Neural Networks

T. Konstantin Rusch, Nathan Kirk, Michael M Bronstein, Christiane Lemieux, and Daniela Rus

Proceedings of the National Academy of Sciences, 2024

Abs PDF Code

Discrepancy is a well-known measure for the irregularity of the distribution of a point set. Point sets with small discrepancy are called low-discrepancy and are known to efficiently fill the space in a uniform manner. Low-discrepancy points play a central role in many problems in science and engineering, including numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. In this work, we present the first machine learning approach to generate a new class of low-discrepancy point sets named Message-Passing Monte Carlo (MPMC) points. Motivated by the geometric nature of generating low-discrepancy point sets, we leverage tools from Geometric Deep Learning and base our model on Graph Neural Networks. We further provide an extension of our framework to higher dimensions, which flexibly allows the generation of custom-made points that emphasize the uniformity in specific dimensions that are primarily important for the particular problem at hand. Finally, we demonstrate that our proposed model achieves state-of-the-art performance superior to previous methods by a significant margin. In fact, MPMC points are empirically shown to be either optimal or near-optimal with respect to the discrepancy for low dimension and small number of points, i.e., for which the optimal discrepancy can be determined.
TMLR Oral at GLFrontiers

How does over-squashing affect the power of GNNs?

Francesco Di Giovanni, T. Konstantin Rusch, Michael M. Bronstein, Andreea Deac, Marc Lackenby, Siddhartha Mishra, and Petar Veličković

Transactions on Machine Learning Research, 2024

Abs PDF

Graph Neural Networks (GNNs) are the state-of-the-art model for machine learning on graph-structured data. The most popular class of GNNs operate by exchanging information between adjacent nodes, and are known as Message Passing Neural Networks (MPNNs). Given their widespread use, understanding the expressive power of MPNNs is a key question. However, existing results typically consider settings with uninformative node features. In this paper, we provide a rigorous analysis to determine which function classes of node features can be learned by an MPNN of a given capacity. We do so by measuring the level of pairwise interactions between nodes that MPNNs allow for. This measure provides a novel quantitative characterization of the so-called over-squashing effect, which is observed to occur when a large volume of messages is aggregated into fixed-size vectors. Using our measure, we prove that, to guarantee sufficient communication between pairs of nodes, the capacity of the MPNN must be large enough, depending on properties of the input graph structure, such as commute times. For many relevant scenarios, our analysis results in impossibility statements in practice, showing that over-squashing hinders the expressive power of MPNNs. We validate our theoretical findings through extensive controlled experiments and ablation studies.

2023

NeurIPS

Neural Oscillators are Universal

Samuel Lanthaler, T. Konstantin Rusch, and Siddhartha Mishra

The 37th Conference on Neural Information Processing Systems., 2023

Abs PDF

Coupled oscillators are being increasingly used as the basis of machine learning (ML) architectures, for instance in sequence modeling, graph representation learning and in physical neural networks that are used in analog ML devices. We introduce an abstract class of neural oscillators that encompasses these architectures and prove that neural oscillators are universal, i.e, they can approximate any continuous and casual operator mapping between time-varying functions, to desired accuracy. This universality result provides theoretical justification for the use of oscillator based ML systems. The proof builds on a fundamental result of independent interest, which shows that a combination of forced harmonic oscillators with a nonlinear read-out suffices to approximate the underlying operators.
PhD thesis

Physics-inspired Machine Learning

T. Konstantin Rusch

PhD thesis, 2023

Abs PDF

Physics-inspired machine learning can be seen as incorporating structure from physical systems (e.g., given by ordinary or partial differential equations) into machine learning methods to obtain models with better inductive biases. In this thesis, we provide several of the earliest examples of such methods in the fields of sequence modelling and graph representation learning. We subsequently show that physicsinspired inductive biases can be leveraged to mitigate important and central issues in each particular field. More concretely, we demonstrate that systems of coupled nonlinear oscillators and Hamiltonian systems lead to recurrent sequence models that are able to process sequential interactions over long time scales by mitigating the exploding and vanishing gradients problem. Additionally, we rigorously prove that neural systems of oscillators are universal approximators for continuous and causal operators. Moreover, we show that sequence models derived from multiscale dynamical systems not only mitigate the exploding and vanishing gradients problem (and are thus able to learn long-term dependencies), but equally importantly yield expressive models for learning on (real-world) multiscale data. We further show the impact of physics-inspired approaches on graph representation learning. In particular, systems of graph-coupled nonlinear oscillators denote a powerful framework for learning on graphs that allows for stacking many graph neural network (GNN) layers on top of each other. Thereby, we prove that these systems mitigate the oversmoothing issue in GNNs, where node features exponentially converge to the same constant node vector for increasing number of GNN layers. Finally, we propose to incorporate multiple rates that are inferred from the underlying graph data into the message-passing framework of GNNs. Moreover, we leverage the graph gradient modulated through gating functions to obtain multiple rates that automatically mitigate the oversmoothing issue. We extensively test all proposed methods on a variety of versatile synthetic and real-world datasets, ranging from image recognition, speech recognition, natural language processing (NLP), medical applications, and scientific computing for sequence models, to citation networks, computational chemistry applications, and article and website networks for graph learning models.
arXiv

A Survey on Oversmoothing in Graph Neural Networks

T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra

arXiv preprint, 2023

Abs PDF

Node features of graph neural networks (GNNs) tend to become more similar with the increase of the network depth. This effect is known as over-smoothing, which we axiomatically define as the exponential convergence of suitable similarity measures on the node features. Our definition unifies previous approaches and gives rise to new quantitative measures of over-smoothing. Moreover, we empirically demonstrate this behavior for several over-smoothing measures on different graphs (small-, medium-, and large-scale). We also review several approaches for mitigating over-smoothing and empirically test their effectiveness on real-world graph datasets. Through illustrative examples, we demonstrate that mitigating over-smoothing is a necessary but not sufficient condition for building deep GNNs that are expressive on a wide range of graph learning tasks. Finally, we extend our definition of over-smoothing to the rapidly emerging field of continuous-time GNNs.
Physics4ML Spotlight

Multi-Scale Message Passing Neural PDE Solvers

Léonard Equer, T. Konstantin Rusch, and Siddhartha Mishra

ICLR Workshop on Physics for Machine Learning, 2023

Spotlight Presentation (top 7% of all submitted papers)

Abs PDF

We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.
ICLR

Gradient Gating for Deep Multi-Rate Learning on Graphs

T. Konstantin Rusch, Benjamin P. Chamberlain, Michael W. Mahoney, Michael M. Bronstein, and Siddhartha Mishra

The 11th International Conference on Learning Representations., 2023

Abs PDF Code

We present Gradient Gating (G^2), a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph. Local gradients are harnessed to further modulate message passing updates. Our framework flexibly allows one to use any basic GNN layer as a wrapper around which the multi-rate gradient gating mechanism is built. We rigorously prove that G^2 alleviates the oversmoothing problem and allows the design of deep GNNs. Empirical results are presented to demonstrate that the proposed framework achieves state-of-the-art performance on a variety of graph learning tasks, including on large-scale heterophilic graphs.

2022

ICML

Graph-Coupled Oscillator Networks

T. Konstantin Rusch, Benjamin P. Chamberlain, James Rowbottom, Siddhartha Mishra, and Michael M. Bronstein

The 39th International Conference on Machine Learning., 2022

Abs PDF Code

We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear forced and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
ICLR Spotlight

Long Expressive Memory for Sequence Modeling

T. Konstantin Rusch, Siddhartha Mishra, N. Benjamin Erichson, and Michael W. Mahoney

The 10th International Conference on Learning Representations., 2022

Spotlight Presentation (top 6% of all submitted papers)

Abs PDF Code

We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies. LEM is gradient-based, it can efficiently process sequential tasks with very long-term dependencies, and it is sufficiently expressive to be able to learn complicated input-output maps. To derive LEM, we consider a system of multiscale ordinary differential equations, as well as a suitable time-discretization of this system. For LEM, we derive rigorous bounds to show the mitigation of the exploding and vanishing gradients problem, a well-known challenge for gradient-based recurrent sequential learning methods. We also prove that LEM can approximate a large class of dynamical systems to high accuracy. Our empirical results, ranging from image and time-series classification through dynamical systems prediction to speech recognition and language modeling, demonstrate that LEM outperforms state-of-the-art recurrent neural networks, gated recurrent units, and long short-term memory models.

2021

SISC

Higher-Order Quasi-Monte Carlo Training of Deep Neural Networks

Marcello Longo, Siddhartha Mishra, T. Konstantin Rusch, and Christoph Schwab

SIAM Journal on Scientific Computing., 2021

Abs PDF Code

We present a novel algorithmic approach and an error analysis leveraging Quasi-Monte Carlo points for training deep neural network (DNN) surrogates of Data-to-Observable (DtO) maps in engineering design. Our analysis reveals higher-order consistent, deterministic choices of training points in the input data space for deep and shallow Neural Networks with holomorphic activation functions such as tanh. These novel training points are proved to facilitate higher-order decay (in terms of the number of training samples) of the underlying generalization error, with consistency error bounds that are free from the curse of dimensionality in the input data space, provided that DNN weights in hidden layers satisfy certain summability conditions. We present numerical experiments for DtO maps from elliptic and parabolic PDEs with uncertain inputs that confirm the theoretical analysis.
ICML

UnICORNN: A recurrent model for learning very long time dependencies

T. Konstantin Rusch and Siddhartha Mishra

The 38th International Conference on Machine Learning., 2021

Abs PDF Code

The design of recurrent neural networks (RNNs) to accurately process sequential inputs with longtime dependencies is very challenging on account of the exploding and vanishing gradient problem. To overcome this, we propose a novel RNN architecture which is based on a structure preserving discretization of a Hamiltonian system of secondorder ordinary differential equations that models networks of oscillators. The resulting RNN is fast, invertible (in time), memory efficient and we derive rigorous bounds on the hidden state gradients to prove the mitigation of the exploding and vanishing gradient problem. A suite of experiments are presented to demonstrate that the proposed RNN provides state of the art performance on a variety of learning tasks with (very) long-time dependencies.
ICLR Oral

Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies

T. Konstantin Rusch and Siddhartha Mishra

The 9th International Conference on Learning Representations., 2021

Oral Presentation (top 1% of all submitted papers)

Abs PDF Code

Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.
SINUM

Enhancing accuracy of deep learning algorithms by training with low-discrepancy sequences

Siddhartha Mishra and T. Konstantin Rusch

SIAM Journal on Numerical Analysis., 2021

Abs PDF Code

We propose a deep supervised learning algorithm based on low-discrepancy sequences as the training set. By a combination of theoretical arguments and extensive numerical experiments we demonstrate that the proposed algorithm significantly outperforms standard deep learning algorithms that are based on randomly chosen training data, for problems in moderately high dimensions. The proposed algorithm provides an efficient method for building inexpensive surrogates for many underlying maps in the context of scientific computing.