Publications
2025
- arXivLearning to Dissipate Energy in Oscillatory State-Space ModelsJared Boyer, T. Konstantin Rusch, and Daniela RusArxiv preprint., 2025
State-space models (SSMs) are a class of networks for sequence learning that benefit from fixed state size and linear complexity with respect to sequence length, contrasting the quadratic scaling of typical attention mechanisms. Inspired from observations in neuroscience, Linear Oscillatory State-Space models (LinOSS) are a recently proposed class of SSMs constructed from layers of discretized forced harmonic oscillators. Although these models perform competitively, leveraging fast parallel scans over diagonal recurrent matrices and achieving state-of-the-art performance on tasks with sequence length up to 50k, LinOSS models rely on rigid energy dissipation ("forgetting") mechanisms that are inherently coupled to the timescale of state evolution. As forgetting is a crucial mechanism for long-range reasoning, we demonstrate the representational limitations of these models and introduce Damped Linear Oscillatory State-Space models (D-LinOSS), a more general class of oscillatory SSMs that learn to dissipate latent state energy on multiple timescales. We analyze the spectral distribution of the model’s recurrent matrices and prove that the SSM layers exhibit stable dynamics under simple, flexible parameterizations. D-LinOSS consistently outperforms previous LinOSS methods on long-range learning tasks, without introducing additional complexity, and simultaneously reduces the hyperparameter search space by 50%.
- ICLR OralOscillatory State-Space ModelsT. Konstantin Rusch and Daniela RusThe 13th International Conference on Learning Representations., 2025Oral Presentation (top 1.8% of all submitted papers)
We propose Linear Oscillatory State-Space models (LinOSS) for efficiently learning on long sequences. Inspired by cortical dynamics of biological neural networks, we base our proposed LinOSS model on a system of forced harmonic oscillators. A stable discretization, integrated over time using fast associative parallel scans, yields the proposed state-space model. We prove that LinOSS produces stable dynamics only requiring nonnegative diagonal state matrix. This is in stark contrast to many previous state-space models relying heavily on restrictive parameterizations. Moreover, we rigorously show that LinOSS is universal, i.e., it can approximate any continuous and causal operator mapping between time-varying functions, to desired accuracy. In addition, we show that an implicit-explicit discretization of LinOSS perfectly conserves the symmetry of time reversibility of the underlying dynamics. Together, these properties enable efficient modeling of long-range interactions, while ensuring stable and accurate long-horizon forecasting. Finally, our empirical results, spanning a wide range of time-series tasks from mid-range to very long-range classification and regression, as well as long-horizon forecasting, demonstrate that our proposed LinOSS model consistently outperforms state-of-the-art sequence models. Notably, LinOSS outperforms Mamba by nearly 2x and LRU by 2.5x on a sequence modeling task with sequences of length 50k.
- FPILow Stein Discrepancy via Message-Passing Monte CarloNathan Kirk, T. Konstantin Rusch, Jakob Zech, and Daniela RusICLR Workshop on Frontiers in Probabilistic Inference, 2025
Message-Passing Monte Carlo (MPMC) was recently introduced as a novel low-discrepancy sampling approach leveraging tools from geometric deep learning. While originally designed for generating uniform point sets, we extend this framework to sample from general multivariate probability distributions with known probability density function. Our proposed method, Stein-Message-Passing Monte Carlo (Stein-MPMC), minimizes a kernelized Stein discrepancy, ensuring improved sample quality. Finally, we show that Stein-MPMC outperforms competing methods, such as Stein Variational Gradient Descent and (greedy) Stein Points, by achieving a lower Stein discrepancy.
- WRLImproving Efficiency of Sampling-based Motion Planning via Message-Passing Monte CarloMakram Chahine, T. Konstantin Rusch, Zach J Patterson, and Daniela RusICLR Workshop on Robot Learning, 2025
Sampling-based motion planning methods, while effective in high-dimensional spaces, often suffer from inefficiencies due to irregular sampling distributions, leading to suboptimal exploration of the configuration space. In this paper, we propose an approach that enhances the efficiency of these methods by utilizing low-discrepancy distributions generated through Message-Passing Monte Carlo (MPMC). MPMC leverages Graph Neural Networks (GNNs) to generate point sets that uniformly cover the space, with uniformity assessed using the the L_p-discrepancy measure, which quantifies the irregularity of sample distributions. By improving the uniformity of the point sets, our approach significantly reduces computational overhead and the number of samples required for solving motion planning problems. Experimental results demonstrate that our method outperforms traditional sampling techniques in terms of planning efficiency.
- MLGenX SpotlightRelaxed Equivariance via Multitask LearningAhmed A. Elhag, T. Konstantin Rusch, Francesco Di Giovanni, and Michael M BronsteinICLR Workshop on Machine Learning for Genomics Explorations, 2025Spotlight Presentation
Incorporating equivariance as an inductive bias into deep learning architectures to take advantage of the data symmetry has been successful in multiple applications, such as chemistry and dynamical systems. In particular, roto-translations are crucial for effectively modeling geometric graphs and molecules, where understanding the 3D structures enhances generalization. However, equivariant models often pose challenges due to their high computational complexity. In this paper, we introduce REMUL, a training procedure for approximating equivariance with multitask learning. We show that unconstrained models (which do not build equivariance into the architecture) can learn approximate symmetries by minimizing an additional simple equivariance loss. By formulating equivariance as a new learning objective, we can control the level of approximate equivariance in the model. Our method achieves competitive performance compared to equivariant baselines while being 10x faster at inference and 2.5x at training.
2024
- PNASMessage-Passing Monte Carlo: Generating low-discrepancy point sets via Graph Neural NetworksT. Konstantin Rusch, Nathan Kirk, Michael M Bronstein, Christiane Lemieux, and Daniela RusProceedings of the National Academy of Sciences, 2024
Discrepancy is a well-known measure for the irregularity of the distribution of a point set. Point sets with small discrepancy are called low-discrepancy and are known to efficiently fill the space in a uniform manner. Low-discrepancy points play a central role in many problems in science and engineering, including numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. In this work, we present the first machine learning approach to generate a new class of low-discrepancy point sets named Message-Passing Monte Carlo (MPMC) points. Motivated by the geometric nature of generating low-discrepancy point sets, we leverage tools from Geometric Deep Learning and base our model on Graph Neural Networks. We further provide an extension of our framework to higher dimensions, which flexibly allows the generation of custom-made points that emphasize the uniformity in specific dimensions that are primarily important for the particular problem at hand. Finally, we demonstrate that our proposed model achieves state-of-the-art performance superior to previous methods by a significant margin. In fact, MPMC points are empirically shown to be either optimal or near-optimal with respect to the discrepancy for low dimension and small number of points, i.e., for which the optimal discrepancy can be determined.
- TMLRHow does over-squashing affect the power of GNNs?Francesco Di Giovanni, T. Konstantin Rusch, Michael M. Bronstein, Andreea Deac, Marc Lackenby, Siddhartha Mishra, and Petar VeličkovićTransactions on Machine Learning Research, 2024
Graph Neural Networks (GNNs) are the state-of-the-art model for machine learning on graph-structured data. The most popular class of GNNs operate by exchanging information between adjacent nodes, and are known as Message Passing Neural Networks (MPNNs). Given their widespread use, understanding the expressive power of MPNNs is a key question. However, existing results typically consider settings with uninformative node features. In this paper, we provide a rigorous analysis to determine which function classes of node features can be learned by an MPNN of a given capacity. We do so by measuring the level of pairwise interactions between nodes that MPNNs allow for. This measure provides a novel quantitative characterization of the so-called over-squashing effect, which is observed to occur when a large volume of messages is aggregated into fixed-size vectors. Using our measure, we prove that, to guarantee sufficient communication between pairs of nodes, the capacity of the MPNN must be large enough, depending on properties of the input graph structure, such as commute times. For many relevant scenarios, our analysis results in impossibility statements in practice, showing that over-squashing hinders the expressive power of MPNNs. We validate our theoretical findings through extensive controlled experiments and ablation studies.
2023
- NeurIPSNeural Oscillators are UniversalSamuel Lanthaler, T. Konstantin Rusch, and Siddhartha MishraThe 37th Conference on Neural Information Processing Systems., 2023
Coupled oscillators are being increasingly used as the basis of machine learning (ML) architectures, for instance in sequence modeling, graph representation learning and in physical neural networks that are used in analog ML devices. We introduce an abstract class of neural oscillators that encompasses these architectures and prove that neural oscillators are universal, i.e, they can approximate any continuous and casual operator mapping between time-varying functions, to desired accuracy. This universality result provides theoretical justification for the use of oscillator based ML systems. The proof builds on a fundamental result of independent interest, which shows that a combination of forced harmonic oscillators with a nonlinear read-out suffices to approximate the underlying operators.
- PhD thesisPhysics-inspired Machine LearningT. Konstantin RuschPhD thesis, 2023
Physics-inspired machine learning can be seen as incorporating structure from physical systems (e.g., given by ordinary or partial differential equations) into machine learning methods to obtain models with better inductive biases. In this thesis, we provide several of the earliest examples of such methods in the fields of sequence modelling and graph representation learning. We subsequently show that physicsinspired inductive biases can be leveraged to mitigate important and central issues in each particular field. More concretely, we demonstrate that systems of coupled nonlinear oscillators and Hamiltonian systems lead to recurrent sequence models that are able to process sequential interactions over long time scales by mitigating the exploding and vanishing gradients problem. Additionally, we rigorously prove that neural systems of oscillators are universal approximators for continuous and causal operators. Moreover, we show that sequence models derived from multiscale dynamical systems not only mitigate the exploding and vanishing gradients problem (and are thus able to learn long-term dependencies), but equally importantly yield expressive models for learning on (real-world) multiscale data. We further show the impact of physics-inspired approaches on graph representation learning. In particular, systems of graph-coupled nonlinear oscillators denote a powerful framework for learning on graphs that allows for stacking many graph neural network (GNN) layers on top of each other. Thereby, we prove that these systems mitigate the oversmoothing issue in GNNs, where node features exponentially converge to the same constant node vector for increasing number of GNN layers. Finally, we propose to incorporate multiple rates that are inferred from the underlying graph data into the message-passing framework of GNNs. Moreover, we leverage the graph gradient modulated through gating functions to obtain multiple rates that automatically mitigate the oversmoothing issue. We extensively test all proposed methods on a variety of versatile synthetic and real-world datasets, ranging from image recognition, speech recognition, natural language processing (NLP), medical applications, and scientific computing for sequence models, to citation networks, computational chemistry applications, and article and website networks for graph learning models.
- arXivA Survey on Oversmoothing in Graph Neural NetworksT. Konstantin Rusch, Michael M. Bronstein, and Siddhartha MishraarXiv preprint, 2023
Node features of graph neural networks (GNNs) tend to become more similar with the increase of the network depth. This effect is known as over-smoothing, which we axiomatically define as the exponential convergence of suitable similarity measures on the node features. Our definition unifies previous approaches and gives rise to new quantitative measures of over-smoothing. Moreover, we empirically demonstrate this behavior for several over-smoothing measures on different graphs (small-, medium-, and large-scale). We also review several approaches for mitigating over-smoothing and empirically test their effectiveness on real-world graph datasets. Through illustrative examples, we demonstrate that mitigating over-smoothing is a necessary but not sufficient condition for building deep GNNs that are expressive on a wide range of graph learning tasks. Finally, we extend our definition of over-smoothing to the rapidly emerging field of continuous-time GNNs.
- Physics4ML SpotlightMulti-Scale Message Passing Neural PDE SolversLéonard Equer, T. Konstantin Rusch, and Siddhartha MishraICLR Workshop on Physics for Machine Learning, 2023Spotlight Presentation (top 7% of all submitted papers)
We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.
- ICLRGradient Gating for Deep Multi-Rate Learning on GraphsT. Konstantin Rusch, Benjamin P. Chamberlain, Michael W. Mahoney, Michael M. Bronstein, and Siddhartha MishraThe 11th International Conference on Learning Representations., 2023
We present Gradient Gating (G^2), a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph. Local gradients are harnessed to further modulate message passing updates. Our framework flexibly allows one to use any basic GNN layer as a wrapper around which the multi-rate gradient gating mechanism is built. We rigorously prove that G^2 alleviates the oversmoothing problem and allows the design of deep GNNs. Empirical results are presented to demonstrate that the proposed framework achieves state-of-the-art performance on a variety of graph learning tasks, including on large-scale heterophilic graphs.
2022
- ICMLGraph-Coupled Oscillator NetworksT. Konstantin Rusch, Benjamin P. Chamberlain, James Rowbottom, Siddhartha Mishra, and Michael M. BronsteinThe 39th International Conference on Machine Learning., 2022
We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear forced and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
- ICLR SpotlightLong Expressive Memory for Sequence ModelingT. Konstantin Rusch, Siddhartha Mishra, N. Benjamin Erichson, and Michael W. MahoneyThe 10th International Conference on Learning Representations., 2022Spotlight Presentation (top 6% of all submitted papers)
We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies. LEM is gradient-based, it can efficiently process sequential tasks with very long-term dependencies, and it is sufficiently expressive to be able to learn complicated input-output maps. To derive LEM, we consider a system of multiscale ordinary differential equations, as well as a suitable time-discretization of this system. For LEM, we derive rigorous bounds to show the mitigation of the exploding and vanishing gradients problem, a well-known challenge for gradient-based recurrent sequential learning methods. We also prove that LEM can approximate a large class of dynamical systems to high accuracy. Our empirical results, ranging from image and time-series classification through dynamical systems prediction to speech recognition and language modeling, demonstrate that LEM outperforms state-of-the-art recurrent neural networks, gated recurrent units, and long short-term memory models.
2021
- SISCHigher-Order Quasi-Monte Carlo Training of Deep Neural NetworksMarcello Longo, Siddhartha Mishra, T. Konstantin Rusch, and Christoph SchwabSIAM Journal on Scientific Computing., 2021
We present a novel algorithmic approach and an error analysis leveraging Quasi-Monte Carlo points for training deep neural network (DNN) surrogates of Data-to-Observable (DtO) maps in engineering design. Our analysis reveals higher-order consistent, deterministic choices of training points in the input data space for deep and shallow Neural Networks with holomorphic activation functions such as tanh. These novel training points are proved to facilitate higher-order decay (in terms of the number of training samples) of the underlying generalization error, with consistency error bounds that are free from the curse of dimensionality in the input data space, provided that DNN weights in hidden layers satisfy certain summability conditions. We present numerical experiments for DtO maps from elliptic and parabolic PDEs with uncertain inputs that confirm the theoretical analysis.
- ICMLUnICORNN: A recurrent model for learning very long time dependenciesT. Konstantin Rusch and Siddhartha MishraThe 38th International Conference on Machine Learning., 2021
The design of recurrent neural networks (RNNs) to accurately process sequential inputs with longtime dependencies is very challenging on account of the exploding and vanishing gradient problem. To overcome this, we propose a novel RNN architecture which is based on a structure preserving discretization of a Hamiltonian system of secondorder ordinary differential equations that models networks of oscillators. The resulting RNN is fast, invertible (in time), memory efficient and we derive rigorous bounds on the hidden state gradients to prove the mitigation of the exploding and vanishing gradient problem. A suite of experiments are presented to demonstrate that the proposed RNN provides state of the art performance on a variety of learning tasks with (very) long-time dependencies.
- ICLR OralCoupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependenciesT. Konstantin Rusch and Siddhartha MishraThe 9th International Conference on Learning Representations., 2021Oral Presentation (top 1% of all submitted papers)
Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.
- SINUMEnhancing accuracy of deep learning algorithms by training with low-discrepancy sequencesSiddhartha Mishra and T. Konstantin RuschSIAM Journal on Numerical Analysis., 2021
We propose a deep supervised learning algorithm based on low-discrepancy sequences as the training set. By a combination of theoretical arguments and extensive numerical experiments we demonstrate that the proposed algorithm significantly outperforms standard deep learning algorithms that are based on randomly chosen training data, for problems in moderately high dimensions. The proposed algorithm provides an efficient method for building inexpensive surrogates for many underlying maps in the context of scientific computing.