Guided tour of Ancona: visit to the ancient historical town
–
Meeting point: Piazza della Repubblica at the stairs under the RAI sign
Tue, Sept 2
Wed, Sept 3
Thu, Sept 4
Fri, Sept 5
Sat, Sept 6
08:30
Registration (all day)
Registration (all day)
Registration (all day)
09:00
Registration (all day)
Welcome Remarks
Auditorium Tamburi
Spatial Sound, Room Acoustics and Perception
Auditorium Tamburi
Deep Learning Methods, Effects and Data Analysis
Auditorium Tamburi
09:30
Tutorial 1
Alessia Andò
Auditorium Tamburi
Virtual Analog
Auditorium Tamburi
Guided tour of Ancona: visit to the ancient historical town
Meeting point: Piazza della Repubblica at the stairs under the RAI sign
10:30
Coffee Break
Foyer Tamburi
Sponsor Demo 1
Sala Boxe
Coffee Break
Foyer Tamburi
Sponsor Demo 2
Sala Boxe
11:00
Coffee Break
Foyer Tamburi
Coffee Break
Foyer Tamburi
Spatial Sound, Room Acoustics and Perception
Sala Boxe
Spatial Sound, Room Acoustics and Perception
Sala Boxe
Deep Learning Methods, Effects and Data Analysis
Sala Boxe
Deep Learning Methods, Effects and Data Analysis
Sala Boxe
11:30
Tutorial 2
Balázs Bank
Auditorium Tamburi
Virtual Analog
Sala Boxe
12:00
Keynote 2
Gaël Richard
Auditorium Tamburi
Keynote 3
Andrew McPherson
Auditorium Tamburi
12:30
Keynote 1
Johanna Devaney
Auditorium Tamburi
13:00
Lunch
Foyer Tamburi
Lunch
Foyer Tamburi
Lunch
Foyer Tamburi
13:30
Lunch
Foyer Tamburi
14:30
Tutorial 3
Gloria Dal Santo
Auditorium Tamburi
Physical Modeling
Auditorium Tamburi
Signal Processing
Auditorium Tamburi
Deep Learning for Synthesis
Auditorium Tamburi
15:45
Awards and Closing Session
Auditorium Tamburi
16:00
Coffee Break
Foyer Tamburi
Coffee Break
Foyer Tamburi
Coffee Break
Foyer Tamburi
16:15
Handover Address
Auditorium Tamburi
16:30
Tutorial 4
Annika Neidhardt
Auditorium Tamburi
Physical Modeling
Sala Boxe
Virtual Analog and Physical Modeling
Sala Boxe
Signal Processing
Sala Boxe
Signal Processing
Sala Boxe
16:45
Board Meeting
TBA
18:00
DAFx Welcome Aperitivo
Foyer Tamburi
18:30
Aperitivo
Foyer Tamburi
20:00
DAFx Banquet
MaWay
DAFx Concert: “Macchine Nostre” – A/V Performance for Italian Synthesizers
Auditorium Tamburi
Color Keys
Oral Sessions
Poster Sessions
Demo
Social Events
Other
Towards Efficient Emulation of Nonlinear Analog Circuits for Audio Using Constraint Stabilization and Convex Quadratic Programming
Miguel Zea and Luis A. Rivera
Abstract: This paper introduces a computationally efficient method for
the emulation of nonlinear analog audio circuits by combining state-space representations, constraint stabilization, and convex quadratic programming (QP). Unlike traditional virtual analog (VA) modeling approaches or computationally demanding
SPICE-based simulations, our approach reformulates the nonlinear
differential-algebraic (DAE) systems that arise from analog circuit
analysis into numerically stable optimization problems. The proposed method efficiently addresses the numerical challenges posed
by nonlinear algebraic constraints via constraint stabilization techniques, significantly enhancing robustness and stability, suitable
for real-time simulations. A canonical diode clipper circuit is presented as a test case, demonstrating that our method achieves accurate and faster emulations compared to conventional state-space
methods. Furthermore, our method performs very well even at
substantially lower sampling rates. Preliminary numerical experiments confirm that the proposed approach offers improved numerical stability and real-time feasibility, positioning it as a practical
solution for high-fidelity audio applications.
Simplifying Antiderivative Antialiasing with Lookup Table Integration
Leonardo Gabrielli and Stefano Squartini
Abstract: Antiderivative Antialiasing (ADAA), has become a pivotal method
for reducing aliasing when dealing with nonlinear function at audio rate. However, its implementation requires analytical computation of the antiderivative of the nonlinear function, which in practical cases can be challenging without a symbolic solver. Moreover, when the nonlinear function is given by measurements it
must be approximated to get a symbolic description. In this paper, we propose a simple approach to ADAA for practical applications that employs numerical integration of lookup tables (LUTs)
to approximate the antiderivative. This method eliminates the need
for closed-form solutions, streamlining the ADAA implementation
process in industrial applications. We analyze the trade-offs of this
approach, highlighting its computational efficiency and ease of implementation while discussing the potential impact of numerical
integration errors on aliasing performance. Experiments are conducted with static nonlinearities (tanh, a simple wavefolder and
the Buchla 259 wavefolding circuit) and a stateful nonlinear system (the diode clipper).
Anti-Aliasing of Neural Distortion Effects via Model Fine Tuning
Alistair Carson, Alec Wright and Stefan Bilbao
Abstract: Neural networks have become ubiquitous with guitar distortion
effects modelling in recent years. Despite their ability to yield
perceptually convincing models, they are susceptible to frequency
aliasing when driven by high frequency and high gain inputs.
Nonlinear activation functions create both the desired harmonic
distortion and unwanted aliasing distortion as the bandwidth of
the signal is expanded beyond the Nyquist frequency. Here, we
present a method for reducing aliasing in neural models via a
teacher-student fine tuning approach, where the teacher is a pretrained model with its weights frozen, and the student is a copy of
this with learnable parameters. The student is fine-tuned against
an aliasing-free dataset generated by passing sinusoids through
the original model and removing non-harmonic components from
the output spectra.
Our results show that this method significantly suppresses aliasing for both long-short-term-memory networks (LSTM) and temporal convolutional networks (TCN). In the
majority of our case studies, the reduction in aliasing was greater
than that achieved by two times oversampling. One side-effect
of the proposed method is that harmonic distortion components
are also affected.
This adverse effect was found to be modeldependent, with the LSTM models giving the best balance between
anti-aliasing and preserving the perceived similarity to an analog
reference device.
MorphDrive: Latent Conditioning for Cross-Circuit Effect Modeling and a Parametric Audio Dataset of Analog Overdrive Pedals
Francesco Ardan Dal Rí, Domenico Stefani, Luca Turchet and Nicola Conci
Abstract: In this paper, we present an approach to the neural modeling of
overdrive guitar pedals with conditioning from a cross-circuit and
cross-setting latent space. The resulting network models the behavior of multiple overdrive pedals across different settings, offering continuous morphing between real configurations and hybrid
behaviors. Compact conditioning spaces are obtained through unsupervised training of a variational autoencoder with adversarial
training, resulting in accurate reconstruction performance across
different sets of pedals. We then compare three Hyper-Recurrent
architectures for processing, including dynamic and static HyperRNNs, and a smaller model for real-time processing. Additionally,
we present pOD-set, a new open dataset including recordings of
27 analog overdrive pedals, each with 36 gain and tone parameter combinations totaling over 97 hours of recordings. Precise parameter setting was achieved through a custom-deployed recording
robot.
Impedance Synthesis for Hybrid Analog-Digital Audio Effects
Francisco Bernardo, Matthew Davison and Andrew McPherson
Abstract: Most real systems, from acoustics to analog electronics, are
characterised by bidirectional coupling amongst elements rather
than neat, unidirectional signal flows between self-contained modules. Integrating digital processing into physical domains becomes
a significant engineering challenge when the application requires
bidirectional coupling across the physical-digital boundary rather
than separate, well-defined inputs and outputs. We introduce an
approach to hybrid analog-digital audio processing using synthetic
impedance: digitally simulated circuit elements integrated into an
otherwise analog circuit. This approach combines the physicality and classic character of analog audio circuits alongside the
precision and flexibility of digital signal processing (DSP). Our
impedance synthesis system consists of a voltage-controlled current source and a microcontroller-based DSP system. We demonstrate our technique through modifying an iconic guitar distortion pedal, the Boss DS-1, showing the ability of the synthetic
impedance to both replicate and extend the behaviour of the pedal’s
diode clipping stage. We discuss the behaviour of the synthetic
impedance in isolated laboratory conditions and in the DS-1 pedal,
highlighting the technical and creative potential of the technique as
well as its practical limitations and future extensions.
Antiderivative Antialiasing for Recurrent Neural Networks
Otto Mikkonen and Kurt James Werner
Abstract: Neural networks have become invaluable for general audio processing tasks, such as virtual analog modeling of nonlinear audio equipment.
For sequence modeling tasks in particular, recurrent neural networks (RNNs) have gained widespread adoption in recent years. Their general applicability and effectiveness
stems partly from their inherent nonlinearity, which makes them
prone to aliasing. Recent work has explored mitigating aliasing
by oversampling the network—an approach whose effectiveness is
directly linked with the incurred computational costs. This work
explores an alternative route by extending the antiderivative antialiasing technique to explicit, computable RNNs. Detailed applications to the Gated Recurrent Unit and Long Short-Term Memory cell are shown as case studies. The proposed technique is evaluated
on multiple pre-trained guitar amplifier models, assessing its impact on the amount of aliasing and model tonality. The method is
shown to reduce the models’ tendency to alias considerably across
all considered sample rates while only affecting their tonality moderately, without requiring high oversampling factors. The results
of this study can be used to improve sound quality in neural audio
processing tasks that employ a suitable class of RNNs. Additional
materials are provided in the accompanying webpage.
Towards Neural Emulation of Voltage-Controlled Oscillators
Riccardo Simionato and Stefano Fasciani
Abstract: Machine learning models have become ubiquitous in modeling
analog audio devices. Expanding on this line of research, our study
focuses on Voltage-Controlled Oscillators of analog synthesizers.
We employ black box autoregressive artificial neural networks to
model the typical analog waveshapes, including triangle, square,
and sawtooth. The models can be conditioned on wave frequency
and type, enabling the generation of pitch envelopes and morphing across waveshapes. We conduct evaluations on both synthetic
and analog datasets to assess the accuracy of various architectural
variants. The LSTM variant performed better, although lower frequency ranges present particular challenges.
Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling
Yicheng Gu, Runsong Zhang, Lauri Juvela and Zhizheng Wu
Abstract: Virtual Analog (VA) modeling aims to simulate the behavior
of hardware circuits via algorithms to replicate their tone digitally.
Dynamic Range Compressor (DRC) is an audio processing module
that controls the dynamics of a track by reducing and amplifying
the volumes of loud and quiet sounds, which is essential in music
production. In recent years, neural-network-based VA modeling has
shown great potential in producing high-fidelity models. However,
due to the lack of data quantity and diversity, their generalization
ability in different parameter settings and input sounds is still limited. To tackle this problem, we present Solid State Bus-Comp, the
first large-scale and diverse dataset for modeling the classical VCA
compressor — SSL 500 G-Bus. Specifically, we manually collected
175 unmastered songs from the Cambridge Multitrack Library. We
recorded the compressed audio in 220 parameter combinations,
resulting in an extensive 2528-hour dataset with diverse genres, instruments, tempos, and keys. Moreover, to facilitate the use of our
proposed dataset, we conducted benchmark experiments in various
open-sourced black-box and grey-box models, as well as white-box
plugins. We also conducted ablation studies in different data subsets to illustrate the effectiveness of the improved data diversity and
quantity. The dataset and demos are on our project page: https:
//www.yichenggu.com/SolidStateBusComp/.
Real-Time Virtual Analog Modelling of Diode-Based VCAs
Coriander V. Pines
Abstract: Some early analog voltage-controlled amplifiers (VCAs) utilized
semiconductor diodes as a variable-gain element. Diode-based
VCAs exhibit a unique sound quality, with distortion dependent
both on signal level and gain control. In this work, we examine the
behavior of a simplified circuit for a diode-based VCA and propose
a nonlinear, explicit, stateless digital model. This approach avoids
traditional iterative algorithms, which can be computationally intensive. The resulting digital model retains the sonic characteristics
of the analog model and is suitable for real-time simulation. We
present an analysis of the gain characteristics and harmonic distortion produced by this model, as well as practical guidance for
implementation. We apply this approach to a set of alternative
analog topologies and introduce a family of digital VCA models
based on fixed nonlinearities with variable operating points.
Antialiasing in BBD Chips Using BLEP
Leonardo Gabrielli, Stefano D'Angelo and Stefano Squartini
Abstract: Several methods exist in the literature to accurately simulate Bucket
Brigade Device (BBD) chips, which are widely used in analog
delay-based audio effects for their characteristic lo-fi sound, which
is affected by noise, nonlinearities and aliasing. The latter is a desired quality, being typical of those chips. However, when simulating BBDs in a discrete-time domain environment, additional aliasing components occur that need to be suppressed. In this work, we
propose a novel method that applies the Bandlimited Step (BLEP)
technique, effectively minimizing aliasing artifacts introduced by
the simulation. The paper provides some insights on the design
of a BBD simulation using interpolation at the input for clock rate
conversion and, most importantly, shows how BLEP can be effective in reducing unwanted aliasing artifacts. Interpolation is shown
to have minor importance in the reduction of spurious components.
Aliasing Reduction in Neural Amp Modeling by Smoothing Activations
Ryota Sato and Julius O. Smith III
Abstract: The increasing demand for high-quality digital emulations of analog audio hardware, such as vintage tube guitar amplifiers, led
to numerous works on neural network-based black-box modeling,
with deep learning architectures like WaveNet showing promising
results. However, a key limitation in all of these models was the
aliasing artifacts stemming from nonlinear activation functions in
neural networks. In this paper, we investigated novel and modified activation functions aimed at mitigating aliasing within neural
amplifier models. Supporting this, we introduced a novel metric,
the Aliasing-to-Signal Ratio (ASR), which quantitatively assesses
the level of aliasing with high accuracy. Measuring also the conventional Error-to-Signal Ratio (ESR), we conducted studies on a
range of preexisting and modern activation functions with varying
stretch factors. Our findings confirmed that activation functions
with smoother curves tend to achieve lower ASR values, indicating a noticeable reduction in aliasing. Notably, this improvement
in aliasing reduction was achievable without a substantial increase
in ESR, demonstrating the potential for high modeling accuracy
with reduced aliasing in neural amp models.
Antialiased Black-Box Modeling of Audio Distortion Circuits Using Real Linear Recurrent Units
Fabián Esqueda and Shogo Murai
Abstract: In this paper, we propose the use of real-valued Linear Recurrent
Units (LRUs) for black-box modeling of audio circuits. A network architecture composed of real LRU blocks interleaved with
nonlinear processing stages is proposed.
Two case studies are
presented, a second-order diode clipper and an overdrive distortion pedal. Furthermore, we show how to integrate the antiderivative antialiaisng technique into the proposed method, effectively
lowering oversampling requirements. Our experiments show that
the proposed method generates models that accurately capture the
nonlinear dynamics of the examined devices and are highly efficient, which makes them suitable for real-time operation inside
Digital Audio Workstations.
Training Neural Models of Nonlinear Multi-Port Elements Within Wave Digital Structures Through Discrete-Time Simulation
Oliviero Massi, Alessandro Ilic Mezza, Riccardo Giampiccolo and Alberto Bernardini
Abstract: Neural networks have been applied within the Wave Digital Filter
(WDF) framework as data-driven models for nonlinear multi-port
circuit elements. Conventionally, these models are trained on wave
variables obtained by sampling the current-voltage characteristic
of the considered nonlinear element before being incorporated into
the circuit WDF implementation. However, isolating multi-port
elements for this process can be challenging, as their nonlinear
behavior often depends on dynamic effects that emerge from interactions with the surrounding circuit. In this paper, we propose a
novel approach for training neural models of nonlinear multi-port
elements directly within a circuit’s Wave Digital (WD) discretetime implementation, relying solely on circuit input-output voltage
measurements. Exploiting the differentiability of WD simulations,
we embed the neural network into the simulation process and optimize its parameters using gradient-based methods by minimizing
a loss function defined over the circuit output voltage. Experimental results demonstrate the effectiveness of the proposed approach
in accurately capturing the nonlinear circuit behavior, while preserving the interpretability and modularity of WDFs.
Distributed Single-Reed Modeling Based on Energy Quadratization and Approximate Modal Expansion
Champ C. Darabundit, Vasileios Chatziioannou and Gary Scavone
Abstract: Recently, energy quadratization and modal expansion have become popular methods for developing efficient physics-based
sound synthesis algorithms. These methods have been primarily
used to derive explicit schemes modeling the collision between
a string and a fixed barrier. In this paper, these techniques are
applied to a similar problem: modeling a distributed mouthpiece
lay-reed-lip interaction in a woodwind instrument. The proposed
model aims to provide a more accurate representation of how a musician’s embouchure affects the reed’s dynamics. The mouthpiece
and lip are modeled as distributed static and dynamic viscoelastic
barriers, respectively. The reed is modeled using an approximate
modal expansion derived via the Rayleigh-Ritz method. The reed
system is then acoustically coupled to a measured input impedance
response of a saxophone. Numerical experiments are presented.
A Wavelet-Based Method for the Estimation of Clarity of Attack Parameters in Non-Percussive Instruments
Gianpaolo Evangelista and Alberto Acquilino
Abstract: From the exploration of databases of instrument sounds to the selfassisted practice of musical instruments, methods for automatically
and objectively assessing the quality of musical tones are in high
demand. In this paper, we develop a new algorithm for estimating
the duration of the attack, with particular attention to wind and
bowed string instruments. In fact, for these instruments, the quality
of the tones is highly influenced by the attack clarity, for which,
together with pitch stability, the attack duration is an indicator often
used by teachers by ear. Since the direct estimation of the attack
duration from sounds is made difficult by the initial preponderance of the excitation noise, we propose a more robust approach
based on the separation of the ensemble of the harmonics from the
excitation noise, which is obtained by means of an improved pitchsynchronous wavelet transform. We also define a new parameter,
the noise ducking time, which is relevant for detecting the extent of
the noise component in the attack. In addition to the exploration of
available sound databases, for testing our algorithm, we created an
annotated data set in which several problematic sounds are included.
Moreover, to check the consistency and robustness of our duration
estimates, we applied our algorithm to sets of synthetic sounds with
noisy attacks of programmable duration.
Non-Iterative Numerical Simulation in Virtual Analog: A Framework Incorporating Current Trends
Alessia Andò, Enrico Bozzo and Federico Fontana
Abstract: For their low and constant computational cost, non-iterative methods for the solution of differential problems are gaining popularity
in virtual analog provided their stability properties and accuracy
level afford their use at no exaggerate temporal oversampling. At
least in some application case studies, one recent family of noniterative schemes has shown promise to outperform methods that
achieve accurate results at the cost of iterating several times while
converging to the numerical solution. Here, this family is contextualized and studied against known classes of non-iterative methods.
The results from these studies foster a more general discussion
about the possibilities, role and prospective use of non-iterative
methods in virtual analog.
Power-Balanced Drift Regulation for Scalar Auxiliary Variable Methods: Application to Real-Time Simulation of Nonlinear String Vibrations
Thomas Risse, Thomas Hélie and Stefan Bilbao
Abstract: Efficient stable integration methods for nonlinear systems are
of great importance for physical modeling sound synthesis. Specifically, a number of musical systems of interest, including vibrating
strings, bars or plates may be written as port-Hamiltonian systems
with quadratic kinetic energy and non-quadratic potential energy.
Efficient schemes have been developed for such systems through
the introduction of a scalar auxiliary variable. As a result, the stable real-time simulations of nonlinear musical systems of up to a
few thousands of degrees of freedom is possible, even for nearly
lossless systems. However, convergence rates can be slow and
seem to be system-dependent. Specifically, at audio rates, they
may suffer from numerical drift of the auxiliary variable, resulting
in dramatic unwanted effects on audio output, such as pitch drifts
after several impacts on the same resonator.
In this paper, a novel method for mitigating this unwanted drift
while preserving power balance is presented, based on a control
approach. A set of modified equations is proposed to control the
drift artefact by rerouting energy through the scalar auxiliary variable and potential energy state. Numerical experiments are run
in order to check convergence on simulations in the case of a cubic nonlinear string. A real-time implementation is provided as
a Max/MSP external. 60-note polyphony is achieved on a laptop, and some simple high level control parameters are provided,
making the proposed implementation suitable for use in artistic
contexts. All code is available in a public repository, along with
compiled Max/MSP externals1.
Fast Differentiable Modal Simulation of Non-Linear Strings, Membranes, and Plates
Rodrigo Diaz and Mark Sandler
Abstract: Modal methods for simulating vibrations of strings, membranes, and plates are widely used in acoustics and physically
informed audio synthesis. However, traditional implementations,
particularly for non-linear models like the von Kármán plate, are
computationally demanding and lack differentiability, limiting inverse modelling and real-time applications. We introduce a fast,
differentiable, GPU-accelerated modal framework built with the
JAX library, providing efficient simulations and enabling gradientbased inverse modelling.
Benchmarks show that our approach
significantly outperforms CPU and GPU-based implementations,
particularly for simulations with many modes. Inverse modelling
experiments demonstrate that our approach can recover physical
parameters, including tension, stiffness, and geometry, from both
synthetic and experimental data. Although fitting physical parameters is more sensitive to initialisation compared to methods that
fit abstract spectral parameters, it provides greater interpretability
and more compact parameterisation. The code is released as open
source to support future research and applications in differentiable
physical modelling and sound synthesis.
Learning Nonlinear Dynamics in Physical Modelling Synthesis Using Neural Ordinary Differential Equations
Victor Zheleznov, Stefan Bilbao, Alec Wright and Simon King
Abstract: Modal synthesis methods are a long-standing approach for modelling distributed musical systems. In some cases extensions are
possible in order to handle geometric nonlinearities. One such
case is the high-amplitude vibration of a string, where geometric nonlinear effects lead to perceptually important effects including pitch glides and a dependence of brightness on striking amplitude. A modal decomposition leads to a coupled nonlinear system of ordinary differential equations. Recent work in applied machine learning approaches (in particular neural ordinary differential equations) has been used to model lumped dynamic systems
such as electronic circuits automatically from data. In this work,
we examine how modal decomposition can be combined with neural ordinary differential equations for modelling distributed musical systems. The proposed model leverages the analytical solution
for linear vibration of system’s modes and employs a neural network to account for nonlinear dynamic behaviour. Physical parameters of a system remain easily accessible after the training without
the need for a parameter encoder in the network architecture. As
an initial proof of concept, we generate synthetic data for a nonlinear transverse string and show that the model can be trained to
reproduce the nonlinear dynamics of the system. Sound examples
are presented.
Physics-Informed Deep Learning for Nonlinear Friction Model of Bow-String Interaction
Xinmeng Luan and Gary Scavone
Abstract: This study investigates the use of an unsupervised, physicsinformed deep learning framework to model a one-degree-offreedom mass-spring system subjected to a nonlinear friction bow
force and governed by a set of ordinary differential equations.
Specifically, it examines the application of Physics-Informed Neural Networks (PINNs) and Physics-Informed Deep Operator Networks (PI-DeepONets). Our findings demonstrate that PINNs successfully address the problem across different bow force scenarios,
while PI-DeepONets perform well under low bow forces but encounter difficulties at higher forces. Additionally, we analyze the
Hessian eigenvalue density and visualize the loss landscape. Overall, the presence of large Hessian eigenvalues and sharp minima
indicates highly ill-conditioned optimization.
These results underscore the promise of physics-informed
deep learning for nonlinear modelling in musical acoustics, while
also revealing the limitations of relying solely on physics-based
approaches to capture complex nonlinearities. We demonstrate
that PI-DeepONets, with their ability to generalize across varying parameters, are well-suited for sound synthesis. Furthermore,
we demonstrate that the limitations of PI-DeepONets under higher
forces can be mitigated by integrating observation data within a
hybrid supervised-unsupervised framework. This suggests that a
hybrid supervised-unsupervised DeepONets framework could be
a promising direction for future practical applications.
Comparing Acoustic and Digital Piano Actions: Data Analysis and Key Insights
Michael Fioretti, Giuseppe Bergamino, Leonardo Gabrielli, Gianluca Ciattaglia and Susanna Spinsante
Abstract: The acoustic piano and its sound production mechanisms have been
extensively studied in the field of acoustics. Similarly, digital piano synthesis has been the focus of numerous signal processing
research studies. However, the role of the piano action in shaping the dynamics and nuances of piano sound has received less
attention, particularly in the context of digital pianos. Digital pianos are well-established commercial instruments that typically use
weighted keys with two or three sensors to measure the average
key velocity—this being the only input to a sampling synthesis
engine. In this study, we investigate whether this simplified measurement method adequately captures the full dynamic behavior of
the original piano action. After a brief review of the state of the art,
we describe an experimental setup designed to measure physical
properties of the keys and hammers of a piano. This setup enables
high-precision readings of acceleration, velocity, and position for
both the key and hammer across various dynamic levels. Through
extensive data analysis, we examine their relationships and identify
the optimal key position for velocity measurement. We also analyze
a digital piano key to determine where the average key velocity is
measured and compare it with our proposed optimal timing. We
find that the instantaneous key velocity just before let-off correlates
most strongly with hammer impact velocity, indicating a target
for improved sensing; however, due to the limitations of discrete
velocity sensing this optimization alone may not suffice to replicate
the nuanced expressiveness of acoustic piano touch. This study
represents the first step in a broader research effort aimed at linking
piano touch, dynamics, and sound production.
Wave Pulse Phase Modulation: Hybridising Phase Modulation and Phase Distortion
Matthew Smart
Abstract: This paper introduces Wave Pulse Phase Modulation (WPPM), a
novel synthesis technique based on phase shaping. It combines
two classic digital synthesis techniques: Phase Modulation (PM)
and Phase Distortion (PD), aiming to overcome their respective
limitations while enabling the creation of new, interesting timbres.
It works by segmenting a phase signal into two regions, each independently driving the phase of a modulator waveform. This results
in two distinct pulses per period that together form the signal used
as the phase input to a carrier waveform, similar to PM, hence the
name Wave Pulse Phase Modulation. This method provides a minimal set of parameters that enable the creation of complex, evolving waveforms, and rich dynamic textures. By modulating these
parameters, WPPM can produce a wide range of interesting spectra, including those with formant-like resonant peaks. The paper
examines PM and PD in detail, exploring the modifications needed
to integrate them with WPPM, before presenting the full WPPM
algorithm alongside its parameters and creative possibilities. Finally, it discusses scope for further research and developments into
new similar phase shaping algorithms.
Digital Morphophone Environment. Computer Rendering of a Pioneering Sound Processing Device
Daniel Scorranese
Abstract: This paper introduces a digital reconstruction of the morphophone,
a complex magnetophonic device developed in the 1950s within
the laboratories of the GRM (Groupe de Recherches Musicales)
in Paris. The analysis, design, and implementation methodologies
underlying the Digital Morphophone Environment are discussed.
Based on a detailed review of historical sources and limited
documentation – including a small body of literature and, most
notably, archival images – the core operational principles of the
morphophone have been modeled within the MAX visual programming environment. The main goals of this work are, on the one
hand, to study and make accessible a now obsolete and unavailable
tool, and on the other, to provide the opportunity for new explorations in computer music and research.
MorphDrive: Latent Conditioning for Cross-Circuit Effect Modeling and a Parametric Audio Dataset of Analog Overdrive Pedals
Francesco Ardan Dal Rí, Domenico Stefani, Luca Turchet and Nicola Conci
Abstract: In this paper, we present an approach to the neural modeling of
overdrive guitar pedals with conditioning from a cross-circuit and
cross-setting latent space. The resulting network models the behavior of multiple overdrive pedals across different settings, offering continuous morphing between real configurations and hybrid
behaviors. Compact conditioning spaces are obtained through unsupervised training of a variational autoencoder with adversarial
training, resulting in accurate reconstruction performance across
different sets of pedals. We then compare three Hyper-Recurrent
architectures for processing, including dynamic and static HyperRNNs, and a smaller model for real-time processing. Additionally,
we present pOD-set, a new open dataset including recordings of
27 analog overdrive pedals, each with 36 gain and tone parameter combinations totaling over 97 hours of recordings. Precise parameter setting was achieved through a custom-deployed recording
robot.
Power-Balanced Drift Regulation for Scalar Auxiliary Variable Methods: Application to Real-Time Simulation of Nonlinear String Vibrations
Thomas Risse, Thomas Hélie and Stefan Bilbao
Abstract: Efficient stable integration methods for nonlinear systems are
of great importance for physical modeling sound synthesis. Specifically, a number of musical systems of interest, including vibrating
strings, bars or plates may be written as port-Hamiltonian systems
with quadratic kinetic energy and non-quadratic potential energy.
Efficient schemes have been developed for such systems through
the introduction of a scalar auxiliary variable. As a result, the stable real-time simulations of nonlinear musical systems of up to a
few thousands of degrees of freedom is possible, even for nearly
lossless systems. However, convergence rates can be slow and
seem to be system-dependent. Specifically, at audio rates, they
may suffer from numerical drift of the auxiliary variable, resulting
in dramatic unwanted effects on audio output, such as pitch drifts
after several impacts on the same resonator.
In this paper, a novel method for mitigating this unwanted drift
while preserving power balance is presented, based on a control
approach. A set of modified equations is proposed to control the
drift artefact by rerouting energy through the scalar auxiliary variable and potential energy state. Numerical experiments are run
in order to check convergence on simulations in the case of a cubic nonlinear string. A real-time implementation is provided as
a Max/MSP external. 60-note polyphony is achieved on a laptop, and some simple high level control parameters are provided,
making the proposed implementation suitable for use in artistic
contexts. All code is available in a public repository, along with
compiled Max/MSP externals1.
Modeling the Impulse Response of Higher-Order Microphone Arrays Using Differentiable Feedback Delay Networks
Riccardo Giampiccolo, Alessandro Ilic Mezza, Mirco Pezzoli, Shoichi Koyama, Alberto Bernardini and Fabio Antonacci
Abstract: Recently, differentiable multiple-input multiple-output Feedback
Delay Networks (FDNs) have been proposed for modeling target multichannel room impulse responses by optimizing their parameters according to perceptually-driven time-domain descriptors. However, in spatial audio applications, frequency-domain
characteristics and inter-channel differences are crucial for accurately replicating a given soundfield. In this article, targeting the
modeling of the response of higher-order microphone arrays, we
improve on the methodology by optimizing the FDN parameters
using a novel spatially-informed loss function, demonstrating its
superior performance over previous approaches and paving the
way toward the use of differentiable FDNs in spatial audio applications such as soundfield reconstruction and rendering.
A Modified Algorithm for a Loudspeaker Line Array Multi-Lobe Control
Stefania Cecchi, Valeria Bruschi, Michele Frati, Marco Secondini and Andrea Tanoni
Abstract: The creation of personal sound zones is an effective solution
for delivering personalized auditory experiences in shared spaces.
Their applications span various domains, including in-car entertainment, home and office environments, and healthcare functions.
This paper presents a novel approach for the creation of personal
sound zones using a modified algorithm for multi-lobe control in
loudspeaker line array. The proposed method integrates a pressurematching beamforming algorithm with an innovative technique for
reducing side lobes, enhancing the precision and isolation of sound
zones.
The system was evaluated through simulations and experimental tests conducted in a semi-anechoic environment and a
large listening room. Results demonstrate the effectiveness of the
method in creating two separate sound zones.
Estimation of Multi-Slope Amplitudes in Late Reverberation
Jeremy B. Bai and Sebastian J. Schlecht
Abstract: The common-slope model is used to model late reverberation of
complex room geometries such as multiple coupled rooms. The
model fits band-limited room impulse responses using a set of
common decay rates, with amplitudes varying based on listener
positions. This paper investigates amplitude estimation methods
within the common-slope model framework. We compare several traditional least squares estimation methods and propose using
LINEX regression, a Maximum Likelihood approach using logsquared RIR statistics. Through statistical analysis and simulation
tests, we demonstrate that LINEX regression improves accuracy
and reduces bias when compared to traditional methods.
Differentiable Scattering Delay Networks for Artificial Reverberation
Alessandro Ilic Mezza, Riccardo Giampiccolo, Enzo De Sena and Alberto Bernardini
Abstract: Scattering delay networks (SDNs) provide a flexible and efficient
framework for artificial reverberation and room acoustic modeling. In this work, we introduce a differentiable SDN, enabling
gradient-based optimization of its parameters to better approximate the acoustics of real-world environments. By formulating
key parameters such as scattering matrices and absorption filters
as differentiable functions, we employ gradient descent to optimize an SDN based on a target room impulse response. Our approach minimizes discrepancies in perceptually relevant acoustic
features, such as energy decay and frequency-dependent reverberation times. Experimental results demonstrate that the learned SDN
configurations significantly improve the accuracy of synthetic reverberation, highlighting the potential of data-driven room acoustic modeling.
Differentiable Attenuation Filters for Feedback Delay Networks
Ilias Ibnyahya and Joshua D. Reiss
Abstract: We introduce a novel method for designing attenuation filters in
digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS)
of Infinite Impulse Response (IIR) filters arranged as parametric
equalizers (PEQ), enabling fine control over frequency-dependent
reverberation decay. Unlike traditional graphic equalizer designs,
which require numerous filters per delay line, we propose a scalable solution where the number of filters can be adjusted. The frequency, gain, and quality factor (Q) parameters are shared parameters across delay lines and only the gain is adjusted based on delay
length. This design not only reduces the number of optimization
parameters, but also remains fully differentiable and compatible
with gradient-based learning frameworks. Leveraging principles
of analog filter design, our method allows for efficient and accurate filter fitting using supervised learning. Our method delivers
a flexible and differentiable design, achieving state-of-the-art performance while significantly reducing computational cost.
Perceptual Decorrelator Based on Resonators
Jon Fagerström, Nils Meyer-Kahlen, Sebastian J. Schlecht and Vesa Välimäki
Abstract: Decorrelation filters transform mono audio into multiple decorrelated copies. This paper introduces a novel decorrelation filter design based on a resonator bank, which produces a sum of over a thousand exponentially decaying sinusoids. A headphone listening test was used to identify the minimum inter-channel time delays that perceptually match ERB-filtered coherent noise to corresponding incoherent noise. The decay rate of each resonator is set based on a group delay profile determined by the listening test results at its corresponding frequency. Furthermore, the delays from the test are used to refine frequency-dependent windowing in coherence estimation, which we argue represents the perceptually most accurate way of assessing interaural coherence. This coherence measure then guides an optimization process that adjusts the initial phases of the sinusoids to minimize the coherence between two instances of the resonator-based decorrelator. The delay results establish the necessary group delay per ERB for effective decorrelation, revealing higher-than-expected values, particularly at higher frequencies. For comparison, the optimization is also performed using two previously proposed group-delay profiles: one based on the period of the ERB band center frequency and another based on the maximum group-delay limit before introducing smearing. The results indicate that the perceptually informed profile achieves equal decorrelation to the latter profile while smearing less at high frequencies. Overall, optimizing the phase response of the proposed decorrelator yields significantly lower coherence compared to using a random phase.
Compression of Head-Related Transfer Functions Using Piecewise Cubic Hermite Interpolation
Tom Krueger and Julián Villegas
Abstract: We present a spline-based method for compressing and reconstructing Head-Related Transfer Functions (HRTFs) that preserves perceptual quality. Our approach focuses on the magnitude response and consists of four stages: (1) acquiring minimumphase head-related impulse responses (HRIR), (2) transforming
them into the frequency domain and applying adaptive Wiener
filtering to preserve important spectral features, (3) extracting a
minimal set of control points using derivative-based methods to
identify local maxima and inflection points, and (4) reconstructing
the HRTF using piecewise cubic Hermite interpolation (PCHIP)
over the refined control points. Evaluation on 301 subjects demonstrates that our method achieves an average compression ratio of
4.7:1 with spectral distortion ≤ 1.0 dB in each Equivalent Rectangular Band (ERB). The method preserves binaural cues with a
mean absolute interaural level difference (ILD) error of 0.10 dB.
Our method achieves about three times the compression obtained
with a PCA-based method.
Spatializing Screen Readers: Extending VoiceOver via Head-Tracked Binaural Synthesis for User Interface Accessibility
Giuseppe Bergamino, Michael Fioretti, Leonardo Gabrielli and Stefano Squartini
Abstract: Traditional screen-based graphical user interfaces (GUIs) pose significant accessibility challenges for visually impaired users. This
paper demonstrates how existing GUI elements can be translated
into an interactive auditory domain using high-order Ambisonics and inertial sensor-based head tracking, culminating in a realtime binaural rendering over headphones. The proposed system
is designed to spatialize the auditory output from VoiceOver, the
built-in macOS screen reader, aiming to foster clearer mental mapping and enhanced navigability.
A between-groups experiment
was conducted to compare standard VoiceOver with the proposed
spatialized version. Non visually-impaired participants (n = 32),
with no visual access to the test interface, completed a list-based
exploration and then attempted to reconstruct the UI solely from
auditory cues. Experimental results indicate that the head-tracked
group achieved a slightly higher accuracy in reconstructing the interface, while user experience assessments showed no significant
differences in self-reported workload or usability. These findings
suggest that potential benefits may come from the integration of
head-tracked binaural audio into mainstream screen-reader workflows, but future investigations involving blind and low-vision users
are needed.
Although the experimental testbed uses a generic
desktop app, our ultimate goal is to tackle the complex visual layouts of music-production software, where an head-tracked audio
approach could benefit visually impaired producers and musicians
navigating plug-in controls.
Evaluating the Performance of Objective Audio Quality Metrics in Response to Common Audio Degradations
Xie He, Duncan Williams and Bruno Fazenda
Abstract: This study evaluates the performance of five objective audio quality metrics—PEAQ Basic, PEAQ Advanced, PEMO-Q, ViSQOL,
and HAAQI —in the context of digital music production. Unlike
previous comparisons, we focus on their suitability for production environments, an area currently underexplored in existing research. Twelve audio examples were tested using two evaluation
types: an effectiveness test under progressively increasing degradations (hum, hiss, clipping, glitches) and a robustness test under
fixed-level, randomly fluctuating degradations.
In the effectiveness test, HAAQI, PEMO-Q, and PEAQ Basic
effectively tracked degradation changes, while PEAQ Advanced
failed consistently and ViSQOL showed low sensitivity to hum
and glitches. In the robustness test, ViSQOL and HAAQI demonstrated the highest consistency, with average standard deviations
of 0.004 and 0.007, respectively, followed by PEMO-Q (0.021),
PEAQ Basic (0.057), and PEAQ Advanced (0.065).
However,
ViSQOL also showed low variability across audio examples, suggesting limited genre sensitivity.
These findings highlight the strengths and limitations of each
metric for music production, specifically quality measurement with
compressed audio. The source code and dataset will be made publicly available upon publication.
Room Acoustic Modelling Using a Hybrid Ray-Tracing/Feedback Delay Network Method
Haowen Zhao, Akihiko Suyama, Kazunobu Kondo and Damian T. Murphy
Abstract: Combining different room acoustic modelling methods could provide a better balance between perceptual plausibility and computational efficiency than using a single and potentially more computationally expensive model. In this work, a hybrid acoustic modelling system that integrates ray tracing (RT) with an advanced
feedback delay network (FDN) is designed to generate perceptually plausible RIRs. A multiple stimuli with hidden reference
and anchor (MUSHRA) test and a two-alternative-forced-choice
(2AFC) discrimination task have been conducted to compare the
proposed method against ground truth recordings and conventional
RT-based approaches. The results show that the proposed system
delivers robust performance in various scenarios, achieving highly
plausible reverberation synthesis.
DataRES and PyRES: A Room Dataset and a Python Library for Reverberation Enhancement System Development, Evaluation, and Simulation
Gian Marco De Bortoli, Karolina Prawda, Philip Coleman and Sebastian J. Schlecht
Abstract: Reverberation is crucial in the acoustical design of physical
spaces, especially halls for live music performances. Reverberation Enhancement Systems (RESs) are active acoustic systems that
can control the reverberation properties of physical spaces, allowing them to adapt to specific acoustical needs. The performance of
RESs strongly depends on the properties of the physical room and
the architecture of the Digital Signal Processor (DSP). However,
room-impulse-response (RIR) measurements and the DSP code
from previous studies on RESs have never been made open access, leading to non-reproducible results. In this study, we present
DataRES and PyRES—a RIR dataset and a Python library to increase the reproducibility of studies on RESs. The dataset contains RIRs measured in RES research and development rooms and
professional music venues. The library offers classes and functionality for the development, evaluation, and simulation of RESs.
The implemented DSP architectures are made differentiable, allowing their components to be trained in a machine-learning-like
pipeline. The replication of previous studies by the authors shows
that PyRES can become a useful tool in future research on RESs.
Auditory Discrimination of Early Reflections in Virtual Rooms
Junting Chen, Duncan Williams, Bruno Fazenda
Abstract: This study investigates the perceptual sensitivity to early reflection changes across different spatial directions in a virtual
reality (VR) environment. Using an ABX discrimination paradigm, participants evaluated speech stimuli convolved with thirdorder Ambisonic room impulse responses under three position
reversal (Left–Right, Front–Back, and Floor–Ceiling) and three
reverberation conditions (RT60 = 1.0 s, 0.6 s, and 0.2 s). Binomial tests revealed that participants consistently detected early reflection differences in the Left–Right reversal, while discrimination performance in the other two directions remained at or near
chance. This result can be explained by the higher acuity and
lower localisation blur found for the human auditory system. A
two-way ANOVA confirmed a significant main effect of spatial
position (p = 0.00685, η² = 0.1605), with no significant effect of
reverberation or interaction. The analysis of the binaural room
impulse responses showed wave forms and Direct-ReverberantRatio differences in the Left–Right reversal position, aligning
with perceptual results. However, no definitive causal link between DRR variations and perceptual outcomes can yet be established.
Partiels – Exploring, Analyzing and Understanding Sounds
Pierre Guillot
Abstract: This
article
presents
Partiels,
an
open-source
application
developed at IRCAM to analyze digital audio files and explore
sound characteristics.
The application uses Vamp plug-ins to
extract various information on different aspects of the sound, such
as spectrum, partials, pitch, tempo, text, and chords. Partiels is the
successor to AudioSculpt, offering a modern, flexible interface for
visualizing, editing, and exporting analysis results, addressing a
wide range of issues from musicological practice to sound creation
and signal processing research. The article describes Partiels’ key
features, including analysis organization, audio file management,
results visualization and editing, as well as data export and sharing
options, and its interoperability with other software such as Max
and Pure Data. In addition, it highlights the numerous analysis
plug-ins developed at IRCAM, based in particular on machine
learning models, as well as the IRCAM Vamp extension, which
overcomes certain limitations of the original Vamp format.
Listener-Adaptive 3D Audio with Crosstalk Cancellation
Francesco Veronesi, Filippo Fazi and Jacob Hollebon
Abstract: Crosstalk cancellation is a technology that allows the delivery of binaural audio over loudspeakers using loudspeaker beamforming, without the need for headphones. It enables spatial audio to be reproduced using practical loudspeaker distributions, for example a soundbar of loudspeakers positioned in front of the user only.
Crosstalk cancellation requires the user to be positioned at a specific location in space, the 'sweet-spot'. However, by using a built-in camera or sensor, the listener's ear position relative to the audio device can tracked in real time, enabling a mobile sweet-spot through precise beamforming and effective crosstalk cancellation no matter where the listener is positioned.
This demo allows users to experience listener-adaptive crosstalk cancellation developed by Audioscenic, on a multi-loudspeaker gaming soundbar. Audioscenic develops advanced crosstalk cancellation solutions for home audio, gaming, automotive, and public space applications. Founded in 2017 by Dr Marcos Simón and Professor Filippo Fazi, the company emerged from their collaborative research at the Institute of Sound and Vibration Research, University of Southampton.
Biquad Coefficients Optimization via Kolmogorov-Arnold Networks
Ayoub Malek, Donald Schulz and Felix Wuebbelmann
Abstract: Conventional Deep Learning (DL) approaches to Infinite Impulse
Response (IIR) filter coefficients estimation from arbitrary frequency response are quite limited. They often suffer from inefficiencies such as tight training requirements, high complexity, and
limited accuracy. As an alternative, in this paper, we explore the
use of Kolmogorov-Arnold Networks (KANs) to predict the IIR
filter—specifically biquad coefficients—effectively. By leveraging the high interpretability and accuracy of KANs, we achieve
smooth coefficients’ optimization. Furthermore, by constraining
the search space and exploring different loss functions, we demonstrate improved performance in speed and accuracy. Our approach
is evaluated against other existing differentiable IIR filter solutions. The results show significant advantages of KANs over existing methods, offering steadier convergences and more accurate
results. This offers new possibilities for integrating digital infinite
impulse response (IIR) filters into deep-learning frameworks.
Audio Processor Parameters: Estimating Distributions Instead of Deterministic Values
Côme Peladeau, Dominique Fourer and Geoffroy Peeters
Abstract: Audio effects and sound synthesizers are widely used processors
in popular music.
Their parameters control the quality of the
output sound. Multiple combinations of parameters can lead to
the same sound.
While recent approaches have been proposed
to estimate these parameters given only the output sound, those
are deterministic, i.e. they only estimate a single solution among
the many possible parameter configurations.
In this work, we
propose to model the parameters as probability distributions instead
of deterministic values. To learn the distributions, we optimize
two objectives: (1) we minimize the reconstruction error between
the ground truth output sound and the one generated using the
estimated parameters, asisit usuallydone, but also(2)we maximize
the parameter diversity, using entropy. We evaluate our approach
through two numerical audio experiments to show its effectiveness.
These results show how our approach effectively outputs multiple
combinations of parameters to match one sound.
A Parametric Equalizer with Interactive Poles and Zeros Control for Digital Signal Processing Education
Andrea Casati, Giorgio Presti and Marco Tiraboschi
Abstract: This article presents ZePolA, a digital audio equalizer designed
as an educational resource for understanding digital filter design.
Unlike conventional equalization plug-ins, which define the frequency response first and then derive the filter coefficients, this
software adopts an inverse approach: users directly manipulate the
placement of poles and zeros on the complex plane, with the corresponding frequency response visualized in real time. This methodology provides an intuitive link between theoretical filter concepts
and their practical application. The plug-in features three main
panels: a filter parameter panel, a frequency response panel, and a
filter design panel. It allows users to configure a cascade of firstor second-order filter elements, each parameterized by the location of its poles or zeros. The GUI supports interaction through
drag-and-drop gestures, enabling immediate visual and auditory
feedback. This hands-on approach is intended to enhance learning
by bridging the gap between theoretical knowledge and practical
application. To assess the educational value and usability of the
plug-in, a preliminary evaluation was conducted with focus groups
of students and lecturers. Future developments will include support for additional filter types and increased architectural flexibility. Moreover, a systematic validation study involving students
and educators is proposed to quantitatively evaluate the plug-in’s
impact on learning outcomes. This work contributes to the field
of digital signal processing education by offering an innovative
tool that merges the hands-on approach of music production with
a deeper theoretical understanding of digital filters, fostering an
interactive and engaging educational experience.
Zero-Phase Sound via Giant FFT
Vesa Välimäki, Stefan Bilbao, Sebastian J. Schlecht, Roope Salmi and David Zicarelli
Abstract: Given the speedy computation of the FFT in current computer
hardware, there are new possibilities for examining transformations for very long sounds. A zero-phase version of any audio
signal can be obtained by zeroing the phase angle of its complex
spectrum and taking the inverse FFT. This paper recommends additional processing steps, including zero-padding, transient suppression at the signal’s start and end, and gain compensation, to
enhance the resulting sound quality. As a result, a sound with the
same spectral characteristics as the original one, but with different temporal events, is obtained. Repeating rhythm patterns are
retained, however. Zero-phase sounds are palindromic in the sense
that they are symmetric in time. A comparison of the zero-phase
conversion to the autocorrelation function helps to understand its
properties, such as why the rhythm of the original sound is emphasized. It is also argued that the zero-phase signal has the same
autocorrelation function as the original sound. One exciting variation of the method is to apply the method separately to the real
and imaginary parts of the spectrum to produce a stereo effect. A
frame-based technique enables the use of the zero-phase conversion in real-time audio processing. The zero-phase conversion is
another member of the giant FFT toolset, allowing the modification of sampled sounds, such as drum loops or entire songs.
Partiels – Exploring, Analyzing and Understanding Sounds
Pierre Guillot
Abstract: This
article
presents
Partiels,
an
open-source
application
developed at IRCAM to analyze digital audio files and explore
sound characteristics.
The application uses Vamp plug-ins to
extract various information on different aspects of the sound, such
as spectrum, partials, pitch, tempo, text, and chords. Partiels is the
successor to AudioSculpt, offering a modern, flexible interface for
visualizing, editing, and exporting analysis results, addressing a
wide range of issues from musicological practice to sound creation
and signal processing research. The article describes Partiels’ key
features, including analysis organization, audio file management,
results visualization and editing, as well as data export and sharing
options, and its interoperability with other software such as Max
and Pure Data. In addition, it highlights the numerous analysis
plug-ins developed at IRCAM, based in particular on machine
learning models, as well as the IRCAM Vamp extension, which
overcomes certain limitations of the original Vamp format.
Stable Limit Cycles as Tunable Signal Sources
Wolfram E. Weingartner
Abstract: This paper presents a method for synthesizing audio signals from
nonlinear dynamical systems exhibiting stable limit cycles, with
control over frequency and amplitude independent of changes to
the system’s internal parameters. Using the van der Pol oscillator
and the Brusselator as case studies, it is demonstrated how parameters are decoupled from frequency and amplitude by rescaling the
angular frequency and normalizing amplitude extrema. Practical
implementation considerations are discussed, as are the limits and
challenges of this approach. The method’s validity is evaluated experimentally and synthesis examples show the application of tunable nonlinear oscillators in sound design, including the generation
of transients in FM synthesis by means of a van der Pol oscillator
and a Supersaw oscillator bank based on the Brusselator.
Lookup Table Based Audio Spectral Transformation
Ryoho Kobayashi
Abstract: We present a unified visual interface for flexible spectral audio manipulation based on editable lookup tables (LUTs). In the proposed
approach, the audio spectrum is visualized as a two-dimensional
color map of frequency versus amplitude, serving as an editable
lookup table for modifying the sound. This single tool can replicate common audio effects such as equalization, pitch shifting, and
spectral compression, while also enabling novel sound transformations through creative combinations of adjustments. By consolidating these capabilities into one visual platform, the system has
the potential to streamline audio-editing workflows and encourage
creative experimentation. The approach also supports real-time
processing, providing immediate auditory feedback in an interactive graphical environment. Overall, this LUT-based method offers
an accessible yet powerful framework for designing and applying
a broad range of spectral audio effects through intuitive visual manipulation.
A Non-Uniform Subband Implementation of an Active Noise Control System for Snoring Reduction
Abstract: The snoring noise can be extremely annoying and can negatively
affect people’s social lives. To reduce this problem, active noise
control (ANC) systems can be adopted for snoring cancellation.
Recently, adaptive subband systems have been developed to improve the convergence rate and reduce the computational complexity of the ANC algorithm. Several structures have been proposed
with different approaches. This paper proposes a non-uniform subband adaptive filtering (SAF) structure to improve a feedforward
active noise control algorithm. The non-uniform band distribution
allows for a higher frequency resolution of the lower frequencies,
where the snoring noise is most concentrated. Several experiments
have been carried out to evaluate the proposed system in comparison with a reference ANC system which uses a uniform approach.
Compositional Application of a Chaotic Dynamical System for the Synthesis of Sounds
Costantino Rizzuti
Abstract: The paper presents a review of compositional application developed in the last years using a chaotic dynamical system in different
sound synthesis processes. The use of chaotic dynamical systems
in computer music has been a widespread practice for some time
now. The experimentation presented in this work shows the use
of a specific chaotic system: the Chua’s oscillator, within different
sound synthesis methods. A family of new musical instruments
has been developed exploiting the potential offered by the use of
this chaotic system to produce complex timbres and sounds. The
instruments have been used for the creation of musical pieces and
for the realization of live electronics performances.
Delay Optimization Towards Smooth Sparse Noise
Cristóbal Andrade and Sebastian J. Schlecht
Abstract: Smooth sparse noise sequences are applied to efficiently model
reverberation. This paper addresses the problem of optimizing
sparse noise sequences for perceptual smoothness using gradient-
based methods. We demonstrate that sinc-shaped artifacts introduced by fractional delay create non-convexities in an envelope-based roughness loss function, hindering delay optimization. By temporarily removing pulse polarity and omitting envelope rectification, we obtain a convex loss suitable for gradient descent. Pulse
signs are reintroduced after optimization during synthesis. Optimization results show roughness reduction across various pulse densities, with the optimized sequences approaching the perceptual smoothness of velvet noise.
graetli: A Microcontroller-Based DSP Platform for Real-Time Audio Signal Processing
Jonas Roth, Silvan Krebs, and Christoph Studer
Abstract: This demonstration presents graetli, a standalone digital signal processing (DSP) platform for real-time audio applications,
built around the Electrosmith Daisy Seed [1] microcontroller platform. graetli features high-quality analog audio I/O, a zero-latency
analog dry signal path, a user interface with programmable potentiometers, and a rugged enclosure. graetli is suitable for both
performance interaction and algorithm prototyping. To showcase
its capabilities, we implement a frequency domain artificial reverberation algorithm. Conference visitors are invited to interact with
the platform and experience the real-time DSP reverb algorithm.
A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
Esteban Gutiérrez, Frederic Font, Xavier Serra and Lonce Wyse
Abstract: In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds
characterized by stochastic structure and perceptual stationarity.
Drawing inspiration from the statistical and perceptual framework
of McDermott and Simoncelli, TexStat identifies similarities
between signals belonging to the same texture category without
relying on temporal structure. We also propose using TexStat
as a validation metric alongside Frechet Audio Distances (FAD) to
evaluate texture sound synthesis models. In addition to TexStat,
we present TexEnv, an efficient, lightweight and differentiable
texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored
for texture sounds. Through extensive experiments across various
texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that
make it effective both as a loss function for generative tasks and as
a validation metric. All tools and code are provided as open-source
contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.
DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions
Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas and Yuki Mitsufuji
Abstract: This study introduces a novel and interpretable model, DiffVox,
for matching vocal effects in music production. DiffVox, short
for “Differentiable Vocal Fx", integrates parametric equalisation,
dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for
parameter estimation. Vocal presets are retrieved from two datasets,
comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong
relationships between effects and parameters, such as the highpass and low-shelf filters often working together to shape the low
end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to
McAdams’ timbre dimensions, where the most crucial component
modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms
the non-Gaussian nature of the parameter distribution, highlighting
the complexity of the vocal effects space. These initial findings on
the parameter distributions set the foundation for future research
in vocal effects modelling and automatic mixing.
Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss
Tian Cheng, Tomayasu Nakano and Masataka Goto
Abstract: This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.
Automatic Classification of Chains of Guitar Effects Through Evolutionary Neural Architecture Search
Michele Rossi, Giovanni Iacca and Luca Turchet
Abstract: Recent studies on classifying electric guitar effects have achieved
high accuracy, particularly with deep learning techniques. However, these studies often rely on simplified datasets consisting
mainly of single notes rather than realistic guitar recordings.
Moreover, in the specific field of effect chain estimation, the literature tends to rely on large models, making them impractical for
real-time or resource-constrained applications. In this work, we
recorded realistic guitar performances using four different guitars
and created three datasets by applying a chain of five effects with
increasing complexity: (1) fixed order and parameters, (2) fixed order with randomly sampled parameters, and (3) random order and
parameters. We also propose a novel Neural Architecture Search
method aimed at discovering accurate yet compact convolutional
neural network models to reduce power and memory consumption.
We compared its performance to a basic random search strategy,
showing that our custom Neural Architecture Search outperformed
random search in identifying models that balance accuracy and
complexity. We found that the number of convolutional and pooling layers becomes increasingly important as dataset complexity
grows, while dense layers have less impact. Additionally, among
the effects, tremolo was identified as the most challenging to classify.
Inference-Time Structured Pruning for Real-Time Neural Network Audio Effects
Christopher Johann Clarke and Jatin Chowdhury
Abstract: Structured pruning is a technique for reducing the computational
load and memory footprint of neural networks by removing structured subsets of parameters according to a predefined schedule
or ranking criterion.
This paper investigates the application of
structured pruning to real-time neural network audio effects, focusing on both feedforward networks and recurrent architectures.
We evaluate multiple pruning strategies at inference time, without retraining, and analyze their effects on model performance. To
quantify the trade-off between parameter count and audio fidelity,
we construct a theoretical model of the approximation error as a
function of network architecture and pruning level. The resulting bounds establish a principled relationship between pruninginduced sparsity and functional error, enabling informed deployment of neural audio effects in constrained real-time environments.
Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial Approaches
Eloi Moliner, Michal Švento, Alec Wright, Lauri Juvela, Pavel Rajmic and Vesa Välimäki
Abstract: Accurately estimating nonlinear audio effects without access to
paired input-output signals remains a challenging problem. This
work studies unsupervised probabilistic approaches for solving this
task. We introduce a method, novel for this application, based
on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using blackand gray-box models. This study compares this method with a
previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the
effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show
that the diffusion-based approach provides more stable results and
is less sensitive to data availability, while the adversarial approach
is superior at estimating more pronounced distortion effects. Our
findings contribute to the robust unsupervised blind estimation of
audio effects, demonstrating the potential of diffusion models for
system identification in music technology.
Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects
Yann Bourdin, Pierrick Legrand and Fanny Roche
Abstract: This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in
digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with
and without conditioning by user controls. Results demonstrate
that carefully tuning these parameters enhances model accuracy
and training stability, while also reducing computational demands.
Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the
revised TBPTT configuration maintains high perceptual quality.
Neural-Driven Multi-Band Processing for Automatic Equalization and Style Transfer
Parakrant Sarkar and Permagnus Lindborg
Abstract: We present a Neural-Driven Multi-Band Processor (NDMP), a differentiable audio processing framework that augments a static sixband Parametric Equalizer (PEQ) with per-band dynamic range
compression. We optimize this processor using neural inference
for two tasks: Automatic Equalization (AutoEQ), which estimates
tonal and dynamic corrections without a reference, and Production
Style Transfer (NDMP-ST), which adapts the processing of an input signal to match the tonal and dynamic characteristics of a reference. We train NDMP using a self-supervised strategy, where the
model learns to recover a clean signal from inputs degraded with
randomly sampled NDMP parameters and gain adjustments. This
setup eliminates the need for paired input–target data and enables
end-to-end training with audio-domain loss functions. In the inference, AutoEQ enhances previously unseen inputs in a blind setting, while NDMP-ST performs style transfer by predicting taskspecific processing parameters. We evaluate our approach on the
MUSDB18 dataset using both objective metrics (e.g., SI-SDR,
PESQ, STFT loss) and a listening test.
Our results show that
NDMP consistently outperforms traditional PEQ and a PEQ+DRC
(single-band) baseline, offering a robust neural framework for audio enhancement that combines learned spectral and dynamic control.
TorchFX: A Modern Approach to Audio DSP with PyTorch and GPU Acceleration
Matteo Spanio and Antonio Rodà
Abstract: The increasing complexity and real-time processing demands of
audio signals require optimized algorithms that utilize the computational power of Graphics Processing Units (GPUs).
Existing Digital Signal Processing (DSP) libraries often do not provide
the necessary efficiency and flexibility, particularly for integrating
with Artificial Intelligence (AI) models. In response, we introduce TorchFX: a GPU-accelerated Python library for DSP, engineered to facilitate sophisticated audio signal processing. Built on
the PyTorch framework, TorchFX offers an Object-Oriented interface similar to torchaudio but enhances functionality with a novel
pipe operator for intuitive filter chaining. The library provides a
comprehensive suite of Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, with a focus on multichannel
audio, thereby facilitating the integration of DSP and AI-based
approaches.
Our benchmarking results demonstrate significant
efficiency gains over traditional libraries like SciPy, particularly
in multichannel contexts. While there are current limitations in
GPU compatibility, ongoing developments promise broader support and real-time processing capabilities. TorchFX aims to become a useful tool for the community, contributing to innovation
in GPU-accelerated DSP. TorchFX is publicly available on GitHub
at https://github.com/matteospanio/torchfx.
Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains
Aogu Wada, Tomohiko Nakamura and Hiroshi Saruwatari
Abstract: Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or timevariant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To
address this gap, we formulate AFX chain recognition as the task
of jointly estimating AFX types and their order from a wet signal.
We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently
than Euclidean space due to its exponential expansion property.
Since AFX chains can be represented as trees, with AFXs as nodes
and edges encoding effect order, hyperbolic space is well-suited
for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar
sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.
Towards an Objective Comparison of Panning Feature Algorithms for Unsupervised Learning
Richard Mitic and Andreas Rossholm
Abstract: Estimations of panning attributes are an important feature to extract from a piece of recorded music, with downstream uses such
as classification, quality assessment, and listening enhancement.
While several algorithms exist in the literature, there is currently
no comparison between them and no studies to suggest which one
is most suitable for any particular task. This paper compares four
algorithms for extracting amplitude panning features with respect
to their suitability for unsupervised learning. It finds synchronicities between them and analyses their results on a small set of
commercial music excerpts chosen for their distinct panning features. The ability of each algorithm to differentiate between the
tracks is analysed. The results can be used in future work to either
select the most appropriate panning feature algorithm or create a
version customized for a particular task.
Unsupervised Text-to-Sound Mapping via Embedding Space Alignment
Luke Dzwonczyk and Carmine-Emanuele Cella
Abstract: This work focuses on developing an artistic tool that performs an
unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus.
With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system.
The tool performs the task of text-to-sound retrieval, creating a
soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated
to play sequentially. We experiment with three different mapping
methods, and perform quantitative and qualitative evaluations on
the outputs. Our results demonstrate the potential of unsupervised
methods for creative applications in text-to-sound mapping.
Generative Latent Spaces for Neural Synthesis of Audio Textures
Aaron Dees and Seán O'Leary
Abstract: This paper investigates the synthesis of audio textures and the
structure of generative latent spaces using Variational Autoencoders (VAEs) within two paradigms of neural audio synthesis:
DSP-inspired and data-driven approaches. For each paradigm, we
propose VAE-based frameworks that allow fine-grained temporal
control. We introduce datasets across three categories of environmental sounds to support our investigations. We evaluate and compare the models’ reconstruction performance using objective metrics, and investigate their generative capabilities and latent space
structure through latent space interpolations.
RT-PAD-VC – Creative Applications of Neural Voice Conversion as an Audio Effect
Paolo Sani, Edgar Andres Suarez Guarnizo, Kishor Kayyar Lakshminarayana, and Christian Dittmar
Abstract: Streaming-enabled voice conversion (VC) bears the potential for many creative applications as an audio effect. This demo paper details our low-latency, real-time implementation of the recently proposed Prosody-aware Decoder Voice Conversion (PAD-VC). Building on this technical foundation, we explore and demonstrate diverse use cases in creative processing of speech and vocal recordings. Enabled by it’s voice cloning capabilities and fine-grained controllability, RT-PAD-VC can be used as a low-delay, quasi real-time audio effects processor for gender conversion, timbre and formant-preserving pitch-shifting, vocal harmonization and cross-synthesis from musical instruments. The on-site demo setup will allow participants to interact in a playful way with our technology.
DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions
Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas and Yuki Mitsufuji
Abstract: This study introduces a novel and interpretable model, DiffVox,
for matching vocal effects in music production. DiffVox, short
for “Differentiable Vocal Fx", integrates parametric equalisation,
dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for
parameter estimation. Vocal presets are retrieved from two datasets,
comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong
relationships between effects and parameters, such as the highpass and low-shelf filters often working together to shape the low
end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to
McAdams’ timbre dimensions, where the most crucial component
modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms
the non-Gaussian nature of the parameter distribution, highlighting
the complexity of the vocal effects space. These initial findings on
the parameter distributions set the foundation for future research
in vocal effects modelling and automatic mixing.
SCHAEFFER: A Dataset of Human-Annotated Sound Objects for Machine Learning Applications
Maurizio Berta and Daniele Ghisi
Abstract: Machine learning for sound generation is rapidly expanding within
the computer music community. However, most datasets used to
train models are built from field recordings, foley sounds, instrumental notes, or commercial music. This presents a significant
limitation for composers working in acousmatic and electroacoustic music, who require datasets tailored to their creative processes.
To address this gap, we introduce the SCHAEFFER Dataset (Spectromorphological Corpus of Human-annotated Audio with Electroacoustic Features For Experimental Research), a curated collection of 1000 sound objects designed and annotated by composers and students of electroacoustic composition. The dataset,
distributed under Creative Commons licenses, features annotations
combining technical and poetic descriptions, alongside classifications based on pre-defined spectromorphological categories.
Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space
Christian Limberg, Fares Schulz, Zhe Zhang and Stefan Weinzierl
Abstract: This paper presents a novel approach to neural instrument sound
synthesis using a two-stage semi-supervised learning framework
capable of generating pitch-accurate, high-quality music samples
from an expressive timbre latent space. Existing approaches that
achieve sufficient quality for music production often rely on highdimensional latent representations that are difficult to navigate and
provide unintuitive user experiences. We address this limitation
through a two-stage training paradigm: first, we train a pitchtimbre disentangled 2D representation of audio samples using a
Variational Autoencoder; second, we use this representation as
conditioning input for a Transformer-based generative model. The
learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the
proposed method effectively learns a disentangled timbre space,
enabling expressive and controllable audio generation with reliable
pitch conditioning. Experimental results show the model’s ability to capture subtle variations in timbre while maintaining a high
degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential
as a step towards future music production environments that are
both intuitive and creatively empowering:
https://pgesam.faresschulz.com/.
Neural Sample-Based Piano Synthesis
Riccardo Simionato and Stefano Fasciani
Abstract: Piano sound emulation has been an active topic of research and development for several decades. Although comprehensive physicsbased piano models have been proposed, sample-based piano emulation is still widely utilized for its computational efficiency and
relative accuracy despite presenting significant memory storage
requirements. This paper proposes a novel hybrid approach to
sample-based piano synthesis aimed at improving the fidelity of
sound emulation while reducing memory requirements for storing samples. A neural network-based model processes the sound
recorded from a single example of piano key at a given velocity.
The network is trained to learn the nonlinear relationship between
the various velocities at which a piano key is pressed and the corresponding sound alterations. Results show that the method achieves
high accuracy using a specific neural architecture that is computationally efficient, presenting few trainable parameters, and it requires memory only for one sample for each piano key.
Piano-SSM: Diagonal State Space Models for Efficient Midi-to-Raw Audio Synthesis
Dominik Dallinger, Matthias Bittner, Daniel Schnöll, Matthias Wess and Axel Jantsch
Abstract: Deep State Space Models (SSMs) have shown remarkable performance in long-sequence reasoning tasks, such as raw audio
classification, and audio generation. This paper introduces PianoSSM, an end-to-end deep SSM neural network architecture designed to synthesize raw piano audio directly from MIDI input.
The network requires no intermediate representations or domainspecific expert knowledge, simplifying training and improving accessibility.
Quantitative evaluations on the MAESTRO dataset
show that Piano-SSM achieves a Multi-Scale Spectral Loss (MSSL)
of 7.02 at 16kHz, outperforming DDSP-Piano v1 with a MSSL of
7.09. At 24kHz, Piano-SSM maintains competitive performance
with an MSSL of 6.75, closely matching DDSP-Piano v2’s result of 6.58. Evaluations on the MAPS dataset achieve an MSSL
score of 8.23, which demonstrates the generalization capability
even when training with very limited data. Further analysis highlights Piano-SSM’s ability to train on high sampling-rate audio
while synthesizing audio at lower sampling rates, explicitly linking performance loss to aliasing effects. Additionally, the proposed model facilitates real-time causal inference through a custom C++17 header-only implementation. Using an Intel Core i712700 processor at 4.5GHz, with single core inference, allows synthesizing one second of audio at 44.1kHz in 0.44s with a workload of 23.1GFLOPS/s and an 10.1µs input/output delay with the
largest network. While the smallest network at 16kHz only needs
0.04s with 2.3GFLOP/s and 2.6µs input/output delay. These results underscore Piano-SSM’s practical utility and efficiency in
real-time audio synthesis applications.
A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
Esteban Gutiérrez, Frederic Font, Xavier Serra and Lonce Wyse
Abstract: In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds
characterized by stochastic structure and perceptual stationarity.
Drawing inspiration from the statistical and perceptual framework
of McDermott and Simoncelli, TexStat identifies similarities
between signals belonging to the same texture category without
relying on temporal structure. We also propose using TexStat
as a validation metric alongside Frechet Audio Distances (FAD) to
evaluate texture sound synthesis models. In addition to TexStat,
we present TexEnv, an efficient, lightweight and differentiable
texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored
for texture sounds. Through extensive experiments across various
texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that
make it effective both as a loss function for generative tasks and as
a validation metric. All tools and code are provided as open-source
contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.
Non-Iterative Simulation: A Numerical Analysis Viewpoint
Alessia Andò (University of Udine)
Abstract: Stiff ordinary differential equations (ODEs) frequently appear in scientific and engineering applications, necessitating numerical methods that ensure stability and efficiency. Non-iterative approaches for stiff ODEs provide an alternative to fully implicit schemes, which often feature a certain degree of unpredictability of the computation time that can be unacceptable in virtual analog real time applications.
This tutorial will especially focus on Rosenbrock-Wanner (ROW) methods and exponential integration techniques, whose origins date back to the 1960s. ROW methods are linearly implicit methods, that is, their idea is to replace the solution of nonlinear systems with a finite number of linear systems per step. Exponential integrators, on the other hand, incorporate stiff dynamics by leveraging matrix exponentials, and offer advantages in problems whose stiffness or oscillatory nature is mainly driven by their linear component. We will discuss the derivation, stability properties, and practical implementation of these methods, as well as compare their strengths, limitations, and potential for virtual analog in real-world applications through illustrative examples.
Dr. Alessia Andò is a postdoctoral fellow at the Department of Mathematics, Computer Science and Physics, University of Udine, where she received her PhD in 2020. She also worked as a postdoc at GSSI (Gran Sasso Science Institute), Italy.
Within the general area of Numerical Analysis, her main research interests are ordinary and delay differential equations and related dynamical systems and models. The focus is towards both the numerical time integration and the dynamical analysis of the models, which includes the computation of invariant sets and the study of their asymptotic stability.
Logarithmic Frequency Resolution Filter Design for Audio
Balázs Bank (Budapest University of Technology and Economics)
Abstract: Digital filters are often used to model or equalize acoustic or electroacoustic transfer functions. Applications include headphone, loudspeaker, and room equalization, or modeling the radiation of musical instruments for sound synthesis. As the final judge of quality is the human ear, filter design should take into account the quasi-logarithmic frequency resolution of the auditory system. This tutorial presents various approaches for achieving this goal, including warped FIR and IIR, Kautz, and fixed-pole parallel filters, and discusses their differences and similarities. Application examples will include phyics-based sound synthesis, loudspeaker and room equalization, and the equalization of a spherical loudspeaker array.
Balázs Bank is an associate professor at the Department of Artificial Intelligence and Systems Engineering, Budapest University of Technology and Economics (BUTE), Hungary. He received his M.Sc. and Ph.D. degrees in Electrical Engineering from BUTE in 2000 and in 2006, and his Hungarian Academy of Sciences (MTA) doctoral degree in 2023. In the academic year 1999/2000 and in year 2007 he was with the Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland. In 2008 he was with the Department of Computer Science, Verona University, Italy. Between 2000 to 2006, and since 2009 he was/is with BUTE. He has been an Associate Editor for IEEE Signal Processing Letters in 2013–2016 and for IEEE Signal Processing Magazine in 2018-2022. He has been the lead Guest Editor for the 2022 JAES special issue “Audio Filter Design”. His research interests include physics-based sound synthesis and filter design for audio applications.
Building Flexible Audio DDSP Pipelines: A Case Study on Artificial Reverb
Gloria Dal Santo (Aalto University)
Abstract: This tutorial focuses on Differentiable Digital Signal Processing (DDSP) for audio synthesis, an approach that applies automatic differentiation to digital signal processing operations. By implementing signal models in a differentiable manner, it becomes possible to backpropagate loss gradients through their parameters, enabling data-driven optimization without losing domain knowledge.
DDSP has gained popularity due to its domain-appropriate inductive biases, yet it still presents several challenges. The parameters of differentiable models are often constrained by stability conditions, affected by non-uniqueness issues, and may belong to different domains and distributions, making optimization nontrivial.
This tutorial provides an overview of these limitations and introduces FLAMO, a library designed to facilitate more flexible training pipelines. A key focus will be on loss functions: how to select appropriate ones, insights from perceptually informed losses, and techniques for validating them.
Demonstrations will use FLAMO, an open-source Python library built on PyTorch’s automatic differentiation framework. Practical examples will primarily centre on recursive systems for artificial reverberation applications.
Gloria Dal Santo received the M.Sc. degree in electrical and electronic engineering from the Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland in 2022, during which she interned at the Audio Machine Learning team at Logitech.
She is currently working toward a Doctoral degree with the Acoustics Lab, at Aalto University, Espoo, Finland. Her research interests include artificial reverberation and audio applications of machine learning, with a focus on designing more robust and psychoacoustically informed systems.
Plausible Editing of our Acoustic Environment
Annika Neidhardt (University of Surrey)
Abstract: The technology and methods for creating spatial auditory illusions have evolved phenomenally to the point where we can create illusions that cannot be distinguished from reality anymore. However, so far, such convincing quality can only be achieved with accurate knowledge about the target environment based on measurements or detailed modelling. Rendering virtual content into previously unknown environments remains a challenge. Quick automatic characterisation of its acoustic properties is necessary. Which information do we need to extract to render convincing illusions? Moreover, to what extent can we become creative in manipulating the appearance of the actual environment without compromising its plausibility and vividness? This tutorial will give insight into the perceptual requirements for rendering audio for Augmented and Extended Reality.
Annika Neidhardt is a Senior Research Fellow in Immersive Audio at the University of Surrey. She has been an active researcher of related topics for more than 10 years. She holds an MSc in Electrical Engineering (Automation & Robotics) from Technische Universität Chemnitz and an MSc in Audio Engineering (Computermusic & Multimedia) from the University for Music and Performing Art Graz. After three years in advanced development and applied science, she started her own research project at Technische Universität Ilmenau in the group of Karlheinz Brandenburg in 2017 on 6DoF binaural audio and related perceptual requirements and evaluation. She defended her PhD thesis on the plausibility of simplified room acoustic representations in Augmented Reality in May 2023. In addition, she conducted research on the automatic characterisation of acoustic environments, and perceptual implications for audio in Social VR and XR. Since autumn 2023, Annika continues her research at the Institute of Sound Recording in Surrey with more focus on room acoustic modelling and perceptual modelling.
DISCOVER THE POWER OF MODELING – PART 1: Synthesis Techniques for Acoustic Instrument Emulation
Audio Modeling
Abstract: This presentation introduces SWAM (Synchronous Wave Acoustic Modeling), a technology for accurate acoustic instrument emulation, developed by Audio Modeling. The development process and underlying synthesis techniques are discussed, highlighting their ability to reproduce expressive nuances. VariFlute, a physical model of the flute family, is presented as a case study demonstrating high realism and detailed playability.
DISCOVER THE POWER OF MODELING – PART 2: Room Modeling Combining Physical and Psychoacoustic Approaches
Audio Modeling
Abstract: This presentation discusses the need for an efficient spatializer to represent the continuous and coherent positioning of realistic instruments in a room. Ambiente, an integrated spatializer within SWAM, is presented, combining physical modeling and psychoacoustic principles to achieve accurate and immersive room simulation.
Bridging Symbolic and Audio Data: Score-Informed Music Performance Data Estimation
Johanna Devaney (Brooklyn College and the Graduate Center, CUNY)
Abstract: The empirical study of musical performance dates back to the birth of recorded media. From the laborious manual processes used in the earliest work to the current data-hungry end-to-end models, the estimation, modelling, and generation of expressive performance data remains challenging. This talk will consider the advantages of score-aligned performance data estimation, both for guiding signal processing algorithms and leveraging musical score data and other types of linked symbolic data (such as annotations) for analysing and modelling performance-related data. While the focus of this talk will primarily be on musical performance, connections to speech data will also be discussed, as well as the resultant potential for cross-modal analysis.
Johanna Devaney is an Associate Professor at Brooklyn College and the Graduate Center, CUNY, where she teaches courses in music theory, music technology, and data analysis. Johanna’s research primarily examines the ways in which recordings can be used to study and model performance, and she has developed computational tools to facilitate this. Her research on computational methods for audio understanding has been funded by the National Endowment for the Humanities (NEH) Digital Humanities program and the National Science Foundation (NSF). Johanna currently serves as the Co-Editor-in-Chief of the Journal of New Music Research.
Reverberation – Dereverberation: The promise of hybrid models
Gaël Richard (Télécom Paris, Institut Polytechnique de Paris)
Abstract: The propagation of acoustic waves within enclosed environments is inherently shaped by complex interactions with surrounding surfaces and objects, leading to phenomena such as reflections, diffractions, and the resulting reverberation. Over the years, a wide range of reverberation models have been developed, driven by both theoretical interest and practical applications, including artificial reverberation synthesis—where realistic reverberation is added to anechoic signals—and dereverberation, which aims to suppress reverberant components in recorded signals. In this keynote, we will provide a concise overview of some reverberation modeling approaches and illustrate how these models can be integrated into hybrid frameworks that combine classical signal processing, physical modeling, and machine learning techniques to advance artificial reverberation synthesis or dereverberation.
Gaël RICHARD received the State Engineering degree from Telecom Paris, France in 1990, the Ph.D. degree and Habilitation from University of Paris-Saclay respectively in 1994 and 2001. After the Ph.D. degree, he spent two years at Rutgers University, Piscataway, NJ, in the Speech Processing Group of Prof. J. Flanagan. From 1997 to 2001, he successively worked for Matra, Bois d’Arcy, France, and for Philips, Montrouge, France. He then joined Telecom Paris, where he is now a Full Professor in audio signal processing. He is also the co-scientific director of the Hi! PARIS interdisciplinary center on AI and Data analytics. He is a coauthor of over 250 papers and inventor in 10 patents. His research interests are mainly in the field of speech and audio signal processing and include topics such as source separation, machine learning methods for audio/music signals and music information retrieval. He is a fellow member of the IEEE, and was the chair of the IEEE SPS TC for Audio and Acoustic Signal Processing (2021-2022). He received, in 2020, the Grand prize of IMT-National academy of science. In 2022, he was awarded of an advanced ERC grant of the European Union for a project on machine listening and artificial intelligence for sound.
Effecting Audio: An Entangled Approach to Signals, Concepts and Artistic Contexts
Andrew McPherson (Imperial College London)
Abstract: I propose to approach audio effects not as technical objects, but as a kind of activity. The shift from from noun (“audio effect”) to verb (“effecting audio”, in the sense of applying transformations to sound) calls attention to the motivations, discourses and contexts in which audio processing, analysis and synthesis take place. We build audio-technical systems for specific reasons in specific situations. No system is ever devoid of sociocultural context or human intervention, and even the simplest technologies when examined in situ can exhibit fascinating complexity.
My talk will begin with a stubbornly contrarian take on some seemingly obvious premises of musical audio processing. Physicist and feminist theorist Karen Barad writes that “language has been granted too much power.” I would like to propose that as designers and researchers, we can let words about music take precedence over the messy and open-ended experience of making music, but that becoming overly preoccupied with language risks propagating clichés and reinforcing cultural stereotypes. Drawing on recent scholarship in human-computer interaction and science and technology studies, I will recount some alternative approaches and possible futures for designing digital audio technology when human and technical factors are inextricably entangled. I will illustrate these ideas with recent projects from the Augmented Instruments Laboratory, with a focus on rich bidirectional couplings between digital and analog electronics, acoustics and human creative experience.
Navigation Instructions
In the Overview Tab:
clicking on any event card will bring you to the detailed daily program scrolled at the time and day of the event (hit the Overview tab to go back to the program summary)
In the Daily Program:
On page reload, the program opens at the next closest date of conference events. Hit any date tab to change day or the Overview tab to reach the overview of the program.
Clicking on the title of any paper will open a popup window showing the abstract of that paper (close [X] the popup window or press Esc to continue navigating the program).
Clicking on the author(s) will open the original paper pdf in another tab of the browser.
Clicking on any tutorial or keynote title will open a popup window with a summary of the talk and a short biography of the speaker(s).
In the Abstract PopUp Window:
Move the horizontal slider to change the width of the window, move the right scroll bar (if available) to browse the abstract.
About
This online conference program was generated by means of scripts developed for DAFx by Gianpaolo Evangelista