0% found this document useful (0 votes)
3 views5 pages

Guitar Effects Estimation with DDSP

Summary of SRIP 2024

Uploaded by

puckandpaint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Guitar Effects Estimation with DDSP

Summary of SRIP 2024

Uploaded by

puckandpaint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Wittemann 1

Title:

Blind estimation of guitar AFX's using DDSP by Luke Wittemann

Abstract:

Given the ubiquity of audio effects in the creation and production of music, there exists a
necessity to efficiently estimate the effects used in order to recreate a sound or tone. While
traditional machine learning techniques have produced some promising results, a massive
amount of properly labeled data is needed and extrapolation of unseen configurations still
leaves something to be desired. By using DDSP, a developing field which integrates
differentiable modules of traditional DSP techniques into neural networks, both the complexity of
the models and the vastness of the required data can be reduced appreciably. Additionally, the
nature of DDSP is such that it can be used for a large variety of tasks ranging from classification
to timbre transfer between datasets. This study plans to explore the possibility of DDSP to
estimate an entire guitar effects chain, starting with the pickup(s) used and electronics settings,
continuing to varying numbers of cascaded effects.

Progress:

Thankfully, I got the opportunity to prepare for the summer during a quarter of ECE 199
at UCSD. Coming from an audio signal processing background, I wanted to research something
that was a fusion of existing digital signal processing(DSP) techniques with more modern
machine learning(ML) ones. I used the quarter to narrow down my area of research by reading
numerous research papers in order to survey areas where work has and has not been done.
One area I was considering during this time was effects estimation, where you have a system
capable of identifying effects such as reverb or equalization used to create a given audio
sample. It wasn’t until I came across Google’s DDSP project that I really saw a clear path
forward. DDSP is a set of differentiable digital synthesizers and effects whose parameters can
become trainable features inside a neural network. It was this ability to combine a trainable
source instrument in the form of an auto-encoder and differentiable effects that finally inspired
me to choose the estimation of an entire guitar signal chain as my area of research. This is
because, in essence, the different sounds that a guitar’s different pickups produce can be
thought of as distinct instruments in the way while playing the same note, the presence and
distribution of the harmonics are the prime factor in differentiation of the pickups. The DDSP
autoencoder takes audio and learns how to recreate it using the sum of a harmonic synthesizer
and an additive noise synthesizer. From my experience, the harmonic synth captures the
majority of the harmonic content of notes while the noise synth captures more cacophonous and
unpredictable moments such as the very start of a plucked note. By default, the autoencoder
used a loss function that minimizes spectral magnitude and log magnitude error.
Wittemann 2

Before I could start training, I needed to record some sample audio. I used the USB out
on my Boss Katana to send the audio to my computer, where it was recorded using REAPER.
The amp was left on the clean channel with the EQ and gain controls set at 12 o’clock. Though
the Fender Stratocaster I used to record the samples had five “positions”, I only recorded tracks
for the individual pickup positions(1,3,5). This is because positions 2 and 4 are simply sums of
the adjacent positions(2 is the sum 1 and 3 while 4 is the sum of 3 and 5), so these could be
created after the fact using samples from the three distinct pickup positions. For my first training
attempt, I recorded a little over 20 minutes of audio for each pickup position, trying my best to
play the same thing, melodically and dynamically, for all three pickups.

When first trying to use DDSP, I started off on Google Colab trying to run DDSP’s sample
notebooks. To my surprise, these didn’t run at all. I eventually found a work around in the form
of specifying certain python packages to be installed before running it. This worked, but now the
model training was being done without a GPU, meaning it would take days to generate each
model. After some digging, I found that this issue was an incompatibility between Google
Colab’s current version of Linux(Ubuntu 22.04) and the Nvidia CUDA toolkit, so I decided to try
training the models through a Linux WSL on my personal PC using an RTX 3080. It was during
this time that I gained a familiarity with both basic Linux commands and the underlying structure
of Python and how its package system works. After much trial and error, I finally found the
winning formula of Linux, CUDA, Python, and Python package versions to start training models.
Now, with my CUDA-enabled desktop, I was able to train models in 1-2 hours instead of the 1-2
days on Google Colab. I evaluated the amount of training steps hyper-parameter by looking at
the spectral loss, the magnitude of the noise synthesis(one of the most direct signs of overfitting
for this model), and by simply listening to the resynthesized audio. This led me to the conclusion
that, using my dataset, around 5000 training steps was optimal to minimize both loss and
overfitting.

Sample from training data(left) and resynthesis after training(right)


Wittemann 3

To evaluate the models, I used the three trained models to resynthesize a never before
seen(not in the training data) audio clip. Then, the recreation that was the most similar to the
sample would be the guess. Initially, I used a simple magnitude or log magnitude spectral error
as a means of comparison, but found this to be inconsistent. The magnitude error created a
tendency for the signal with the closest average spectral power to the original to be the guess
while the log magnitude error would be too ignorant of more minute spectral differences
between the recreations and the original sample. My first thought was to use a spectral
threshold such that the frequencies between harmonics which contained little to no information
were ignored in spectral comparisons. While initially promising using hand selected threshold
values, this ended up working poorly. This is because automating the threshold value with a
statistic such as the spectral average proved to be inconsistent at best. I also noticed there were
times in the original sample where the harmonics would die off in a way that was not being
represented by the recreations. Using a threshold based off of the original, these moments
where the models were clearly failing would be ignored. In other words, some spots where all
spectrums were near zero would be ignored as intended, but there were also areas where the
spectrums did differ and this was also being ignored unintentionally.

Next, I decided to threshold the spectral comparisons not by a magnitude, but by the
fundamental frequency confidence provided by the DDSP encoder. I did this because the start
of notes, which would largely be represented by noise synthesis, is not useful information in
comparing the models as the original and recreations matched almost exactly. The spot where
valid comparisons could be made was after the string vibration settled to a fundamental
frequency and its harmonics. Thankfully, this corresponds almost exactly to when the F0
confidence from the encoder reached a certain level(0.8-0.95 out of 1 max in practice). Again,
this seemed effective at first but had its own pitfalls. While more accurate than the initial mag
and log mag comparisons, the F0-confidence gated comparisons still had the same tendencies
to focus too much on average spectral energy or ignore smaller spectral differences. It was at
this point that I went back and recorded a second set of training and validation samples, this
time focusing more on actual guitar playing instead of just playing every note on the fretboard at
different loudnesses. While this did indeed improve the models, the pickup positions were still
just too similar for accurate differentiation using this method and the erroneous non-decaying
harmonics of the recreations were still present. Thankfully, it was at this time that I got the
opportunity to present my project at UCSD’s SRC 2024 and for my PI Tara Javidi and all of her
research students. This gave me a chance to step back a little and get some valuable feedback.
It was in the meeting with my PI that one of her students recommended using the DDSP losses
used for model training as opposed to my own.

For training, the DDSP autoencoder used a loss that is an evenly weighted linear
combination of the spectral magnitude loss and the spectral log magnitude loss. As soon as I
moved to this, my guessing accuracy started to become more consistent. Not previously
mentioned is the fact that for each recreation, both the overall pitch and loudness(high level
abstraction of mag and log mag) could be adjusted manually to tune the results. Previously, I
would have to set and sometimes adjust these manually to make the comparisons between the
recreations more valid. To prevent these adjustable parameters from interfering with the
guessing, I developed a system to only make a recreation which had the pitch aligned with the
Wittemann 4

original sample and the loudness error minimized between the recreation and the original. With
the optimized resynthesis, I tested the guessing with the numerous different losses provided by
DDSP. These losses included magnitude, log magnitude, loudness, delta time, delta frequency,
and a cumulative sum of frequencies. Using the optimized resynthesis, fourteen out of the
fifteen 45-second samples(5 for each pickup) were able to be guessed by at least one of these
metrics; with the cumulative sum of frequencies loss being the most individually accurate metric,
being correct nine out of the fifteen samples.

Pitch and loudness optimized sample resynthesis using trained DDSP instrument model. Note on mask(left),
loudness(center), and pitch/fundamental frequency(right)

Losses: L1 L2

Sample: Ma Lg Lo Ti Fr Cs M L Lo Ti Fr Cs

1.1 1 1 1

1.2 1 1 1 1 1 1 1

1.3 1 1 1 1 1 1 1 1 1 1 1

1.4 1 1 1 1 1 1 1 1 1

1.5 1 1 1 1 1 1

3.1 1 1 1 1 1 1

3.2 1 1

3.3 1 1 1 1

3.4 1

3.5

5.1 1 1 1 1 1 1 1 1 1

5.2 1 1 1

5.3 1 1

5.4 1 1 1 1 1 1 1

5.5 1 1 1 1 1 1 1 1 1

Total 8 6 3 7 8 9 8 6 3 5 8 8

Loss testing results using pitch and L2 loudness optimized resynth where a 1 represents a correct guess using only
that metric
Wittemann 5

While I would have loved to continue the effects estimation, the summer is over and I am
happy that I still got to make good progress on what I see as the most proprietary part of the
project. If this work were to be continued, I estimate the 16 kHz sample rate as the single
biggest current bottleneck to performance. This is because the guitar, especially with effects, is
capable of outputting frequencies right up the the limit of human perception. It is doubtless that
the 16 kHz sample rate cuts out a significant portion of the signals information in order to
integrate it more efficiently into the neural network. Another issue with recreating audio using
this method was the non-decaying harmonics, or the tendency for the harmonics inside the
recreations to not decay like they do in the original samples, sometimes not at all. Whether this
was a problem with the model or with my implementation, I was never able to elucidate. In either
case, I believe a system could be implemented to clip the non-decaying harmonics using the
loudness signal, as there is a strong correlation between the decay of the loudness and the
decay of the harmonics.

Conclusion

Though I was unable to attain my original goal of complete effects estimation, spending
my summer on this project under UCSD’s SRIP was as rewarding as it was challenging. Before
this summer, I had never used python, linux, or a professional DAW. Throughout the project, I
used python and tensorflow to train instrument models using my own CUDA-enabled desktop
through a linux WSL. For the training and testing data, I personally recorded hours of sample
audio using REAPER. To analyze the accuracy of these models, I developed a system which
normalized the loudness and pitch of the different models’ recreations such that they could be
compared spectrally to an original sample. While this method has its disadvantages, this project
has confirmed the possibility for instrument models to be integrated into effects estimation
systems in order to reduce the amounts of required training data. At this point, I would like to
thank Tara Javidi at UCSD for sponsoring my research under the SRIP and continually providing
insightful feedback.

Common questions

Powered by AI

The main challenges in the blind estimation of guitar audio effects using DDSP include the requirement for a massive amount of properly labeled data in traditional machine learning and the lack of accurate extrapolation for unseen configurations . The document also highlights issues with non-decaying harmonics in recreations and limitations imposed by a 16 kHz sample rate, which cuts out a significant portion of the signal's information . Additionally, comparisons based on spectral magnitude or log magnitude proved inconsistent, and threshold methods were ineffective due to the focus on average spectral energy or ignorance of smaller spectral differences .

To address non-decaying harmonics in DDSP model recreations, the document suggests using DDSP losses designed for model training. A proposed system to clip the non-decaying harmonics could be implemented using the loudness signal, given the correlation between the decay of loudness and harmonics . Additionally, refining the dataset by including more realistic guitar playing improved model accuracy, although further differentiation using existing methods was challenging .

The research improved model training efficiency by overcoming initial compatibility issues between Google Colab, its Linux version, and the Nvidia CUDA toolkit. By transitioning to a Linux WSL on a personal PC with an RTX 3080, the training time was significantly reduced from 1-2 days to 1-2 hours . Additionally, advances in using Python, tensorflow, and CUDA-enabled systems facilitated faster and more efficient training of the models .

DDSP facilitates the estimation of an entire guitar effects chain by using differentiable modules that can serve as trainable features within a neural network. This approach allows the model to consider the guitar's different pickup sounds as distinct instruments, capturing specific harmonic content that is crucial for differentiating effects within a chain. The system integrates auto-encoders combining harmonic and noise synthesizers to recreate audio accurately by focusing on spectral magnitude and log magnitude errors, despite some implementation challenges .

The document indicates that a 16 kHz sample rate is a significant limitation because it cuts out a considerable portion of a guitar's signal information, especially when effects are in use. This undercuts the ability to fully capture and reproduce the range of frequencies the guitar is capable of outputting, which can detract from the accuracy of audio recreations and estimations using DDSP .

The researcher initially attempted to refine the threshold method by suggesting a spectral threshold where frequencies between harmonics with little information would be ignored. However, this approach was ineffective as automating threshold values inconsistently ignored vital differences. A refined method focused on thresholding by fundamental frequency confidence, which initially seemed promising but failed to capture minor spectral differences across recreations .

The DDSP model integrates differentiable modules of traditional DSP techniques into neural networks, thereby reducing both the complexity of models and the vastness of the required data compared to traditional machine learning methods . This integration allows DDSP to efficiently handle various tasks such as classification and timbre transfer between datasets. By using differentiable digital synthesizers and effects, DDSP reduces the need for extensive labeled data and provides better adaptability to different configurations .

The researcher evaluated the effectiveness of the models by resynthesizing audio clips not included in the training data and comparing them to original samples using spectral loss and the magnitude of noise synthesis. Issues with spectral magnitude or log magnitude consistency illuminated potential areas for improvement. Later, the evaluation included normalization of loudness and pitch to ensure fair spectral comparisons, emphasizing the minimal loss and preventing overfitting through a set optimal number of training steps (around 5000).

The research advanced through receiving valuable feedback during a presentation at UCSD’s SRC 2024. Constructive critiques from PI Tara Javidi and her research students provided insights into refining methodologies. One significant suggestion was to utilize DDSP losses for model training to address inconsistencies in audio recreation, thus enhancing the accuracy of effects estimation .

In the DDSP autoencoder, the harmonic synth captures the majority of the harmonic content of notes, which is essential for differentiating the sound of different guitar pickups. Meanwhile, the noise synth captures more cacophonous, unpredictable moments such as the very start of a plucked note. This division allows the autoencoder to accurately recreate audio by minimizing spectral magnitude and log magnitude errors .

You might also like