Speech Enhancement for Cochlear Implant
Recipients Using Deep Complex Convolution
Transformer With Frequency Transformation
ABSTRACT results in listener fatigue, strain, and other adverse effects,
particularly when individuals are exposed to noisy
environments for extended periods. Speech enhancement
Cochlear implant (CI) users face significant challenges in
(SE) techniques aim to reduce the impact of background
understanding speech in natural environments due to
noise on speech signals, thereby improving speech
background noise or overlapping speakers. These external
perception.
disruptions distort the time-frequency (T-F) characteristics
of speech signals, including both the magnitude spectrum Recently, network-based SE methods have been classified
and phase. While most traditional speech enhancement (SE) into supervised and unsupervised categories. Unsupervised
approaches focus on improving the magnitude response SE methods, such as traditional Wiener filtering [3] and
alone, recent studies emphasize the critical role of phase in model-based approaches [4], estimate speech production
enhancing perceptual speech quality. Inspired by multi-task properties and the statistical characteristics of noise. These
machine learning, this work introduces a deep complex methods perform well when these characteristics can be
convolution transformer network (DCCTN) designed for accurately estimated. However, their efficacy diminishes in
complex spectral mapping, addressing both magnitude and non-stationary noisy environments or when it is challenging
phase enhancement simultaneously. The DCCTN employs a to obtain precise estimates of speech or environmental
complex-valued U-Net framework, integrating a transformer properties. To address these limitations, various supervised
module within the bottleneck layer to effectively capture SE approaches have been developed [5]-[8]. Text-directed
low-level contextual details in the T-F domain. To enhance SE [9] and deep learning-based SE systems have
harmonic correlation in speech, the architecture incorporates significantly advanced the field, especially in handling non-
a frequency transformation block in the encoder of the U- stationary audio environments [10]-[12]. These supervised
Net. By learning a complex transformation matrix, the SE techniques employ deep learning models trained on large
DCCTN accurately reconstructs clean speech from noisy labelled datasets to differentiate between speech and noise,
input spectrograms in the T-F domain. Experimental thereby enhancing CI performance in challenging listening
evaluations reveal that DCCTN surpasses existing models, scenarios. These advancements substantially improve speech
including the convolutional recurrent network (CRN), deep perception for CI users, enabling better communication and
complex convolutional recurrent network (DCCRN), and reducing the adverse effects of noise interference. Over the
gated convolutional recurrent network (GCRN), in terms of past few decades, monaural SE or single-channel SE
objective speech quality and intelligibility, under both techniques have been extensively researched, demonstrating
known and unknown noise conditions. A listener study remarkable success in low Signal-to-Noise Ratio (SNR)
conducted with four CI users demonstrated a marked environments and surpassing traditional methods. Deep
improvement in speech intelligibility in noisy environments. neural network (DNN)-based approaches have shown
Moreover, DCCTN effectively suppresses highly non- significant progress in noise-independent SE, though they
stationary noise without producing the musical artifacts struggle to generalize speaker characteristics [13].
often associated with traditional SE methods. Convolutional neural networks (CNNs), initially designed
for analysing local image patterns, have proven effective for
capturing local features in input signals [14], [15]. However,
I. INTRODUCTION
CNNs face limitations in modelling explicit long-range
Cochlear implants (CIs) offer an invaluable solution for dependencies due to the localized nature of convolution
individuals with severe to profound sensorineural hearing operations. In contrast, recurrent neural networks (RNNs),
loss. However, the quality of speech perception can still be including long short-term memory (LSTM) and gated
influenced by factors such as background noise, distortions, recurrent units (GRUs), excel at modeling long-term
and reverberation [1], [2]. The presence of noise often dependencies and sequential information. Nevertheless, their
lack of parallel processing capabilities leads to high Complex-valued SE networks represent a cutting-edge
computational complexity. Researchers have attempted to advancement in this field by jointly reconstructing
mitigate these limitations by integrating LSTM layers magnitude and phase information [5], [28]-[30]. By
between the encoder and decoder to extract high-level leveraging the interdependence between real and imaginary
features and expand receptive fields. Despite these efforts, components, these networks effectively mitigate noise and
contextual information is often underutilized, which impacts interference, demonstrating superior performance in
denoising performance. To overcome these challenges, challenging acoustic environments. For instance, the deep
hybrid models combining CNNs and RNNs have been complex CRN (DCCRN) [34] combines CNN and RNN
explored, leveraging their respective strengths to capture architectures to emulate complex-valued targets, achieving
both local and long-range dependencies. For example, Tan et performance advantages across both objective and subjective
al. [16] introduced a convolutional recurrent neural network metrics. The DCCRN+ [35] extended this approach by
(CRN) as an encoder-decoder architecture for SE, later incorporating sub-band processing, resulting in a faster noise
extended to a gated convolutional recurrent network (GCRN) suppressor. Although these methods enhance speech quality,
[17], achieving improved SE results. Strake et al. [18] they may still introduce processing distortions by removing
addressed data reshaping issues in CRN models by portions of speech while suppressing noise across both
employing convolutional LSTM for SE, which replaced objective and subjective metrics. The DCCRN+ [35]
fully connected mappings with convolutional mappings, extended this approach by incorporating sub-band
preserving local structures in CNN feature maps. processing, resulting in a faster noise suppressor. Although
Transformers have recently achieved remarkable success in these methods enhance speech quality, they may still
natural language processing by effectively capturing long- introduce processing distortions by removing portions of
range dependency structures and supporting efficient parallel speech while suppressing noise. The success of these
processing [19], [20]. Unlike Transformers have recently networks highlights the importance of developing advanced
achieved remarkable success in natural language processing architectures to improve speech quality further. Harmonic
by effectively capturing long-range dependency structures correlation plays a crucial role in speech perception, but
and supporting efficient parallel processing [19], [20]. traditional noise reduction algorithms often suppress
Unlike traditional CNN-based methods, transformers excel harmonic components in the original noisy signal,
in modelling global context and demonstrate superior introducing artifacts [36]-[38]. Techniques for phase
transferability across tasks through large-scale pre-training. reconstruction and harmonic enhancement [37], [39] have
Self-attention mechanisms within transformers enable the been proposed to address these issues. For example, Mamun
model to focus on relevant input spectrogram features while and Hansen [40] achieved significant success in SE by
ignoring less significant information. Transformers also restoring formants, demonstrating substantial benefits for CI
integrate multi-head attention, feed-forward neural networks, users. Structure on these perceptivity, this suppose about
and residual connections to generate robust hidden proposes a profound complex convolutional machine
representations, enhancing the quality of enhanced speech arrange (DCCTN) for SE acclimated for CI addicts. The
signals. However, the full potential of transformers in audio- DCCTN coordinating a U- Net fashion complex- recognized
visual SE remains underexplored. CNN chine, rush change layers (FTL) in the encoder, a two-
estate machine in the tailback and convolutional places in
Although many DNN-based SE models achieve significant the skip associations. The proposed contributions are four-
improvements in short-time Fourier transform (STFT) fold enhancement of a fully complex- recognized DCCTN
magnitude, they typically combine it with noisy phase data exercising a complex sound machine and complex rush
to reconstruct the time-frequency (T-F) waveform. While change arrange. Foreword of complex FTL in the encoder to
some studies have downplayed the importance of phase in abuse accordant connections, empowering compelling T- F
SE [21], others have recognized its critical role and proposed representation. Integration of a complex sound machine in
methods to estimate both magnitude and phase. Early efforts the tailback estate, exercising tone- consideration for levee
[22]-[24] incorporated phase information into magnitude long- extend dependence, relative preparing for acceptability,
processing. Subsequent work [16], [25]-[28] introduced and multi-head consideration for strength and broad
networks for reconstructing the complex spectrogram of statement. Help of conventional skip associations in the U-
clean speech. For instance, Tan and Wang [17] extended the Net with complex- recognized convolutional places to
CRN model into GCRN by incorporating a gated linear unit upgrade ghastly data sharing, empowering the recovery of
(GLU) block to regulate information flow. While these lost rush factors in deformed signals. The DCCTN utilizes
approaches improve SE, they often remove some speech complex- recognized complication all through its running
components along with background noise, leading to added channel to guarantee successful stage and greatness
distortion.
remaking. This paper is organized as takes after Member II
subtle rudiments the prosecution of the proposed DCCTN.
parts III and IV portray the exploratory setup and assessment
comes about, singly. Area V addresses about the discoveries,
and Area VI concludes the suppose about.
II. METHODOLOGY
The proposed Deep Complex Convolutional Transformer
Network (DCCTN) is built on the principle of reconstructing
lost harmonics in speech signals. This approach addresses
the issue of degraded audio quality by utilizing an advanced
audio transformer network specifically designed for
complex-valued signal representations. These
representations effectively combine both magnitude and
phase information. By adopting this architecture, DCCTN
seeks to recover missing harmonics, thereby improving the
overall quality of speech and potentially enhancing its
intelligibility. The subsequent section provides a detailed
explanation of the DCCTN architecture.
A. Overall Architecture
The primary objective of DCCTN is to process degraded
speech signals and restore them into high-quality audio,
thereby improving both the perceived quality and,
potentially, intelligibility. To accomplish this, the proposed
DCCTN architecture (illustrated in Fig. 1) consists of four
key components: (1) a fully convolutional complex-valued
encoder-decoder network (Cplx-UNet), (2) a complex-
valued audio transformer integrated into the bottleneck layer,
which effectively captures long-range dependencies that
traditional convolutional operations cannot model, (3)
complex-valued frequency transformation modules, and (4)
complex-valued convolutional blocks embedded in the skip Figure 1 depicts the basic structure of DCCTN, including the
connections between the encoder and decoder. complex frequency transformation module. Key parameters,
such as the number of channels (CC), kernel dimensions
The encoder and decoder blocks within the network are (KK), and input feature dimensions (TT) for the FTL, are
constructed using complex-valued convolution layers. These highlighted. Additionally, the notation “SkipBlocks*L”
layers are specifically designed to progressively enhance represents the number of SkipBlocks used in each skip
both the magnitude and phase components of the input connection.
signal. The encoder architecture is organized as a sequence
of encoder blocks, augmented with Frequency To bridge the semantic gap between the encoder and decoder
Transformation Layers (FTLs) applied both before and after features, the skip connections are equipped with several
these blocks. This configuration ensures that the decoder can complex convolutional blocks, referred to as SkipBlocks.
leverage the complete range of spectral and temporal These blocks guide the decoder in reconstructing the
features extracted by the encoder. enhanced output more effectively by aligning features from
the encoder and decoder.
In the bottleneck layer, transformer modules play a critical
role by explicitly modelling long-range dependencies that
are crucial for processing sequential data, such as speech
signals. Unlike convolutional operations, which are limited
to local receptive fields, the transformer modules excel at
capturing global context within the input sequence. This expression integrates magnitude and phase information,
Furthermore, their parallel processing capability enhances enabling the armature to reconstruct cleaner speech signals
computational efficiency, making the DCCTN architecture with lesser perfection. By using this approach, the complex-
well-suited for handling complex audio data. valued encoder- decoder armature enhances its capability to
reuse noisy input data and deliver high- quality labors.
B. Complex-Valued Encoder-Decoder Layer.
C. Complex-Valued Audio Transformer
The complex- valued encoder- decoder armature
distinguishes itself from traditional real- valued networks by The motor, a sophisticated machine literacy module,
exercising complex complications to enhance the quality of employs a tone- attention medium that allows it to
reconstructed speech signals. This section elaborates on its concentrate widely on critical factors of the input signal at
algorithmic foundation and operation. each subcaste. In this study, a complex- valued audio motor
is bedded within the tailback subcaste of the DCCTN model,
The encoder in the proposed armature comprises three replacing the traditional motor used for speech improvement
essential factors complex complication, complex batch ( SE), as depicted in Fig. 2( a). This approach significantly
normalization, and complex nonlinear activation functions. enhances the model’s capacity to capture long- range
Within the U-Net frame, complex complication is employed dependences in the input features.
to ameliorate both the magnitude and phase factors of the
time- frequence ( T- F) representation of noisy speech Within the complex motor network, real- valued
signals. While conventional convolutional layers operate by complications are performed on the real and imaginary
sliding a kernel matrix over the input matrix and performing factors of point charts and weights. Let X denote the
point-wise addition, the complex complication generalizes complex- valued point chart from the encoder, where Xr and
this operation to handle complex- valued inputs and kernels. Xi represent its real and imaginary factors, independently.
This approach ensures that both input and kernel matrices The motor affair for each input, represented as T(.), is
retain their complex- valued parcels. Despite operating on combined to produce the complex- valued affair YT,
complex figures, the abecedarian complication operation formulated as
remains unchanged. The complex convolutional subcaste
performs point-wise addition analogous to its real- valued
counterpart. still, the performing affair is a complex- valued
matrix that incorporates both magnitude and phase
information. This fresh data enables the network to prize
further intricate features and patterns, making complex-
valued complication a important tool for processing audio
signals. In resemblant to the encoder, the decoder employs
complex- transpose complication rather of standard complex
complication. This medium is designed to reconstruct clean
speech signals by effectively using the information decoded
by the complex- valued encoder. The complex- transpose
complication facilitates effective application of the decoded
features to induce the final affair.
The algorithmic expression of complex complication
operates on the input complex variable X = Xr + jXi and the
network’s complex kernel W = Wr + jWi. The operation
combines the real and imaginary factors of X and W to
produce a complex- valued affair Z, expressed as
Figure 2( b) illustrates the armature of the motor module, By incorporating the complex- valued audio motor in its
which includes multiple layers of tone- attention, feed- tailback subcaste, the DCCTN model effectively captures
forward neural networks( FFN), and voluntary positional long- range dependences in the input features. The tone-
encoding. still, in this study, positional encoding was barred, attention medium, combined with the elision of positional
as it proved ineffective for landing aural sequences. The encoding, focuses on critical information and generates a
tone- attention medium operates as anintra-attention module, robust retired representation, significantly enhancing the
enabling the model to learn task-independent sequence model's capability to reuse and ameliorate speech signals.
representations. This medium allows the network to
concentrate on applicable features in the input spectrogram D. Complex-Valued Frequency Transformation
while disregarding extraneous information. The attended
features are latterly reused by the feed-forward layers to When noise or additional disruptions degrade a vocal
induce a robust retired representation of the input. signal, harmonics can be disrupted, Resulting in diminished
The tone- attention module computes connections between speech quality. To Upgrade signal quality, preserving or
rudiments in the sequence using queries, keys, and values reconstructing these harmonics is essential. However,
deduced from direct metamorphoses of the input. Queries traditional CNN kernels are designed for the spatial domain,
identify the rudiments to concentrate on, keys assess the working with images or matrices using a sliding window
similarity between rudiments, and values gauge the technique for convolution operations. When applied to a
significance of each element. These operations do in parallel time-frequency (T-F) spectrogram, which represents speech
across multiple" heads" within the attention medium. In a in the frequency domain, these spatial-domain kernels are
standard motor, the input representation is reused through not effective at capturing harmonics. This is because
completely connected layers to produce the matrices Q, K, harmonics are spread across the frequency axis of the T-F
and V, given by spectrogram and are not localized in the spatial domain.
To effectively capture harmonics, specialized T-F CNN
kernels, such as gamma-tone filter-bank kernels or wavelet
Where Wq, Wk, and Wv are trainable weight matrices, and X convolutional kernels, are essential. Research has shown that
represents the input signal. The attention medium uses these using the attention module can effectively capture harmonic
matrices to model the connections between sequence components of speech, restore the enhanced signal, and
rudiments applicable to the task and acclimatize to different reconstruct missing frequencies in a band-limited signal [38].
input data types. However, most existing networks employ attention modules
The feed-forward network operates singly on each position across the frequency axis with real-valued networks
in the input sequence and includes a reopened intermittent operating only on the magnitude response. Notably, [8] and
unit GRU with a ReLU activation function, followed by a [43] utilized attention modules on both real and imaginary
direct metamorphosis subcaste,T(.). The affair at each spectrograms.
position is passed through posterior motor layers. The In this study we extend the FTL to the complex domain by
attention module’s affair is defined as introducing a complex-valued FTL that focuses on
frequency features while maintaining the interdependence
between the real and imaginary components of the complex-
valued Response map to utilize the interaction between
different feature channels we apply the attention module to
the incoming feature maps pointwise multiply it with the
input features and output the result next we apply the
The final affair of the attention module is concatenated with trainable frequency transformation matrix (FTM) to the
the input sequence X and regularized to produce O1. The feature maps at time step t to ensure global frequency
FFN introduces nonlinearity, enabling the model to capture correlation along the frequency axis finally we concatenate
complex non intercourses between positions in the speech the output of the FTM module with the input features using a
sequence. This sophisticated processing allows the motor to CNN layer to ensure both global and local frequency
identify intricate patterns and exceed in tasks like language correlation among harmonics.
modelling and sequence vaticination.
processed features from its preceding layer and low-level
features from the first layer of the encoder via a skip
connection. The incompatibility between these two feature
sets could potentially limit the network's learning ability. To
address this issue, our study proposes adding convolution
layers within the skip connection to directly transform the
encoder features into a more intuitive form for the decoder,
thus compensating for this incompatibility. Despite the
importance of skip connections in developing robust
networks, the identified semantic gap can hinder speech
synthesis quality. Although skip connections are crucial for
developing robust networks, the identified semantic gap can
impede speech synthesis quality. To address this, our study
proposes incorporating convolution blocks within the skip
connection to enhance the network's ability to learn and
share spectral information. This approach has already proven
successful in image segmentation and speech
dereverberation. We introduce a series of 'SkipBlocks' along
each skip connection path within the architecture. Each
As 'SkipBlock' comprises of a complex convolution layer, a
depicted in Fig. 3, the FTL comprises three stacked CNN normalization layer, and is activated by a complex ReLU
layers: a fully connected layer, a CNN layer for FTM, and a function. Importantly, the number of SkipBlocks deployed is
CNN layer for concatenation. The complex-valued feature inversely proportional to the depth of the respective encoder
maps are extracted from the encoder's stacked CNN layers, layer (as shown in Fig. 1). Thus, a skip connection linked to
forming a sequence of F frequency vectors with C channels the encoder's final layer will have just one SkipBlock, while
and T frames in total. The input feature vector is defined as one connected to the first layer of the encoder will contain
follows: up to eight SkipBlocks. This ensures a tailored approach to
the varying levels of feature abstraction across the network.
U ∈ RT×F×C (8)
The trainable Frequency Transformation Matrix (FTM) is F. Loss Function
applied to the feature map slice at each point in time. Let Most advanced networks use either a time domain or
WFTM ∈ RFxF represent the trainable FTM and U(t0) ∈ RFxC frequency domain loss function to optimize machine
denote the feature slice at time step. The transformed feature learning models, which may not fully align with the
slice at time step can be illustrated by the equation provided perceptual quality of the reconstructed signal. Therefore, this
below: study optimizes the proposed network by calculating both
time and frequency domain losses of the real and imaginary
Utr(t0) = WFTM.U(t0) (9) components. We adopt a frame-level auxiliary loss, an
where t0 ∈ 0, 1, . . ..., T − 1. STFT-based auxiliary loss, and a scale-invariant signal-to-
distortion ratio (SISDR) [45] loss to minimize the mean
E. Complex-Valued SkipBlocks
square error between the network prediction, and the
A recent study has identified a potential issue in the u- corresponding clean spectrogram, Ya.
net architecture concerning the transfer of features between
its encoder and decoder linked to a probabilistic semantic
(10)
gap [44]. The encoder's initial layer focuses on acquiring
where α is the weight (with α = 50 chosen in this study to
low-level regional spectral and temporal features while
align with SISDR loss) and the time domain SISDR loss,
subsequent layers progressively acquire higher-level features.
Meanwhile, the final layer of the decoder receives highly is defined as
model. Consequently, the training set includes 38,000
noisy-clean pairs, while the test set contains 1,750 pairs.
(11) All speech samples and noises were resampled at a rate of
and frequency domain loss, LFreq is a combination of spectral 16 kHz.
convergence loss, LSC and logarithmic value of STFT
magnitude loss, B. Subjective Listener Evaluation
LMag and can be expressed as:
1) Stimuli and Subject Demographics: This study
involved four CI users—three post-lingually and one
prelingually deafened (two males and two females). The
ages of the CI subjects at the time of the test ranged from 50
to 75 years, with an average age of 64.5 years. Implant use
varied from 5 to 14 years, with an average of 6.2 years.
where ||.||F and ||.||1 denote the Frobenius and L1
normalization, respectively and |STFT(.)| denotes the
magnitudes of the spectrogram.
III. EXPERIMENTAL SETUP
A. Speech Database
In our experiments, we evaluated the proposed SE
model using the TIMIT database [46], which includes
6300 speech utterances from American English speakers,
each phonetically transcribed. The duration of each
sentence ranges from 3 to 5 seconds. The training set is a Table I provides the demographic details of the CI
subset of the TIMIT database, consisting of 950 participants. All participants were native English speakers
utterances from 50 speakers. These sentences were and had the Nucleus cochlear implant system by Cochlear
modified by adding eight distinct noise sources from the Corporation. They were compensated for their participation
AURORA dataset, using five different SNRs: −10, −5, 0, in this study. The test stimuli used in this evaluation were
derived from the TIMIT corpus. Each sentence contains 3–5
5, and 10 dB. For utterances from the training set, we
keywords voiced by numerous speakers. The root-mean-
held out 150 randomly selected utterances to create a
square value of all sentences was standardized to
validation test set. The environmental noise conditions
approximately 65 dB. All stimuli were sampled at 16 kHz.
included samples from various sources such as airports,
To simulate noisy speech, two types of noise-babble and car
babble, cars, exhibitions, train stations, city streets, noise-were used as maskers at 0 dB and 5 dB SNR. This
speech-shaped noise (SSN), and white Gaussian noise. database was created to assess the speech recognition
For testing the model, a second subset of 50 samples abilities of CI users in noisy environments. The speech
was used. These utterances were mixed with one of three corpus comprises 12 lists, each containing 5 phonetically
known noise types (babble, car, and SSN) and two balanced sentences recorded from six speakers.
unknown noise types (restaurant and train), using seven 2) Experiment I. Speech Intelligibility Assessment: The
different SNR levels: −7.5, −5, −2.5, 0, 2.5, 5, and 10 dB test was conducted using CCi-cloud [47], an online research
(note: four of the seven SNRs are the same as the training platform developed by UTD-CRSS-CILab, featuring a
set). 'Known' noise refers to the noise type encountered MATLAB GUI with a test dataset running in the backend.
by the model during training, whereas 'unknown' noise CI users performed the test with their daily clinical processor.
denotes noise types that were entirely new to the trained Recordings from the TIMIT database were used to create the
experiment [Link] test comprised 60 randomly selected Orthogonal Polynomial Measure (SOPM) [49] for
samples. It began with a short training phase where intelligibility evaluation. For measuring speech quality, we
participants listened to a set of five clean stimuli to use the Perceptual Evaluation of Speech Quality (PESQ) [50]
familiarize themselves with the testing procedure. After the and SISDR [45] metrics. Additionally, we assess speech
training phase, a speech token was presented to the listener. distortion using the Log-Spectral Distance (LSD) [51] and
Participants were asked to listen to 60 samples in different Itakura-Saito (IS) [52], [55], [56] metrics.
conditions. Each subject participated in a total of 12 test In all objective metrics except LSD and IS, higher values
conditions (2 noisy types * 2 SNR levels * 3 processing indicate better performance, suggesting that the
conditions). The noisy and enhanced samples were chosen to enhancement system effectively reduces distortion while
be distinct from each other. Participants were allowed to maintaining the quality of the target speech. PESQ scores
listen to the test token only twice. They were then asked to typically range from −0.5 to 4.5, with higher values
type what they had heard in a designated box within the indicating improved speech quality. Both SOPM and STOI
MATLAB GUI interface. The total number of speech map objective scores to the range of [0, 1], where higher
samples was 60 (2 noisy types * 2 SNR levels * 3 processing values indicate enhanced [Link], the
conditions * 5 samples/condition) throughout the experiment. SISDR, LSD, and IS scores are unbounded. Lower values
The presentation order of the enhanced, noisy speech, and for LSD and IS indicate better similarity or less distortion
SNR level was randomized throughout the session. The between the signals. IS values ranging from 0 to 0.5 reflect
average testing time for the experiment was one hour. waveform coding level distortion, while values between 1.5
3) Experiment II. Speech Quality Assessment: The test and 5.0 indicate greater additive noise [Link] taking
set for this experiment is similar to Experiment I. Each test into account these various metrics, we can comprehensively
consists of four speech samples: clean speech as a reference assess the functionality of the proposed network and gain
sample, noisy speech, and two enhanced speech samples insights into its ability to effectively enhance speech quality
processed by a current state-of-the-art (SOA) network and and intelligibility.
our proposed network. Each subject participated in a total of
12 test conditions (2 noisy types * 2 SNR levels * 3 D. Comparison Systems
processing conditions). The total number of test sets was 40 This study compared the proposed network with four
(2 noisy types * 2 SNR levels * 10 samples/condition) and state-of-the-art (SOA) algorithms: CRN, DCCRN, GCRN,
speech samples were 160 (40 samples/networks * 4 and CFTNet. To validate the proposed network, we used the
conditions) throughout the experiment. Participants were same training and testing dataset, ensuring consistent test
asked to listen to a clean speech token in each test set as a conditions throughout our evaluation. CRN [16]
reference sample. After that, three test audio samples (one integrates CNNs and RNNs, with convolutional layers
original distorted and two enhanced using two networks) capturing local spatial or temporal patterns and recurrent
were randomly presented. Participants were allowed to listen layers modeling long-term dependencies and sequential
to each speech token as many times as they wanted. They information in the input data.
were then asked to select the best sample (among the three DCCRN [34] is tailored to handle complex-valued
test samples) that was closest to the reference clean sample. sequential or time-dependent data, employing complex-
Additionally, they were asked to perceptually rate each of valued operations to effectively model elaborate associations
the three test samples on a scale from 1 to 5, with 1 being within the data. It attributes convolutional layers for
poor quality and 5 being the highest quality. In each set, the understanding spatial or temporal patterns and recurrent
clean and processed samples were chosen to be unique in connections for capturing temporal dependencies. As both
DCCRN and DCCTN are derived from CRN, this study uses
this experiment. The average testing time for the experiment
DCCRN as a baseline network.
was 30 minutes. GCRN [17] combines CNNs and RNNs with gated
systems, particularly utilizing gated units to model
C. Evaluation Metrics sequential or temporal dependencies within the dataset.
To assess the performance of the proposed network in These gated processes enable the system to selectively
terms of speech quality and intelligibility, we employ several refresh and preserve pertinent data over time, resulting in
objective metrics. Specifically, we calculate the Short-Time GCRNs that are highly efficient in managing sequential data
with prolonged dependencies and temporal patterns.
Objective Intelligibility (STOI) [48] and Spectrogram
CFTNet [40] employs a frequency transformation
module to capture distorted frequency components in speech.
Additionally, it incorporates skip connections to improve
incline flow and mitigate the diminishing incline issue in
deep neural networks (DNNs).This allows the system to
acquire residual mappings and highlight distinctions
between the input and the desired output.
E. Network Architecture
The DCCTN is designed to estimate non-linear
transformations from a noisy time-frequency (T-F) spectrum
of speech to a clean speech spectrum. The initial step
involves calculating the Short-Time Fourier Transform
(STFT) of the speech signal, utilizing a frame duration of 16
milliseconds and an overlap of 8 milliseconds. phase representation, successfully overcoming the obstacles
presented by noisy speech signals.
The architecture of the network features eight layers of
encoder-decoder pairs, which include two fully connected IV. RESULTS
layers (FTLs) and two transformer layers situated in the
bottleneck section. Additionally, the architecture The performance of the suggested SE approach is
incorporates skip connections that leverage convolutional evaluated in this section. The assessment includes analyzing
layers. In the encoder layers, convolutional layers are
the findings of subjective listening tests as well as objective
applied with defined kernel sizes and strides, while the
decoder layers mirror these parameters, employing measures for speech quality and intelligibility. Under
transposed convolution instead. scenarios with both known and unknown noise types and
different signal-to-noise ratios, the scores are compared with
To maintain harmonic correlation along the frequency
axis, an FTL is incorporated after the input layer and prior to those of existing models, including CRN, DCCRN, GCRN,
the bottleneck layer in the encoder. The variables of the and CFTNet.
FTL are chosen to align with those of the equivalent encoder
layer. This tactic enhances consistency in both magnitude A. Experiment 1: Subjective Speech Enhancement
and phase advancements by leveraging the complex Performance Based on Speech Intelligibility
spectrogram, implementing complex-valued convolutions,
and employing complex-valued LSTM layers. The speech intelligibility of cochlear implant (CI)
The network is trained for 100 epochs using an Adam recipients is assessed using the Word Recognition Rate
optimizer with an initial learning rate of 0.0003 and a batch (WRR) derived from test samples. Error bars in Figure 4
size of 16. The objective function combines SISDR and represent ±1 standard deviation in CI performance, while the
STFT loss to minimize the mean square error (MSE)
between the network's predictions and the corresponding figure displays the average WRR results under babble and
clean spectrogram. The STFT loss calculates spectral car noise conditions. Both the baseline and proposed
convergence and spectral magnitude losses in the STFT approaches improve speech intelligibility across all SNR
domain, while SISDR accounts for channel variations, levels and noise types. Notably, the proposed DCCTN
interference, and artifacts in the time domain signal. The method consistently outperforms the baseline network,
total number of parameters in the DCCTN model is 10.1 regardless of the noise type or SNR level. Specifically, in
million, and the Multiply-Accumulate operations (MACs)
total 130 million. babble noise at 0 dB SNR, the mean WRR scores increased
In summary, the DCCTN employs complex-valued from 5.95% to 21.39% with DCCRN processing and further
convolution, a complex-valued frequency transformation to 54.76% with DCCTN [Link] DCCRN and
block, and complex-valued transformer DCCTN, the mean WRR scores increased from 11.36% to
blocks to estimate the non-linear mapping from a noisy T-F 59.78% and 61.96%, respectively, at 5 dB. When automobile
spectrum to a clean speech spectrum. The network noise was present, DCCTN outperformed DCCRN by 33.7%
architecture, training process, and objective function
and 42.5% at 0 and 5 dB SNR, [Link] examine
are carefully crafted to ensure accurate magnitude and
improvements across SNR levels and noise conditions, an
Analysis of Variance (ANOVA) was conducted with a
significance threshold of 0.05. The ANOVA results for car the network that was most preferred over baseline in all
noise were [F(2, 11) = 34.95, p < 0.0006] at 0 dB SNR and conditions. DCCTN's overall mean quality was 97.9%,
[F(2, 11) = 84.45, p < 0.0000015] at 5 dB SNR.F(2, 11) = whereas DCCRN's was 76.8%, while original noisy speech's
15.79, p < 0.001 and F(2, 11) = 13.48, p < 0.0019 are the was 28.12%.Furthermore, when compared to DCCRN-
ANOVA values for babbling noise, respectively. These processed speech for vehicle noise settings, DCCTN-
findings show a notable variation across processing settings. processed speech showed the most gain in speech quality
(+27.5%).
ANOVA analysis shows that, across all SNR levels and
noise types, the difference between the original noisy speech
and the speech processed by both the proposed and baseline
networks is statistically significant. Specifically, for
babbling noise at 0 and 5 dB SNR, the ANOVA results
indicate a significant difference with [F(2, 11) = 63.2,
p<0.000005] and [F(2, 11) = 33.28, p<0.00007], respectively.
Similarly, for automobile noise at 0 and 5 dB SNR, the
results are [F(2, 11) = 125.6, p<0.00003] and [F(2, 11) =
50.83, p<0.00001], showing a significant difference as well.
[Link] 3:Speech Enhancement Performance
Assessment Using Objective Measures
1) Evaluation of the DCCTN Model: Table II displays the
goal scores evaluating the improvement in speech
intelligibility, quality, and distortion using the baseline
DCCRN method and the suggested DCCTN. Four distinct
objective metrics—STOI, SOPM, PESQ, and SISDR—were
used to generate [Link] objective scores were computed
for both the original noisy signals and the enhanced signals
after processing with DCCRN and DCCTN. The evaluation
considered a range of SNRs from -5 to +10 dB and included
Fig 5-Mean speech quality assessment in babble and car
three different types of noise: babble, vehicle, and SSN.
noise condition for CI recipients.
Each objective score represents the average speech
According to posthoc analysis, there is a significant
intelligibility or quality derived from 50 utterances. The
difference in scores between DCCRN and DCCTN. This
objective scores for the enhanced speech were generally
difference is most noticeable in babbling noise at 0 dB SNR
higher than those for the original noisy [Link] all
and car noise at 5 dB SNR [F(1, 7) = 13.45, p < 0.01] and
metrics, the DCCTN network outperformed the DCCRN
[F(1, 7) = 26.14, p < 0.002], respectively. For automobile
network for all types of sound. The relative improvement in
noise at 0 dB SNR, the observed change is statistically
PESQ was especially noticeable at higher SNRs, while other
negligible [F(1, 7) = 3.89, p < 0.096].
metrics displayed the opposite trend. Additionally, the
B. Experiment 2: Subjective Speech Enhancement results indicated that the proposed network performed better
Performance for Speech Quality in handling babbling noise compared to automobile and SSN
noises. Specifically, DCCTN surpassed DCCRN in PESQ by
We broadened our study to include pair preference tests +18.9%, +22.4%, and +31% for babble, vehicle, and SSN
that compare original noisy speech, DCCRN, and DCCTN- noise at −5 dB, respectively.
processed speech to assess speech quality [Link]
5 displayed the processed speech on a scale of 1 to 5, with 5 2) Comparison With Existing Networks: Cochlear
being the greatest quality. Overall, the suggested DCCTN implants transmit sound information by electrically
solution had the best quality score above baseline and was stimulating the auditory nerve, but this comes with a reduced
time-frequency (T-F) signal representation. In noisy and comprehension more difficult for CI users. This study
environments, the neural processing in the auditory system categorizes noisy speech into these three SNR ranges to
may be disrupted, affecting the CI user's ability to decode highlight the varying impacts of different noise levels and
speech signals in the brain. Background noise can mask types, as well as the relative improvements provided by the
important speech cues, making it harder for CI users to hear proposed network in enhancing speech intelligibility and
and understand speech. As the signal-to-noise ratio (SNR) quality for CI [Link] assess the performance of the
decreases, the interference from noise becomes more proposed network across different ranges, objective scores
pronounced, leading to reduced speech intelligibility. were calculated under both visible and invisible noise
conditions, with the results presented in Table III. The
evaluation utilized one speech quality metric (PESQ), two
speech distortion metrics (SISDR and LSD), and two speech
intelligibility metrics (STOI and SOPM). The "High,"
"Medium," and "Low" ranges correspond to SNRs of 5–10
dB, 0–2.5 dB, and −5 to −2.5 dB, respectively, and represent
the mean objective scores for each category. Each score
reflects the average for a specified number of speech
samples within that [Link] average score is based on 200
speech samples (50 samples × 2 noise types × 2 SNRs)
under unseen noise conditions and 300 speech samples (50
samples × 3 noise types × 2 SNRs) under seen noise
conditions. The results, shown for four baseline networks
(CFTNet, DCCRN, GCRN, and CRN), as well as the
proposed DCCTN algorithm, demonstrate that DCCTN
The aim of this study was to assess the impact of loud and
outperforms all previous methods. Performance evaluation
enhanced speech on cochlear implant (CI) recipients by
and objective scores indicate that the proposed DCCTN
categorizing the data into three SNR-based groups: "High"
algorithm offers several key advantages over existing
(SNRs between 5–10 dB), "Medium" (SNRs between 0–5
methods. Objective ratings for each network show
dB), and "Low" (SNRs below 0 dB). In the "High" category,
improvements compared to the original unprocessed speech,
the voice signal is typically clear and suitable for CI users,
regardless of the noise type or group. However, the relative
indicating a relatively high SNR where recipients can expect
improvement is more significant in visible noise scenarios
high-quality and intelligible speech. The "Medium" category,
than in invisible [Link] results reveal that the proposed
representing intermediate SNRs, suggests more challenging
algorithm consistently outperforms the baseline networks
listening conditions with moderate noise, which can still be
across all conditions. Its exceptional performance in both
intelligible depending on the type of [Link] CI recipients,
speech intelligibility and distortion reduction makes it the
the "Low" group denotes highly noisy conditions, with
optimal choice across all three SNR ranges. The proposed
SNRs below 0 dB. Speech intelligibility is severely impaired
DCCTN method demonstrates a significant improvement in
in this range, making it extremely difficult for CI receivers
speech intelligibility and distortion reduction compared to
to comprehend speech content. Note that CI recipients are
previous algorithms, particularly in the challenging "Red"
significantly affected across all SNR ranges due to the
zone, where speech intelligibility is greatly compromised.
reduced T-F content delivery of their implants (e.g., ∼ 10%
Additionally, the network maintains high speech
of what normal hearing subjects experience), even though
intelligibility while achieving competitive voice quality
listeners with normal hearing may also experience some
ratings, even in unseen noise conditions. This advantage is
speech intelligibility loss in all [Link] enhancement
especially important, as it highlights the algorithm's potential
(SE) techniques are essential to address these challenges and
for practical applications in real-world environments for
can benefit CI users. Modern SE methods have shown
cochlear implant (CI) users.
effectiveness for CI users in the "High" SNR range, where
speech signals are relatively clean. However, in the
"Medium" and "Low" SNR ranges, these methods often
introduce processing distortions, making speech perception
maintaining speech intelligibility. Notably, the model
showed significant performance enhancements when
operating in the complex domain, underscoring the
advantages of a complex-valued network over a real-valued
[Link] compared to CRN, the inclusion of convolutional
layers in the skip connection(CRN+SkipConvNet) resulted
in a significant +25.9% relative improvement in SISDR.
Additionally, the complex-valued FTL (applied only to the
first and last layer) showed substantial enhancements over
the real-valued CRNSkipConv, achieving +16.7% and
+51.1% relative improvements in STOI and PESQ,
[Link], increasing the number of FTL
layers led to diminishing returns in model performance, with
Table IV summarizes the average objective scores for five the optimal number of FTL layers for the proposed network
sound types (three visible and two invisible) across seven being two. Further benefits were observed when a
SNR levels (−7.5, −5, −2.5, 0, 2.5, 5, and 10 dB). The transformer was added to the bottleneck layer of the
DCCTN algorithm shows significant performance proposed DCCTN network. Specifically, DCCTN achieved
improvements over baseline networks and noisy speech. relative improvements over CplxCRN+SkipConv+FTL in
Specifically, it enhances STOI by +21.4% compared to the STOI, PESQ, and IS scores by +2.4%, +3%, and +72.4%,
unprocessed signal and by +11.8%, +6.5%, and +2.4% respectively. The absence of SkipConvNet in DCCTN
relative to the CRN, GCRN, and DCCRN networks, significantly affected the objective ratings.
respectively. Furthermore, compared to the original research
We created a real-valued network that was comparable to
network, CFTNet, DCCTN achieves an impressive +42.3%
DCCTN in order to assess the contribution of a complex-
improvement in SISDR and a +7.8% improvement in PESQ.
valued DCCTN over a real-valued network. The results
These results highlight the effectiveness of the proposed
showed that DCCTN performed better than its real-valued
DCCTN algorithm in enhancing speech quality,
counterpart, underscoring the benefits of a complex network.
intelligibility, and reducing distortion, particularly for CI
users.
We carried out a number of training and testing procedures
for the suggested model, displaying the mean objective
scores in Table V to demonstrate the contributions of each
block.
V. DISCUSSION
An array of electrodes is inserted in cochlear implants to
The addition of convolutional blocks to the skip link led to electrically stimulate the auditory nerve. Every CI recipient
increases in PESQ, SISDR, and LSD scores, reflecting will have a different CI MAP setup, which will affect their
improvements in voice quality and distortion while particular auditory perception variations and residual hearing.
The auditory CI stimulation response is represented by a B. Effect of DCCTN on Formant Restoration
two-dimensional time vs. electrode/channel electrodogram.
Over the auditory space, the electrode/channel is connected In challenging listening environments, background noise
to frequency [53]. On the other hand, spectral envelope and distortion degrade formant information, reducing
information is provided via linear predictive coefficients phonetic sound perception. Speech enhancement (SE) can
(LPC). Understanding the patterns of cochlear implant restore formant frequencies to improve the overall quality
stimulation is possible through the analysis of electrodogram and naturalness of speech. Formant analysis of speech
and LPC responses. signals under various conditions—clean, noisy, baseline
DCCRN, and proposed DCCTN-processed—is illustrated in
A. Effect of Proposed Network on CI Electrodogram Fig. 7. The analysis reveals that the proposed DCCTN
Responses effectively restores formant frequencies, as the LPC
parameters for DCCTN-processed speech closely align with
Background noise significantly impacts how CI users those of clean [Link] contrast, the LPC parameters for
perceive speech, and speech enhancement offers a potential DCCRN-processed speech, while showing some
solution to reduce this interference. To evaluate the effect of improvement, are more closely aligned with noisy speech
the proposed algorithm on the signals, electrodograms of and exhibit smaller magnitudes. Overall, formant
processed signals are shown in Fig. 6. At an SNR of 0 dB, restoration through speech enhancement algorithms is
the original clean signal is corrupted by babbling noise, and crucial for improving speech quality and perception,
both the baseline and proposed networks are used to enhance especially in challenging listening environments. This LPC
the noisy [Link] corresponding electrodograms are analysis further supports the effectiveness of the DCCTN
generated by simulating the received signal for RF pulse method in successfully restoring the speech formant
production using the CI Advanced Combined Encoder (ACE) structure.
signal processing technique [54]. A standard CI parameter
configuration is applied to create biphasic electric RF pulse VI. CONCLUSION
stimuli across 22 [Link] study displays electrograms
DCCTN, a machine learning-based speech enhancement (SE)
of the clean, noisy, baseline-processed, and ultimately
technique, was introduced and demonstrated to improve
suggested network-processed signal in order to compare the speech perception in naturalistic environments. The
output of CI devices. approach incorporates a transformer within the bottleneck
layer of a complex-valued U-Net architecture. Additionally,
a frequency transformation module was added to accurately
reconstruct harmonic components from the distorted speech.
To enhance speech quality, DCCTN focuses on improving
both the magnitude and phase response of the speech
[Link], DCCTN has potential as a preprocessor
for next speech technology problems as speaker
identification and CI user-specific speech recognition
algorithms. In general, DCCTN makes a significant
contribution to the field by providing a potential remedy to
improve the quality and perception of speech for CI listeners
in a variety of real-world [Link], DCCTN
has potential as a preprocessor for next speech technology
problems as speaker identification and CI user-specific
speech recognition algorithms. In general, DCCTN makes a
significant contribution to the field by providing a potential
The suggested DCCTN network effectively diminishes noise remedy to improve the quality and perception of speech for
while maintaining the electrodogram's harmonic speech CI listeners in a variety of real-world situations.
pattern, according to the results. Alternatively, residual noise,
either added or preserved in the electrodogram, and
processing abnormalities in the baseline DCCRN processed
signal ultimately reduce cochlear implant recipients' speech
intelligibility.