0% found this document useful (0 votes)

10 views9 pages

Deep Learning for NAM to Speech Conversion

Q: In what ways do exemplar-based non-parametric approaches differ from GMM-based VC approaches regarding spectral details?

Exemplar-based non-parametric approaches differ from GMM-based VC techniques by directly using target language samples to synthesize speech, thus preserving more spectral details. In contrast, GMM-based approaches effectively transform gross spectral properties but may not retain finer features due to excessive spectral flattening. Hence, exemplar-based methods maintain spectral nuance better than their GMM counterparts .

Q: Describe the interplay between Amplitude Scaling (AS) and Frequency Warping (FW) in achieving high-quality voice conversion.

Amplitude Scaling (AS), when combined with Frequency Warping (FW), enhances VC by modifying the warped spectrum's vertical axis to match harmonic content with the target spectrum. While FW adjusts the source spectrum to align with the frequency axis of the target spectrum, AS compensates for the shortcomings in spectral magnitude, collectively resulting in high-quality voice conversion with preserved spectral nuances .

Q: How does Dynamic Time Warping (DTW) improve alignment in parallel VC and what are its limitations?

DTW improves alignment in parallel VC by globally aligning two speech utterances, ensuring that spectral properties from both source and target speakers are time-aligned. However, its limitation lies in the assumption that similar phonemes uttered by different speakers will have comparable spectral characteristics. This ignores speaker-dependent spectral variations, potentially leading to incorrectly aligned feature pairs and thus reducing the quality of converted voices .

Q: What effects do poorly aligned feature pairings have on non-parallel voice conversion quality, and how can outliers be managed?

Poorly aligned feature pairings in non-parallel voice conversion can significantly degrade the quality by leading to incorrect mapping functions, hence producing lower quality converted voices. To manage such outliers, unique removal procedures must be applied to detect and eliminate incorrect alignments, thereby preserving overall conversion integrity and quality .

Q: What improvements do the BLFW+SAE approach bring over the BLFW+AS method in voice conversion systems?

The BLFW+SAE method improves upon the BLFW+AS approach by incorporating a stated amplitude escalating technique that provides better adaptation of amplitude characteristics. This results in heightened speaker similarity and improved voice quality, as confirmed by both subjective and objective assessments like lower MCD scores. By further refining the amplitude modifications, BLFW+SAE achieves higher performance than the prior BLFW+AS method .

Q: What are the primary challenges in achieving effective voice conversion (VC) synchronization, particularly in non-parallel VC systems?

A primary challenge in achieving effective VC synchronization, especially for non-parallel systems, lies in the extraction of appropriate feature pairs due to variations in the utterances of source and target speakers. In parallel VC, while both speakers might have spoken the same utterances, variations in speaking rates and the need for spectral time alignment present challenges. This is exacerbated in non-parallel VC, where the lack of identical utterances makes it difficult to find the necessary feature pairs, impacting the mapping function's learning and resulting in lower voice conversion quality .

Q: How do alternative frequency warping methodologies like BLFW address overfitting in voice conversion systems?

BLFW, or Bilinear Frequency Warping, addresses overfitting in VC systems by reducing the number of parameters that need to be learned compared to classical methods. By employing a parameterized domain and limiting the learning process to essential components, it effectively minimizes overfitting, thus enhancing the robustness and generalization capabilities of the VC systems .

Q: What is the role of the nearest neighbor (NN) technique in message and non-parallel VC, and what are its limitations?

The nearest neighbor (NN) technique in message and non-parallel VC serves to align and find relevant target units from the source based on proximity, potentially improving VC quality. However, its main limitation is the presupposed correlation between perceptual and geometric dimensions, which may not always hold true, thereby reducing the effectiveness of alignments .

Q: Explain the impact of MCD scores on the evaluation of FW-based VC systems, despite lacking correlation with subjective assessment.

The Mel-Cepstral Distortion (MCD) scores are used to objectively evaluate VC system performance by comparing spectral distortions. Although high MCD scores generally indicate good VC performance, studies have shown a lack of strong correlation between these objective scores and subjective assessments of voice quality and speaker similarity, particularly in FW-based systems. This discrepancy suggests that while MCD serves as a quantitative measure, it may not fully capture perceptual audio quality aspects .

Q: What strategy is suggested to enhance neural network-based VC systems concerning training data?

To enhance neural network-based VC systems, it is suggested to gather more training data. The quality of text-to-speech (TTS) synthesis in these systems is closely tied to the quantity and quality of training data. By collecting expansive datasets, TTS systems can better account for various durations and language synthesis complexities, leading to higher quality outputs .

Gr paper format for journal conference

Uploaded by

Godwin Louis Malfoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views9 pages

Deep Learning for NAM to Speech Conversion

Gr paper format for journal conference

Uploaded by

Godwin Louis Malfoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

BILINEAR FREQUENCY WARPING WITH STATED AMPLITUDE ESCALATING

OPTIMIZATION BASED DEEP LEARNING FOR CONVERSION OF NON-AUDIBLE

MURMUR TO NORMAL SPEECH
1 2
S. Godvin Mani , T. Rajesh Kumar
1
Department of Computer Science and Engineering, Saveetha School of Engineering

Saveetha Institute of Medical and Technical Science, Chennai, India.

2
Department of Computer Science and Engineering, Saveetha School of Engineering

Saveetha Institute of Medical and Technical Science, Chennai, India.

1
godvinglory@[Link], t.rajesh61074@[Link]

An unvoiced speech signal known as Non-Audible Murmur (NAM) can be picked up through body tissue with the aid of specialized acoustic sensors, or NAM microphones,
Abstract:
installed behind the talker's ear. Body transmission and lip radiation loss serve as a low-pass filter in a NAM microphone. As a result, in a NAM signal, higher frequency components are
muted. Even though the NAM microphone can only record a tiny amount of NAM produced by patients, it loses quality due to the soft tissue's low pass characteristics and the lack of
radiation activity at the lips. These elements weaken the information connected to high frequencies. As a result, most of it is unclear. In this research, we suggest using alignment
methodologies for parallel voice conversion methods to enhance the NAM signals understand ability. In particular, we suggest using the voice conversion technique with the Bilinear
Frequency Warping with Stated Amplitude Escalating (BLFW+SAE) elimination approach and the rectifier linear unit as the nonlinear activation function. This less understandable NAM
signal is transformed into an audible speech signal using the suggested technique. The effectiveness of our stated system was compared to that of the robust alignment methods based method
using F-radio modelling. For a more objective evaluation, we also ran the Mel Cepstral Distortion (MCD) test. In particular, we discovered that for the Bilinear Frequency Warping with
Amplitude Scaling BLFW+AS, respectively, 93.9% of all stated amplitude escalating relate to silence-speech combinations. Also, it was discovered that cross - functional and cross voice
conversion systems do not reduce the MCD as much as cross - functional and cross voice conversion systems do. Also, the voice quality and speaker resemblance of the suggested
BLFW+SAE improved according to both subjective and objective assessments.

Keywords: Stated Amplitude Escalating, Frequency Warping, Speech Conversion, Voice Conversion, Normal Speech Conversion.

1. INTRODUCTION
Vocal Conversion (VC) is the study's primary goal, without altering the linguistic content, the VC method shifts the apparent voice association of something
like a specific vector representation from a sources speaker to a specific target speaker [8]. It is distinct from vocal morphing and voice transformation
(VT). Any kind of non-linguistic alteration to the speech waveform is particularly referred to as VT. voice modulation, contrasted with, is a method that
blends the sounds of speaker and listener to create the voice of something like a third, unidentifiable person speaking in order to hide the presence of the
typically characterized for security purposes. Figure 2.1 displays the various overarching categories that underlie VT. Synchronous and Non-Parallel VC
represent two broad categories of VC difficulty as shown in Fig (2.2), due to the nature of the learning algorithm [8]. Both the source and indeed the target
individuals have uttered that very same sentences during parallel VC. The utterances of something like the source and intended speakers may have differed
in non-parallel VC. Furthermore, the utterances of the two speakers could be spoken in the same vernacular or in multiple languages, moreover voice con-
version Word-based strategies, text-independent and other types of techniques and categories. Phonetic transcriptions must be supplied in combination to
the audio signals for text-dependent VC [22]. Contrarily, word-based VC does not require any phonetic or transcriptional information.

Fig 1.1. Voice transformations can be categorized broadly (VT). Upon [9].
Fig 1.2. VC approaches rely on the traits of the training data. Upon [8]. The word "Hello" in this context is "Namaste."

Traditional VC techniques are used to extract the characteristics of the original speaker's sound stream. Next, the features of the target speaker's recognition
are changed to match those of the reference speaker's, and then use speech synthesis techniques to produce speech from these converted features. Since the
voice signal comprises several levels of information, we only want to collect and modify VC-related variables that reflect speaker identification.

2. LITERATURE SURVEY

2.1. Analysis and Augmentation of Speech

T. Rajesh Kumar et al [26]'s analysis of the FR-GMM using the conventional approach and the nonparallel training adaption method using reference
speaker voices was conducted. Since the challenge's objective is to alter how speakers are believed to be identified in a given speech signal, the only
characteristics that can be taken from the target speaker's representations are changed into characteristics of the 15 intended speakers are those that are
important for speaker identity.

Fig 2.1. NAM derives the voice position and structure.

The sociocultural and physiological elements that contribute to speaker individuality can be divided into two categories in general [23]. An individual's
speaking style is more influenced by sociological factors, such as the community they are from, their socioeconomic standing, their dialect, etc. Speaking
styles can be reproduced acoustically by means of prosodic characteristics including fundamental frequency ( F 0 ) contour, duration, rhythm, power
levels, etc.

Individual speakers' speech organs, by contrast hand, are linked to physiological factors. These variables influence the vocal tract system's shape and length,
which in turn affects formants, the spectral envelope, and other variables such as spectral tilt. These cultural and physiological components can be compared
to technology and software, respectively. When someone imitates someone else, they attempt to mimic that person's "software". However, with today's
speech technologies, changing the software component is more difficult. Thus, the majority of VC systems are more focused on technology than software
[9]. The mean spectrum, formants, and pitch ( F 0 ) have been found in the VC literature to be the most important acoustical features for the speaker's
personality.

2.2. Steps Enabling for Voice Conversion Synchronization

Most people might think of VC as a supervised learning problem. However, both in parallel and non-parallel VC, Finding suitable characteristic pairs from
which to learn the mapping is a very challenging problem. Despite the fact that both speakers in a parallel VC have spoken the same utterances, speaking
rate variations between speakers (also known as interspeaker variations) and demand that, during training, the spectral properties from either the source or
intended speakers be time-aligned. (i.e., intraspeaker variations).The quantity of features will vary depending on how long the source and target speakers
talk the identical speech utterance. Therefore, to account for the temporal variances and to extract the same amount of features from both of the 16 speakers,
time alignment procedures should be applied. The utterances made by the two speakers will differ in non-parallel VC, on the other hand. Therefore, getting
the required feature pairs is the non-parallel VC scenario's most difficult challenge. The solo VC must therefore be applied after it has been aligned. Incor -
rectly aligned pairs will have an impact on how the mapping function is learned, which will ultimately lower the calibre of the converted voices [4, 18, 21,
and 27]. As a result, alignment is a crucial phase in the VC work.
One of the most utilised alignment techniques for the alignment problem in parallel VC is the Dynamic Time Warping (DTW) algorithm [8] and [27]. The
DTW algorithm seeks to globally, not locally, align two speech utterances. Phonetic data has been utilised to locally align two spoken utterances, or at the
phoneme level, in order to enhance DTW performance [12]. The DTW algorithm, however, makes the assumption that the same phonemes uttered by the
two speakers will have comparable characteristics. Spectral characteristics, however, are not speaker-independent. As a result, DTW will produce incor-
rectly aligned pairings. Outliers were those feature pairings that had been poorly aligned. Even though they did not follow the data's overall trend. In this es -
say, we suggest using unique outlier’s removal procedures, to evaluate or locate these incorrectly aligned pairings. In this study, we also looked at how
eliminating outliers affected the quality of converted voices. Recently, considerable attention has been paid to the alignment step as a result of the require -
ment to construct non-parallel VC systems for practical applications [8]. It was previously suggested that since TTS may take into consideration durations, a
component range of choice TTS synthesis system may be able to synthesise simultaneous phrases from both speakers. [20]. for this method to produce high-
quality TTS systems for both speakers, more training data are required. Additionally, this method is text-dependent because it requires text. A unit selection
strategy that determines the relevant target units from either the targeted speaker's role in improving the overall on the source speaker's attributes is stated
for text distinct and non-parallel VC systems [21].
Later, the concept of constantly combining skills based on nearest neighbour (NN) proximity has gained prominence inside the framework of message and
non-parallel VC [2-4, 12, 23]. The method's applicability was empirically confirmed in the first INCA publication [2]. A mathematical efficiency theory for
the INCA approach is presented in this publication [13]. The Temporal Context (TC) INCA technique was recently used to enhance the performance of the
INCA algorithm [23]. The INCA and TC-INCA algorithms could be made even more effective by incorporating vibrant aspects in addition to static inform -
ation when computing NN pairs [3].
The main issue with NN-based alignment methods is that they presuppose a correlation between perceptual and Geometric dimensions [12]. In addition, we
recommended including phonetic information into the Multiple - input and multiple alignment approach [26]. To estimate the phonetic boundaries, a revolu-
tionary realignment method is proposed based on Spectroscopy Transition Measure (STM) is provided [5, 16],
Recently, animated version [15, 26] and data acquisition and processing techniques (that use Linguistic Progressive Gram (PPG)) have been developed [6,
17, and 27] to get around the requirement for synchronization in the VC tasks. The founder VC text throughout this paper contained the PPG's flaws.

3. PROPOSED APPROACH

The overall gross spectral properties are transformed effectively by the GMM-based VC approach. However, because of the excessive flattening, the finer
features do not translate effectively (as shown in Figure 3.2). The use of exemplar-based non-parametric approaches, In addition to using target language
samples directly to synthesize transformed speech and using dynamic features and GVs, it has also been proposed to preserve more spectral detail [10–15,
24 and 25]. In addition, there are methods based on frequency warping (FW) that adjust the source spectrum to fit the target spectrum's frequency axis (as
shown in Figure 3.1). From a range of FW-based methodologies, the BLFW method has been selected [14, 19, and 26]. As previously indicated in [19],
Sometimes the parameterized domain can represent the BLFW-based VC. Moreover, when evaluated to FW techniques based on piecewise learning.
Additionally, because there are fewer parameters to learn, it is appropriate in the context of overfitting [19].

Fig 3.1. After [14], the fundamental concept of frequency warping voice conversion.

After conversion, Frequency warping approaches generate a voice of great quality because they leave all spectral nuances intact. Yet, there is no change in
the spectrum's relative magnitude. Speaker similarity (SS) upon transformation is hence less successful than in VC systems based on GMM. Amplitude
Scaling (AS) added to the FW-based technique to address this issue [19, 18, and 19]. The AS modifies the warped spectrum's vertical axis. In real-world
applications, the AS operation of the modern BLFW+AS approach requires a match made in heaven seen between distorted and target fricative consonants
structures [19]. As a consequence, in addition to information about the spectrum's amplitude, the AS vector also carries some information about where
frequencies are located consequently, it is believed that a transformed voice will sound worse.

Suggest a cutting-edge AS approach at the spectrum level to get rid of such fictitious peaks. The suggested AS converts the warped-only spectrum's
wavelength response to the GMM-based spectrum's wavelength range. There have been numerous attempts to merge the compared to the state techniques,
in an effort to combine the benefits of both methodologies, we focus in particular on GMM and FW-based approaches [18], [20], [21]. According to this,
our proposed AS strategy incorporates how do these two cutting-edge techniques produce outcomes that are better than those of the BLFW+AS method.

3.1. Bilinear frequency warping based voice conversion using the amplitude escalating approach
The frequency-warped feature vector an of one particular d-dimensional input vectors b is denoted by
b=M β a , (3.1)

[ ]
1−β 2 2 β−2 β 3 … (3.2)

M β = −β −β 1−4 β +3 β … ,
3 2 4

⋮ ⋮ ⋱
When the 0
th
cepstral coefficient has not been taken into account when expressing M β , also known as a warping matrix. The all pass transform used
by the BLFW technique is provided by [19]:
−1
p −β (3.3)
C ( p)= −1
,
1−β p
Where |β|<1 and if p=k rx, the allpass transforms frequency reaction might be put this way:
−rm
k −β , (3.4)
C ( k )=
rm
1−βm

τ β =tan
−1
[ ( 1−β 2 ) sin τ
( 1+ β 2 ) cos τ −2 β ], (3.5).

Fig 3.2. Response frequency (Normal Frequency vs Magnitude)

Figure 3.2 displays the all pass filters and band reject filter magnitude response. It is apparent that all frequencies will be able to pass through. The
correlation here between are here is shown below and the distorted frequency:

Fig 3.3. Following [15], the design of a bilinear frequency warping function for modifying a.

Fig (3.3) displays this curve's form for numerous forecasts made utilizing some of the formula equation (3.5).As shown in Figure 3.3, formants emigrate
from either the shorter wavelengths to the higher energies when the value of is positive (just like in the instance of a male to female transformation), and
vice versa when the value of is negative (just like in the instance of a male to female transformation). The negative link amongst speech production
thickness and timbre frequencies that is suggested by [10] is therefore preserved:
( 2 j+1 ) .t (3.6)
Z j= ,
4d
where d is indeed the length of the speech production system and c is the noise strength, and Z j is the j th formant frequency. Here, we suggest
estimating the warping factor for each GMM component rather than locating the global warping factor for the entire training set. GMM is based on a source
speaker's training database (i.e., η ). Each GMM component is coupled with an FW factor β w and an AS vector e w therefore the conversion function is
given by [19]:
b=M β (b , η) a+e ( a , η ) , (3.7)

where ( m , d ) , which is given by: where β ( m, d ) and e ( m, d ) are obtained by combining, respectively, the baseline bending components as
well as the AS orientations of each component of η.
Ht
( a , η )=∑ Q(wη ) ( a ) β w ,
w=1

Ht
(3.8)
s ( a , η )= ∑ Q(wη ) ( a ) e w ,
w=1

wherein H t is the sum of all the mixture's constituent parts., Q(wη) ( a ) The probability that "a" is a mixing component of (a) is (a), and The warp factors
β w is initially established through optimizing parameters of bending only transformation, which is provided by: Once the origin and destination feature
vectors are matched and the GMM has been conditioned on the origin speaker data η , the bending factor β w is found.
H
2
φ =∑ ‖b x −M β ( a x , η ) a x‖ ,
(α ) (3.9)

x=1

This applies the iterative approach recommended in [19] to determine a set of { β w } for making the equation smaller (3.7)
4.2. Amplitude Escalating Method

After { β w }’s have been estimated, [19] provides the value of {e w } that minimizes between the distorted and objective vectors' error.
H
2, (3.10)
φ(k) =∑ ‖f x −e ( ax , η )‖
x=1

Where f x =b x −M β ¿). Calculating the system's least squares solutions, i.e., Q.L = G, where

[ ]
Q(η)
1 ( a1 ) … Q(η)
H ( a1 ) t
(3.11)

Q H × w= ⋮ ⋱ ⋮ ,
(η) (η)
Q 1 ( a H ) … QH ( a H ) t

T
And Lw ×1= [ L1 … L w ] ,
T
G H × 1= [ f 1 … f H ] , (3.12)

2
The following gives the d norm minimization-based least squares solution:
−1
Lopt =( Q Q ) Q G
T T (3.13)

The varying formant amplitudes should be taken into account by the AS vector. In some circumstances in which the warped harmonic overtones do not line
AS vector, projected to capture confusing information about the intensity of the target formants and placement of the harmonic frequencies that may be
damaging towards the speech production of a converted sound [15].

4. STATED AMPLITUDE ESCALATING

In the approach outlined above, the AS operation makes the unrealistic assumption that the target formant structures and the warped formant structures will
perfectly match. Therefore, the AS operation will produce erroneous peaks, giving the converted speech signal's speech quality a decline due to the
perceptual perception of incorrect formant placements. Figure 3.4 shows that just the BLFW warped spectrum is added with false peaks by the BLFW+AS
approach (OBLFW). In essence, AS operation should change just the amplitude and frequency of the distorted wavelength, however this is not the case. As
a result, we suggest the subsequent transformation matrix at the wavelength level:
It is impractical to expect the distorted and targeted consonance patterns to perfectly match in real life, as implied by the AS technique in the approach
described above. Because of the artificial peaks which the AS operation would produce, the transformed transmitted signal will already have worse speech
quality and may have inaccurate formant placements perceptually. Figure 3.4 demonstrates how the BLFW+AS technique adds bogus peaks to the BLFW
distorted continuum itself (OBLFW). Fundamentally, the warped spectrum's amplitudes should only change during AS operation. In light of this, the
following equations transformation is our suggestion at the frequency band level:

( g3−g 4 ) (3.14)
b^ u ( k )= −( a^ u ( k )−g 2 )+ g4 ,
rm rm

( g1−g 2 )
where a^ t ( k rm ) is the only distorted spectrum, and

g1=max ( a^ u ( k ) ) ,
rm

g2=min ( a^ u ( k ) ) ,
rm

g3=max (a^ ugmm ( k ) ) ,

g4 =min ( a^ ugmm ( k ) ) ,
rm (3.15)

where the functions max() and min() will locate a spectrum's maximum and minimum values, respectively. Additionally, the JDGMM method's transformed

spectrum is indicated by a^ ugmm ( k rm ).

The suggested AS method in this case translates the OBLFW spectrum's spectral range to the converted spectrum's spectral range based on a GMM. Given
that GMM-based VC transmits the pretty disgusting spectral properties accurately, the converted spectrum's spectral range will be helpful in making up for
the magnitude mismatch between both the creation of employment opportunities spectrum and the real target spectrum. Because it depends on wavelength
range information rather than the smaller characteristics of a transformed wavelength based on a GMM, the suggested solution is immune to the over-
smoothing issue. Figure 3.4 demonstrates that the recommended AS (i.e., BLFW+SAE) won't change the content of a transformed speech; it will only com -
pensate for the magnitude difference and won't add any false peaks. Therefore, we would want to show that the recommended AS approach is better than
the most complex AS on the distorted BLFW-based spectra. Hence, Figure 3.4 does not display the genuine target wavelength or the GMM-based
wavelength. When applying the most contemporary AS approaches, comparable false peaks are observed for the preponderance of the frames.

Fig 3.4. Spectrum converted using different VC techniques. After [15].

5. RESULT AND DISCUSSION

This research provided use of the First VC Challenge Database [13]. From every pair of speakers (source and target), we have created 25 systems overall

utilizing the JDGMM-based technique, the BLFW+AS approach and the suggested approach BLFW+SAE. The 1-D ( F 0 ) per frame (with a 25ms frame
duration and a 5ms frame shift). To coordinate parallel training corpora, [26]. For instance, m=16, 32, 64, or 128 and chose the one that results in the best
MCD for the ( F 0 ) transformation, we employed the mean-variance (MV) transform technique. The framework for interpretation has been implemented
using AHOCODER [27]. The analysis-synthesis framework has been implemented using AHOCODER [27]
Table 4.1. Evaluation of the XAB assessment for voice quality and comparison of the BLFW+AS and BLFW+SAE 95% confidence

intervals

Particular Preference
Score (%)

BLFW+AS 28.36

BLFW+SAE 40.73

Equal Preference 30.91

Figure 4.1 Evaluation of the XAB assessment for voice quality and comparison of the BLFW+AS and BLFW+SAE 95% confidence

intervals

Table 4.2. Shows a speaker similarity analysis using the XAB test

Particular Preference
Score (%)

BLFW+AS 29.45

BLFW+SAE 20.36

Equal Preference 50.18

Figure 4.2. Shows a speaker similarity analysis using the XAB test along with a 95% confidence interval

The appraisal will use the XAB evaluation, a comparison objective test. Respondents were asked to rate how similar the speaker was (SS) and speech
quality of the A and B samples that were played at random in relation to the real target sample X. In the event of perceptually comparable samples, the
individuals can also choose equal preference. The proposed approach. Figures 4.1 and 4.2 show, for speech production and SS, respectively, According to
the findings, voice quality, the subjects prefer the suggested AS system 56.36% of the time compared to the GMM-based system 22.55% of the time.
Similar to this, the participants choose BLFW+SAE 40.73% of the time while BLFW+AS just 28.36% of the time. The speaker identification conversion
method outperformed the GMM-based approach by 0.73%. Despite being 9.09% times less liked than the BLFW+AS system, the suggested system and
BLFW+AS have received the same amount of support (50.18% of the time). The proposed system has a lower preference for speaker similarity than
cutting-edge systems. The pattern of the spectra trajectory, in addition to formant placements and amplitude, clearly supports improved speaker
identification conversion, as demonstrated by BLFW+AS. [23]. As a result, there are trade-offs in quality conversion. The other investigations in the
literature [19, 26, 27] found similar trade-offs.
Table 4.3. Shows the MCD analysis and 95% confidence interval for a variety of systems

Mapping Technique MCD(dB)

BLFW+AS 7.83

BLFW+SAE 9.39

Figure 4.3 shows the MCD analysis and 95% confidence interval for a variety of systems

The classic MCD is used for evaluation that is objective. Figure 4.3 demonstrates that, particularly relative to the GMM-based VC, the suggested approach
offers higher MCD values. The BLFW approach brings the forms of words in the spectrum of the target speaker closer to their representation. The
suggested AS will therefore alter the amplitude of the distorted spectrum rather than accurately matching the target's spectral characteristics. As a result, it
will receive comparable MCD scores to the VC that is based on GMM (as shown in Figure 4.3). Additionally, the research has shown that for FW-based
VC, MCD and subjective score do not have a strong correlation [7, 19, 26, and 27]. MCD is employed in this situation to compare to determine the proper
amount of combination components, the comparative effectiveness of a similar type of VC. Additionally to these techniques based on signal processing,
recently, neural network-based strategies have gained a lot of traction [8].

6. CONCLUSION

The innovative proposed Bilinear Frequency Warping and Stated Amplitude Escalating using a linear unit as the sigmoid transfer function were proposed in
this work. We discovered that the silence-speech pairs correspond to, on average, 93.9% of the total detected suggested amplitude escalating. We looked at
the MCD and the 95% confidence interval for a number of systems that had a relative MCD reduction compared to BLFW+AS. On the other hand, utilizing
the proposed approach BLFW+SAE, we were able to outperform the BLFW+AS VC systems in terms of performance for the proposed systems. We also
ran the MCD test for a more objective assessment more specifically. Hence, in actual implementations of VC, the job of bilinear frequency warping with
recommended amplitude escalating becomes crucial. Also BLFW+SAE was suggested, and both subjective and objective assessments revealed
improvements in speaker similarity and voice quality.

REFERENCES
11.1. Journal Article
[1] N. J. Shah and H. A. Patil. (2019) Novel outliers’ removal approach for parallel voice conversion, Computer Speech and Language, Elsevier,

vol. 58, no. 11, pp. 127–152.

[2] D. Erro, A. Moreno, and A. Bonafonte. (2010) INCA algorithm for training voice conversion systems from non-parallel corpora, IEEE

Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 944–953.

[3] N. J. Shah and H. A. Patil. (2018) Effectiveness of dynamic features in INCA and temporal context-INCA, in INTERSPEECH, Hyderabad, In -

dia, pp. 711–715.

[4] N. J. Shah and H. A. Patil. (2019) Novel metric learning for non-parallel voice conversion, in IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), Brighton, UK, pp. 3722–3726.

[5] N. J. Shah and H. A. Patil. (2019) Phone aware nearest neighbor technique using spectral transition measure for non-parallel voice conver -

sion, in submitted for possible publication in INTERSPEECH, Graz, Austria.

[6] N. J. Shah, S. R., N. Shah, and H. A. Patil. (2018) Novel unsupervised sorted GMM posteriorgram for DNN and GAN-based voice conversion

framework, in Proceedings of Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference.

Hawaii, USA: IEEE, pp. 1776–1781.

[7] A. Rajpal, N. J. Shah, M. Zaki, and H. A. Patil. (2017) Quality assessment of voice converted speech using articulatory features, in Interna -

tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, pp. 5515–5519.

[8] S. H. Mohammadi and A. Kain. (2017) An overview of voice conversion systems, Speech Communication, vol. 88, no. 04, pp. 65–82.

[9] Y. Stylianou. (2009) Voice transformation: A survey, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP),

Taipei,Taiwan, pp. 3585–3588.

[10] T. F. Quatieri. (2006) Discrete-Time Speech Signal Processing: Principles and Practice, 1st Ed. (Pearson Education India)
[11] J. Kominek and A.W. Black. (2004) The CMU-ARCTIC speech databases, in ISCAWorkshop on Speech Synthesis, Pittsburgh, USA, pp. 223–
224.

[12] A. Kain and M. W. Macon. (1998) Spectral voice conversion for text-to-speech synthesis, in International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), Seattle, WA, USA, pp. 285–288.

[13] N. J. Shah and H. A. Patil. (2017) On the convergence of INCA algorithm, in Proceedings of Asia-Pacific Signal and Information Processing
Association (APSIPA) IEEE, pp. 559–562.

[14] D. Sundermann and H. Ney. (2003) VTLN-based voice conversion, in IEEE International Symposium on Signal Processing and Information
Technology, Darmstadt, Germany, pp. 556–559.

[15] N. J. Shah and H. A. Patil. (2017) Novel amplitude scaling method for bilinear frequency warping based voice conversion, in International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, pp. 5520–5524.

[16] N. J. Shah, B. B. Vachhani, H. B. Sailor, and H. A. Patil. (2014) Effectiveness of PLP-based phonetic segmentation for speech synthesis, in
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 270–274.

[17] N. J. Shah, M. C. Madhavi, and H. A. Patil. (2018) Unsupervised vocal tract length warped posterior features for non-parallel voice conver -
sion, in INTERSPEECH, Hyderabad, India, pp. 1968–1972.

[18] E. Helander, J. Schwarz, J. Nurminen, H. Silen, and M. Gabbouj. (2008) On the impact of alignment on voice conversion performance, in IN-
TERSPEECH, Brisbane, Australia pp. 1453–1456.

[19] D. Erro, E. Navas, and I. Hernaez. (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling, IEEE
Transactions on Audio, Speech, and Language Processing, vol. 21, no. 3, pp. 556–566.

[20] D. Sündermann. (2008) Text-independent voice conversion, Ph.D. Thesis, Universitätsbibliothek der Universität der Bundeswehr München.
[21] S. V. Rao, N. J. Shah, and H. A. Patil. (2016) Novel pre-processing using outlier removal in voice conversion, in ISCA Speech Synthesis
Workshop (SSW), Sunnyvale, CA, USA pp. 147–152.

[22] D. Sündermann, A. Bonafonte, H. Ney, and H. Höge. (2004) A first step towards textindependent voice conversion, in International Confer -
ence on Spoken Language Processing (ICSLP), South Korea, pp. 1–4.

[23] H. Kuwabara and Y. Sagisak. (1995) Acoustic characteristics of speaker individuality: control and conversion, Speech Communication, vol.
16, no. 2, pp. 165–173.

[24] G. Fant. (1973) Speech Sounds and Features, The MIT Press.
[25] H. Valbret, E. Moulines, and J. P. Tubach. (1992) Voice transformation using PSOLA technique, in IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, San Francisco, USA, pp. 145–148.

[26] T. Rajesh Kumar. (2019) Conversion of Non-Audible Murmur to Normal Speech Based on FR-GMM using Non-Parallel Training Adaptation
Method, International Conference on Intelligent Sustainable Systems (ICISS)

[27] Shah, Nirmesh J. (2019) Voice conversion: alignment and mapping perspective, Dhirubhai Ambani Institute of Information and communica-
tion Technology

Common questions