0% found this document useful (0 votes)
12 views10 pages

Deep Learning for Audio Classification

This paper presents a deep learning-based approach for audio signal classification using Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The models were evaluated on a labeled dataset, achieving high accuracy with CNN excelling in spatial feature extraction and RNN in temporal dependencies. The results confirm that deep learning significantly enhances the performance of audio classification systems.

Uploaded by

sanjaysamson0522
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Deep Learning for Audio Classification

This paper presents a deep learning-based approach for audio signal classification using Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The models were evaluated on a labeled dataset, achieving high accuracy with CNN excelling in spatial feature extraction and RNN in temporal dependencies. The results confirm that deep learning significantly enhances the performance of audio classification systems.

Uploaded by

sanjaysamson0522
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Audio Signal Classification Using Deep

Learning
ABSTRACT :
Audio signal classification plays a significant role in various real-world applications such as
speech recognition, environmental sound analysis, and music genre identification. Traditional
approaches often depend on manually extracted features, which may not capture the full
complexity of audio data. This paper presents a deep learning-based method for automatic
audio signal classification using Convolutional Neural Networks (CNN) and Recurrent
Neural Networks (RNN). The CNN model is utilized to extract spatial features from
spectrogram representations, while the RNN model effectively captures temporal
dependencies within the audio sequences. Both models were trained and evaluated on a
labelled dataset, and their performance was compared using metrics such as accuracy,
precision, recall, and F1-score. The experimental results demonstrate that both CNN and
RNN architectures achieve high classification accuracy, with CNN excelling at spatial feature
extraction and RNN providing better temporal feature learning. The proposed approach
confirms that deep learning models can significantly enhance the performance and reliability
of audio signal classification systems.
KEYWORDS :
Audio Signal Classification, Deep Learning, Convolutional Neural Network (1DCNN),
Recurrent Neural Network (RNN), Spectrogram, Feature Extraction.

INTRODUCTION :
Audio signal classification plays a crucial In recent years, deep learning has emerged
role in numerous applications, including as a powerful approach for audio signal
speech recognition, environmental sound processing, enabling end-to-end learning
detection, music genre identification, and directly from raw or transformed audio
digital forensics. Audio signals contain representations. Among the various
both temporal and spectral information, architectures, Convolutional Neural
making their accurate classification a Networks (CNNs) and Recurrent Neural
challenging task. Traditional machine Networks (RNNs) have achieved
learning techniques, such as Support significant success in modelling spectral
Vector Machines (SVM) and Hidden and temporal features of sound. CNNs are
Markov Models (HMM), rely heavily on particularly effective in capturing local
handcrafted features like Mel-Frequency spatial patterns in spectrograms, whereas
Cepstral Coefficients (MFCCs) and RNNs, especially Long Short-Term
spectral centroids. However, these Memory (LSTM) and Gated Recurrent
approaches often fail to capture the Unit (GRU) models, excel at learning
complex and dynamic nature of audio data long-term temporal dependencies [2].
[1].
Rakesh Kumar et al. [3] developed types. Patel et al. [8] proposed a
an intelligent audio signal CNN-based forensic audio
processing system for rainforest classifier that accurately
species identification using CNN distinguishes recording conditions
and LSTM networks, achieving and environments using spectro-
accuracies of 95.62% and temporal cues. These results
93.12%, respectively. A hybrid indicate that CNN and RNN
CNN–LSTM model achieved architectures are not only
97.12% accuracy with reduced log effective in traditional audio
loss, demonstrating the classification tasks but also in
complementary nature of forensic and authentication
convolutional and recurrent scenarios.
architectures. Similarly, Meenu
This study focuses on the
Gupta et al. [4] implemented CNN
implementation and comparative
and RNN models for
evaluation of CNN and RNN
environmental sound
models for audio signal
classification, reporting superior
classification. The proposed
accuracy compared to traditional
system aims to achieve higher
classifiers.
classification accuracy while
In the field of music information minimizing the effects of
retrieval (MIR), Pons et al. [5] background noise and variations
reviewed the application of deep in recording conditions. The CNN
learning models for music signal model is designed to extract
processing and highlighted CNNs’ spectral features from
ability to learn timbral and spectrogram representations,
rhythmic representations directly while the RNN model captures
from spectrograms. Kim et al. [6] temporal dependencies from
utilized bidirectional RNNs to sequential data. Both models are
enhance rhythm and melody trained and tested on a labelled
recognition, demonstrating dataset, and their performances
improved temporal modeling are compared using evaluation
performance. Bhangale and metrics such as accuracy,
Kothandaraman [7] emphasized precision, recall, and F1-score.
that combining CNN and RNN The results reveal that CNNs
models results in robust systems perform efficiently in spatial
capable of handling various audio feature learning, whereas RNNs
domains efficiently. excel in temporal sequence
modelling, establishing a
Furthermore, deep learning
foundation for future hybrid deep
techniques have found application
learning-based audio
in digital audio forensics, where
classification frameworks.
they aid in identifying recording
environments and microphone METHODOLOGY :
This section outlines the methodological dimensionality of the feature map. The
framework adopted for classifying audio output of the pooling layer is given by:
signals into their respective music genres
Y m =max ( X m )
using deep learning. The primary objective
of the proposed system is to automatically Where X mrepresents the 1D input and Y m
learn distinctive audio feature patterns represents the pooling output. The output
from a structured dataset and accurately of final pooling layer was flattened and
predict the genre category. Two given to fully connected layer. The output
independent deep learning architectures — of the fully connected layer is given by:
the One-Dimensional Convolutional

(∑ )
n
Neural Network (1D-CNN) and the Simple
Y m =f W uv X q +B m
Recurrent Neural Network (SimpleRNN) q=1
were designed, trained, and evaluated to
compare their performance in music genre Where Y m denotes the output of the fully
recognition. Here are the explanation of connected layer, f denotes the ReLU activation
the models give below. function, W uv denotes the weight values, X q
denotes the 1D data obtained through the
A. One-Dimensional Convolutional flattened layer, Bm represents the bias, and n
Neural Network (1D-CNN) denotes the number of neurons. The output of
the fully connected layer was given to the
The 1D-CNN model is designed to extract
output layer to perform the multiclass
spatial feature representations from
classification . For the multi class
sequential numerical data. It processes classification, the softmax activation function
one-dimensional sequences, making it was used in the output layer for the multi-
suitable for time-series data such as music output regression, the linaer activation
signal attributes. The convolutional layers function was used. The expression for the
learn local patterns within the input vector, softmax function is given by:
such as rhythm, tempo, and harmonic
exp( X k )
structure correlations. Yk= n

The convolution operation for a given ∑ exp( X q)


q =1
filter ican be mathematically expressed as:
Where Y k represents the output, X k

(∑ )
j
represents the input, and n represents the
Y (k)
m =f H (k)
i X m + Bm
i=0 number of neurons.
(k)
Where Y m represents the output feature
map, X m represent the 1D input audio, H (k)
i

represent the kernel values, k represent the


number of kernels, j represent the kernel
size, f represents the ReLU activation
funcation, and Bm represents the bias. The
output of convolutional layer was passed
through the pooling layer to reduce the
Fig.1. Proposed 1DCNN Architecture for Audio The dataset used in this study consists of
classification. musical audio features stored in CSV
format. Each record represents a music
B. Simple Recurrent Neural Network sample with attributes such as tempo,
(SimpleRNN) danceability, energy, loudness, and
valence, among others. The data were
The SimpleRNN model is designed to normalized to ensure uniform scale across
capture temporal dependencies and features and were divided into training and
sequential relationships in the dataset. It testing subsets with 75% of the data for
maintains an internal memory of previous training and 25% for testing. This
feature states, allowing it to learn time- preprocessing step ensures the model is
related transitions in audio features such as trained efficiently and generalizes well to
rhythm progression and beat consistency. unseen data.
This makes RNNs suitable for tasks
involving sequence learning and B. Convolution Operation (1D-CNN
contextual understanding. Model)
The core of this model is the convolution operation,
At each time step t , the RNN computes its
defined as:
hidden state ht and output y t using the
following equations: O(i)=∑ I (i+ m)⋅ K (m)where I is the
m
ht =f (W h ht −1+ W x x t +b) input sequence, K is the convolution
y t =Softmax (W y ht + c)where W h and W x kernel, and O is the output feature map.
are weight matrices, b and c are biases. This operation enables the CNN to detect
local patterns within the sequential data,
f(x) represents the non-linear activation such as changes in rhythm or frequency
function, typically tanh : components in music. The extracted
x
e −e
−x features are then passed through pooling
f (x)=tanh ⁡(x)= x −x and dense layers for classification.
e +e

C. Recurrent Operation (SimpleRNN


Model)
The Recurrent Neural Network (RNN)
captures the temporal dependencies in
sequential data through recurrent
connections. At each time step t , the
hidden state ht is updated based on the
current input x t and the previous hidden
Fig.2. Proposed SimpleRNN Architecture for state ht −1:
Audio Classification.
ht =f (W x t +U ht −1+ b)

A. Data Preprocessing where W and U are the weight matrices, b


is the bias, and f ( ⋅ )is the activation
function, typically the tanh function. This TP
R=
recurrent formulation allows the model to TP+ FN
retain information over time, making it
 F1-Score: The harmonic mean of
effective for sequential tasks like audio
precision and recall.
classification.
P×R
F 1=2×
P+ R
D. Loss Function Where P-Precision, R-Recall
Both the 1D-CNN and RNN models use These metrics collectively provide a
the Categorical Cross-Entropy loss comprehensive understanding of each
function, suitable for multi-class
classification: Metrics Value (%)
C Accuracy 86.77%
L=−∑ y c log ⁡( pc ) Precision 75.31%
c=1
Recall 86.77%
where y c is the true label and pc is the
F1-Score 80.63%
predicted probability for class c . The
model minimizes this loss to improve model’s effectiveness in classifying audio
prediction accuracy. signals.

E. Performance Metrics : RESULTS AND DISCUSSION :

The following formulas were used to 1. Performance Metrics :


evaluate the model’s performance: The models were evaluated using
 Accuracy (A): Measures the Accuracy, Precision, Recall, F1-Score. The
overall correctness of predictions. results are summarized in the tables for
both 1DCNN and SimpleRNN.
(TP-True Positive, TN-True Negative,
Table 1- Performance Metrics for
FP-False Positive, FN-False Negative) 1DCNN
TP+TN
A=
TP+TN + FP+ FN
Table2 - Performance Metrics for Simple
 Precision (P): The ratio of RNN
correctly predicted positive
Metrics Value (%)
observations to the total predicted
positives. Accuracy 9.81%
Precision 87.78%
TP
P= Recall 9.81%
TP+ FP
F1-Score 1.82%
 Recall (R): The ratio of correctly
predicted positives to all actual 2. Training and Validation :
positives.
 The training and validation
accuracy and loss results for the
1DCNN model are as follows.
Figure 3 shows the training and validation
accuracy of the 1D-CNN model over 100
epochs. The training accuracy gradually
improves and stabilizes around 89%, while
the validation accuracy reaches about 91%.
The close values between them indicate
good generalization and effective learning
by the model without significant
overfitting.

Fig.4. Training and Validation Loss over Epochs

 The training and validation


accuracy and loss results for the
SimpleRNN model are as follows.

Figure 5 shows the training and validation


accuracy of the Simple RNN model over
100 epochs. The training accuracy
stabilizes around 89%, while the
validation accuracy remains slightly higher
Fig.3. Training and Validation Accuracy over at about 91%. The close alignment
Epochs between the two curves indicates
consistent learning and good
generalization by the RNN model.
Figure 4 shows the training and validation
loss of the 1D-CNN model over 100
epochs. The loss decreases sharply during
the initial epochs and stabilizes near zero
after about 15 epochs. Both training and
validation losses follow a similar trend,
indicating efficient learning and minimal
overfitting in the model.
Fig.6. Training and Validation Loss over Epochs

3. Confusion Matrix :

Figure 7 shows the confusion matrix of the


1D-CNN model, which illustrates how
well the model distinguishes between
different audio classes. The matrix
indicates strong performance, with the
majority of predictions concentrated along
the diagonal. For example, class 4 has the
highest number of correct predictions
Fig.5. Training and Validation Accuracy over (10,995), followed by class 3 with 1,257
Epochs correctly classified samples. Only a few
misclassifications occurred, such as 3
Figure 6 shows the training and samples from class 0 and 1 sample from
validation loss of the Simple RNN class 4 being incorrectly predicted.
model over 100 epochs. The loss Overall, the 1D-CNN model demonstrates
decreases rapidly in the initial high accuracy and effective feature
epochs and stabilizes close to learning in classifying audio signals.
zero after around 15 epochs. Both
training and validation losses
follow a similar pattern, showing
that the RNN model effectively
minimizes error and maintains
good consistency without
overfitting.

Fig.7. Confusion Matrix for 1DCNN


Figure 8 shows the confusion matrix of the and 0.27, respectively. Overall, the 1D
Simple RNN model, which illustrates how CNN model demonstrates strong
well the model differentiates between discriminative ability for certain classes,
various audio classes. The matrix indicates reflecting effective feature extraction and
that the model performs effectively, with classification performance.
most predictions correctly aligned along
the diagonal. For instance, class 4 has the
highest number of correct predictions
(10,971), followed by class 3 with 1,239
correctly classified samples. Only a few
misclassifications occurred, such as 18
samples from class 3 and 23 samples from
class 4 being incorrectly predicted.
Overall, the Simple RNN model shows
good classification performance with
minor errors across a few classes.

Fig.9. Receiver Operating Characteristic


(ROC) curve for 1DCNN.

Figure 10 presents the Receiver Operating


Characteristic (ROC) curve for the Simple
RNN model, illustrating the balance
between the true positive rate and false
positive rate across different audio classes.
The ROC curves indicate that the model
achieves strong discriminative
Fig.8. Confusion Matrix for SimpleRNN performance, particularly for class 0,
which attains an area under the curve
[Link] Operating Characteristic (ROC) (AUC) of 1.00, representing excellent
Curve : classification. Other classes, such as class
Figure 9 presents the Receiver Operating 3 and class 4, also perform reasonably well
Characteristic (ROC) curve for the 1D with AUC values around 0.75, while
CNN model, illustrating the trade-off classes 1 and 2 have slightly lower AUCs
between the true positive rate and false of 0.72 and 0.73, respectively. Overall, the
positive rate across different classes. The RNN model demonstrates reliable
Area Under the Curve (AUC) values classification capability, with class 0
indicate that class 0 achieves the best showing near-perfect separation
performance with an AUC of 0.96, performance.
followed by class 1 with 0.73, and class 2
with 0.56. In contrast, class 3 and class 4
show relatively lower AUC values of 0.26
M., Platt, D., Saurous, R.A., Seybold, B. and
Slaney, M., 2017, March. CNN architectures for
large-scale audio classification. In 2017 ieee
international conference on acoustics, speech and
signal processing (icassp) (pp. 131-135).

2. Choi, K., Fazekas, G., Sandler, M. and Cho, K.,


2017, March. Convolutional recurrent neural
networks for music classification. In 2017 IEEE
International conference on acoustics, speech and
signal processing (ICASSP) (pp. 2392-2396).

3. R. Kumar, M. Gupta, S. Ahmed, A. Alhumam,


and T. Aggarwal, “Intelligent audio signal
processing for detecting rainforest species using
deep learning,” Intelligent Automation & Soft
Computing, vol. 31, no. 2, pp. 692–706, 2022.
Fig.10. Receiver Operating Characteristic (ROC)
curve for Simple RNN. 4. M. Gupta and R. Sharma, “Deep learning-based
environmental sound classification using CNN and
RNN architectures,” Journal of Intelligent Systems,
Conclusion: vol. 30, no. 4, pp. 415–427, 2021.
In this study, a deep learning-based 5. Pons, J., Lidy, T. and Serra, X., 2016, June.
framework was implemented for audio Experimenting with musically motivated
signal classification using 1D convolutional neural networks. In 2016 14th
international workshop on content-based
Convolutional Neural Network (1D CNN)
multimedia indexing (CBMI) (pp. 1-6).
and Recurrent Neural Network (RNN)
architectures. Experimental results 6. K. Zaman, M. Sah, C. Direkoglu, and
M. Unoki, ‘‘A survey of audio
demonstrated that the 1D CNN model
classification using deep learning,’’ IEEE
achieved superior performance with an Access, vol.11, pp.106621–
accuracy of 86.77%, while the RNN 106652,Oct.2023,doi:10.1109/ACCESS.2
model attained an accuracy of 9.80%. The 023.3318015.
higher accuracy of the CNN model 7. Zaman, K., Sah, M., Direkoglu, C. and
highlights its effectiveness in capturing Unoki, M., 2023. A survey of audio
local temporal patterns and discriminative classification using deep learning. IEEE
audio features. These findings confirm that access, 11, pp.106620-106649.
CNN-based architectures are more 8. Qamhan, M.A., Altaheri, H., Meftah,
efficient for audio classification tasks A.H., Muhammad, G. and Alotaibi, Y.A.,
compared to traditional sequential models. 2021. Digital audio forensics:
The proposed system can be further microphone and environment
classification using deep learning. Ieee
extended for real-time sound recognition
Access, 9, pp.62719-62733.
applications such as speech emotion
analysis, environmental sound monitoring, 9. R. Kumar, M. Gupta, S. Ahmed, A.
Alhumam, and T. Aggarwal, ‘‘Intelligent
and multimedia content classification.
audio signal processing for detecting
References : rainforest species using deep learning’’,
Intelligent automation and Soft
1. Hershey, S., Chaudhuri, S., Ellis, D.P., Computing, vol. 31, no. 2, pp. 693–706,
Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, 2022, doi: 10.32604/iasc.2022.019811.
10. M. A. Aslam, M. U. Sarwar, M. K.
Hanif, R. Talib, and U. Khalid, ‘‘Acoustic
classification using deep learning,’’ Int. J.
Adv. Comput. Sci. Appl. (IJACSA), vol. 9,
no. 8, pp. 153–159, 2018.
15. H. Hasan, M. S. M. Rahman, and M. S.
11. H. Purwins, B. Li, T. Virtanen, J. Islam, “Audio forensic authentication
Schlüter, S.-Y. Chang, and T. Sainath, using background noise,” Applied
‘‘Deep learning for audio signal Intelligence, vol. 42, no. 3, pp. 627–641,
processing,’’ IEEE J. Sel. Topics Signal Mar. 2015, doi: 10.1007/s10489-014-
Process., vol. 13, no. 2, pp. 206–219, 0629-7
May 2019, doi:
10.1109/JSTSP.2019.2908700. 16. E. Hassan, S. Elbedwehy, M. Y. Shams,
T. Abd El-Hafeez, and N. El-Rashidy,
12. Akinpelu and S. Viriri, “Deep learning “Optimizing poultry audio signal
framework for speech emotion classification with deep learning and
classification,” IEEE Access, vol. 12, pp. burn layer fusion,” J. Big Data, vol. 11,
152152–152182, Oct. 2024, doi: no. 135, pp. 1–29, Sep. 2024, doi:
10.1109/ACCESS.2024.3474553. 10.1186/s40537-024-00985-8.
13. Qamhan, H. Altaheri, A. H. Meftah, G. 17. Alzahrani, M. A. Aljohani, and M. A. Alzahrani,
Muhammad, and Y. A. Alotaibi, “Digital “Audio-based activities recognition using machine
audio forensics: Microphone and learning algorithms and deep learning,” Sensors,
environment classification using deep vol. 19, no. 4819, pp. 1–19, Oct. 2019, doi:
learning,” IEEE Access, vol. 9, pp. 62719– 10.3390/s19224819.
62733, Apr. 2021, doi:
10.1109/ACCESS.2021.3073786. 18. Kim, J.W., Salamon, J., Li, P. and Bello, J.P.,
2018, April. Crepe: A convolutional representation
14. Hashemi, M. Aghabozorgi, and M. T. for pitch estimation. In 2018 IEEE international
Sadeghi, “Persian music source conference on acoustics, speech and signal
separation in audio visual data using processing (ICASSP) (pp. 161-165).
deep learning,” in Proc. 6th Iranian Conf.
Signal Process. Intelligent Syst. (ICSPIS),
Yazd, Iran, Dec. 2020, pp. 1–6, doi:
10.1109/ICSPIS51611.2020.9349614

You might also like