0% found this document useful (0 votes)
16 views5 pages

S3T: Self-Supervised Music Classification

Uploaded by

inayzianayati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

S3T: Self-Supervised Music Classification

Uploaded by

inayzianayati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

S3T: SELF-SUPERVISED PRE-TRAINING WITH SWIN TRANSFORMER

FOR MUSIC CLASSIFICATION

Hang Zhao1 , Chen Zhang2 , Bilei Zhu1 , Zejun Ma1 , Kejun Zhang2
1
ByteDance AI Lab Speech & Audio Team, 2 Zhejiang University, China
arXiv:2202.10139v1 [[Link]] 21 Feb 2022

ABSTRACT large amount of unlabeled data and learns representations


via designing either heuristics-driven pretext tasks [1, 2] or
In this paper, we propose S3T, a self-supervised pre-training
augmentation-invariant contrastive methods [3, 4], then the
method with Swin Transformer for music classification, aim-
learned representations are transferred to a series of down-
ing to learn meaningful music representations from massive
stream tasks.
easily accessible unlabeled music data. S3T introduces a
momentum-based paradigm, MoCo, with Swin Transformer Recently, several works have made attempts to improve
as its feature extractor to music time-frequency domain. For music classification performance with self-supervised learn-
better music representations learning, S3T contributes a mu- ing. CLMR [5] applied an SSL framework SimCLR [4] with
sic data augmentation pipeline and two specially designed SampleCNN [6] as encoder on raw waveforms. SimCLR ben-
pre-processors. To our knowledge, S3T is the first method efits from large batch size due to its in-batch contrast setting,
combining the Swin Transformer with a self-supervised learn- however, full-length music waveforms are normally too long
ing method for music classification. We evaluate S3T on mu- to render a similar large batch size as images. CLMR ad-
sic genre classification and music tagging tasks with linear dressed this by a short fixed audio input of 2.7s, leading
classifiers trained on learned representations. Experimen- to a trade-off between fair contrastive learning performance
tal results show that S3T outperforms the previous self- and sufficient temporal contextual information. [2] combined
supervised method (CLMR) by 12.5 percents top-1 accuracy multi-task learning with SSL in the pre-training stage, how-
and 4.8 percents PR-AUC on two tasks respectively, and also ever, it brought marginal improvement because selected pre-
surpasses the task-specific state-of-the-art supervised meth- text tasks like waveform reconstruction and log power spec-
ods. Besides, S3T shows advances in label efficiency using trum guide the representation learning process more focus on
only 10% labeled data exceeding CLMR on both tasks with detailed acoustic features than discriminative content.
100% labeled data. In this paper, we propose S3T, a self-supervised pre-
training method with Swin Transformer for music classi-
Index Terms— Self-supervised learning, Swin Trans- fication. S3T builds upon MoCo [3], a momentum-based
former, music genre classification, music tagging. contrastive learning paradigm decoupling the size of negative
samples from the batch size with a dynamic dictionary. S3T
1. INTRODUCTION employs the Swin Transformer [7] as its backbone on music
time-frequency domain. Swin Transformer is a general-
Automatic music classification categorizes a music piece into purpose vision backbone showing advances in several recog-
labels based on several factors such as genre, mood, artist, nition tasks. In S3T, Swin Transformer helps extract mean-
and instrumentation. It is essential in the field of music infor- ingful features from music spectrograms under its inductive
mation retrieval (MIR), with a variety of applications such as biases of locality, hierarchy, and translation invariance. In
music recommendation, music retrieval, and music discovery. addition, we design a music augmentation pipeline and two
Therefore, music classification is of interest to both industry pre-processors to adapt S3T better to music domain.
and academia and has been actively explored for decades. We evaluate S3T on two music classification tasks, i.e.,
Supervised learning methods have dominated the previ- genre classification and music tagging with linear classifiers.
ous music classification works. However, these methods nor- Empirical results show that S3T outperforms CLMR on both
mally consume a considerable amount of parallel data to train tasks, with top-1 genre classification accuracy on the GTZAN
a model, and collecting paired data requires music annotation dataset being 12.5 percents higher, and the PR-AUC of mu-
with certain expertise, which incurs a high cost. In contrast, sic tagging on the MagnaTagATune dataset being 4.8 percents
unlabeled music data is easy to collect and available in large higher. S3T also surpasses the task-specific supervised state-
quantities. To exploit massive unpaired data, self-supervised of-the-art methods on both tasks. Moreover, S3T demon-
learning (SSL) is proposed in recent years and performs well strates its label efficiency with using only 10% labeled data
in many fields. SSL conducts task-agnostic pre-training on surpassing CLMR trained with full labeled data on both tasks.
Self-Supervised Pre-training Downstream Transfer Learning

Data Augmentations
Contrastive Loss

Pre-processor
Spectrogram Music Classification
Encoder

Enqueue Pop Violin/Guitar/Piano…


… Jazz Pop/Jazz/Rock…
Rock Happy/Sad/Peace…
Augmented … …
Music Audio Signal Spectrograms Momentum Encoder Representations Queue
Music Genre
Music Data Augmentation Swin Transformer for Music Contrastive Loss Classification Music Tagging

Fig. 1: The pipeline of S3T. The left part illustrates the self-supervised pre-training stage and the right part introduces the
downstream transfer learning stage.

2. METHOD trast. Such mechanism mitigates the exhaustive demand of


those in-batch contrastive methods like SimCLR as they con-
In this section, we introduce the design of S3T. We first de- sider negative samples within one batch. To maintain the dic-
scribe the overall pipeline of S3T and then introduce the ma- tionary tractable and consistent, the key encoder is updated
jor components specially for music audio. by a momentum-based moving average of the query encoder
instead of back-propagation.
2.1. Pipeline Overview
2.3. Music Data Augmentation
An overview of S3T is shown in Figure 1, which is based on
a contrastive learning paradigm MoCo with a siamese struc- Rich data augmentations promote contrastive learning by en-
ture. S3T contains two stages: self-supervised pre-training couraging learning augmentation-invariant representations.
and downstream transfer learning. The pre-training stage in- To increase the robustness of music representations, we pro-
volves three modules: music data augmentation, encoder and pose a music augmentation pipeline based on [8, 9] adding
momentum encoder with the same Swin Transformer struc- more complexity and stochasticity. Firstly, apply Random
ture, and a contrastive objective associated with a dynamic Multi-crop on an input sample, yielding 2 crops with random
dictionary. During pre-training, S3T first converts music au- lengths. S3T assumes that two crops from the same music
dio signal into time-frequency spectrogram and randomly ap- sample have similar representations. Then the other 4 either
plies a series of music data augmentations to yield two aug- frequency or temporal transformations are randomly applied
mented spectrograms, one as query while another as key. S3T on the two crops with specific probability p.
then applies a specially designed pre-processor on them and • Random Multi-crop Cut two crops from a length T spec-
sends them to two encoders respectively. Finally, the encoded trogram with length T ∗ r1 and T ∗ r2 randomly, r1 and r2
query looks up its corresponding key from the dictionary to are uniformly sampled from a ratio range [0.1, 0.9].
learn a contrastive objective. In the transfer learning stage, we • Random Frequency Masking Mask N frequency seg-
train simple linear classifiers upon the self-supervised repre- ments and each segment has a length of F . N and F are
sentations for downstream music classification tasks. uniformly sampled from the range [1, 5] and [5, 30] re-
spectively, with total mask length constrained. Randomly
2.2. Self-Supervised Learning Framework applied with p = 0.5.
• Random Time Masking Mask N temporal segments with
S3T employs a momentum-based contrastive learning frame- each length of t = T ∗ r, N and r are sampled from uni-
work, MoCo, on the music time-frequency domain. S3T has form distribution [1, 10] and [0.01, 0.2], total mask length
two encoders, a query encoder Encq and a momentum en- is restricted to T ∗ 0.4. Randomly applied with p = 0.5.
coder Enck . Each encoder encodes the query q = Encq (xq ) • Time Warping Following [8], a random point along the
+
and the key k + = Enck (xk ) respectively. S3T enqueues the time axis passing through the center of spectrogram within
current mini-batch encoded keys into a large queue as a look- the time steps (W, T −W ) is warped either to left or right by
up dictionary while dequeues the oldest mini-batch. Then the a distance W sampled from [0, 10] uniformly. Randomly
encoded query q looks up its key k + in the dictionary with applied with p = 0.4.
InfoNCE [1] as the contrastive loss, whose value is low if q • Random Shifting Randomly applied pitch shift or time
is similar to k + and dissimilar to all other keys in the dic- shift with p = 0.4. Horizontally or vertically shift to ei-
tionary. The dictionary is much larger than a training batch ther directions t or f steps, both t and f are sampled from
which enables rich negative music samples to perform con- a uniform distribution [1, 10].
2.4. Swin Transformer for Music 3.2. Experimental Setup
Swin Transformer [7] is proposed as a general-purpose vi- Model Input Each music sample input to S3T was first re-
sion backbone and shows advances in both supervised and sampled to a sampling rate of 22.05 kHz and then converted
self-supervised works [10, 11]. Swin Transformer learns a to a 84-dim constant-Q transform (CQT) spectrogram, with
hierarchical representation by gradually merging neighbor- the number of bins per octave as 12, the hop size as 512, and
ing small-sized patches as layer goes deeper. Besides, the the Hann window as the windowing function. To improve the
shifted windowing mechanism of Swin Transformer limits training efficiency, we further compressed the CQT spectro-
the self-attention computation within non-overlapping win- gram by a factor of 100 along the time axis. The empirical
dows while also allowing cross-window connections. In results demonstrate that though compressed 100 times S3T
S3T, the Swin Transformer extracts multi-resolution time- can still learn meaningful representations from CQTs yield-
frequency features with self-attention within hierarchical ing competitive performance on music classification.
partitioned spectrograms. Meanwhile, such scheme allows Feature Extractor We followed the setting of Swin-T, the
capturing cross-frequency and cross-temporal features. To tiny version of Swin Transformer, with hidden channel num-
leverage powerful Swin Transformer while preventing rigidly ber of 96, {2, 2, 6, 2} layers in each block respectively, and a
viewing spectrograms as images, we design two different drop path rate of 0.1. To adapt the 84-dim CQT-spectrogram
spectrogram-specific pre-processors, Frequency Tiling and to be used for Swin-T, we changed the model input size to
Time Folding, to make proper adaptation to the music do- 256 × 256 with a window size 8. We followed the MoCo v2
main. configuration using a projection head with two hidden fully-
• Frequency Tiling Replicate a spectrogram with shape of connected layers to project the raw representation to a 128-
F × T in frequency domain n times to (F ∗ n) × T , then dim hidden space. Unless otherwise mentioned, we used this
cut out extra high frequency part to form a shape of T × T . setup in all experiments. As for linear evaluation, we followed
• Time Folding Fold a music spectrogram along time axis common linear protocol using a 1-layer MLP with represen-
n times. i.e. a spectrogram with shape F × T is directly tations as input features for each downstream task.
collapsed into (F ∗n)×(T /n), with equal length and width. Pre-training Setup Follow [3], the total size of the queue is
65,536. The pre-training of S3T was run using 4 Tesla V100
3. EXPERIMENTS AND RESULTS GPUs, with a batch size of 128 on each GPU and 400 training
epochs in total. Following [7], we employed the AdamW [17]
3.1. Datasets optimizer with default setting, with a cosine decay learning
rate scheduler and 20 epochs of linear warm-up. The initial
We conducted the self-supervised pre-training of S3T on an
learning rate was set to 0.0005.
in-house dataset of 1 million music recordings covering var-
Downstream Setup During the transfer learning stage, we
ious genres, languages, and themes. Each music recording
trained a simple linear classifier for each downstream task
is sampled at 16 kHz, with a length ranging from 3 to 5
by freezing the feature extractor. The training of each down-
minutes. The pre-trained model was then transferred to two
stream task was run using 1 Tesla V100 GPU with 50 epochs
downstream tasks: music genre classification and music tag-
in total. A batch size of 64 was used for GTZAN and 256 for
ging. The genre classification experiments were run on the
both FMA and MTAT. We employed the AdamW optimizer
GTZAN and the FMA datasets, while the music tagging ex-
with an initial learning rate of 0.001 and a weight decay of
periments were evaluated using the MagnaTagATune dataset.
0.05. Besides, we used the cosine decay learning rate sched-
The details of the three datasets are as follows:
uler with a linear warm-up for 5 epochs.
• GTZAN [12] contains 1,000 30-seconds-long music clips
distributed across 10 distinct genres, with a sampling rate of
22.05 kHz. In this paper, we used the “fault-filtered” split 3.3. Experimental Results and Analyses
described in [13] to make a fair comparison with [5, 14].
3.3.1. Main Results
• FMA [15] is a music analysis dataset collected from Free
Music Achieve. We used the small version of FMA, which We evaluated S3T on both music genre classification and mu-
consists of 8,000 music clips of 8 genres, with each clip sic tagging, and compared our results with other methods us-
being around 30 seconds long and sampled at 44.1 kHz. ing a linear protocol as described in Section 3.2.: 1) Ran-
We followed the official [Link] split to divide FMA into dom Init, which trains a linear classifier with a randomly ini-
train/valid/test subsets. tialized feature extractor; 2) Supervised, which pre-trains the
• MagnaTagATune (MTAT) [16] contains 25,863 29-seconds- feature extractor with labeled data in a supervised manner;
long music clips extracted from 5,223 songs, 445 albums 3) SSL (CLMR), which pre-trains the feature extractor with
and 230 artists. Each clip is associated with a vector of unpaired music by contrastive learning. For comparison, we
binary annotations of 188 tags. We followed the official use average tag-wise area under the receiver operating charac-
partition using the top 50 most popular tags for evaluation. teristic curve (ROC-AUC) and average precision (PR-AUC)
MTAT FMA GTZAN GTZAN
Method Backbone
PR-AUC ROC-AUC Top-1(%) Top-1(%) Top-1(%) Top-5(%)
Random Init 18.0 73.5 26.0 27.6 ResNet50 73.6 93.4
Supervised 38.3 90.6 42.7 79.0 MSNet-BiGRU 64.5 91.9
SSL (CLMR) 36.1 89.4 48.4 68.6 Swin-T 81.1 97.2
S3T (Ours) 40.9 89.9 56.4 81.1
Table 3: Comparison of the results on GTZAN among differ-
Table 1: Comparison of linear evaluation results of S3T ent backbones.
with other three methods. Supervised results are collected
from [14, 18].
GTZAN
Pre-processor
Top-1(%) Top-5(%)
MTAT FMA
Setting Frequency Tiling 78.3 96.6
PR-AUC ROC-AUC Top-1(%) Top-5(%) Time Folding 81.1 97.2
CLMR 36.1 89.4 48.4∗ 90.6∗
Supervised 38.3 90.6 42.7 – Table 4: Comparison of the results on GTZAN between two
pre-processors.
S3T-1% 27.2 77.4 39.0 84.8
S3T-10% 37.2 87.8 50.6 91.9 by 2.2 percents and 7.9 percents. Evaluation results on both
S3T-100% 40.9 89.9 56.4 95.3 tasks demonstrate S3T has strong representation learning ca-
Table 2: Comparison of linear evaluation results with differ- pability even with only 10% or 1% labeled data.
ent proportions of labeled data on MTAT for music tagging Feature Extractor To investigate the effect of model back-
and FMA for genre classification. ∗ indicates reproduced re- bone on the performance of music classification, we com-
sults with official release of 10,000-epoch model. pared the performance of S3T on genre classification when
using ResNet50 [19], MSNet [20]-based BiGRU [21], and
scores as metrics for music tagging on MTAT, and use top-1 Swin-T as the backbone and report the results in Table 3. The
accuracy for genre classification on GTZAN and FMA. results indicate that Swin-T has superior performance than
both CNN-based and BiGRU-based backbones.
S3T outperforms Random Init, CLMR, and respective
Supervised methods on all tasks, as shown in Table 1. Es- Pre-processor Our last experiment is to compare two data
pecially, for music tagging, S3T achieves a PR-AUC of pre-processors described in Section 2.4 on music genre clas-
40.9% and a ROC-AUC of 89.9% on MTAT, exceeding the sification, and the results are listed in Table 4. As shown in
CLMR by 4.8 percents and the Supervised by 2.6 percents the table, Time Folding outperforms Frequency Tiling slightly,
on PR-AUC. For music genre classification, S3T surpasses mainly because it remains more original contextual informa-
CLMR by 12.5 percents and 8.0 percents on top-1 accuracy on tion instead of just repeating small segments of input sample.
GTZAN and FMA respectively and exceeds the Supervised
pre-trained method by 13.7 percents on FMA and 2.1 per- 4. CONCLUSION
cents on GTZAN. Experimental results demonstrate that S3T
can make good use of massive unlabeled music and pre-train In this paper, we propose S3T, a self-supervised pre-training
a feature extractor with good representation learning ability method with Swin Transformer for music classification,
for music classification tasks. which leverages massive unlabeled music data to improve the
performance of music classification and reduce the depen-
3.3.2. Method Analysis dence on a considerable amount of labeled music data. S3T is
built upon a momentum contrast paradigm and firstly applies
In this part, we analyze the label efficiency, feature extractor the Swin Transformer as a feature extractor for music classi-
choice, and pre-processor choice of S3T. fication. Besides, S3T contributes a music data augmentation
Label Efficiency To study the effectiveness of S3T, We eval- pipeline and two specially designed pre-processors. Exper-
uated S3T with limited labeled subsets (1%, 10%) of MTAT iments show that S3T outperforms previous self-supervised
and FMA for two classification tasks and report the results methods and becomes competitive to task-specific supervised
in Table 2. Noticeably, for the music tagging task, S3T with pre-trained method. Our further method analysis verifies the
only 10% labeled data achieves a PR-AUC 37.2% on MTAT, effectiveness of detailed designs in S3T, especially its label
which exceeds CLMR with full labeled data (36.1%), and per- efficiency. In the future, we will consider pre-training on a
forms competitively with Supervised pre-trained result with larger amount of unlabeled data, and explore a feature ex-
full labeled data (38.3%). For music genre classification, S3T tractor better leveraging temporal contextual information of
with only 10% labeled data outperforms both CLMR and Su- music, and combine S3T with a specially designed classifier
pervised pre-trained method with full labeled data on FMA for music classification.
5. REFERENCES [11] Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei
Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao,
[1] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep- “Efficient self-supervised vision transformers for repre-
resentation learning with contrastive predictive coding,” sentation learning,” arXiv preprint arXiv:2106.09785,
arXiv preprint arXiv:1807.03748, 2018. 2021.

[2] Ho-Hsiang Wu, Chieh-Chi Kao, Qingming Tang, Ming [12] George Tzanetakis and Perry Cook, “Musical genre
Sun, Brian McFee, Juan Pablo Bello, and Chao Wang, classification of audio signals,” IEEE Transactions on
“Multi-task self-supervised pre-training for music clas- speech and audio processing, vol. 10, no. 5, pp. 293–
sification,” in ICASSP 2021-2021 IEEE International 302, 2002.
Conference on Acoustics, Speech and Signal Processing
[13] Corey Kereliuk, Bob L Sturm, and Jan Larsen, “Deep
(ICASSP). IEEE, 2021, pp. 556–560.
learning and music adversaries,” IEEE Transactions on
[3] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Multimedia, vol. 17, no. 11, pp. 2059–2071, 2015.
Ross Girshick, “Momentum contrast for unsupervised [14] Rodrigo Castellon, Chris Donahue, and Percy Liang,
visual representation learning,” in Proceedings of the “Codified audio language modeling learns useful rep-
IEEE/CVF Conference on Computer Vision and Pattern resentations for music information retrieval,” in ISMIR,
Recognition, 2020, pp. 9729–9738. 2021.
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and [15] Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst,
Geoffrey Hinton, “A simple framework for contrastive and Xavier Bresson, “Fma: A dataset for music analy-
learning of visual representations,” in International sis,” in 18th International Society for Music Information
Conference on Machine Learning. PMLR, 2020, pp. Retrieval Conference, 2017, number CONF.
1597–1607.
[16] Edith Law, Kris West, Michael I Mandel, Mert Bay,
[5] Janne Spijkervet and John Ashley Burgoyne, “Con- and J Stephen Downie, “Evaluation of algorithms using
trastive learning of musical representations,” in ISMIR, games: The case of music tagging.,” in ISMIR. Citeseer,
2021. 2009, pp. 387–392.

[6] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and [17] Ilya Loshchilov and Frank Hutter, “Decoupled
Juhan Nam, “Samplecnn: End-to-end deep convolu- weight decay regularization,” arXiv preprint
tional neural networks using very small filters for music arXiv:1711.05101, 2017.
classification,” Applied Sciences, vol. 8, no. 1, pp. 150,
[18] Dat Thanh Tran, Nikolaos Passalis, Anastasios Tefas,
2018.
Moncef Gabbouj, and Alexandros Iosifidis, “Attention-
[7] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan based neural bag-of-features learning for sequence
Wei, Zheng Zhang, Stephen Lin, and Baining Guo, data,” arXiv preprint arXiv:2005.12250, 2020.
“Swin transformer: Hierarchical vision transformer us- [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
ing shifted windows,” arXiv preprint arXiv:2103.14030, Sun, “Deep residual learning for image recognition,” in
2021. Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[8] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, [20] Xiaoyu Zhu, Junwei Liang, and Alexander Hauptmann,
“Specaugment: A simple data augmentation method “Msnet: A multilevel instance segmentation network for
for automatic speech recognition,” arXiv preprint natural disaster damage assessment in aerial videos,” in
arXiv:1904.08779, 2019. Proceedings of the IEEE/CVF Winter Conference on Ap-
plications of Computer Vision, 2021, pp. 2023–2032.
[9] Haider Al-Tahan and Yalda Mohsenzadeh, “Clar: Con-
trastive learning of auditory representations,” in Interna- [21] Kyunghyun Cho, Bart Merrienboer, Caglar Gulcehre,
tional Conference on Artificial Intelligence and Statis- Fethi Bougares, Holger Schwenk, and Yoshua Bengio,
tics. PMLR, 2021, pp. 2530–2538. “Learning phrase representations using rnn encoder-
decoder for statistical machine translation,” in EMNLP,
[10] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, 2014.
Qi Dai, Yue Cao, and Han Hu, “Self-supervised
learning with swin transformers,” arXiv preprint
arXiv:2105.04553, 2021.

You might also like