Tail Label Challenges in Multi-Label Learning
Tail Label Challenges in Multi-Label Learning
➢ Summary
3
Review – Binary Classification
•Text categorization
•Audio categorization
•Image categorization
•Bioinformatics
Multi-label
Multi-class
6
Multi-Label Classification
[Link]
8
Real-world Examples: XML
Recommender Systems
Users’ information is collected as input features and their shopping history can be used to construct observed
labels. It is easy to see this problem can be formulated as a multi-label learning task.
9
Real-world Examples: XML
Recommender Systems
The problem of recommending related queries can be reformulated as an extreme multi-label learning task.
After the user submits a query, the search engine need to recommend the most related queries that might serve
the user’s requirements from a huge candidate set. Since the size of the label set can be up to millions, existing
ranking algorithms suffer from inadmissible computational cost
10
Real-world Examples: XML
Comparison XML with other machine learning settings.
[Link]
11
Multi-Label Classification
The quantity of publications related to the topic of MLC utilizing DL methods according to Google Scholar sources
from 2006-2023
12
Multi-Label Classification
Illustration of the core idea of different types of methods for solving XML problem, i.e., binary relevance, embedding based
method, and two kinds of tree-based method.
14
Multi-Label Classification
➢ Summary
17
Multi-Label Classification: Traditional Methods
18
Multi-Label Classification: Traditional Methods
[Link]
20
Problem Transformation
Binary Relevance
21
Problem Transformation
`Binary Relevance
22
Problem Transformation
`Binary Relevance
Strong sides:
- estimates single-label classifiers - can generalize beyond available label
combinations
Weak sides:
- not suitable for large number of labels - ignores label relations
Remark:
Though the rationale behind BR is simple, it often achieves performance very close to
sophisticated XML ap- proaches, including deep learning methods, which hence raises an
interesting question “does label correlation matter in XML?”. Even for extant deep learning
methods, very few approaches explicitly leverage label correlations.
Using techniques such as GNNs [73] to model the label correlation may be a promising
direction and can help answer the above question
23
Problem Transformation
Classifier Chains
The first classifier is trained just on the input data and then each next classifier is trained on the input space
and all the previous classifiers in the chain
24
Problem Transformation
Classifier Chains
Strong sides: - estimates single-label classifiers - can generalize beyond available label combinations - takes
label relations into account
Weak sides: - not suitable for large number of labels - quality strongly depends on the label ordering in chain.
25
Problem Transformation
Label Powerset
Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label
combinations found in the training data
26
Problem Transformation
Label Powerset
Strong sides: - estimates label dependencies, with only one classifier - often best solution for subset accuracy
if training data contains all relevant label combinations
Weak sides: - requires all label combinations predictable by the classifier to be present in the training data -
very prone to underfitting with large label spaces
27
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C Step 1: Calculate Distance to Neighbors
1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 2: Gather Labels of the Neighbors
NO F1 F2 Label A Label B Label C
1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1
The nearest neighbors based on distance
(with K = 3) are:
Instance 5 (distance = 0)
Instance 1 (distance ≈ 1.41)
Instance 2 (distance ≈ 1.41)
30
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 3: Count the Occurrences of Each Label
F1 F2 Label A Label B Label C
2 3 1 0 1 Now, we count how many times each label occurs in the K nearest
neighbors:
4 5 0 1 0 •Label A:
1 1 1 1 0 • It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).
5 6 0 1 1
•Label B:
• It appears in 1 out of 3 neighbors (Instance 2).
3 4 1 0 1 •Label C:
• It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).
P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)
P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)
31
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)
P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C
P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C
P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C
P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.
NO F1 F2
❑ Nearest neighbors: Instance 1, Instance 2,
5 3 4 Instance 4.
❑ Labels for Label A: [1, 0, 0].
35
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
This means that, if the test instance should have Label A (Label A = 1), then there is an 67% chance that 2 out of 3 of its
neighbors would also have Label A.
36
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm
MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)
P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)
P(NA=2) = P(NA=2 ∣A=1) * P(A=1)+P(NA=2 ∣A=0) * P(A=0) = 0.6*0.67 + 0.4* 1.0 = 0.8
P(A=1|NA=2) = 0.67* 0.6 / 0.8 = 0.5025
Predict Label A
P(A=0|NA=2) = 1.0* 0.4 / 0.8 = 0.50
37
Neural Network
Remark:
It is a common belief that tree-based methods
usually suffer from low prediction accuracy affected
by the so- called cascading effect, where the
prediction error at the top cannot be corrected at a
lower level. Therefore, to alleviate this issue, addition
modules can be designed as a post-processing step
43
Outline
➢ Introduction to Multi-label Classification
➢ Summary
44
Limitation: Imbalanced Data
45
Limitation: Poor Generalization
Machine Learning
46
Limitation: Computational Cost
{∅}
{A}
{𝐵}
n {𝐶}
Class = {A, B, C} 2 function
{𝐴, 𝐵}
RAKEL (Random k-Labelsets) {𝐵, 𝐶}
{𝐶, 𝐴}
{𝐴, 𝐵, 𝐶}
47
Limitation: Computational Cost
[Link]
49
Limitation: Extreme Multi-label (XML)
Topic Modeling
Millions of tags
Dynamics Labels
Weak Supervision
(Missing labels)
50
Multi-Label Classification: Deep Learning
[Link]
51
Deep Learning Approaches
Deep Learning
Loss function Architecture
52
Deep Neural Network (DNN)
𝑒 𝑧𝑖 threshold = 0.9
Softmax =
𝑧𝑗
σ𝐾
𝑗=1 𝑒
threshold = 0.5
1
Sigmoid = 1+ 𝑒 −𝑧𝑖 threshold = 0.1
Cumulative
Output layer Softmax probability
2.1 𝑒 𝑧𝑖 0.75
σ𝐾 𝑧𝑗
𝑗=1 𝑒
1.0 0.25
=1
55
Softmax for Binary Classification
2.1 0.75
threshold = 0.5
1.0 0.25
=1
56
Softmax for Binary Classification
Probably a Dog!
2.1 0.75
Hard constraint
1.0 1 - 0.75
Multi-Class 0.8
0.01
0.05
0
. threshold = 0.5 …
.
.
0.03
0
0
=1
58
Softmax for Multi-Class Classification
Softmax Probably a
Multi-Class 0.8
Spindlehorn
0.01
0.05
0 Not a
Spindlehorn
. …
.
. 1 - 0.8
0.03
0
0
59
Softmax for Multi-Label Classification
Softmax
Multi-Label 0
threshold = 0.5
0
0
0.5
.
…
.
.
0
0.5
0
=1
60
Softmax for Multi-Label Classification
Softmax
Multi-Label 0
0
0
0.45
.
…
.
.
0.05
0.5
0
=1
61
Sigmoid for Binary Classification
1.8 0.86
-0.2 0.45
1
Sigmoid = 1+ 𝑒 −𝑧𝑖
62
Sigmoid for Binary Classification
Probably a Dog!
Multi-Label
0.86
0.45
Maybe a Cat…
Multi-Label
63
Sigmoid for Multi-Label Classification
Sigmoid
Definitely a
Multi-Label 2.1 0.89 Spindlehorn
-2.2 0.1
Maybe a
-3.0 0.05 Platypus
-0.03 0.49
. . . .
. . . . …
. . . .
Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49
. . .
. . . …
. . .
2 0.88
-0.9 0.3
4 0.98
65
Sigmoid for Multi-Label Classification
Sigmoid
Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49
. . .
. . . …
. . .
2 0.88
-0.9 0.3
4 0.98
66
Flexible threshold
Softmax Sigmoid
0.1 0.26
0 0.55
0 0.05
0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05
0.05 0.88
0.5 0.3
0 0.98
=1
67
Flexible threshold
Softmax Sigmoid
0.1 0.26
0 0.55
0 0.05
0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05
0.05 0.88
0.5 0.3
0 0.98
=1
68
Notation
Active labels
Irrelevant labels
(Relevant labels)
K = {Dog, Cat} L = {Human, Pig}
69
Global Loss
m
E = ei
Output Label Error
2
( 0.9 - 1 ) 0.01
i =1
2
( 0.3 - -1 ) 1.69 Sample Loss
ei = ( c − d )
Q
i i 2 e1 1 4.3
j j
j =1 2
e2 2 2.6
( 0.2 - -1 ) 0.64
e3 3 3.7
e4 4 1.5
2 e5 5 2.8
( 0.4 - 1 ) 1.96
14.9
4.3 = e1
70
Limitation: Multi-Label Ranking
Output Incorrect Correct
Label Dependency
Irrelevant
Ranking
Labels
(
exp − ( ck − cl ) )
K L
1
error of sample 𝑖 ei =
K L k =1 l =1
i i
{ A, C} → 2 {B} → 1
Number of Active Labels Number of Irrelavant Labels
74
BP-MLL
ei =
1 K L
K L k =1 l =1
exp (
− ( ck
i
− cl )
i
)
Example:
Output K = { A, C} L = {B}
0.8 Active Irrelevant
Labels Labels
1
ei = [exp(−(ciA − cBi )) + exp(−(cCi − cBi ))]
0.4 2 1
1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
0.2 2 1
75
BP-MLL
Output 0 Output 1
A: 0.8 A: 0.8 1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
2 1
1
B: 0.4 B: 0.4 e1 = [exp(−(0.8 − 0.4)) + exp(−(0.7 − 0.4))] = 0.71
2 1
−
79
BP-MLL
K = { A, C} L = {B} {𝐴, 𝐵}
Active Irrelevant
BP-MLL
Labels Labels
{𝐶, 𝐵}
{𝐴, 𝐵}
{𝐴, 𝐷}
K = { A, C , E} L = {B, D} BP-MLL
Active Irrelevant
{𝐶, 𝐵}
Labels Labels
{𝐶, 𝐷}
[Link]
84
Canonical Correlated AutoEncoder
Balanced term
Objective function:
The orthogonal constraint ensures Find two sets of vectors (features and
the vectors in both spaces remain labels) in the latent space so that they
independent. have the highest correlation.
86
Canonical Correlated AutoEncoder
Rectangle
0%
100%
Square The deterministic latent space learned in C2AE
lacks smoothness and clear structure.
0% Rhombus
100% Rectangle
Minor perturbations in this latent space can lead
noise 0% to significantly different Decoder outputs.
Square
0% Rhombus
88
Variational AutoEncoders
(VAEs)
89
Variational AutoEncoder
90
Variational AutoEncoder
[Link]
91
Multivariate Probit Variational AutoEncoder
Probit model
“In statistics, a probit model is a type of regression where the dependent variable can take only two values” - Wikipedia
Class A Y = X + ò,
*
ò ~ N (0, )
1 Y * 0 1 X T + 0
Y = =
0 otherwise 0 otherwise
Not class A
92
Multivariate Probit Variational AutoEncoder
Y = X +ò
* Condition probability of label Y given feature X
Logistics function
93
Multivariate Probit Variational AutoEncoder
Traditional VAEs
[Link]
94
Multivariate Probit Variational AutoEncoder
sample mean x
sample mean y
[Link]
95
Multivariate Probit Variational AutoEncoder
[Link]
96
[Link]
97
C-GNVAE
VAE Reconstruction + +
(Negative log-likelihood)
+ KL-Divergence Contrastive Cross-entropy
Rebuild input from latent space Measure difference between Pull similar, push different samples Compare predicted and true labels
probability distributions
98
C-GNVAE
99
Twin Encoding Network (TEN)
[Link]
10
0 Twin Encoding Network (TEN)
TEN split Encoding
Extract label correlations into different steps.
directly might be failed with
high-order correlations.
p p p
yˆ FM (x) = w0 + w j x j + v T
j v k x j xk
j =1 j =1 k = j +1
[Link]
Dual-AutoEncoder
[Link]
10
5 Dual-AutoEncoder
Autoencoder-based methods only rely on the single autoencoder model,
lacking the capability to compute similarities between data spaces.
[Link]
10
6 Dual-AutoEncoder
Dual Autoencoder used two different sequential AutoEncoder.
[Link]
10
7 Dual-AutoEncoder
Similar representations
are kept closed together.
[Link]
10
8 Transformer
[Link]
10
9
[Link]
11
0 Graph Attention Transformer (GATN)
Graph Attention Transformer is designed to effectively explore complex relationships between labels.
[Link]
11
1
Multi-label Decoder
11
2 Multi-label Decoder (ML-Decoder)
➢ Summary
11
6 Extreme Multi-label Learning
Label frequency of Wiki10-31K and Amazon-670K datasets, which follows a long-tailed distribution
[Link]
11
7 Extreme Multi-label Learning
Infrequently occurring labels (referred to as tail labels) possess limited training examples
and are harder to predict than frequently occurring ones (referred to as head labels).
[Link]
11
8 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Robust Loss Function
A simple idea to improve the tail label performance is to generate more data. For instance,
GLaS trains DNNs with input dropout which is a simple data augmentation technique for text
classification models with sparse features.
Re-Rank finds that models trained on long- tailed datasets tend to have imbalanced label
classifier norms. In other words, classifiers of head labels usually have larger norms, while
norms tail label classifiers are smaller. To this end, they propose to balance the classifier
weights for labels to improve the tail label performance through data augmentation
12
0 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Knowledge Transfer
Regarding transferring knowledge from some labels to some others, ECC is a simple
approach. ECC trains a binary classifiers for each label sequentially. Given each label, it trains
the classifier using both original features and the predictions of preceding labels’ classifiers.
This can effectively leverage label correlations.
J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine
learning, vol. 85, no. 3, pp. 333– 359, 2011.
12
1 Extreme Multi-label Learning
However, there are still big room for improvement:
❑ For instance, most existing studies are based on traditional machine learning
models, well-designed deep learning approaches should achieve superior
performance.
For instance, due the large number of candidate labels, some relevant labels are missing.
Even worse, examples for some labels may not present in the training dataset. Also, new
labels may emerge over time rather than available beforehand. In some applications, we
have to deal with multi-instance XML problem
Missing labels As an intuitive approach, LEML handles missing labels by training model on
observed labels only which means the position of missing entries in label matrix need be
known in advance.
12
3 Code Demo
Multi Label Image Classification Dataset
1 0 0 … 0 1
10 classes
12
4 Code Demo
12
5 Code Demo
12
6 Code Demo
12
7
Outline
➢ Introduction to Multi-label Classification
➢ Summary
12
8 Challenges & Techniques in MLC
Extreme MLC Involves an extremely large Traditional classifiers like Rank Auto-Encoder (Rank-AE)
number of labels (up to one-vs-all, SVM, and DNNs DeepXML Framework
millions) struggle due to AttentionXML
computational limits Two-stage XMTC (XRR)
High Dimensionality An increasing number of labels High-dimensional data Global and local feature
leads to a growth in features, introduces noise, causing selection (GLFS)
reducing classification overfitting
performance
13
0 Summary
131