0% found this document useful (0 votes)
21 views131 pages

Tail Label Challenges in Multi-Label Learning

The document provides an overview of multi-label classification, detailing traditional and deep learning methods, as well as challenges in extreme multi-label classification. It discusses various techniques for solving multi-label problems, including problem transformation and adapted algorithms like MLkNN. The document emphasizes the relevance of label correlation and the importance of addressing computational challenges in real-world applications such as recommender systems.

Uploaded by

raizel711247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views131 pages

Tail Label Challenges in Multi-Label Learning

The document provides an overview of multi-label classification, detailing traditional and deep learning methods, as well as challenges in extreme multi-label classification. It discusses various techniques for solving multi-label problems, including problem transformation and adapted algorithms like MLkNN. The document emphasizes the relevance of label correlation and the importance of addressing computational challenges in real-world applications such as recommender systems.

Uploaded by

raizel711247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Multi-label Classification

(From basic to advanced)

Vinh Dinh Nguyen - PhD in Computer Science


Khiem Nguyen – STA
Thinh Nguyen – STA
Code & Data
2
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
3
Review – Binary Classification

Logistics Regression Support Vector Machine


4
What is Multi-Label Classification

•Text categorization
•Audio categorization
•Image categorization
•Bioinformatics

What if I ask you that does this image contains a house?

all things (or labels) are relevant to this picture?

House Tree Beach Cloud Mountain Animal


Yes Yes No Yes No No
5
Multi-Label v/s Multi-Class

Multi-label

Multi-class
6
Multi-Label Classification

Binary Multi-Class Multi-Label

Spam Cat Cat


Not Spam Dog Dog
Pandas Human
7
Multi-Label Classification

Softmax Softmax Sigmoid

[Link]
8
Real-world Examples: XML
Recommender Systems

Users’ information is collected as input features and their shopping history can be used to construct observed
labels. It is easy to see this problem can be formulated as a multi-label learning task.
9
Real-world Examples: XML
Recommender Systems

The problem of recommending related queries can be reformulated as an extreme multi-label learning task.
After the user submits a query, the search engine need to recommend the most related queries that might serve
the user’s requirements from a huge candidate set. Since the size of the label set can be up to millions, existing
ranking algorithms suffer from inadmissible computational cost
10
Real-world Examples: XML
Comparison XML with other machine learning settings.

[Link]
11
Multi-Label Classification

The quantity of publications related to the topic of MLC utilizing DL methods according to Google Scholar sources
from 2006-2023
12
Multi-Label Classification

A chronological overview of recent representative work in XML


13
Multi-Label Classification

Illustration of the core idea of different types of methods for solving XML problem, i.e., binary relevance, embedding based
method, and two kinds of tree-based method.
14
Multi-Label Classification

Domain relevance of different types of XML methods.


15
Extreme Multi-label Learning

Refers to explosive growth in the amount


of data, which poses many challenges to
XML

Refers to the frequency of labels observed


in training data, which typically follows a
long-tailed distribution

Refers to the inferior quality of the


annotated output labels.
[Link]
16
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
17
Multi-Label Classification: Traditional Methods
18
Multi-Label Classification: Traditional Methods

Techniques for Solving a Multi-Label classification problem

[Link] Transformation Transform our multi-label problem into single-label problem(s).

2. Adapted Algorithm Adapting the algorithm to directly perform multi-label classification

3. Ensemble approaches Use different ensembling classification functions

4. Neural Network Use different ensembling classification functions


19
Problem Transformation
Binary Relevance
❑ Decompose the multilabel classification problem into a set of independent binary classification problems
❑ Binary Relevance (BR) assumes that labels are independent to each other, and it learns a binary classifier for
each label separately

[Link]
20
Problem Transformation
Binary Relevance
21
Problem Transformation
`Binary Relevance
22
Problem Transformation
`Binary Relevance
Strong sides:
- estimates single-label classifiers - can generalize beyond available label
combinations
Weak sides:
- not suitable for large number of labels - ignores label relations
Remark:
Though the rationale behind BR is simple, it often achieves performance very close to
sophisticated XML ap- proaches, including deep learning methods, which hence raises an
interesting question “does label correlation matter in XML?”. Even for extant deep learning
methods, very few approaches explicitly leverage label correlations.

Using techniques such as GNNs [73] to model the label correlation may be a promising
direction and can help answer the above question
23
Problem Transformation
Classifier Chains

The first classifier is trained just on the input data and then each next classifier is trained on the input space
and all the previous classifiers in the chain
24
Problem Transformation
Classifier Chains

Strong sides: - estimates single-label classifiers - can generalize beyond available label combinations - takes
label relations into account
Weak sides: - not suitable for large number of labels - quality strongly depends on the label ordering in chain.
25
Problem Transformation
Label Powerset

Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label
combinations found in the training data
26
Problem Transformation
Label Powerset

Strong sides: - estimates label dependencies, with only one classifier - often best solution for subset accuracy
if training data contains all relevant label combinations
Weak sides: - requires all label combinations predictable by the classifier to be present in the training data -
very prone to underfitting with large label spaces
27
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

No F1 F2 Label A Label B Label C Now, we want to classify a new sample:


1 2 3 1 0 1 ❖ Feature1 = 3
❖ Feature2 = 4
2 4 5 0 1 0 ❖ Result: Which labels
3 1 1 1 1 0
4 5 6 0 1 1 We set k = 3
5 3 4 1 0 1
28
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C Step 1: Calculate Distance to Neighbors
1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1

Now, we want to classify a new sample:


❖ Feature1 = 3
❖ Feature2 = 4
❖ Result: Which labels
29
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 2: Gather Labels of the Neighbors
NO F1 F2 Label A Label B Label C

1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1
The nearest neighbors based on distance
(with K = 3) are:
Instance 5 (distance = 0)
Instance 1 (distance ≈ 1.41)
Instance 2 (distance ≈ 1.41)
30
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 3: Count the Occurrences of Each Label
F1 F2 Label A Label B Label C
2 3 1 0 1 Now, we count how many times each label occurs in the K nearest
neighbors:
4 5 0 1 0 •Label A:
1 1 1 1 0 • It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).
5 6 0 1 1
•Label B:
• It appears in 1 out of 3 neighbors (Instance 2).
3 4 1 0 1 •Label C:
• It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)
31
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2)

P(A=1) = 3/5 = 0.6

P(NA=2 |A=1) = ???


32
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1


P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

NO F1 F2 ❑ Nearest neighbors: Instance 5, Instance 3,


Instance 2.
1 2 3
❑ Labels for Label A: [1, 1, 0].
❑ So, for Instance 1, 2 neighbors have Label
33
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1


P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

❑ Nearest neighbors: Instance 1, Instance 5,


NO F1 F2
Instance 2.
3 1 1 ❑ Labels for Label A: [1, 1, 0].
❑ So, for Instance 3, 2 neighbors have Label A =
34
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1


P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

NO F1 F2
❑ Nearest neighbors: Instance 1, Instance 2,
5 3 4 Instance 4.
❑ Labels for Label A: [1, 0, 0].
35
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

Frequency Table for NA(Neighbors with Label A)


P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2)
P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) Number of Neighbors with Label A Frequency when A=1

P(NA=2 |A=1) = 2/3 = 0.67 0 neighbors 0 coccurences


1 neighbors 1 coccurences
We are now interested in the frequency of how many 2 neighbors 2 coccurences
neighbors (among the 3 nearest ones) have Label A = 1.

This means that, if the test instance should have Label A (Label A = 1), then there is an 67% chance that 2 out of 3 of its
neighbors would also have Label A.
36
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2)

P(A=1) = 3/5 = 0.6 => P(A=0) = 0.4

P(NA=2 |A=1) = 0.67 => P(NA=2 |A=0) = 1.0

P(NA=2) = P(NA=2 ∣A=1) * P(A=1)+P(NA=2 ∣A=0) * P(A=0) = 0.6*0.67 + 0.4* 1.0 = 0.8
P(A=1|NA=2) = 0.67* 0.6 / 0.8 = 0.5025
Predict Label A
P(A=0|NA=2) = 1.0* 0.4 / 0.8 = 0.50
37
Neural Network

•Number of nodes in the output layer matches the number of labels.


•Sigmoid activation for each node in the output layer.
•Binary cross-entropy loss function.
38
Embedding-based methods
Embedding-based methods usually assume that the label matrix is low-rank which can
exploit the underlying correlations between labels.
39
Embedding-based methods
Embedding-based methods usually assume that the label matrix is low-rank which can
exploit the underlying correlations between labels.
40
Embedding-based methods
In summary:
❑ Conventional embedding-based methods offer
several advantages such as simplicity, ease of
implementation, strong theoretical foundations, and
the ability to manage label correlations and adapt to
online scenarios.

❑ However, they may sacrifice prediction accuracy due


to information loss during compression.

❑ Most deep learning methods can also be considered


embedding-based, they use more expressive neural
network representations compared to traditional
linear projections, potentially improving
classification performance.
41
Tree-based methods

In instance trees, each node consists of a set of


training examples and then distributed to the child
nodes. The intuition is that each region of feature
space contains only a small number of active labels.
The active label set at a node is the union of the
labels of all training points present in that node.
Label tree methods usually assume that labels are
organized in a tree structure. In label trees, each
node consists of a set of labels which are then
distributed to the child nodes. The tree structure is
determined by recursively clustering labels until
terminal conditions are reached.
42
Tree-based methods

Remark:
It is a common belief that tree-based methods
usually suffer from low prediction accuracy affected
by the so- called cascading effect, where the
prediction error at the top cannot be corrected at a
lower level. Therefore, to alleviate this issue, addition
modules can be designed as a post-processing step
43
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
44
Limitation: Imbalanced Data
45
Limitation: Poor Generalization

Machine Learning
46
Limitation: Computational Cost
{∅}
{A}
{𝐵}
n {𝐶}
Class = {A, B, C} 2 function
{𝐴, 𝐵}
RAKEL (Random k-Labelsets) {𝐵, 𝐶}
{𝐶, 𝐴}
{𝐴, 𝐵, 𝐶}
47
Limitation: Computational Cost

Some may never appear {∅}


{A}
Sparsity of the label space Long-tailed Distribution
{𝐵}
{𝐶}
Some appear frequently
{𝐴, 𝐵}
{𝐵, 𝐶}
{𝐶, 𝐴}
{𝐴, 𝐵, 𝐶}
[Link]
48
Limitation: Label Dependencies

[Link]
49
Limitation: Extreme Multi-label (XML)
Topic Modeling

Millions of tags

Dynamics Labels

Movie Recommendation Systems Imbalanced data

Weak Supervision
(Missing labels)
50
Multi-Label Classification: Deep Learning

[Link]
51
Deep Learning Approaches

Deep Learning
Loss function Architecture
52
Deep Neural Network (DNN)

Input layer Hidden layer Output layer


53
DNN – Softmax or Sigmoid?

𝑒 𝑧𝑖 threshold = 0.9
Softmax =
𝑧𝑗
σ𝐾
𝑗=1 𝑒

threshold = 0.5

1
Sigmoid = 1+ 𝑒 −𝑧𝑖 threshold = 0.1

Softmax & Sigmoid Activation function


54
Softmax for Binary Classification

Cumulative
Output layer Softmax probability

2.1 𝑒 𝑧𝑖 0.75
σ𝐾 𝑧𝑗
𝑗=1 𝑒

1.0 0.25

=1
55
Softmax for Binary Classification

Output layer Softmax

2.1 0.75

threshold = 0.5
1.0 0.25

=1
56
Softmax for Binary Classification

Probably a Dog!

2.1 0.75

Hard constraint
1.0 1 - 0.75

Maybe NOT a Dog!


57
Softmax for Multi-Class Classification
Softmax

Multi-Class 0.8
0.01
0.05
0

. threshold = 0.5 …
.
.

0.03
0
0

=1
58
Softmax for Multi-Class Classification
Softmax Probably a
Multi-Class 0.8
Spindlehorn

0.01
0.05
0 Not a
Spindlehorn
. …
.
. 1 - 0.8

0.03
0
0
59
Softmax for Multi-Label Classification
Softmax

Multi-Label 0
threshold = 0.5

0
0
0.5

.

.
.

0
0.5
0

=1
60
Softmax for Multi-Label Classification
Softmax

Multi-Label 0
0
0
0.45

.

.
.

0.05
0.5
0

=1
61
Sigmoid for Binary Classification

Multi-Label Output layer Sigmoid

1.8 0.86

-0.2 0.45

1
Sigmoid = 1+ 𝑒 −𝑧𝑖
62
Sigmoid for Binary Classification

Probably a Dog!
Multi-Label

0.86

0.45

Maybe a Cat…
Multi-Label
63
Sigmoid for Multi-Label Classification
Sigmoid
Definitely a
Multi-Label 2.1 0.89 Spindlehorn
-2.2 0.1
Maybe a
-3.0 0.05 Platypus
-0.03 0.49

. . . .
. . . . …
. . . .

-1.4 0.2 Unlikely a Cat


-0.9 0.3
-0.5 0.38
64
Sigmoid for Multi-Label Classification
Sigmoid

Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49

. . .
. . . …
. . .

2 0.88
-0.9 0.3
4 0.98
65
Sigmoid for Multi-Label Classification
Sigmoid

Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49

. . .
. . . …
. . .

2 0.88
-0.9 0.3
4 0.98
66
Flexible threshold
Softmax Sigmoid

0.1 0.26
0 0.55

0 0.05

0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05

0.05 0.88

0.5 0.3
0 0.98

=1
67
Flexible threshold
Softmax Sigmoid

0.1 0.26
0 0.55

0 0.05

0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05

0.05 0.88

0.5 0.3
0 0.98

=1
68
Notation

Labels = {Dog, Human, Cat, Pig}

Active labels
Irrelevant labels
(Relevant labels)
K = {Dog, Cat} L = {Human, Pig}
69
Global Loss
m
E =  ei
Output Label Error
2
( 0.9 - 1 ) 0.01
i =1

2
( 0.3 - -1 ) 1.69 Sample Loss
ei =  ( c − d )
Q
i i 2 e1 1 4.3
j j
j =1 2
e2 2 2.6
( 0.2 - -1 ) 0.64
e3 3 3.7
e4 4 1.5
2 e5 5 2.8
( 0.4 - 1 ) 1.96
14.9
4.3 = e1
70
Limitation: Multi-Label Ranking
Output Incorrect Correct

A: 0.9 A: 0.9 A: 0.9

Active labels = {A, C}


B: 0.3 D: 0.4 C: 0.2

Irrelevant labels = {B, D} C: 0.2 B: 0.3 D: 0.4

D: 0.4 C: 0.2 B: 0.3


71
Target of Multi-label Tasks
Incorrect Correct

A: 0.9 A: 0.9 A: 0.95


Active
Ranking Labels

D: 0.4 C: 0.2 C: 0.7

Label Dependency

B: 0.3 D: 0.4 D: 0.2

Irrelevant
Ranking
Labels

C: 0.2 B: 0.3 B: 0.1


72

Backpropagation for Multi-Label Loss


(BP-MLL)
73
Backpropagation for Multi-Label Loss

Output of Active Labels 0.9

(
exp − ( ck − cl ) )
K L
1
error of sample 𝑖 ei = 
K  L k =1 l =1
i i

Output of Irrelevant Labels 0.3

{ A, C} → 2 {B} → 1
Number of Active Labels Number of Irrelavant Labels
74
BP-MLL

ei =
1 K L

K  L k =1 l =1
exp (
− ( ck
i
− cl )
i
)
Example:

Output K = { A, C} L = {B}
0.8 Active Irrelevant
Labels Labels

1
ei = [exp(−(ciA − cBi )) + exp(−(cCi − cBi ))]
0.4 2 1

1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
0.2 2 1
75
BP-MLL

Output 0 Output 1

A: 0.8 A: 0.8 1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
2 1

1
B: 0.4 B: 0.4 e1 = [exp(−(0.8 − 0.4)) + exp(−(0.7 − 0.4))] = 0.71
2 1

Because {A, C} have dependent relationship → 𝑒0 > 𝑒1.


C: 0.2 C: 0.7
76
Why exp(-x)?
77
Why exp(-x)?

exp − ((0.9−0.2) + (0.7 −0.3) +...) = exp −+ → 0

If the number of labels is large:

When the active labels have high


probabilities and irrelevant labels have
low probabilities, the error will be low.
+
78
Why exp(-x)?
+

exp − ((0.1−0.8) + (0.3−0.7) +...) = exp −− → +

If the active labels have low probabilities


and irrelevant labels have high probabilities,
the error will be high.

−
79
BP-MLL

K = { A, C} L = {B} {𝐴, 𝐵}
Active Irrelevant
BP-MLL
Labels Labels
{𝐶, 𝐵}

Classes can be related to each other.


80
Still High Computational Cost

{𝐴, 𝐵}

{𝐴, 𝐷}
K = { A, C , E} L = {B, D} BP-MLL
Active Irrelevant
{𝐶, 𝐵}
Labels Labels

{𝐶, 𝐷}

Failed with extreme labels or the {𝐸, 𝐵}


number of labels is uncertain (may
increase over time). {𝐸, 𝐷}
81
AutoEncoder

Autoencoders are unsupervised models that encode


input data into a lower-dimensional latent space.
Only keeping important features.

Images and text are Feature extraction for better


unstructured and highly classification.
noisy information.
82
AutoEncoder

Autoencoders are unsupervised models that encode


input data into a lower-dimensional latent space.
Only keeping important features.
83
Canonical Correlated AutoEncoder

This turns C2AE into Semi-Supervised problem

Canonical Correlated AutoEncoder (C2AE) is the


first deep learning-based architecture for Multi-label
Classification that embeds both feature X and label
Y into the latent space.

[Link]
84
Canonical Correlated AutoEncoder

Loss between feature and label in latent space

Balanced term
Objective function:

 = min Fx ,Fe ,Fd  ( Fx , Fe ) +  ( Fe , Fd )

Loss between encoder and decoder


85
Canonical Correlated AutoEncoder

 = min Fx ,Fe ,Fd  ( Fx , Fe ) +  ( Fe , Fd )


Minimizing MSE between features and
min Fx , Fe || Fx ( X ) − Fe (Y ) ‖ 2F labels in latent space means
maximizing their correlation.
s.t Fx ( X ) Fx ( X )T = Fe (Y ) Fe (Y )T = I

The orthogonal constraint ensures Find two sets of vectors (features and
the vectors in both spaces remain labels) in the latent space so that they
independent. have the highest correlation.
86
Canonical Correlated AutoEncoder

These constraints are similar to PCA


Minimizing MSE between features and
min Fx , Fe || Fx ( X ) − Fe (Y ) ‖ 2F labels in latent space means
maximizing their correlation.
s.t Fx ( X ) Fx ( X )T = Fe (Y ) Fe (Y )T = I

The orthogonal constraint ensures


the vectors in both spaces remain
independent.
87
Limitations of C2AE

Rectangle
0%
100%
Square The deterministic latent space learned in C2AE
lacks smoothness and clear structure.
0% Rhombus

100% Rectangle
Minor perturbations in this latent space can lead
noise 0% to significantly different Decoder outputs.
Square

0% Rhombus
88

Variational AutoEncoders
(VAEs)
89
Variational AutoEncoder
90
Variational AutoEncoder

[Link]
91
Multivariate Probit Variational AutoEncoder
Probit model
“In statistics, a probit model is a type of regression where the dependent variable can take only two values” - Wikipedia

Class A Y = X  + ò,
*
ò ~ N (0, )

1 Y *  0  1 X T  +   0
Y = = 
0 otherwise  0 otherwise 
Not class A
92
Multivariate Probit Variational AutoEncoder

Cdf of Standard Normal Distribution

Y = X +ò
* Condition probability of label Y given feature X

Logistics function
93
Multivariate Probit Variational AutoEncoder
Traditional VAEs

Adaptively learning two


probabilistic embedding space

Multivariate Probit VAEs

Feature embedding space

Label embedding space

[Link]
94
Multivariate Probit Variational AutoEncoder

sample mean x

Shared Covariance matrix

sample mean y

[Link]
95
Multivariate Probit Variational AutoEncoder

During testing, since we don’t


have labels, so only yˆ f is used.

We can use this Covariance Metrix to


Dimensionality Reduction
interpret the distances between labels.

[Link]
96

Gaussian Mixture Variational Autoencoder


with Contrastive Learning
(C-GNVAE)

[Link]
97
C-GNVAE

VAE Reconstruction + +
(Negative log-likelihood)
+ KL-Divergence Contrastive Cross-entropy

Rebuild input from latent space Measure difference between Pull similar, push different samples Compare predicted and true labels
probability distributions
98
C-GNVAE
99
Twin Encoding Network (TEN)

Similar concepts with MPVAE,


Feature network
Twin Encoding Network (TEN) is
designed to compare two different
inputs X and Y by encoding them into
latent representations and measuring
Label network
the similarity between them.

[Link]
10
0 Twin Encoding Network (TEN)
TEN split Encoding
Extract label correlations into different steps.
directly might be failed with
high-order correlations.

Learn pairwise internal-interactions

Learn higher-order correlations


between features and labels.

Contain both feature information


and label correlations.
10
1 Neural Factorization Machine

Learn pairwise internal-interactions

p p p
yˆ FM (x) = w0 +  w j x j +  v T
j v k  x j xk
j =1 j =1 k = j +1

linear regression pairwise factorization

But it can only learn the second-order feature interactions


10
2 Twin Encoding Network (TEN)
Fully-connected layers

Learn higher-order correlations


between features and labels
10
3 Factorization Layer

[Link]

The Factorization layer is similar to kernel


functions, used to compute correlations between
features and labels through linear or non-linear
operations, such as the Hadamard product, Dot
Product, … helping reduce computational costs for
the model.

Additionally, techniques like matrix factorization and


bilinear interaction (capturing interactions using dual
weight matrices) can also be applied.
10
4

Dual-AutoEncoder

[Link]
10
5 Dual-AutoEncoder
Autoencoder-based methods only rely on the single autoencoder model,
lacking the capability to compute similarities between data spaces.

[Link]
10
6 Dual-AutoEncoder
Dual Autoencoder used two different sequential AutoEncoder.

Stacked Autoencoder with


RICA Manifold Regularization

[Link]
10
7 Dual-AutoEncoder

Improving the quality


Learning global features of representations
through manifold
regularization.
Stacked Autoencoder with
RICA Manifold Regularization

Similar representations
are kept closed together.

[Link]
10
8 Transformer

[Link]
10
9

Graph Attention Transformer (GATN)

[Link]
11
0 Graph Attention Transformer (GATN)
Graph Attention Transformer is designed to effectively explore complex relationships between labels.

Employs a two-step process


to enhance label relationship
representation.

[Link]
11
1

Multi-label Decoder
11
2 Multi-label Decoder (ML-Decoder)

For multi-label datasets with small number of classes

• There is no sequential structure like the words in a


sentence.
• The relationship between labels and input is static
and semantic, often fixed, without requiring step-
by-step dependency calculations as in text data.

Even if the input structure changes (e.g., text content),


the labels (topics) are likely to remain the same.
11
3 Multi-label Decoder (ML-Decoder)
Self-Attention is replaced with normal Label Embedding
to reduce computational cost and improve speed.

Fixed number of group labels K

Find the relationship between labels O( N 2 ) → O( N )

Expand K groups to N labels


O( N ) → O( K )
11
4 Multi-label Decoder (ML-Decoder)

Compare ML-Decoder with other methods on COCO dataset


11
5
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
11
6 Extreme Multi-label Learning

Label frequency of Wiki10-31K and Amazon-670K datasets, which follows a long-tailed distribution

[Link]
11
7 Extreme Multi-label Learning
Infrequently occurring labels (referred to as tail labels) possess limited training examples
and are harder to predict than frequently occurring ones (referred to as head labels).

Left: Label frequency follows a long-tailed


distribution. Right: Distribution of the norm of label
classifier weights

[Link]
11
8 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Robust Loss Function

H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss


functions for recommendation, tagging, ranking & other
missing label applications.” in Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016, pp. 935– 944.
11
9 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Data Manipulation

A simple idea to improve the tail label performance is to generate more data. For instance,
GLaS trains DNNs with input dropout which is a simple data augmentation technique for text
classification models with sparse features.

Re-Rank finds that models trained on long- tailed datasets tend to have imbalanced label
classifier norms. In other words, classifiers of head labels usually have larger norms, while
norms tail label classifiers are smaller. To this end, they propose to balance the classifier
weights for labels to improve the tail label performance through data augmentation
12
0 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Knowledge Transfer

Regarding transferring knowledge from some labels to some others, ECC is a simple
approach. ECC trains a binary classifiers for each label sequentially. Given each label, it trains
the classifier using both original features and the predictions of preceding labels’ classifiers.
This can effectively leverage label correlations.
J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine
learning, vol. 85, no. 3, pp. 333– 359, 2011.
12
1 Extreme Multi-label Learning
However, there are still big room for improvement:

❑ For instance, most existing studies are based on traditional machine learning
models, well-designed deep learning approaches should achieve superior
performance.

❑ Moreover, although data augmentation for tail labels are demonstrated to be


effective however, simply applying existing class-agnostic augmentation
techniques is unfavorable because they do not consider label correlations in
XML and may further increase the computational cost. How to better conduct
data augmentation for XML is still an open question.

❑ Additionally, in computer vision tasks, many training strategies are proposed


12
2 Weakly Supervised XML
In real-world XML applications, it is costly to obtain strong supervision by annotating all
data. Instead, we usually confront with a weakly supervised learning problem

For instance, due the large number of candidate labels, some relevant labels are missing.
Even worse, examples for some labels may not present in the training dataset. Also, new
labels may emerge over time rather than available beforehand. In some applications, we
have to deal with multi-instance XML problem

Missing labels As an intuitive approach, LEML handles missing labels by training model on
observed labels only which means the position of missing entries in label matrix need be
known in advance.
12
3 Code Demo
Multi Label Image Classification Dataset

Motorcycle, truck, boat, bus,


cycle, person, desert, mountains,
sea, sunset, trees, sitar, ektara,
flutes, tabla, harmonium

1 0 0 … 0 1

10 classes
12
4 Code Demo
12
5 Code Demo
12
6 Code Demo
12
7
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
12
8 Challenges & Techniques in MLC

Category Description Challenges Techniques & Models


Graph CNN
Represents relationships Difficult to represent Autoencoder
Label Dependencies
between labels relationships between labels Transformers
Hybrid models
Extreme MLC Involves an extremely large Traditional classifiers like Rank Auto-Encoder (Rank-AE)
number of labels (up to one-vs-all, SVM, and DNNs DeepXML Framework
millions) struggle due to AttentionXML
computational limits. Two-stage XMTC (XRR)
Weakly-Supervised MLC Some labels are missing, and Expensive and time- Graph Neural Networks
supervised learning requires consuming data labeling. (GNNs)
expensive manual annotation Deep generative models
Hierarchical MLC
Zero-shot learning
Few-shot learning
Self-supervised learning
12
9 Challenges & Techniques in MLC

Category Description Challenges Techniques & Models

Refers to imbalances within These imbalances can occur


Imbalanced Data labels, between labels, or simultaneously, leading to DL model adaptation
among label sets reduced model performance

Extreme MLC Involves an extremely large Traditional classifiers like Rank Auto-Encoder (Rank-AE)
number of labels (up to one-vs-all, SVM, and DNNs DeepXML Framework
millions) struggle due to AttentionXML
computational limits Two-stage XMTC (XRR)
High Dimensionality An increasing number of labels High-dimensional data Global and local feature
leads to a growth in features, introduces noise, causing selection (GLFS)
reducing classification overfitting
performance
13
0 Summary

Criteria Traditional DNN AutoEncoder Transformer


Effectively captures label
Multi-Label Assumes labels are Learns a compressed dependencies through
Relationships independent latent representation self-attention
mechanisms
Performance Low Moderate High performance,
especially with sequential
data
Applications General-purpose MLC Feature compression Emotion classificatio
tasks Dimensionality reduction Text-based MLC
Disease classification
Video inspection
Model Complexity Simple Moderate High
13
1

131

You might also like