0% found this document useful (0 votes)

21 views131 pages

Tail Label Challenges in Multi-Label Learning

The document provides an overview of multi-label classification, detailing traditional and deep learning methods, as well as challenges in extreme multi-label classification. It discusses various techniques for solving multi-label problems, including problem transformation and adapted algorithms like MLkNN. The document emphasizes the relevance of label correlation and the importance of addressing computational challenges in real-world applications such as recommender systems.

Uploaded by

raizel711247

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views131 pages

Tail Label Challenges in Multi-Label Learning

Uploaded by

raizel711247

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Multi-label Classification

(From basic to advanced)

Vinh Dinh Nguyen - PhD in Computer Science

Khiem Nguyen – STA
Thinh Nguyen – STA
Code & Data
2
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
3
Review – Binary Classification

Logistics Regression Support Vector Machine

4
What is Multi-Label Classification

•Text categorization
•Audio categorization
•Image categorization
•Bioinformatics

What if I ask you that does this image contains a house?

all things (or labels) are relevant to this picture?

House Tree Beach Cloud Mountain Animal

Yes Yes No Yes No No
5
Multi-Label v/s Multi-Class

Multi-label

Multi-class
6
Multi-Label Classification

Binary Multi-Class Multi-Label

Spam Cat Cat

Not Spam Dog Dog
Pandas Human
7
Multi-Label Classification

Softmax Softmax Sigmoid

[Link]
8
Real-world Examples: XML
Recommender Systems

Users’ information is collected as input features and their shopping history can be used to construct observed
labels. It is easy to see this problem can be formulated as a multi-label learning task.
9
Real-world Examples: XML
Recommender Systems

The problem of recommending related queries can be reformulated as an extreme multi-label learning task.
After the user submits a query, the search engine need to recommend the most related queries that might serve
the user’s requirements from a huge candidate set. Since the size of the label set can be up to millions, existing
ranking algorithms suffer from inadmissible computational cost
10
Real-world Examples: XML
Comparison XML with other machine learning settings.

[Link]
11
Multi-Label Classification

The quantity of publications related to the topic of MLC utilizing DL methods according to Google Scholar sources
from 2006-2023
12
Multi-Label Classification

A chronological overview of recent representative work in XML

13
Multi-Label Classification

Illustration of the core idea of different types of methods for solving XML problem, i.e., binary relevance, embedding based
method, and two kinds of tree-based method.
14
Multi-Label Classification

Domain relevance of different types of XML methods.

15
Extreme Multi-label Learning

Refers to explosive growth in the amount

of data, which poses many challenges to
XML

Refers to the frequency of labels observed

in training data, which typically follows a
long-tailed distribution

Refers to the inferior quality of the

annotated output labels.
[Link]
16
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
17
Multi-Label Classification: Traditional Methods
18
Multi-Label Classification: Traditional Methods

Techniques for Solving a Multi-Label classification problem

[Link] Transformation Transform our multi-label problem into single-label problem(s).

2. Adapted Algorithm Adapting the algorithm to directly perform multi-label classification

3. Ensemble approaches Use different ensembling classification functions

4. Neural Network Use different ensembling classification functions

19
Problem Transformation
Binary Relevance
❑ Decompose the multilabel classification problem into a set of independent binary classification problems
❑ Binary Relevance (BR) assumes that labels are independent to each other, and it learns a binary classifier for
each label separately

[Link]
20
Problem Transformation
Binary Relevance
21
Problem Transformation
`Binary Relevance
22
Problem Transformation
`Binary Relevance
Strong sides:
- estimates single-label classifiers - can generalize beyond available label
combinations
Weak sides:
- not suitable for large number of labels - ignores label relations
Remark:
Though the rationale behind BR is simple, it often achieves performance very close to
sophisticated XML approaches, including deep learning methods, which hence raises an
interesting question “does label correlation matter in XML?”. Even for extant deep learning
methods, very few approaches explicitly leverage label correlations.

Using techniques such as GNNs [73] to model the label correlation may be a promising
direction and can help answer the above question
23
Problem Transformation
Classifier Chains

The first classifier is trained just on the input data and then each next classifier is trained on the input space
and all the previous classifiers in the chain
24
Problem Transformation
Classifier Chains

Strong sides: - estimates single-label classifiers - can generalize beyond available label combinations - takes
label relations into account
Weak sides: - not suitable for large number of labels - quality strongly depends on the label ordering in chain.
25
Problem Transformation
Label Powerset

Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label
combinations found in the training data
26
Problem Transformation
Label Powerset

Strong sides: - estimates label dependencies, with only one classifier - often best solution for subset accuracy
if training data contains all relevant label combinations
Weak sides: - requires all label combinations predictable by the classifier to be present in the training data -
very prone to underfitting with large label spaces
27
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

No F1 F2 Label A Label B Label C Now, we want to classify a new sample:

1 2 3 1 0 1 ❖ Feature1 = 3
❖ Feature2 = 4
2 4 5 0 1 0 ❖ Result: Which labels
3 1 1 1 1 0
4 5 6 0 1 1 We set k = 3
5 3 4 1 0 1
28
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C Step 1: Calculate Distance to Neighbors
1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1

Now, we want to classify a new sample:

❖ Feature1 = 3
❖ Feature2 = 4
❖ Result: Which labels
29
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 2: Gather Labels of the Neighbors
NO F1 F2 Label A Label B Label C

1 2 3 1 0 1
2 4 5 0 1 0
3 1 1 1 1 0
4 5 6 0 1 1
5 3 4 1 0 1
The nearest neighbors based on distance
(with K = 3) are:
Instance 5 (distance = 0)
Instance 1 (distance ≈ 1.41)
Instance 2 (distance ≈ 1.41)
30
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
Step 3: Count the Occurrences of Each Label
F1 F2 Label A Label B Label C
2 3 1 0 1 Now, we count how many times each label occurs in the K nearest
neighbors:
4 5 0 1 0 •Label A:
1 1 1 1 0 • It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).
5 6 0 1 1
•Label B:
• It appears in 1 out of 3 neighbors (Instance 2).
3 4 1 0 1 •Label C:
• It appears in 2 out of 3 neighbors (Instance 5 and Instance 1).

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)
31
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2)

P(A=1) = 3/5 = 0.6

P(NA=2 |A=1) = ???

32
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

NO F1 F2 ❑ Nearest neighbors: Instance 5, Instance 3,

Instance 2.
1 2 3
❑ Labels for Label A: [1, 1, 0].
❑ So, for Instance 1, 2 neighbors have Label
33
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

❑ Nearest neighbors: Instance 1, Instance 5,

NO F1 F2
Instance 2.
3 1 1 ❑ Labels for Label A: [1, 1, 0].
❑ So, for Instance 3, 2 neighbors have Label A =
34
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.
NO F1 F2 Label A Label B Label C

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) 1 2 3 1 0 1

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) 2 4 5 0 1 0

P(NA=2 |A=1) 3 1 1 1 1 0
4 5 6 0 1 1
We are now interested in the frequency of how many 5 3 4 1 0 1
neighbors (among the 3 nearest ones) have Label A = 1.

NO F1 F2
❑ Nearest neighbors: Instance 1, Instance 2,
5 3 4 Instance 4.
❑ Labels for Label A: [1, 0, 0].
35
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

Frequency Table for NA(Neighbors with Label A)

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2)
P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2) Number of Neighbors with Label A Frequency when A=1

P(NA=2 |A=1) = 2/3 = 0.67 0 neighbors 0 coccurences

1 neighbors 1 coccurences
We are now interested in the frequency of how many 2 neighbors 2 coccurences
neighbors (among the 3 nearest ones) have Label A = 1.

This means that, if the test instance should have Label A (Label A = 1), then there is an 67% chance that 2 out of 3 of its
neighbors would also have Label A.
36
Adapted Algorithm
Multi-label k-Nearest Neighbour (MLkNN) algorithm

MLkNN builds uses k-Nearest Neighbors find nearest examples to a test class and uses
Bayesian inference to select assigned labels.

P(Label A=1∣count of Label A in neighbors) = P(A=1|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=1|N B=2)

P(Label A=0∣count of Label A in neighbors) = P(A=0|NA=2) P(Label B=1∣count of Label B in neighbors) = P(B=0|N B=2)

P(A=1|NA=2) = P(NA=2 |A=1) * P(A=1) / P(NA=2)

P(A=1) = 3/5 = 0.6 => P(A=0) = 0.4

P(NA=2 |A=1) = 0.67 => P(NA=2 |A=0) = 1.0

P(NA=2) = P(NA=2 ∣A=1) * P(A=1)+P(NA=2 ∣A=0) * P(A=0) = 0.6*0.67 + 0.4* 1.0 = 0.8
P(A=1|NA=2) = 0.67* 0.6 / 0.8 = 0.5025
Predict Label A
P(A=0|NA=2) = 1.0* 0.4 / 0.8 = 0.50
37
Neural Network

•Number of nodes in the output layer matches the number of labels.

•Sigmoid activation for each node in the output layer.
•Binary cross-entropy loss function.
38
Embedding-based methods
Embedding-based methods usually assume that the label matrix is low-rank which can
exploit the underlying correlations between labels.
39
Embedding-based methods
Embedding-based methods usually assume that the label matrix is low-rank which can
exploit the underlying correlations between labels.
40
Embedding-based methods
In summary:
❑ Conventional embedding-based methods offer
several advantages such as simplicity, ease of
implementation, strong theoretical foundations, and
the ability to manage label correlations and adapt to
online scenarios.

❑ However, they may sacrifice prediction accuracy due

to information loss during compression.

❑ Most deep learning methods can also be considered

embedding-based, they use more expressive neural
network representations compared to traditional
linear projections, potentially improving
classification performance.
41
Tree-based methods

In instance trees, each node consists of a set of

training examples and then distributed to the child
nodes. The intuition is that each region of feature
space contains only a small number of active labels.
The active label set at a node is the union of the
labels of all training points present in that node.
Label tree methods usually assume that labels are
organized in a tree structure. In label trees, each
node consists of a set of labels which are then
distributed to the child nodes. The tree structure is
determined by recursively clustering labels until
terminal conditions are reached.
42
Tree-based methods

Remark:
It is a common belief that tree-based methods
usually suffer from low prediction accuracy affected
by the so- called cascading effect, where the
prediction error at the top cannot be corrected at a
lower level. Therefore, to alleviate this issue, addition
modules can be designed as a post-processing step
43
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
44
Limitation: Imbalanced Data
45
Limitation: Poor Generalization

Machine Learning
46
Limitation: Computational Cost
{∅}
{A}
{𝐵}
n {𝐶}
Class = {A, B, C} 2 function
{𝐴, 𝐵}
RAKEL (Random k-Labelsets) {𝐵, 𝐶}
{𝐶, 𝐴}
{𝐴, 𝐵, 𝐶}
47
Limitation: Computational Cost

Some may never appear {∅}

{A}
Sparsity of the label space Long-tailed Distribution
{𝐵}
{𝐶}
Some appear frequently
{𝐴, 𝐵}
{𝐵, 𝐶}
{𝐶, 𝐴}
{𝐴, 𝐵, 𝐶}
[Link]
48
Limitation: Label Dependencies

[Link]
49
Limitation: Extreme Multi-label (XML)
Topic Modeling

Millions of tags

Dynamics Labels

Movie Recommendation Systems Imbalanced data

Weak Supervision
(Missing labels)
50
Multi-Label Classification: Deep Learning

[Link]
51
Deep Learning Approaches

Deep Learning
Loss function Architecture
52
Deep Neural Network (DNN)

Input layer Hidden layer Output layer

53
DNN – Softmax or Sigmoid?

𝑒 𝑧𝑖 threshold = 0.9
Softmax =
𝑧𝑗
σ𝐾
𝑗=1 𝑒

threshold = 0.5

1
Sigmoid = 1+ 𝑒 −𝑧𝑖 threshold = 0.1

Softmax & Sigmoid Activation function

54
Softmax for Binary Classification

Cumulative
Output layer Softmax probability

2.1 𝑒 𝑧𝑖 0.75
σ𝐾 𝑧𝑗
𝑗=1 𝑒

1.0 0.25

=1
55
Softmax for Binary Classification

Output layer Softmax

2.1 0.75

threshold = 0.5
1.0 0.25

=1
56
Softmax for Binary Classification

Probably a Dog!

2.1 0.75

Hard constraint
1.0 1 - 0.75

Maybe NOT a Dog!

57
Softmax for Multi-Class Classification
Softmax

Multi-Class 0.8
0.01
0.05
0

. threshold = 0.5 …
.
.

0.03
0
0

=1
58
Softmax for Multi-Class Classification
Softmax Probably a
Multi-Class 0.8
Spindlehorn

0.01
0.05
0 Not a
Spindlehorn
. …
.
. 1 - 0.8

0.03
0
0
59
Softmax for Multi-Label Classification
Softmax

Multi-Label 0
threshold = 0.5

0
0
0.5

.
…
.
.

0
0.5
0

=1
60
Softmax for Multi-Label Classification
Softmax

Multi-Label 0
0
0
0.45

.
…
.
.

0.05
0.5
0

=1
61
Sigmoid for Binary Classification

Multi-Label Output layer Sigmoid

1.8 0.86

-0.2 0.45

1
Sigmoid = 1+ 𝑒 −𝑧𝑖
62
Sigmoid for Binary Classification

Probably a Dog!
Multi-Label

0.86

0.45

Maybe a Cat…
Multi-Label
63
Sigmoid for Multi-Label Classification
Sigmoid
Definitely a
Multi-Label 2.1 0.89 Spindlehorn
-2.2 0.1
Maybe a
-3.0 0.05 Platypus
-0.03 0.49

. . . .
. . . . …
. . . .

-1.4 0.2 Unlikely a Cat

-0.9 0.3
-0.5 0.38
64
Sigmoid for Multi-Label Classification
Sigmoid

Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49

. . .
. . . …
. . .

2 0.88
-0.9 0.3
4 0.98
65
Sigmoid for Multi-Label Classification
Sigmoid

Multi-Label -1 0.26
-2.2 0.1
-3.0 0.05
-0.03 0.49

. . .
. . . …
. . .

2 0.88
-0.9 0.3
4 0.98
66
Flexible threshold
Softmax Sigmoid

0.1 0.26
0 0.55

0 0.05

0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05

0.05 0.88

0.5 0.3
0 0.98

=1
67
Flexible threshold
Softmax Sigmoid

0.1 0.26
0 0.55

0 0.05

0.35 0.49
.
. . .
… . . …
.
. .
threshold = 0.05

0.05 0.88

0.5 0.3
0 0.98

=1
68
Notation

Labels = {Dog, Human, Cat, Pig}

Active labels
Irrelevant labels
(Relevant labels)
K = {Dog, Cat} L = {Human, Pig}
69
Global Loss
m
E =  ei
Output Label Error
2
( 0.9 - 1 ) 0.01
i =1

2
( 0.3 - -1 ) 1.69 Sample Loss
ei =  ( c − d )
Q
i i 2 e1 1 4.3
j j
j =1 2
e2 2 2.6
( 0.2 - -1 ) 0.64
e3 3 3.7
e4 4 1.5
2 e5 5 2.8
( 0.4 - 1 ) 1.96
14.9
4.3 = e1
70
Limitation: Multi-Label Ranking
Output Incorrect Correct

A: 0.9 A: 0.9 A: 0.9

Active labels = {A, C}

B: 0.3 D: 0.4 C: 0.2

Irrelevant labels = {B, D} C: 0.2 B: 0.3 D: 0.4

D: 0.4 C: 0.2 B: 0.3

71
Target of Multi-label Tasks
Incorrect Correct

A: 0.9 A: 0.9 A: 0.95

Active
Ranking Labels

D: 0.4 C: 0.2 C: 0.7

Label Dependency

B: 0.3 D: 0.4 D: 0.2

Irrelevant
Ranking
Labels

C: 0.2 B: 0.3 B: 0.1

Backpropagation for Multi-Label Loss

(BP-MLL)
73
Backpropagation for Multi-Label Loss

Output of Active Labels 0.9

(
exp − ( ck − cl ) )
K L
1
error of sample 𝑖 ei = 
K  L k =1 l =1
i i

Output of Irrelevant Labels 0.3

{ A, C} → 2 {B} → 1
Number of Active Labels Number of Irrelavant Labels
74
BP-MLL

ei =
1 K L

K  L k =1 l =1
exp (
− ( ck
i
− cl )
i
)
Example:

Output K = { A, C} L = {B}
0.8 Active Irrelevant
Labels Labels

1
ei = [exp(−(ciA − cBi )) + exp(−(cCi − cBi ))]
0.4 2 1

1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
0.2 2 1
75
BP-MLL

Output 0 Output 1

A: 0.8 A: 0.8 1
e0 = [exp(−(0.8 − 0.4)) + exp(−(0.2 − 0.4))] = 0.946
2 1

1
B: 0.4 B: 0.4 e1 = [exp(−(0.8 − 0.4)) + exp(−(0.7 − 0.4))] = 0.71
2 1

Because {A, C} have dependent relationship → 𝑒0 > 𝑒1.

C: 0.2 C: 0.7
76
Why exp(-x)?
77
Why exp(-x)?

exp − ((0.9−0.2) + (0.7 −0.3) +...) = exp −+ → 0

If the number of labels is large:

When the active labels have high

probabilities and irrelevant labels have
low probabilities, the error will be low.
+
78
Why exp(-x)?
+

exp − ((0.1−0.8) + (0.3−0.7) +...) = exp −− → +

If the active labels have low probabilities

and irrelevant labels have high probabilities,
the error will be high.

−
79
BP-MLL

K = { A, C} L = {B} {𝐴, 𝐵}
Active Irrelevant
BP-MLL
Labels Labels
{𝐶, 𝐵}

Classes can be related to each other.

80
Still High Computational Cost

{𝐴, 𝐵}

{𝐴, 𝐷}
K = { A, C , E} L = {B, D} BP-MLL
Active Irrelevant
{𝐶, 𝐵}
Labels Labels

{𝐶, 𝐷}

Failed with extreme labels or the {𝐸, 𝐵}

number of labels is uncertain (may
increase over time). {𝐸, 𝐷}
81
AutoEncoder

Autoencoders are unsupervised models that encode

input data into a lower-dimensional latent space.
Only keeping important features.

Images and text are Feature extraction for better

unstructured and highly classification.
noisy information.
82
AutoEncoder

Autoencoders are unsupervised models that encode

input data into a lower-dimensional latent space.
Only keeping important features.
83
Canonical Correlated AutoEncoder

This turns C2AE into Semi-Supervised problem

Canonical Correlated AutoEncoder (C2AE) is the

first deep learning-based architecture for Multi-label
Classification that embeds both feature X and label
Y into the latent space.

[Link]
84
Canonical Correlated AutoEncoder

Loss between feature and label in latent space

Balanced term
Objective function:

 = min Fx ,Fe ,Fd  ( Fx , Fe ) +  ( Fe , Fd )

Loss between encoder and decoder

85
Canonical Correlated AutoEncoder

 = min Fx ,Fe ,Fd  ( Fx , Fe ) +  ( Fe , Fd )

Minimizing MSE between features and
min Fx , Fe || Fx ( X ) − Fe (Y ) ‖ 2F labels in latent space means
maximizing their correlation.
s.t Fx ( X ) Fx ( X )T = Fe (Y ) Fe (Y )T = I

The orthogonal constraint ensures Find two sets of vectors (features and
the vectors in both spaces remain labels) in the latent space so that they
independent. have the highest correlation.
86
Canonical Correlated AutoEncoder

These constraints are similar to PCA

Minimizing MSE between features and
min Fx , Fe || Fx ( X ) − Fe (Y ) ‖ 2F labels in latent space means
maximizing their correlation.
s.t Fx ( X ) Fx ( X )T = Fe (Y ) Fe (Y )T = I

The orthogonal constraint ensures

the vectors in both spaces remain
independent.
87
Limitations of C2AE

Rectangle
0%
100%
Square The deterministic latent space learned in C2AE
lacks smoothness and clear structure.
0% Rhombus

100% Rectangle
Minor perturbations in this latent space can lead
noise 0% to significantly different Decoder outputs.
Square

0% Rhombus
88

Variational AutoEncoders
(VAEs)
89
Variational AutoEncoder
90
Variational AutoEncoder

[Link]
91
Multivariate Probit Variational AutoEncoder
Probit model
“In statistics, a probit model is a type of regression where the dependent variable can take only two values” - Wikipedia

Class A Y = X  + ò,
*
ò ~ N (0, )

1 Y *  0  1 X T  +   0
Y = = 
0 otherwise  0 otherwise 
Not class A
92
Multivariate Probit Variational AutoEncoder

Cdf of Standard Normal Distribution

Y = X +ò
* Condition probability of label Y given feature X

Logistics function
93
Multivariate Probit Variational AutoEncoder
Traditional VAEs

Adaptively learning two

probabilistic embedding space

Multivariate Probit VAEs

Feature embedding space

Label embedding space

[Link]
94
Multivariate Probit Variational AutoEncoder

sample mean x

Shared Covariance matrix

sample mean y

[Link]
95
Multivariate Probit Variational AutoEncoder

During testing, since we don’t

have labels, so only yˆ f is used.

We can use this Covariance Metrix to

Dimensionality Reduction
interpret the distances between labels.

[Link]
96

Gaussian Mixture Variational Autoencoder

with Contrastive Learning
(C-GNVAE)

[Link]
97
C-GNVAE

VAE Reconstruction + +
(Negative log-likelihood)
+ KL-Divergence Contrastive Cross-entropy

Rebuild input from latent space Measure difference between Pull similar, push different samples Compare predicted and true labels
probability distributions
98
C-GNVAE
99
Twin Encoding Network (TEN)

Similar concepts with MPVAE,

Feature network
Twin Encoding Network (TEN) is
designed to compare two different
inputs X and Y by encoding them into
latent representations and measuring
Label network
the similarity between them.

[Link]
10
0 Twin Encoding Network (TEN)
TEN split Encoding
Extract label correlations into different steps.
directly might be failed with
high-order correlations.

Learn pairwise internal-interactions

Learn higher-order correlations

between features and labels.

Contain both feature information

and label correlations.
10
1 Neural Factorization Machine

Learn pairwise internal-interactions

p p p
yˆ FM (x) = w0 +  w j x j +  v T
j v k  x j xk
j =1 j =1 k = j +1

linear regression pairwise factorization

But it can only learn the second-order feature interactions

10
2 Twin Encoding Network (TEN)
Fully-connected layers

Learn higher-order correlations

between features and labels
10
3 Factorization Layer

[Link]

The Factorization layer is similar to kernel

functions, used to compute correlations between
features and labels through linear or non-linear
operations, such as the Hadamard product, Dot
Product, … helping reduce computational costs for
the model.

Additionally, techniques like matrix factorization and

bilinear interaction (capturing interactions using dual
weight matrices) can also be applied.
10
4

Dual-AutoEncoder

[Link]
10
5 Dual-AutoEncoder
Autoencoder-based methods only rely on the single autoencoder model,
lacking the capability to compute similarities between data spaces.

[Link]
10
6 Dual-AutoEncoder
Dual Autoencoder used two different sequential AutoEncoder.

Stacked Autoencoder with

RICA Manifold Regularization

[Link]
10
7 Dual-AutoEncoder

Improving the quality

Learning global features of representations
through manifold
regularization.
Stacked Autoencoder with
RICA Manifold Regularization

Similar representations
are kept closed together.

[Link]
10
8 Transformer

[Link]
10
9

Graph Attention Transformer (GATN)

[Link]
11
0 Graph Attention Transformer (GATN)
Graph Attention Transformer is designed to effectively explore complex relationships between labels.

Employs a two-step process

to enhance label relationship
representation.

[Link]
11
1

Multi-label Decoder
11
2 Multi-label Decoder (ML-Decoder)

For multi-label datasets with small number of classes

• There is no sequential structure like the words in a

sentence.
• The relationship between labels and input is static
and semantic, often fixed, without requiring step-
by-step dependency calculations as in text data.

Even if the input structure changes (e.g., text content),

the labels (topics) are likely to remain the same.
11
3 Multi-label Decoder (ML-Decoder)
Self-Attention is replaced with normal Label Embedding
to reduce computational cost and improve speed.

Fixed number of group labels K

Find the relationship between labels O( N 2 ) → O( N )

Expand K groups to N labels

O( N ) → O( K )
11
4 Multi-label Decoder (ML-Decoder)

Compare ML-Decoder with other methods on COCO dataset

11
5
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
11
6 Extreme Multi-label Learning

Label frequency of Wiki10-31K and Amazon-670K datasets, which follows a long-tailed distribution

[Link]
11
7 Extreme Multi-label Learning
Infrequently occurring labels (referred to as tail labels) possess limited training examples
and are harder to predict than frequently occurring ones (referred to as head labels).

Left: Label frequency follows a long-tailed

distribution. Right: Distribution of the norm of label
classifier weights

[Link]
11
8 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Robust Loss Function

H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss

functions for recommendation, tagging, ranking & other
missing label applications.” in Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016, pp. 935– 944.
11
9 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Data Manipulation

A simple idea to improve the tail label performance is to generate more data. For instance,
GLaS trains DNNs with input dropout which is a simple data augmentation technique for text
classification models with sparse features.

Re-Rank finds that models trained on long- tailed datasets tend to have imbalanced label
classifier norms. In other words, classifiers of head labels usually have larger norms, while
norms tail label classifiers are smaller. To this end, they propose to balance the classifier
weights for labels to improve the tail label performance through data augmentation
12
0 Extreme Multi-label Learning
To improve the performance of tail labels, many approaches have been studied.
In the following subsections, we introduce each type of methods with its
representative work.
Knowledge Transfer

Regarding transferring knowledge from some labels to some others, ECC is a simple
approach. ECC trains a binary classifiers for each label sequentially. Given each label, it trains
the classifier using both original features and the predictions of preceding labels’ classifiers.
This can effectively leverage label correlations.
J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine
learning, vol. 85, no. 3, pp. 333– 359, 2011.
12
1 Extreme Multi-label Learning
However, there are still big room for improvement:

❑ For instance, most existing studies are based on traditional machine learning
models, well-designed deep learning approaches should achieve superior
performance.

❑ Moreover, although data augmentation for tail labels are demonstrated to be

effective however, simply applying existing class-agnostic augmentation
techniques is unfavorable because they do not consider label correlations in
XML and may further increase the computational cost. How to better conduct
data augmentation for XML is still an open question.

❑ Additionally, in computer vision tasks, many training strategies are proposed

12
2 Weakly Supervised XML
In real-world XML applications, it is costly to obtain strong supervision by annotating all
data. Instead, we usually confront with a weakly supervised learning problem

For instance, due the large number of candidate labels, some relevant labels are missing.
Even worse, examples for some labels may not present in the training dataset. Also, new
labels may emerge over time rather than available beforehand. In some applications, we
have to deal with multi-instance XML problem

Missing labels As an intuitive approach, LEML handles missing labels by training model on
observed labels only which means the position of missing entries in label matrix need be
known in advance.
12
3 Code Demo
Multi Label Image Classification Dataset

Motorcycle, truck, boat, bus,

cycle, person, desert, mountains,
sea, sunset, trees, sitar, ektara,
flutes, tabla, harmonium

1 0 0 … 0 1

10 classes
12
4 Code Demo
12
5 Code Demo
12
6 Code Demo
12
7
Outline
➢ Introduction to Multi-label Classification

➢ Multi-Label Classification: Traditional Methods

➢ Multi-Label Classification: DL Methods

➢ Extreme Multi-label Classification

➢ Current Challenges and Future Research

➢ Summary
12
8 Challenges & Techniques in MLC

Category Description Challenges Techniques & Models

Graph CNN
Represents relationships Difficult to represent Autoencoder
Label Dependencies
between labels relationships between labels Transformers
Hybrid models
Extreme MLC Involves an extremely large Traditional classifiers like Rank Auto-Encoder (Rank-AE)
number of labels (up to one-vs-all, SVM, and DNNs DeepXML Framework
millions) struggle due to AttentionXML
computational limits. Two-stage XMTC (XRR)
Weakly-Supervised MLC Some labels are missing, and Expensive and time- Graph Neural Networks
supervised learning requires consuming data labeling. (GNNs)
expensive manual annotation Deep generative models
Hierarchical MLC
Zero-shot learning
Few-shot learning
Self-supervised learning
12
9 Challenges & Techniques in MLC

Category Description Challenges Techniques & Models

Refers to imbalances within These imbalances can occur

Imbalanced Data labels, between labels, or simultaneously, leading to DL model adaptation
among label sets reduced model performance

Extreme MLC Involves an extremely large Traditional classifiers like Rank Auto-Encoder (Rank-AE)
number of labels (up to one-vs-all, SVM, and DNNs DeepXML Framework
millions) struggle due to AttentionXML
computational limits Two-stage XMTC (XRR)
High Dimensionality An increasing number of labels High-dimensional data Global and local feature
leads to a growth in features, introduces noise, causing selection (GLFS)
reducing classification overfitting
performance
13
0 Summary

Criteria Traditional DNN AutoEncoder Transformer

Effectively captures label
Multi-Label Assumes labels are Learns a compressed dependencies through
Relationships independent latent representation self-attention
mechanisms
Performance Low Moderate High performance,
especially with sequential
data
Applications General-purpose MLC Feature compression Emotion classificatio
tasks Dimensionality reduction Text-based MLC
Disease classification
Video inspection
Model Complexity Simple Moderate High
13
1

131

Multi-Label Classification Techniques
No ratings yet
Multi-Label Classification Techniques
26 pages
Benchmarking Multi-Label Classifiers
No ratings yet
Benchmarking Multi-Label Classifiers
12 pages
Machine Learning Supervised Part1 15032022 090259pm
No ratings yet
Machine Learning Supervised Part1 15032022 090259pm
57 pages
Multiclass Classification Techniques Survey
No ratings yet
Multiclass Classification Techniques Survey
9 pages
Supervised Learning in Machine Learning
100% (1)
Supervised Learning in Machine Learning
124 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
27 pages
ML Lec 4 I-II-III S Classi and Reg Ir KNN DT Bayes SVM
No ratings yet
ML Lec 4 I-II-III S Classi and Reg Ir KNN DT Bayes SVM
67 pages
Multi-Label Classification with Hidden Nodes
No ratings yet
Multi-Label Classification with Hidden Nodes
34 pages
Big Data 5
No ratings yet
Big Data 5
66 pages
Collective Multi-Label Classification Techniques
No ratings yet
Collective Multi-Label Classification Techniques
8 pages
Supervised Learning
No ratings yet
Supervised Learning
79 pages
Supervised Learning: Regression & Classification
No ratings yet
Supervised Learning: Regression & Classification
141 pages
Understanding Classification in Machine Learning
No ratings yet
Understanding Classification in Machine Learning
10 pages
ML-kNN: Multi-Label Classification Algorithm
No ratings yet
ML-kNN: Multi-Label Classification Algorithm
4 pages
PR 2012
No ratings yet
PR 2012
22 pages
Multi-Label Learning Algorithms Review
No ratings yet
Multi-Label Learning Algorithms Review
26 pages
Multi-Label Classification With Weighted Labels Using Learning Classifier Systems
No ratings yet
Multi-Label Classification With Weighted Labels Using Learning Classifier Systems
7 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Multi-Label Learning Algorithms Review
No ratings yet
Multi-Label Learning Algorithms Review
42 pages
Supervised Learning: Classification in ML
No ratings yet
Supervised Learning: Classification in ML
33 pages
Supervised Learning: Classification Basics
No ratings yet
Supervised Learning: Classification Basics
13 pages
Multivariate Linear Regression & Classification
No ratings yet
Multivariate Linear Regression & Classification
35 pages
Classifier Chains For Multi-Label Classification: Jesse Read Bernhard Pfahringer Geoff Holmes Eibe Frank
No ratings yet
Classifier Chains For Multi-Label Classification: Jesse Read Bernhard Pfahringer Geoff Holmes Eibe Frank
27 pages
Classification Models in Supervised Learning
No ratings yet
Classification Models in Supervised Learning
48 pages
Hamming Loss in Multi-Label Classification
No ratings yet
Hamming Loss in Multi-Label Classification
15 pages
Understanding Binary and Multiclass Classification
No ratings yet
Understanding Binary and Multiclass Classification
24 pages
Supervised Learning: Classification Methods
No ratings yet
Supervised Learning: Classification Methods
145 pages
Supervised Learning: Classification Methods
No ratings yet
Supervised Learning: Classification Methods
37 pages
A Tutorial On Multilabel Learning
No ratings yet
A Tutorial On Multilabel Learning
38 pages
Beyond Binary: Multi-Class Classification
No ratings yet
Beyond Binary: Multi-Class Classification
16 pages
MBD - Lecture 3 Part 2
No ratings yet
MBD - Lecture 3 Part 2
38 pages
Text Categorization Algorithms Overview
No ratings yet
Text Categorization Algorithms Overview
68 pages
Understanding Machine Learning Classification
No ratings yet
Understanding Machine Learning Classification
49 pages
Overview of Classification Algorithms
No ratings yet
Overview of Classification Algorithms
75 pages
Weighted Nearest Neighbor for Unlabeled Data
No ratings yet
Weighted Nearest Neighbor for Unlabeled Data
12 pages
Hierarchical Multilabel Classification Method
No ratings yet
Hierarchical Multilabel Classification Method
12 pages
Overview of Classification Models in ML
No ratings yet
Overview of Classification Models in ML
38 pages
Non-Parametric Learning & kNN Methods
No ratings yet
Non-Parametric Learning & kNN Methods
47 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
33 pages
Understanding Multi-Class SVM Techniques
No ratings yet
Understanding Multi-Class SVM Techniques
19 pages
Extending ML-Plan for Multi-Label Classification
No ratings yet
Extending ML-Plan for Multi-Label Classification
8 pages
Multilabel Deep Learning.1
No ratings yet
Multilabel Deep Learning.1
56 pages
Predicting Class Labels in K-NN
No ratings yet
Predicting Class Labels in K-NN
56 pages
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
No ratings yet
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
10 pages
Optimizing Neural Networks for Multi-label Classification
No ratings yet
Optimizing Neural Networks for Multi-label Classification
17 pages
Classification Algorithms Overview
No ratings yet
Classification Algorithms Overview
44 pages
Multi-Class Classification Techniques
No ratings yet
Multi-Class Classification Techniques
18 pages
Ml-KNN: A Multi-Label Learning Method
No ratings yet
Ml-KNN: A Multi-Label Learning Method
26 pages
Machine Learning Classification Basics
No ratings yet
Machine Learning Classification Basics
6 pages
Overview of Binary Relevance in Multi-Label Learning
No ratings yet
Overview of Binary Relevance in Multi-Label Learning
12 pages
Supervised Learning Techniques Explained
No ratings yet
Supervised Learning Techniques Explained
6 pages
Handout 5 - Supervised Learning
No ratings yet
Handout 5 - Supervised Learning
34 pages
Logistic Regression and KNN Overview
No ratings yet
Logistic Regression and KNN Overview
48 pages
Machine Learning Classification Types
No ratings yet
Machine Learning Classification Types
25 pages
Ensemble Learning Techniques Explained
No ratings yet
Ensemble Learning Techniques Explained
31 pages
IML Unit2
No ratings yet
IML Unit2
82 pages
Bharat Hic ML 2010
No ratings yet
Bharat Hic ML 2010
8 pages
A Deep Learning Approach To The Classification of 3D CAD Models
No ratings yet
A Deep Learning Approach To The Classification of 3D CAD Models
16 pages
Hybrid CNN-LSTM Model for Traffic Prediction
No ratings yet
Hybrid CNN-LSTM Model for Traffic Prediction
6 pages
Visual Guide to Transformers (Part 2)
100% (1)
Visual Guide to Transformers (Part 2)
23 pages
Alzheimer's Detection with CNNs
No ratings yet
Alzheimer's Detection with CNNs
6 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
187 pages
Fall 2023 BS AI Exam Schedule
No ratings yet
Fall 2023 BS AI Exam Schedule
1 page
One-Shot Visual Imitation with Transformers
No ratings yet
One-Shot Visual Imitation with Transformers
14 pages
Pothole Detection via Deep Learning Survey
No ratings yet
Pothole Detection via Deep Learning Survey
3 pages
AI Advances in Actuarial Science
No ratings yet
AI Advances in Actuarial Science
23 pages
NLP Course Overview and Key Details
No ratings yet
NLP Course Overview and Key Details
35 pages
Advanced AI - Complete Course Notes
No ratings yet
Advanced AI - Complete Course Notes
38 pages
AI and Neural Networks: Key Concepts
No ratings yet
AI and Neural Networks: Key Concepts
3 pages
Deep Neural Networks: Architecture & Tuning
No ratings yet
Deep Neural Networks: Architecture & Tuning
33 pages
MLTS Exam Guidelines and Topics
No ratings yet
MLTS Exam Guidelines and Topics
6 pages
NLP Mastery Roadmap Guide
No ratings yet
NLP Mastery Roadmap Guide
3 pages
LLMs and RAG in EDA AI-Agent Development
No ratings yet
LLMs and RAG in EDA AI-Agent Development
29 pages
Oracle Machine Learning 1Z0-1096-23 Q&A
No ratings yet
Oracle Machine Learning 1Z0-1096-23 Q&A
4 pages
Image Super-Resolution Techniques Overview
No ratings yet
Image Super-Resolution Techniques Overview
6 pages
Previewpdf
50% (2)
Previewpdf
58 pages
The Weka Multilayer Perceptron Classifier: Daniel I. MORARIU, Radu G. Creţulescu, Macarie Breazu
No ratings yet
The Weka Multilayer Perceptron Classifier: Daniel I. MORARIU, Radu G. Creţulescu, Macarie Breazu
9 pages
FSA-YOLOv5 for Smart Home Detection
No ratings yet
FSA-YOLOv5 for Smart Home Detection
6 pages
Evolution of Artificial Neural Networks
No ratings yet
Evolution of Artificial Neural Networks
20 pages
AI Course: Learning Outcomes & Topics
No ratings yet
AI Course: Learning Outcomes & Topics
1 page
Deep Learning Concepts Overview
No ratings yet
Deep Learning Concepts Overview
17 pages
Unfolding RNN Computational Graphs
No ratings yet
Unfolding RNN Computational Graphs
44 pages
AI Concepts and Applications Overview
No ratings yet
AI Concepts and Applications Overview
19 pages
Leaf Disease Detection with ACO-CNN
No ratings yet
Leaf Disease Detection with ACO-CNN
6 pages
AI's Role in Reducing Human Error
No ratings yet
AI's Role in Reducing Human Error
2 pages
Predictive Maintenance with LSTM Models
No ratings yet
Predictive Maintenance with LSTM Models
15 pages
VoxNet: Real-Time 3D Object Recognition
No ratings yet
VoxNet: Real-Time 3D Object Recognition
8 pages