0% found this document useful (0 votes)
9 views24 pages

COMP1942 Final Exam Answer Sheet

The document is an exam answer sheet for a course on exploring and visualizing data. It contains instructions, parts A and B, with multiple choice and short answer questions on topics like frequent itemset mining, clustering, Bayesian networks and performance metrics for classification.

Uploaded by

Karin Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

COMP1942 Final Exam Answer Sheet

The document is an exam answer sheet for a course on exploring and visualizing data. It contains instructions, parts A and B, with multiple choice and short answer questions on topics like frequent itemset mining, clustering, Bayesian networks and performance metrics for classification.

Uploaded by

Karin Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMP1942 Answer Sheet

COMP1942 Exploring and Visualizing Data (Spring Semester 2018)


Final Examination (Answer Sheet)
Date: 25 May, 2018 (Fri)
Time: 16:30-19:30
Duration: 3 hours

Student ID:__________________ Student Name:_________________________________

Seat No. :__________________

Instructions:
(1) Please answer all questions in Part A in this paper.
(2) You can optionally answer the bonus question in Part B in this paper. You can obtain additional
marks for the bonus question if you answer it correctly.
(3) The total marks in Part A are 200.
(4) The total marks in Part B are 20.
(5) The total marks you can obtain in this exam are 200 only.
If you answer the bonus question in Part B correctly, you can obtain additional marks.
But, if the total marks you obtain from Part A and Part B are over 200, your marks will be truncated to
200 only.
(6) You can use a calculator.

Answer Sheet
Part Question Full Mark Mark
Q1 20
Q2 20
Q3 20
Q4 20
Q5 20
A
Q6 20
Q7 20
Q8 20
Q9 20
Q10 20
Total (Part A) 200
B Q11 (OPTIONAL) 20
Total (Parts A and B) 200
1/24
COMP1942 Answer Sheet

Part A (Compulsory Short Questions)


Q1 (20 Marks)
(a)

The reason why we cannot simply output C as the final output is that not all itemsets in C are frequent (i.e.,
not all itemsets in C can be in the final output).

Let us use the size-2 itemset generation for illustration.

Originally, L1 = {P, Q, S, T}
After the counting step and the pruning step, we have
C2 = { PQ, PS, PT, QS, QT, ST }
Not all itemsets in C2 have frequency at least 2.
E.g., ST is not frequent since its frequency is equal to 1. Thus, ST is not in the output.

(b)

Itemset Frequency
{a, b, c} 4
{a, b} 7
{a, c} 4
{b, c} 4
a 15
b 7
c 4

2/24
COMP1942 Answer Sheet
Q2 (20 Marks)

(a)(i)

 Make initial guesses for the means m1, m2, …, mk


 Until Interrupted
o Acquire the next example x
o If mi is closest to x,
 replace mi by mi + a(x – mi)

(ii)

mn = mn-1 + a(xn – mn-1)


= (1-a)mn-1 + axn
= (1-a)[(1-a)mn-2 + axn-1] + axn
= (1-a)2mn-2 + (1-a)axn-1 + axn
=(1-a)2[(1-a)mn-3 + axn-2] + (1-a)axn-1 + axn
=(1-a)3mn-3 + (1-a)2axn-2 + (1-a)axn-1 + axn
=…
n
= (1-a)nm0 + 
p 1
(1-a)n-paxp

X = (1-a)n
Y = (1-a)n-pa

3/24
COMP1942 Answer Sheet
Q2 (Continued)

(b)

Consider the correlation between A and B.


B\A 1 0
1 2 0
0 1 1
XAB2 = 1.33

Consider the correlation between A and C.


C\A 1 0
1 1 1
0 2 0
2
XAC = 1.33

Consider the correlation between B and C.


C\B 1 0
1 0 2
0 2 0
2
XBC = 4

For attribute A,
XAB2 + XAC2 = 1.33 + 1.33 = 2.66
For attribute B,
XAB2 + XBC2 = 1.33 + 4 = 5.33
For attribute C,
XAC2 + XBC2 = 1.33 + 4 = 5.33

We choose attribute B for splitting since it has the largest value.


We divide the data into two groups, namely {1, 2} and {3, 4}.

Dendrogram:

4
3

2
1

4/24
COMP1942 Answer Sheet

Q3 (20 Marks)
(a)(i)

Predicted Class
Actual Class Yes No
Yes 2 1
No 2 2

(ii)
Lift curve
Cumulative 3
2
Baseline
1
0 1 2 3 4 5 6 7 # cases

5/24
COMP1942 Answer Sheet

Q3 (Continued)
(b)

No. This is because we do not know the distance between cluster (a, b) and cluster (c d) and the distance
between (a b) and e.

6/24
COMP1942 Answer Sheet
Q4 (20 Marks)
(a) P(LC=Yes) =  x{Yes , No}  y{Yes , No}
P( LC  Yes | FH  x, S  y ) P( FH  x, S  y )
= 0.7  0.3  0.6  0.45  0.3  0.4  0.55  0.7  0.6  0.2  0.7  0.4
= 0.467

P ( LC  Yes | FH  Yes, Smo ker  No, PR  Yes)


P( PR  Yes | FH  Yes, Smo ker  No, LC  Yes)
 P( LC  Yes | FH  Yes, Smo ker  No)
P( PR  Yes | FH  Yes, Smo ker  No)
P( PR  Yes | LC  Yes)  P( LC  Yes | FH  Yes, Smo ker  No)

x{Yes,No} P( PR  Yes | LC  x) P( LC  x | FH  Yes, Smo ker  No)
0.85  0.45

0.85  0.45  0.45  0.55
 0.607143
P ( LC  No | FH  Yes, Smo ker  No, PR  Yes)  1  0.601743  0.392857

 P ( LC  Yes | FH  Yes, Smo ker  No, PR  Yes)  P( LC  No | FH  Yes, Smo ker  No, PR  Yes)  It
is more likely that the person is likely to have Lung Cancer.

(b) Disadvantages:
The Bayesian Belief network classifier requires a predefined knowledge about the network.
The Bayesian Belief Network classifier cannot work directly when the network contains cycles.

7/24
COMP1942 Answer Sheet
Q5 (20 Marks)
(a) Yes.

specificity = 3/4
= 0.75 (or 75%)

(b) Yes.

precision = 3/4
= 0.75 (or 75%)

(c) Yes.

recall = 3/4
= 0.75 (or 75%)

(d) Yes.

f-measure = 2 x Precision x Recall /(Precision + Recall)


= 2 x 0.75 x 0.75 /(0.75 + 0.75)
= 0.75 (or 75%)

8/24
COMP1942 Answer Sheet
Q6 (20 Marks)
(a)
P(X, Y | Z)
P( X , Y , Z )
=
P( Z )
P( X , Y , Z ) P(Y , Z )
= 
P(Y , Z ) P( Z )
= P(X|Y, Z) x P(Y|Z)
= P(X|Z) x P(Y|Z)

(b)

The curse of dimensionality can be described as follows.


When the number of dimensions increases, the distance between any two points is nearly the same.

(c)

Iteration 1:
b w1 w2
0.1 0.1 0.1
y=1
Incorrect!
w1 = w1 + α d ‐ y x1
= 0.1 + 0.5*(0 - 1) * 0
= 0.1
w2 = w2 + α d ‐ y x2
= 0.1 + 0.5*(0 - 1) * 0
= 0.1
b=b+α d‐y
= 0.1 + 0.5*(0 - 1)
= -0.4

9/24
COMP1942 Answer Sheet
Q6 (Continued)

Iteration 2:
b w1 w2
y=0 Correct! ‐0.4 0.1 0.1

w1 = w1 + α d ‐ y x1
= 0.1 + 0.5*(0 - 0) * 0
= 0.1
w2 = w2 + α d ‐ y x2
= 0.1 + 0.5*(0 - 0) * 1
= 0.1
b=b+α d‐y
= -0.4 + 0.5*(0 - 0)
= -0.4

Iteration 3:
b w1 w2
y=0 ‐0.4 0.1 0.1
Correct!

w1 = w1 + α d ‐ y x1
= 0.1 + 0.5*(0 - 0) * 1
= 0.1
w2 = w2 + α d ‐ y x2
= 0.1 + 0.5*(0 - 0) * 0
= 0.1
b=b+α d‐y
= -0.4 + 0.5*(0 - 0)
= -0.4

Iteration 4:
b w1 w2
y=0 Incorrect! ‐0.4 0.1 0.1

w1 = w1 + α d ‐ y x1
= 0.1 + 0.5*(1 - 0) * 1
= 0.6
w2 = w2 + α d ‐ y x2
= 0.1 + 0.5*(1 - 0) * 1
= 0.6
b=b+α d‐y
= -0.4 + 0.5*(1 - 0)
= 0.1

10/24
COMP1942 Answer Sheet
Q6 (Continued)

Iteration 5:
b w1 w2
y=1 0.1 0.6 0.6
Incorrect!
w1 = w1 + α d ‐ y x1
= 0.6 + 0.5*(0 - 1) * 0
= 0.6
w2 = w2 + α d ‐ y x2
= 0.6 + 0.5*(0 - 1) * 0
= 0.6
b=b+α d‐y
= 0.1 + 0.5*(0 - 1) b w1 w2
= -0.4 ‐0.4 0.6 0.6

11/24
COMP1942 Answer Sheet
Q6 (Continued)

12/24
COMP1942 Answer Sheet
Q7 (20 Marks)
(a)

6859
  7
mean vector =  4    
 6  8  9  5   7 
 4 
 6  7    1
For data (6, 6), difference from mean vector =     
 6  7    1
 8  7  1
For data (8, 8), difference from mean vector =     
 8  7  1
5  7   2
For data (5, 9), difference from mean vector =     
9  7  2 
9  7  2 
For data (9, 5), difference from mean vector =     
5  7    2
 1 1  2 2
Y   
  1 1 2  2 
 1 1 
 
1 T 1   1 1  2 2  1 1 
  YY   
4 4   1 1 2  2   2 2 
 
 2  2
 
1 10  6 
  
4   6 10 
 5 3
  
 2 2
  3 5 

 2 2 
5 3
  5 3
2 2 0 (   ) 2  ( ) 2  0   4 or   1
3 5 2 2
 
2 2
when   4,
5 3   3 3
 4   x   0      x   0  1 1 x   0 
2 2  1       2
 x   0
2  1      
 x   0  1 1 x    0   x1  x2  0
1

  3 5
 4  2     
3 3
  2      2   
 2 2   2 2

13/24
COMP1942 Answer Sheet
Q7 (Continued)

 2 
 
 x1   2 
We choose the eigenvector of unit length:    .
 x2    2 
 
 2 
When   1 ,
5 3  3 3
  1   x   0     x   0   1  1 x   0 
2 2  1       2
 x   0
2  1      
 x   0    1 1  x    0   x1  x2  0
1

  3 5 3 3
 1 2       2      2   
 2 2   2 2 
 2
 
 x1   2 
We choose the eigenvector of unit length:    .
 x2   2 
 
 2 
 2 2  2 2
    
Thus,    2 2  , Y  X 
T  2 2 X .
 2 2  2 2 
   
 2 2   2 2 
 2 2
  
For data (6, 6), Y   2 2  6    0 
 2 2  6   8.49 
 
 2 2 
 2 2
   8
For data (8, 8), Y   2 2     0 
 2 2  8  11.31
 
 2 2 
 2 2
   5
For data (5, 9), Y   2 2      2.83 
 2 2  9   9.90 
 
 2 2 
 2 2
   9
For data (9, 5), Y   2 2     2.83 
 2 2  5   9.90 
 
 2 2 

 0  0  (2.83)  2.83 
   0 
The mean vector of the above transformed data points is  4    
 8.49  11.31  9.90  9.90   9.90 
 4 
The final transformed data points are:

14/24
COMP1942 Answer Sheet
Q7 (Continued)

 00   0 
For data (6, 6), final transformed vector =     
 8.49  9.90    1.41
 00   0 
For data (8, 8), final transformed vector =     
11.31  9.90  1.41
  2.83  0    2.83 
For data (5, 9), final transformed vector =     
 9.90  9.90   0 
 2.83  0   2.83 
For data (9, 5), final transformed vector =     
 9.90  9.90   0 

Thus, (6, 6) is reduced to (0);


(8, 8) is reduced to (0);
(5, 9) is reduced to (-2.83);
(9, 5) is reduced to (2.83).

(Note: Another possible answer is


(6, 6) is reduced to (0);
(8, 8) is reduced to (0);
(5, 9) is reduced to (2.83);
(9, 5) is reduced to (-2.83).
This is because the eigenvectors used in this case are:
 2  2
    
 x1   2   x1   2 
     
 x2   2   x2   2 
   
 2  and  2 . )

15/24
COMP1942 Answer Sheet
Q7 (Continued)

16/24
COMP1942 Answer Sheet
Q7 (Continued)

17/24
COMP1942 Answer Sheet
Q7 (Continued)
(b)

(5, 5) is reduced to (0);


(7, 7) is reduced to (0);
(4, 8) is reduced to (-2.83);
(8, 4) is reduced to (2.83).

(Note: Another possible answer is


(5, 5) is reduced to (0);
(7, 7) is reduced to (0);
(4, 8) is reduced to (2.83);
(8, 4) is reduced to (-2.83).)

(c)

(18, 18) is reduced to (0);


(24, 24) is reduced to (0);
(15, 27) is reduced to (-8.49);
(27, 15) is reduced to (8.49).

(Note: Another possible answer is


(18, 18) is reduced to (0);
(24, 24) is reduced to (0);
(15, 27) is reduced to (8.49);
(27, 15) is reduced to (-8.49).)

18/24
COMP1942 Answer Sheet
Q8 (20 Marks)

(a)
For a shorter query time (or for performing data analysis).

(b)
The greedy algorithm discussed in class can be modified be changing the heuristics function from the
computation of the benefit of a view to the computation of the benefit of a view per “unit space”.
i.e.
Let C(v) be the cost of view v(the number of rows in v)
Algorithm:
S  {top view} ;
X  X  C (v ) where v is the top view ;
While there exists a view v not in S s.t . C ( v )  X
Select the view v not in S s.t.
C (v )  X
B ( v , S ) / C ( v ) is maximized
S  S  {v}
X  X  C (v )
output S.

19/24
COMP1942 Answer Sheet

Q9 (20 Marks)
(a)

No.
We know that the whole dataset can be split into two clusters, {a, b} and {c, d, e}.
Consider cluster {c, d, e}.
We do not know the hierarchy for points c, d, and e.
We need two kinds of additional information, D({c}, {e}) and D({d}, {e}) to draw the dendrogram.

(b)

Cluster 1: {1, 2, 4, 5, 6}
Cluster 2: {3, 7, 8, 9, 10}

20/24
COMP1942 Answer Sheet
Q10 (20 Marks)

(a)

Yes.

x y z
x  0 1 0.5 
 
y  0.5 0 0.5 
z  0.5 0 0 

(b) (i)

w1, w2, w6, w7, w8

(ii)

w1, w2, w3, w6, w7, w8, w9

21/24
COMP1942 Answer Sheet

Part B (Bonus Question)


Note: The following bonus question is an OPTIONAL question. You can decide whether you will answer
it or not.

Q11 (20 Additional Marks)


(a)

Yes.

We can regard “Gain(V U {top view}, {top view})” as the cost reduction of materializing views in V
compared with that of materializing the top view only.

Similarly, we can regard “Gain(V U {top view, view A}, {top view, view A})” as the cost reduction of
materializing views in V compared with that of materializing the top view and the view A only.

Since materializing view A reduces the cost of accessing views which could be affected by the views in V,
we know that Gain(V U {top view}, {top view}) ≥ Gain(V U {top view, view A}, {top view, view A}).

22/24
COMP1942 Answer Sheet
Q11 (Continued)
(b)

No.

Consider the following example.


Let P = “view b” and C = “view c”.

We know that

Gain({view P} U {top view}, {top view}) = 0

and

Gain({view C} U {top view}, {top view}) = 92.

Thus,

Gain({view P} U {top view}, {top view}) < Gain({view C} U {top view}, {top view})

a 100

b 100 d 100

c2 e 100

f1

23/24
COMP1942 Answer Sheet
Q11 (Continued)
(c)

Yes.

We can regard “Gain({x} U S U {top view}, {top view}) - Gain(S U {top view}, {top view})” as the cost
reduction of materializing view x compared with that of materializing the top view and the views in S only.

Similarly, we can regard “Gain({x} U T U {top view}, {top view}) - Gain(T U {top view}, {top view}) ” as
the cost reduction of materializing view x compared with that of materializing the top view and the views in
T only.

Since S  T, materializing views in S reduces the cost of accessing views which could be affected by the
views in T, we know that
Gain({x} U S U {top view}, {top view}) - Gain(S U {top view}, {top view})
≥ Gain({x} U T U {top view}, {top view}) - Gain(T U {top view}, {top view})

End of Answer Sheet

24/24

You might also like