Problem
I have a dataset which has a special column. Every row has the same value in this column. I fit a model on this data and I create an Explainer instance.
When I try to create an Aspect with the explainer, I get an error:
Traceback (most recent call last):
File "test.py", line 16, in <module>
asp = dx.Aspect(exp)
File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\dalex\aspect\object.py", line 92, in __init__
self.linkage_matrix = utils.calculate_linkage_matrix(
File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\dalex\aspect\utils.py", line 121, in calculate_linkage_matrix
linkage_matrix = linkage(squareform(dissimilarity), clust_method)
File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\scipy\spatial\distance.py", line 2345, in squareform
is_valid_dm(X, throw=True, name='X')
File "C:\Users\user\anaconda3\envs\noobenv\lib\site-packages\scipy\spatial\distance.py", line 2420, in is_valid_dm
raise ValueError(('Distance matrix \'%s\' must be '
ValueError: Distance matrix 'X' must be symmetric.
How to replicate
You can run the following code to replicate. Notice that third column in the data has the same value (3) in every row.
import numpy as np
data = np.array([[242,902,3,435],
[125,684,3,143],
[162,284,3,124],
[712,844,3,145],
[122,864,3,114],
[155,100,3,25]])
target = np.array([723,554,932,543,654,345])
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(data, target)
import dalex as dx
exp = dx.Explainer(clf, data, target)
asp = dx.Aspect(exp)
asp.plot_dendrogram()
Cause
When initialising the Aspect instance, inside the utils.calculate_depend_matrix, corr method of pandas is called with the data we provide. If there is a non-varying column, that column has NaN values in the resulting correlation matrix (related Pandas issue). When I change a value in the column with non-varying values, problem goes away.
Solution
utils.calculate_depend_matrix method can be updated to replace NaN values before returning the depend_matrix:
def calculate_depend_matrix(
data, depend_method, corr_method, agg_method
):
depend_matrix = pd.DataFrame()
if depend_method == "assoc":
depend_matrix = calculate_assoc_matrix(data, corr_method)
if depend_method == "pps":
depend_matrix = calculate_pps_matrix(data, agg_method)
if callable(depend_method):
try:
depend_matrix = depend_method(data)
except:
raise ValueError(
"You have passed wrong callable in depend_method argument. "
"'depend_method' is the callable to use for calculating dependency matrix."
)
# if there is a non-varying column in data, there will be NaN values in the 'depend_matrix'.
# replace NaN values on the diagonal with 1 and others with 0.
depend_matrix[depend_matrix.isnull()] = 0
for i in range(depend_matrix.shape[0]):
depend_matrix.iloc[i,i] = 1
return depend_matrix
When the method is updated this way, I am able to create an Aspect instance and call the plot_dendrogram method. Following plot is generated:

Label 2 is the third column in my data, where all the rows have value 3.
Problem
I have a dataset which has a special column. Every row has the same value in this column. I fit a model on this data and I create an
Explainerinstance.When I try to create an
Aspectwith the explainer, I get an error:How to replicate
You can run the following code to replicate. Notice that third column in the data has the same value (3) in every row.
Cause
When initialising the Aspect instance, inside the
utils.calculate_depend_matrix,corrmethod ofpandasis called with the data we provide. If there is a non-varying column, that column has NaN values in the resulting correlation matrix (related Pandas issue). When I change a value in the column with non-varying values, problem goes away.Solution
utils.calculate_depend_matrixmethod can be updated to replace NaN values before returning thedepend_matrix:When the method is updated this way, I am able to create an Aspect instance and call the
plot_dendrogrammethod. Following plot is generated:Label 2 is the third column in my data, where all the rows have value 3.