Understanding the data


Interactive database exploration using multidimensional scaling


Exploratory data analysis: visualization of data.

Data topography preserving mapping method: MDS (MultiDimensional Scaling)

Idea: try to place images of data vectors in such a way that preserves the distances.

where wij are weights allowing to control which distances are to be better preserved.

using a gradient descent method (steepest descent, conjugate gradient, quasi-Newton, ...)

Our choice: steepest descent with 2nd order optimization of the step-size along the gradient

Multiply previous weight wij by a Gaussian-like term centered on Pc, decreasing when the mean distance

Dcij = (Dci + Dcj)/2 between Dij end points and point Pc is increasing:

 


Real life example of a database visualization


Psychometric MMPI test: patients as samples, numerical factors as attributes

Two datasets: Men / Women.

Women dataset

Metric MDS mapping of the Women database.
S0 = 0.075 (PCA initialization) Sconv = 0.024.

 

 


Focusing on data point 'p554' from class 'organika'

 

- Purpose: View (Understand) why this data is classified into class 'organika'.

- Classified using IncNet neural network, for which features 2, 4 and 7 are sufficient to classify correctly class 'organika'.

- To avoid interference from noisy dimensions, only those dimensions (2,4,7) were used for the MDS mapping,

200 nearest neighbors

100 nearest neighbors

Sconv = 0.02695 (random initialization, trial 6)

Sconv = 0.14635 (random initialization, trial 24)

50 nearest neighbors

20 nearest neighbors

Sconv = 0.02849 (random initialization, trial 2)

Sconv = 0.01899 (random initialization, trial 1)

Visualization of IncNet classifier's decision borders

The 50 nearest neighbors with 100 Gaussian (s =1) points classified

The 50 nearest neighbors with 100 Gaussian (s =2) points classified

   

1 - Generation of 100 new points from a Gaussian distribution centered at p554,

2 - Classification of the new points using IncNet classifier,

3 - Addition of the new points to the 100 nearest neighbors map using relative mapping (each point is mapped separately).


Sensitivity to initial configuration:

 

Initialization of the configuration:

Our strategy: Initialize using PCA and 20 random trials and then keep the best run.

 

3 mappings of the 10 nearest neighbors of point p554

Sconv = 0.03904 (PCA initialization)

Sconv = 0.023181 (random initialization, trial 1)

Sconv = 0.023176 (random initialization, trial 2)


Features of MDS mapping for database visualization

 

 

Features of our MDS mapping software (prototype GUI)