Understanding the data:
extraction, optimization and interpretation
of logical rules

Włodzisław Duch

Computational Intelligence Laboratory,
Department of Informatics,
Nicolaus Copernicus University,

Grudziądzka 5, 87-100 Toruń, Poland.

e-mail: id: wduch, na serwerze fizyka.umk.pl.

WWW: https://is.umk.pl/~duch/index.html

With thanks to Rafał Adamczak, Karol Grudziński, Antoine Naud, Krzysztof Grąbczewski and Norbert Jankowski.

Recent Projects of the
Department of Informatics

Computational Intelligence (CI)

New CI algorithms: Feature Space Mapping (neurofuzzy), D-MLPs (generalization of MLP), SSV (decision tree), ontogenic IncNet network, MDS-based visualization, transfer functions ...
Integration of many neural, fuzzy, pattern recognition, statistical and machine learning methods, based on analysis of similarity and selection of models of different types.
Data mining, data understanding, scientific discovery.

Cognitive Science

Collaboration with neurobiology/education/psychology and philosophy departments, cognitive science journal.
Theories between neurosciences and cognitive psychology: space for mind events.
Philosophy of mind.

Applications

Ghostminer project: Intelligent Decision Support System for medical, psychological, chemical, molecular biology and commercial data analysis (with Fujitsu).
Cognitive toys (with Fujitsu).
Some fMRI image analysis (with MPI Neuropsychology Institute).

Some papers are in our on-line archive:

WWW: https://is.umk.pl/~duch/index.html

Plan

Rules, understanding the data and knowledge discovery
How to extract logical rules from data: a few algorithms
How to optimize sets of logical rules
How to use logical rules to calculate probabilities
Some knowledge discovered
System for analysis of psychometric questionnaires

Rules, understanding the data and knowledge discovery

Neural networks are universal approximators/classifiers but are they good tools for real applications?

ML learning camp: NN are black boxes taking decisions.
Internal neural representations are incomprehensible.
Knowledge in neural networks: architecture+parameters.

Knowledge accessible to humans: symbols, similarity to prototypes, visualization.
What type of explanation is satisfactory? Interesting cognitive psychology problem.

Logical rules, if simple enough, are preferred by humans.
Explanations 'why' are sometimes necessary, rules may expose limitations of neural approximations.
Surprisingly, rules may be more accurate than networks (regularization is easier) !
Only relevant features are used in rules.

Wider implications

Understanding what a neural network has learned (through logic, similarity, visualization).
Use of symbolic knowledge in neural networks: knowledge-based neurocomputing.
Use of distributed representations in symbolic systems for knowledge acquisition, association and generalization.

Use of various forms of knowledge in one system is still an open question.

How to extract rules from data

Some methods developed in our group:

1. Neural methods

Methods based on MLP networks: MLP2LN network.
Architecture: Aggregation, Linguistic variables and Rule layers; one oputput per class.

Aggregation: used to combine and discover new useful features, no constraints.

L-units: providing intervals for fuzzy or crisp membership functions, made from 2 neurons, only biases are adaptive parameters here.
Constraint MLP cost function

First term: standard quadratic function (or any other)
Second term: weight decay & feature selection.
Third term: from complex to simple hypercuboidal classification decision regions for crisp logic (for steep sigmoids).

Different regularizers and different error functions may be used.

Logical rules from MLP: simplify the network by enforcing weight decay and other constraints.
Strong and weak regularization allows to explore simplicity-accuracy tradeoff.
Constructive C-MLP2LN algorithm: faster, train one R-unit at a time.
Add one neuron and train it, freezing the existing skeleton network.
The network first grows, then shrinks; stop when the number of new vectors per one new neuron becomes too small.
Search-based MLP method (S-MLP): integer weights/biases, use beam search techniques instead of backpropagation.
Start from W_ij = 0, bias_i = -0.5, change by 1.
Method based on FSM (Feature Space Mapping) neurofuzzy network.
Crisp rules: FSM + rectangular transfer functions.
Fuzzy rules: FSM + context-dependent fuzzy membership functions.

Network generated for the Iris problem; the simplest (for strongest regularization) uses only x₃.

2. Decision-tree based methods

SSV separability criterion: separate maximum number of pairs from different classes minimizing the number of separated pairs from the same class.
Simple, automatic, gives useful linguistic variables, discretization, rules, decision trees.

3. Similarity-based methods: rules or prototypes?

Selection of the best prototypes - "supermans".

Rules possible with:

Variants of nearest neighbor methods with special distance functions (sums of sigmoids)
Neural k-NN with large exponents in Minkovsky's distance

How to optimize sets of logical rules

Regularization of models allows to exploresimplicity-accuracy tradeoff.
Next step: exploring the confidence-rejection rate tradeoff.

Weighted combination of "predictive power" of rules and the number of errors:

should be minimized without constraints, with optional risk matrix

This cost function allows to reduce the number of errors to zero (large gamma) for models M that reject some instances.

How to use logical rules to calculate probabilities

Data from measurements/observations is not precise.
Finite resolution: Gaussian error distribution:

x -> G_x=G(y;x,s_x), where G_x is a Gaussian (fuzzy) number.

Given a set of logical rules {Â} apply them to input data {G_x }.
Use Monte Carlo sampling to recover p(C_i| X; {Â }) - may be used with any classifier.

Analytical estimation of this probability is based on cumulant function:

Approximation better than 2%. Probability that the rule is true:

Soft trapezoidal membership functions realized by L-units are obtained.
Fuzzy logic with such functions is equivalent to classical logic with Gaussian numbers as input and neural networks with logistic transfer functions.

This is not a fuzzy approach!
Here small receptive fields are used, in fuzzy approach typically 2-3 large receptive fields define linguistic variables.

Benefits:

Probabilities instead 0, 1 crisp rule decisions.
Vectors that were not classified by crisp rules have now non-zero probabilities.
Dispersions s_x may be treated as adaptive parameters of the model M.
Gradient methods may be used for large-scale optimization.

Some knowledge discovered

Mushrooms

The Mushroom Guide clearly states that there is no simple rule for determining the edibility of these mushrooms; no rule like “leaflets three, let it be“ for Poisonous Oak and Ivy.

8124 cases, 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features.
51.8% represent edible, the rest non-edible mushrooms.

Example:

edible, convex, fibrous, yellow, bruises, anise, free, crowded, narrow, brown, tapering, bulbous, smooth, smooth, white, white, partial, white, one, pendant, purple, several, woods

poisonous, convex, smooth, white, bruises, pungent, free, close, narrow, white, enlarging, equal, smooth, smooth, white, white, partial, white, one, pendant, black, scattered, urban

Safe rule for edible mushrooms:

odor = (almond.or.anise.or.none) Ù spore-print-color = Ø green	48 errors, 99.41% correct
This is why animals have such a good sense of smell!
Other odors: creosote, fishy, foul, musty, pungent or spicy
Rules for poisonous mushrooms - 6 attributes only
R₁) odor = Ø (almond Ú anise Ú none);	120 errors, 98.52%
R₂) spore-print-color = green	48 errors, 99.41% correct
R₃) odor = none Ù stalk-surface-below-ring = scaly Ù stalk- color-above-ring = Ø brown	8 errors, 99.90%
R₄) habitat = leaves Ùcap-color = white	no errors!

R₁ + R₂ are quite stable, found even with 10% of data;
R₃ and R₄ may be replaced by other rules:

R'₃): gill-size = narrow Ù stalk-surface-above-ring = (silky Ú scaly)
R'₄): gill-size = narrow Ù population = clustered

Only 5 attributes used !

Ljubliana breast cancer

286 cases, 201 no recurrence cancer events (70.3%), 85 are recurrence (29.7%) events.
9 attributes, symbolic with 2 to 13 values.

Single rule:

with else condition gives over 77% in crossvalidation; best systems do not exceed 78%. All knowledge contained in the data is:

if more than 2 nodes were involved and it is highly malignant there will be recurrence.

Wisconsin breast cancer

699 cases, 458 benign (65.5%), 241 (34.5%) malignant.
9 attributes, integers 1-10, one attribute missing in 16 cases.

The simplest rules, large regularization:

IF f₂ ł 7 Ú f₇ ł 6 THEN malignant (95.6%)

Overall accuracy (including ELSE condition) is 94.9%.
f₂ - uniformity of cell size; f₇ - bland chromatin

Hierarchical sets of rules with increasing accuracy may be build
More accurate set of rules:

R₁: f₂<6 Ù f₄<3 Ù f₈<8	(99.8)%
R₂: f₂<9 Ù f₅<4 Ù f₇<2 Ù f₈<5	(100)%
R₃: f₂<10 Ù f₄<4 Ù f₅<4 Ù f₇<3	(100)%
R₄: f₂<7 Ù f₄<9 Ù f₅<3 Ù f_7Î[4,9] Ù f₈<4	(100)%
R₅: f_2Î[3,4]Ù f₄<9 Ù f₅<10 Ù f₇<6 Ù f₈<8	(99.8)%

R₁ and R₅ misclassify the same 1 benign vector.

ELSE condition makes 6 errors, overall reclassification accuracy 99.00%

In all cases features f₃ and f₆ (uniformity of cell shape and bare nuclei) are not important, f₂ and f₇ being the most important.

100% reliable set of rules rejects 51 cases (7.3%).

Results from the 10-fold (stratified) crossvalidation - accuracy of rules is hard to compare without the test set

Method	% accuracy
IncNet	97.1
3-NN, Manhattan	97.1ą 0.1
Fisher LDA	96.8
MLP+backpropagation	96.7
LVQ (vector quantization)	96.6
Bayes (pairwise dependent)	96.6
FSM (density estimation)	96.5
Naive Bayes	96.4
Linear Discriminant Analysis	96.0
RBF	95.9
CART (decision tree)	94.2
LFC, ASI, ASR (decision trees)	94.4-95.6
Quadratic Discriminant Analysis	34.5

The Hypothyroid dataset

Data from Machine Learning Database repository, UCI
3 classes: hypothyroid, hiperthyroid, normal;
# training vectors 3772 = 93+191+3488
# test vectors 3428 = 73+177+3178
21 attributes (medical tests), 6 continuos

Optimized rules: 4 errors on the training set (99.89%), 22 errors on the test set (99.36%)

primary hypothyroid: TSH>30.48 & FTI <64.27 97.06%

primary hypothyroid: TSH=[6.02,29.53] & FTI <64.27 & T3< 23.22 100%

compensated: TSH > 6.02 & FTI>[64.27,186.71] & TT4=[50, 150.5) &
On_Tyroxin=no & surgery=no 98.96%

no hypothyroid: ELSE 100%

4 continuos attributes used and 2 binary.

Method % training % test Reference

C-MLP2LN rules + ASA 99.9 99.36 our group

CART 99.8 99.36 Weiss

PVM 99.8 99.33 Weiss

IncNet 99.7 99.24 our group

MLP init+ a,b opt. 99.5 99.1 our group

C-MLP2LN rules 99.7 99.0 our group

Cascade correlation 100.0 98.5 Schiffmann

BP + local adapt. rates 99.6 98.5 Schiffmann

BP+genetic opt. 99.4 98.4 Schiffmann

Quickprop 99.6 98.3 Schiffmann

RPROP 99.6 98.0 Schiffmann

3-NN, Euclides, 3 features used 98.7 97.9 our group

1-NN, Euclides, 3 features used 98.4 97.7 our group

Best backpropagation 99.1 97.6 Schiffmann

1-NN, Euclides, 8 features used -- 97.3 our group

Bayesian classif. 97.0 96.1 Weiss

BP+conjugate gradient 94.6 93.8 Schiffmann

1-NN Manhattan, std data 93.8 our group

default: 250 test errors 92.7

1-NN Manhattan, raw data 92.2 our group

NASA Shuttle

Training set 43500, test set 14500, 9 attributes, 7 classes
Approximately 80% of the data belongs to class 1, only 6 vectors in class 6.

Rules from FSM after optimization: 15 rules, train 99.89%, test 99.81% accuracy.

32 rules obtained from SSV give 100% train, 99.99% test accuracy (1 error).

Method % training % test Reference

SSV, 32 rules 100 99.99 our group

NewID decision tree 100 99.99 Statlog

Baytree decision tree 100 99.98 Statlog

CN2 decision tree 100 99.97 Statlog

CART 99.96 99.92 Statlog

C4.5 99.96 99.90 Statlog

FSM, 15 rules 99.89 99.81 our group

MLP 95.50 99.57 Statlog

k-NN 99.61 99.56 Statlog

RBF 98.40 98.60 Statlog

Logistic DA 96.06 96.17 Statlog

LDA 95.02 95.17 Statlog

Naive Bayes 95.40 95.50 Statlog

Default 78.41 79.16

More examples of logical rules discovered are on our rule-extraction WWW page

System for analysis of psychometric questionnaires

Ghostminer project (sponsored by Fujitsu): datamining package, composed of:

Neurofuzzy system: FSM
Decision tree: SSV
Similarity based methods: SBL
Ontogenic neural network: IncNet
Explanatory data analysis: interactive MDS

Example of final product: analysis of psychometric data

Start from computerized test or scanning the paper forms.
MMPI test has 550 questions; any similar test may be computerized.

Store results in a database for future reference

Compute coefficients (scales) measuring different tendencies.
MMPI scales 1-4 used for control, next 10 coefficients are clinical scales: hypochondria, depression, hysteria, psychopathy, paranoia, schizophrenia etc.

Display scales in a “psychogram”, interpreted by skilled psychologists diagnosing specific problems; show rules that are true for this case. Rules are derived from data collected in the Academic Psychological Clinic of Nicolaus Copernicus University and in several psychiatric hospitals around Poland.

Two datasets used, woman and man, over 1600 cases each, 27 classes (normal, neurotic, drug addicts, schizophrenic, psychopaths, organic problems, malingerers, persons with criminal tendencies etc.).

2-3 rules per class found, a total of 50-100 rules.

Analyze how each rule fits to the case; vary uncertainty of input measurement (optimal uncertainty has been calculated by minimization of generalization error).

Show probabilities of different diagnoses, graph their dependence on the uncertainity of inputs.

Show verbal interpretation of cases and rules.

If probability of new classes quickly grows with the assumed uncertainty of the measurement analyze probabilistic confidence levels.

Multidimensional scaling (MDS) allows to see the case in relation to known cases.

Probabilities of different diagnoses may be interpolated to show change of the mental health over time.
Probabilistic confidence levels allow to see detailed changes.

Rules are very important here, allowing for detailed interpretation.

Rules generated using SSV classification tree and FSM neural network.

System	Data	# rules	Accuracy	Fuzzy
C4.5	Women	55	93.0%	93.7%
	Men	61	92.5%	93.1%
FSM	Women	69	95.4%	97.6%
	Men	98	95.9%	96.9%

10-fold crossvalidation gives 82-85% correct answers with FSM (crisp unoptimized rules), and 79-84% correct answers with C4.5. Fuzzification improves FSM crossvalidation results to 90-92%.

Some questions:
How good are our experts?
How to measure the correctness of such system?
Can we provide useful information if diagnosis is not reliable?
How to deal with several disease - automatic creation of new classes?

In real world projects training and finding optimal networks is not our hardest problem ...

Discovering hierarchical structure in the data.
Dealing with unknown (unmeasured or lost) values.
Constructing new, more useful features.
Constructing theories allowing to reason about data.
Constructing new and modifying exising classes.
Building complex systems interacting with humans.

References:
IEEE TNN paper (2000, did not make it to the data mining issue but perhaps the next?);
IJCNN'2000 Tutorial

Włodzisław Duch

primary hypothyroid:	TSH>30.48 & FTI <64.27	97.06%
primary hypothyroid:	TSH=[6.02,29.53] & FTI <64.27 & T3< 23.22	100%
compensated:	TSH > 6.02 & FTI>[64.27,186.71] & TT4=[50, 150.5) & On_Tyroxin=no & surgery=no	98.96%
no hypothyroid:	ELSE	100%

Method	% training	% test	Reference
C-MLP2LN rules + ASA	99.9	99.36	our group
CART	99.8	99.36	Weiss
PVM	99.8	99.33	Weiss
IncNet	99.7	99.24	our group
MLP init+ a,b opt.	99.5	99.1	our group
C-MLP2LN rules	99.7	99.0	our group
Cascade correlation	100.0	98.5	Schiffmann
BP + local adapt. rates	99.6	98.5	Schiffmann
BP+genetic opt.	99.4	98.4	Schiffmann
Quickprop	99.6	98.3	Schiffmann
RPROP	99.6	98.0	Schiffmann
3-NN, Euclides, 3 features used	98.7	97.9	our group
1-NN, Euclides, 3 features used	98.4	97.7	our group
Best backpropagation	99.1	97.6	Schiffmann
1-NN, Euclides, 8 features used	--	97.3	our group
Bayesian classif.	97.0	96.1	Weiss
BP+conjugate gradient	94.6	93.8	Schiffmann
1-NN Manhattan, std data		93.8	our group
default: 250 test errors		92.7
1-NN Manhattan, raw data		92.2	our group

Compute coefficients (scales) measuring different tendencies. MMPI scales 1-4 used for control, next 10 coefficients are clinical scales: hypochondria, depression, hysteria, psychopathy, paranoia, schizophrenia etc.
Display scales in a “psychogram”, interpreted by skilled psychologists diagnosing specific problems; show rules that are true for this case. Rules are derived from data collected in the Academic Psychological Clinic of Nicolaus Copernicus University and in several psychiatric hospitals around Poland.

Analyze how each rule fits to the case; vary uncertainty of input measurement (optimal uncertainty has been calculated by minimization of generalization error).
Show probabilities of different diagnoses, graph their dependence on the uncertainity of inputs.
Show verbal interpretation of cases and rules.
If probability of new classes quickly grows with the assumed uncertainty of the measurement analyze probabilistic confidence levels.

Understanding the data: extraction, optimization and interpretation of logical rules

Włodzisław Duch

Computational Intelligence Laboratory, Department of Informatics, Nicolaus Copernicus University,

Grudziądzka 5, 87-100 Toruń, Poland.

e-mail: id: wduch, na serwerze fizyka.umk.pl.

WWW: https://is.umk.pl/~duch/index.html

Recent Projects of the Department of Informatics

Computational Intelligence (CI)

Cognitive Science

Applications

WWW: https://is.umk.pl/~duch/index.html

Plan

Rules, understanding the data and knowledge discovery

How to extract rules from data

1. Neural methods

2. Decision-tree based methods

3. Similarity-based methods: rules or prototypes?

How to optimize sets of logical rules

How to use logical rules to calculate probabilities

Some knowledge discovered

Mushrooms

Ljubliana breast cancer

Wisconsin breast cancer

The Hypothyroid dataset

NASA Shuttle

System for analysis of psychometric questionnaires

Understanding the data:
extraction, optimization and interpretation
of logical rules

Computational Intelligence Laboratory,
Department of Informatics,
Nicolaus Copernicus University,

Recent Projects of the
Department of Informatics