Understanding the data

Optimization and interpretation of rule-based classifiers

UMK - logo

Włodzisław Duch,
Norbert Jankowski,
Krzysztof Grąbczewski,
Rafał Adamczak

Computational Intelligence Laboratory,
Department of Informatics,
Nicolaus Copernicus University,

Grudziądzka 5, 87-100 Toruń, Poland.

WWW: https://int.umk.pl/kis/~duch

Plan

Rules and problems with understanding of data
Application and optimization of rule-based classifiers
Confidence intervals and probabilistic confidence intervals
Real-life example - psychometric data
Discussion

Rules and problems with understanding of data

ML learning camp: NN are no good! Black boxes taking decisions.
Knowledge in neural networks: opaque, hidden, incomprehensible.
Rules forever!

Are rules indeed the only way to understand the data?
What type of explanation is satisfactory? Interesting cognitive psychology problem.
Knowledge accessible to humans: symbols, similarity to prototypes, visualization.
Psychology: examplar and prototype theories of categorization; rules only in logic is simple.

IF the number of rules is relatively small and
IF the accuracy is sufficiently high.
THEN rules may be an optimal choice.

Crisp logical rules are most desirable but ...

only one class is predicted - black-and-white picture
reliable crisp rules may reject some cases as unclassified
discontinous cost function, only non-gradient optimization

Fuzzy rules - continuous membership functions.

not so comprehensible as the crisp rules
discontinous cost function, only non-gradient optimization
involve additional positions/shapes parameters -
danger of overparameterization

Fixed set of membership functions with predetermined shapes - bad idea.
Curse of dimensionality: k linguistic variables in d dimensions gives k^d areas.
Context-dependent linguistic variables - adapt membership functions in each rule.

Interpretation of crisp rules may be misleading.
Crisp rules may be unstable against small perturbations of input values.
Statisticians: rule-based classifiers are unstable.

Probabilities estimated using fuzzy rules change smoothly.
How to find the best fuzziness/precision tradeoff ?
How to understand what the best classifier is doing?

Application and optimization
of rule-based classifiers

Methodology of rule extraction:

Select linguistic variables.
For continuous x use s_k(X_k,X'_k) true if x in [X_k,X'_k].
Extract rules from data using neural, machine learning or statistical techniques;
explore the simplicity/accuracy rate tradeoff.
Optimize rules and linguistic variables (X_k,X'_k intervals) using the extracted rules;
explore the reliability/rejection rate tradeoff.
Explore the uncertainty of the input values.
Repeat the procedure until a stable set of rules is found.

This approach leads to the following important improvements for any rule-based system:

Crisp logical rules are preserved giving maximal comprehensibility.
Instead of 0/1 decisions "probabilities" of classes p(C_i| X; M) are obtained.
Uncertainties of inputs s_i provide additional adaptive parameters.
Inexpensive gradient method are used allowing for optimization of very large sets of rules.
Rules with wider classification margins are obtained, overcoming the brittleness problem.

Confidence intervals

IF probability of new classes quickly grows (here from 0-33%) with the assumed uncertainty of the measurement (here between 0-3%)

THEN analyze probabilistic confidence levels.

Probabilities of different diagnoses may be interpolated to show change of the mental health over time.
Probabilistic confidence levels allow to see detailed changes.

Real-life example - psychometric data

Discussion

There are many ways to understand the data: rules, prototypes, visualization.

Only reliable, accurate, stable and sufficiently simple rules are useful.
Unstable sets of rules contain little useful information and may be misleading.

Simplicity/accuracy rate tradeoff should be explored.
Optimization of sets of rules allows to explore reliability/rejection rate tradeoff.

Classification probabilities are important, rules are not sufficient.
The neigborhood of the unknown input should always be explored.
Probabilities of classification should be parametrized by uncertainties of inputs.
Probabilistic confidence intervals enable detailed interpretation of cases.
Exploratory data analysis (visualization) is always worth using.

These methods may be used with any classifier, so why not use the best one?

Włodzisław Duch