# Pattern Recognition

Classification

Feature Extraction:

• Convert a raw pattern into a feature vector.
• Reduce redundancy in the pattern.
• e.g. convert image to line drawing.
Classification
• Assign the feature vector a class
• Pattern space (or feature space) must be partitioned through training.
Training and Testing
• The system is trained using a finite set of patterns- the training set.
• If the correct classification for these patterns is known then this is Supervised Learning, otherwise it is Unsupervised Learning.
• The system is evaluated using a different set of patterns - the test set.
Over-training and Under-training.

It is easy to learn about the training set, just remember them all, unfortunately this will be of little use for classification of the test set - this is over-training. Similarly If the system knows too little about the training set then it will classify badly.

The system must not overgeneralise or undergeneralise about patterns.

Bayesian Classification

• The probability of event A:
• Pr(A)
• The joint probability of A and B:
• Pr(A, B)
• Conditional probability of B given A:
• Pr(B | A)=Pr(A,B)/Pr(A)
For independent events:
Pr(B | A) = Pr(B)
i.e. Pr(A, B)=Pr(A)Pr(B)

### Bayes Rule

Pr(A, B) = Pr(B, A)
Pr(B | A) Pr(A) = Pr(A | B) Pr(B)
so
Pr(B | A) = Pr(A | B) Pr(B)
Pr(A)
For pattern classification:

if x is the feature vector and wi is class i.

Pr(wi) is the prior probability of the pattern being in class i.

p(x | wi) is the probability density function for the feature vector.
p(x) = sum(p(x | wi) Pr(wi))

From Bayes Rule:

Pr(wi | x) = p(x | wi) Pr(wi)
p(x)

Pr(wi | x) = p(x | wi) Pr(wi)
sum(p(x | wi) Pr(wi))

where:
Pr(wi | x) = posterior probability for class i.
The optimal decision is the i for which Pr(wi | x) is largest.

Bayesian theory gives us the optimum decision boundary but in practice we don't have enough information to calculate this boundary.

### Nearest Neighbour Classification

A pattern is classified as belonging to the class of the training pattern that is closest to it. To measure closeness use a distance metric.

For a feature vector x = {x1, x2, x3,.....,xn}

and a training pattern t = {t1, t2, t3,.....,tn}

1. Euclidean distance:

2. D2=Sumi( (xi - ti)2)
3. Dot Product Distance:

4. D = Sumi(xi * ti) /(|xi|*|ti|)
5. Angle between vectors:

6. D = Sumi(xi * ti) /(|xi|*|ti|)

### Efficiency

• Training is trivial, just store all patterns, this may require a lot of storage.
• Classification may be time consuming since all stored patterns must be compared.

### Optimality

• Nearest Neighbour is not optimal. Making simple assumptions it can be shown that the error probability is bounded by twice the optimal Bayesian error.
Nearest neighbour classification is prone to errors due to rogue patterns. A rogue pattern is a mis-labeled pattern.

## K Nearest Neighbour

To eliminate the problem caused by rogue patterns use not just the nearest neighbour but a group of them.

Using K neighbours, take a majority vote of their classes to give a classification.

### Optimality

• As K gets larger this method approaches the optimal decision.

### Speeding up nearest neighbour methods.

The biggest problem with this method is the time it takes to calculate the distances to the training examples possible solutions are:
• Only store examples near the decision boundary.
• Use a precomputed search tree and branch and bound to search for the nearest neighbour.