# Lecture 7

## The Multilayer Perceptron

The perceptron can only learn simple problems. It can place a hyperplane in pattern space and move the plane until the error is reduced. Unfortunately this is only useful if the problem is linearly separable.

A linearly separable problem is one in which the classes can be separated by a single hyperplane.

It is often the case that a problem is not linearly separable.To solve these we use a Multi-Layer Perceptron (MLP) where one layer feeds into the next. Consider the following problem.

## The XOR problem

This is the simplest problem that can not be solved by a perceptron.

For two inputs x1 and x2, the output is the exclusive OR of the inputs.

The pattern space for this problem looks like this:

This cannot be solved using a single line, the solution uses two lines:

A two layer Multi-Layer Perceptron to solve this problem looks like this:

The shape of regions in pattern space that can be by Multi-Layer Perceptrons is shown in the following table.

We can see that a three layer MLP can learn arbitrary areas while a two layer MLP can learn convex regions. (if you can draw a line from any point in the region to any other in the region and the line passes out of the region then that region is not convex).

## Training the MLP

last time we saw that the delta rule can be used to train a perceptron. When training the MLP the error (delta) must be propagated back through the larers. This is called error back-propagation. Or just backpropagation.

The following procedure can be used to train a backpropagation network.

```t is the target
units[l] is the number of units in layer l
n[l][i] is unit i in layer l
n[l][i].output is the output
n[l][i].delta is the delta
n[l][i].weight[j] is weight j
ek is the learning constant```
```adapt() {
int i,j,k,l;

for(l=layers-1;l>=0;l--)
for(i=0;i<units[l];i++)
if(l==layers-1)
n[l][i].delta=
ek*n[l][i].output*
(1.0-n[l][i].output)*
(t[i]-n[l][i].output);
else {
n[l][i].delta=0.0;
for(k=0;k<units[l];k++)
n[l][i].delta+=
n[l+1][k].delta*
n[l+1][k].weight[i];
n[l][i].delta=n[l][i].delta*
ek*n[l][i].output*
(1.0-n[l][i].output);
}

for(l=layers-1;l>=1;l--)
for(i=0;i<units[l];i++)
for(j=0;j<weights;j++)
n[l][i].weight[j]+=
n[l-1][j].output*
n[l][i].delta;

for(i=0;i<units[0];i++)
for(j=0;j<weights;j++)
n[0][i].weight[j]+=
input[j]*n[0][i].delta;
}
```

When this algorithm is applied to the XOR we get the following output.

```iteration no 0, inputs 0 1, target 1, output 0.477995
iteration no 20, inputs 0 0, target 1, output 0.447816
iteration no 40, inputs 1 0, target 0, output 0.450292
iteration no 60, inputs 0 0, target 1, output 0.549096
iteration no 80, inputs 1 0, target 0, output 0.460706
iteration no 100, inputs 0 0, target 1, output 0.507636
iteration no 120, inputs 0 1, target 1, output 0.571619
iteration no 140, inputs 1 0, target 0, output 0.451493
iteration no 160, inputs 0 1, target 1, output 0.570574
iteration no 180, inputs 0 0, target 1, output 0.575979
iteration no 200, inputs 0 1, target 1, output 0.744079
iteration no 220, inputs 1 0, target 0, output 0.233541
iteration no 240, inputs 0 1, target 1, output 0.755600
iteration no 260, inputs 1 1, target 0, output 0.185273
iteration no 280, inputs 0 1, target 1, output 0.788309
iteration no 300, inputs 1 1, target 0, output 0.167068
iteration no 320, inputs 1 0, target 0, output 0.123461
iteration no 340, inputs 1 1, target 0, output 0.132892
iteration no 360, inputs 1 1, target 0, output 0.133583
iteration no 380, inputs 1 1, target 0, output 0.116641
iteration no 400, inputs 1 0, target 0, output 0.088269
iteration no 420, inputs 0 0, target 1, output 0.861810
iteration no 440, inputs 1 1, target 0, output 0.102406
iteration no 460, inputs 1 0, target 0, output 0.080179
iteration no 480, inputs 1 0, target 0, output 0.075584
iteration no 500, inputs 0 0, target 1, output 0.884442
iteration no 520, inputs 0 0, target 1, output 0.892789
iteration no 540, inputs 0 1, target 1, output 0.923969
iteration no 560, inputs 1 0, target 0, output 0.064146
iteration no 580, inputs 1 1, target 0, output 0.071938
iteration no 600, inputs 1 1, target 0, output 0.075764
iteration no 620, inputs 1 1, target 0, output 0.074536
iteration no 640, inputs 1 1, target 0, output 0.069014
iteration no 660, inputs 1 1, target 0, output 0.066534
iteration no 680, inputs 0 0, target 1, output 0.918422
iteration no 700, inputs 0 0, target 1, output 0.924860
iteration no 720, inputs 1 1, target 0, output 0.065864
iteration no 740, inputs 1 0, target 0, output 0.052634
iteration no 760, inputs 0 0, target 1, output 0.927081
iteration no 780, inputs 1 0, target 0, output 0.050964
iteration no 800, inputs 0 1, target 1, output 0.948869
iteration no 820, inputs 1 0, target 0, output 0.049082
iteration no 840, inputs 1 0, target 0, output 0.048074
iteration no 860, inputs 1 1, target 0, output 0.057916
iteration no 880, inputs 1 1, target 0, output 0.056088
iteration no 900, inputs 0 1, target 1, output 0.954659
iteration no 920, inputs 1 1, target 0, output 0.057337
iteration no 940, inputs 0 0, target 1, output 0.944243
iteration no 960, inputs 1 0, target 0, output 0.045653
iteration no 980, inputs 0 0, target 1, output 0.946199 ```

## Example: NETtalk

The problem

Convert English text into the vowel and consonant sounds (phonemes) that make up speech. For example consider the following words: albany, Albania, misled and bedraggled.

1. Create if-then rules to encode all the regularities in the language.
2. Maintain a database of exceptions to these rules.
3. Build a production system to resolve conflicts.

For example, a ``c'' can either be pronounced like an ``s'' as in center, icy, and city or a ``k'' as in cat and crumb. Here is a rule that might encode this difference:

```if a ``c'' appears before an ``i'', ``e'', or ``y''
then pronounce it like an ``s''
else pronounce it like a ``k''

Exception: celtic ```

## Neural Network Approach

Allow a system to learn how to pronounce words by giving it lots of examples and correcting its mistakes.

1. Choose an architecture.
2. Choose a representation for the input and output.
3. Determine a training set. 16,000 words were chosen randomly from the 20,000 words in Webster's Dictionary.
4. Train the system. Give the network each word in the training set several times, and indicate the errors for learning.
5. Test the system. The network was able to respond correctly to 90%of the 4,000 remaining words from the dictionary.