# Artificial Intelligence - Introduction

## Can Machines Think?

This is a difficult question and to answer it we need to define intelligence and thinking.

Some aspects of intelligence:

• Planing
• Learning
• Reasoning
• Problem Solving
• Perception
In the 1950's, a test was proposed by Alan Turing to determine whether a machine is intelligent or not - The Turing Test.

### The Turing Test

The basis of the turing test is for a person to attempt to distinguish between a machine and a human being. For fairness, the machine and the human are isolated from the person carrying out the test and messages are exchanged via a keyboard and screen. If the person can not distinguish between the computer and the human being, then the computer must be intelligent.  Each year, there is a Turing test contest called the Loebner Prize. This is a transcript from the winner for 1998.

This test is often criticised because it only tests a limited aspect of intelligence.

Some people think that even if a machine could pass the Turing Test it may still not be intelligent.

### The Chinese Room Problem

A (non Chinese speaking) person is locked in a room and given the algorithm for a program that could pass the Turing test in Chinese. He/she is asked questions in Chinese, applies the algorithm by hand, and feeds the answers back. The room will appear intelligent but the person inside understands no Chinese, so is there any intelligence present?

This problem is criticised because, it may well be possible for the complete system to be intelligent (i.e. room and person inside) without the person being intelligent.

Some people say that passing the Turing test is sufficient to prove intelligence but it is not necessary to prove intelligence. In other words, a machine may fail the Turing Test but still be intelligent.

There are plenty of examples of computer systems that perform tasks that would require intelligence if they were performed by a human being.

One possible classification of AI tasks is into 3 classes: Mundane problems, Formal problems and Expert Problems.

• Perception
• Vision
• Speech
• Natural Language understanding, generation and translation
• Common-sense Reasoning
• Simple reasoning and logical symbol manipulation
• Robot Control

• Games
• Chess

• Deep Blue recently beat Gary Kasparov
• Backgammon
• Draughts
• GO
To solve these problems we must explore a large number of solutions quickly and choose the Best One.
• Mathematics
• Geometry and Logic

• Logic Theorist: It proved mathematical theorems. It actually proved several theorems from Classical Math Textbooks
• Integral Calculus

• Programs such as Mathematica and Mathcad and perform complicated symbolic integration and differentiation.
• Proving Properties of Programs e.g. correctness
• Manipulate Symbols and reduce problem (usually recursively), until the answer is obvious. That is, it can be looked up in a table.

• Engineering
• Design
• Fault finding
• Manufacturing
• Planning
• Scientific Analysis
• Medical Diagnosis
• Financial Analysis
Rule based systems -
if (conditions) then action

## AI's Underlying assumption:

The Physical Symbol System Hypothesis
A physical symbol system has the necessary and sufficient means for general intelligent action
In other words:

Computers (Turing Machines) have the power for general intelligent action.

## Example of AI problem solving.

Problem : A farmer has a hungry fox a fat goose and a bag of grain. The farmer needs to cross a river but his boat can only carry two things.

Constraints: Fox and goose cannot be left together Goose and grain cannot be left together.

How to cross the river?

English language representation is hard to solve.

Try visual/graphical representation:

To solve this problem we need only follow the tree from its root node to any leaf node. Section 01

# Introduction

AI is one of the newest disciplines, formally initiated in 1956 when the name was coined. However, the study of intelligence is one of the oldest disciplines being approximately 2000 years old. The advent of computers made it possible for the first time for people to test models they proposed for learning, reasoning, perceiving, etc.

## What is AI?

Figure 1.1

Definitions may be organised into four categories.

1. Systems that think like humans.
2. Systems that act like humans.
3. Systems that think rationally.
4. Systems that act rationally.

## Acting humanly

The first proposal for success in building a program and acts humanly was the Turing Test. To be considered intelligent a program must be able to act sufficiently like a human to fool an interrogator. A human interrogates the program and another human via a terminal simultaneously. If after a reasonable period, the interrogator cannot tell which is which, the program passes.

To pass this test requires:

• natural language processing
• knowledge representation
• automated reasoning
• machine learning
This test avoids physical contact and concentrates on "higher level" mental faculties. A total Turing test would require the program to also do:
• computer vision
• robotics

## Thinking Humanly

This requires "getting inside" of the human mind to see how it works and then comparing our computer programs to this. This is what cognitive science attempts to do. Another way to do this is to observe a human problem solving and argue that one's programs go about problem solving in a similar way.

Example: GPS (General Problem Solver) was an early computer program that attempted to model human thinking. The developers were not so much interested in whether or not GPS solved problems correctly. They were more interested in showing that it solved problems like people, going through the same steps and taking around the same amount of time to perform those steps.

## Thinking Rationally

Aristotle was one of the first to attempt to codify "thinking". His syllogisms provided patterns of argument structure that always gave correct conclusions, giving correct premises.

Example: All computers use energy. Using energy always generates heat. Therefore, all computers generate heat.

This initiate the field of logic. Formal logic was developed in the late nineteenth century. This was the first step toward enabling computer programs to reason logically.

By 1965, programs existed that could, given enough time and memory, take a description of the problem in logical notation and find the solution, if one existed. The logicist tradition in AI hopes to build on such programs to create intelligence.

There are two main obstacles to this approach: First, it is difficult to make informal knowledge precise enough to use the logicist approach particularly when there is uncertainty in the knowledge. Second, there is a big difference between being able to solve a problem in principle and doing so in practice.

## Acting Rationally: The rational agent approach

Acting rationally means acting so as to achieve one's goals, given one's beliefs. An agent is just something that perceives and acts.

In the logical approach to AI, the emphasis is on correct inferences. This is often part of being a rational agent because one way to act rationally is to reason logically and then act on ones conclusions. But this is not all of rationality because agents often find themselves in situations where there is no provably correct thing to do, yet they must do something.

There are also ways to act rationally that do not seem to involve inference, e.g., reflex actions.

The study of AI as rational agent design has two advantages:

1. It is more general than the logical approach because correct inference is only a useful mechanism for achieving rationality, not a necessary one.
2. It is more amenable to scientific development than approaches based on human behaviour or human thought because a standard of rationality can be defined independent of humans.
Achieving perfect rationality in complex environments is not possible because the computational demands are too high. However, we will study perfect rationality as a starting place.

## Foundations of AI

Mathematics

Philosophers staked out most of the important ideas of AI, but to move to a formal science requires a level of mathematical formalism in three main areas: computation, logic and probability.

Mathematicians have proved that there exists an algorithm to prove any true statement in first-order logic. However, if one adds the principle of induction required to capture the semantics of the natural numbers, then this is no longer the case. Specifically, the incompleteness theorem showed that in any language expressive enough to describe the properties of the natural numbers, there are true statements that are undecidable: their truth cannot be established by any algorithm.

Analogously, Turing showed that there are some functions that no Turing machine can compute.

Although undecidability and noncomputability are important in the understanding of computation, the notion of intractability has had much greater impact on computer science and AI. A class of problems in called intractable if the time required to solve instances of the class grows at least exponentially with the size of the instances.

Exponential vs. polynomial. In between is nondeterministic polynomial.

Even moderately sized instances of intractable problems classes cannot be solved in reasonable amounts of time. Therefore, one should strive to divide the overall problem of generating intelligent behaviour into tractable subproblems rather than intractable ones.

Another important concept from mathematics is problem reduction. A reduction is a general transformation from one class of problems to another such that the solutions to the first class can be found by reducing them to problems in the second class and then solving those.

One notion for recognizing intractable problems in that of NP-Completeness. Problems that can be solved in nondeterministic polynomial time. Any problem class to which an NP-complete problem can be reduced is likely to be intractable.

Probability is the principle mathematical tool that we have to represent and reason about uncertainty. Bayes proposed a rule for updating subjective probabilities in the light of new evidence. This rule forms the basis of the modern approach to uncertain reasoning in AI.

Decision theory combines probability theory with utility theory (which provides a framework for specifying the preferences of an agent) to give a general theory that can distinguish good actions from bad ones.

Psychology

The principle characteristic of cognitive psychology is that the brain processes and processes information. The claim is that beliefs, goals, and reasoning steps can be useful components of a theory of human behaviour. The knowledge-based agent has three key steps:

1. Stimulus is translated into an internal representation
2. The representation is manipulated by cognitive processes to derive new internal representations
3. These are translated into actions
Linguistics

Having a theory of how humans successfully process natural language is an AI-complete problem - if we could solve this problem then we would have created a model of intelligence.

Much of the early work in knowledge representation was done in support of programs that attempted natural language understanding.

## The History of AI

In the early days of AI there was great optimism that the intelligent computer were just a few decades off. However, the problem proved to be far more difficult than anticipated. Today most researchers in AI a smart enough not to make predictions. Also many are not really concerned with creating intelligence, rather they are concerned with creating more intelligent computer programs than currently exist.

The microworlds approach to AI was pioneered in the 1960's and tried to solve problems in limited domains.

The ANALOGY program could solve geometric analogy problems such as this

The most famous microworld is the Blocks World (1970's). A command such as "Pick up the red block" could be used to manipulate the world.

Figure 1.3

The microworld approach turned out to have problems because the advances made in writing programs for microworlds turned out not to be generalisable.

Early work in the Logicist camp also had problems because of the use of weak methods (these use weak information about a domain). However, knowledge intensive approaches have been more successful. A key development from the Logicist tradition was knowledge-based systems in the 1980's.

In the late 1980's Neural Networks became fashionable again (they had been popular in the 60's) due to improved learning algorithms and faster processors.

AI Time Line

1943 - McCulloch and Pits propose modelling neurons using on/off devices.
1950's - Claude Shannon and Alan Turing try to write chess playing programs.
1957 - John McCarthy thinks of the name "Artificial Intelligence".
1960's - Logic Theorist, GPS, microworlds, neural networks.
1971 - NP-Comlpeteness theory (Cook and Karp) casts doubt on general applicability of AI methods.
1970's - Knowledge based systems and expert systems.
1980's - AI techniques in widespread use, neural networks rediscovered.
1990's - Deep Blue wins against world chess champion. Image and Speech recognition become practical.
Section 02

# Intelligent Agents

An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors.

Our aim is to design agents.

A rational agent is one that performs the actions that cause the agent to be most successful.

We use the term performance measure for the criteria that determine how successful and agent is. We will insist on an objective performance measure imposed by some authority.

Example Consider the case of an agent that is supposed to vacuum a dirty floor. A plausible performance measure would be amount of dirt cleaned in a certain period of time. A more sophisticated measure would include the amount of electricity consumed and amount of noise generated.

We need to be careful to distinguish between rationality and omniscience. If an agent is omniscient, it knows the actual outcomes of its actions. Rationality is concerned with expected success given what has been perceived. In other words, we cannot blame an agent for not taking into account something it could not perceive or for failing to take an action that it is not capable of taking.

What is rational at any given time depends on four things:

• - The performance measure that defines degree of success
• - Everything that the agent has perceived so far (the percept sequence)
• - What the agent knows about the environment
• - The actions the agent can perform

## The ideal rational agent:

For each possible percept sequence, an ideal rational agent should do whatever action is expected to maximize its performance measure, on the basis of the evidence provided by the percept sequence and whatever built-in knowledge the agent has.

Ideal mapping from percept sequences to actions

For an ideal agent, we can simply make a table of the action it should take in response to each possible percept sequence. (For most agents, this would be an infinite table.) This table is called a mapping from the percept sequences to actions.

Specifying which action an agent ought to take in response to any given percept sequence provides a design for an ideal agent.

It is, of course, possible to specify the mapping for an ideal agent without creating a table for every possible percept sequence.

Example: The sqrt agent

The percept sequence for this agent is a sequence of keystrokes representing a number and an action is to display a number on a screen. The ideal mapping when a percept is an positive number x is to display a positive number z such that z2=x. This specification does not require the designer to actually construct a table of square roots.

Algorithms exist that make it possible to encode the ideal sqrt agent very compactly. It turns out that the same is true for much more general agents.

One more requirement for agents: Autonomy

If an agent's actions are based completely on built-in knowledge, such that it need pay no attention to its percepts, then we say that the agent lacks autonomy.

An agent's behaviour can depend both on its built-in knowledge and its experience. A system is autonomous if it's behaviour is determined by it's own experience.

It seems likely that the most successful agents will have some built-in knowledge and will also have the ability to learn.

## Structure of Intelligent Agents

Now we start talking about the insides of agents.

The job of AI is to design the agent program: a function that implements the agent mapping from percepts to actions. We assume this program will run on some sort of computing device call the architecture.

The architecture makes the percepts from the sensors available to the agent program. runs the program and feeds the program's action choices to the effectors as they are generated.

• agent=architecture + program
Before we design an agent program, we must have a pretty good idea of the possible percepts and actions, what goals or performance measure the agent is supposed to achieve, and what sort of environment it will operate in.

Example A robot designed to inspect parts as they go by on a conveyer belt can make use of a number of simplifying assumptions: that the lighting will always be the same, that the only thing that will be on the conveyer belt are parts of a certain kind, and there are only two actions: accept and reject.

In contrast, some software agents (softbots) exist in rich unlimited domains, e.g., a robot designed to fly a 747 flight simulator or one designed to scan on-line news sources and show the interesting items to its customers. For the latter to do well, it must be able to process natural language, learn the interests of its customers and must be able to dynamically change its plans as news sources become available or unavailable.

Agent Programs

All agent programs have roughly the same skeleton; they accept percepts from the environment and generate actions.

Each agent uses some internal data structure that is updated as new percepts arrive. These data structures are operated on by the agent's decision-making procedures to generate an action choice, which is then passed to the architecture for execution. Good data structures are often very important in AI.

Agents will receive only single percepts as input. It is up the agent to build up the percept sequence in memory if it so desires. In some environments it is possible to be quite successful without storing the percept sequence, and in complex domains, it is infeasible to store the complete sequence.

Why not do table lookup?

An agent that can play chess would need a table with about 35100 entries.

Furthermore, such an agent has no autonomy at all because the calculation of best actions is entirely built-in. If the environment changed in some way, the agent would be entirely lost.

Learning in the context of such a large table is hopeless.

Example: The taxi driver agent

The full taxi driver task is extremely open-ended - there is no limit to the novel situations that can arise.

The taxi driver will need to know where it is, what else is on the road and how fast it is going. This information can be obtained from the percepts provided by one or more controllable cameras, a speedometer and odometer. To control the vehicle properly it should have an accelerometer. It will need to know the state of the vehicle, so it will need the usual arrays of engine and electrical sensors. It might also have instruments like a GPS to give exact position with respect to an electronic map. It might also have infrared or sonar sensors. Finally, it will need some way for the customer to communicate destination.

The actions will include control over the engine through the accelerator pedal and control over steering and braking. Some way of talking to passengers and perhaps some way to communicate with other vehicles.

Performance measures?

Getting to the correct destination, minimizing fuel consumption and wear and tear, minimizing trip time and cost, minimizing traffic violations and disturbances of other drivers, maximizing safety and passenger comfort. Some of these goals conflict and so there will be trade-offs.

Operating environment?

City streets? highways? snow and other road hazards? driving on right or left?

The more controlled the environment, the easier the problem.

We will now consider four types of agent programs:

• - Simple reflex agents
• - Agents that keep track of the world
• - Goal-based agents
• - Utility agents
Simple reflex agents

Constructing a lookup table is out of the question. The visual input from a simple camera comes in at the rate of 50 megabytes per second, so the lookup table for an hour would be 260x60x50M. However, we can summarize certain portions of the table by noting certain commonly occurring input/output associations. For example, if the car in front brakes, then the driver should also brake.

In other words some processing is done on visual input to establish the condition, "brake lights in front are on" and this triggers some established connection to the action "start braking". Such a connection is called a condition-action rule written as

If condition then action

Agents that keep track of the world

Simple reflex agents only work if the correct action can be chosen based only on the current percept.

Even for the simple braking rule above, we need some sort of internal description of the world state. (To determine if the car in front is braking, we would probably need to compare the current image with the previous to see if the brake light has come on.)

Another example

From time to time the driver looks in the rear view mirror to check on the location of nearby vehicles. When the driver is not looking in the mirror, vehicles in the next lane are invisible. However, in order to decide on a lane change requires that the driver know the location of vehicles in the next lane.

This problem illustrates that for any complex domain, sensors do not provide access to the complete world state. In such domains, the agent must maintain an internal state that it updates as new sensor information becomes available.

Updating the state requires the agent to have two kinds of information: First, it needs information about how the world changes over time. Second, it needs information about how its actions effect the world.

Goal-based agents

Knowing about the state of the world is not always enough for the agent to know what to do next. For example, at an intersection, the taxi driver can either turn left, right, or go straight. Which turn it should make depends on where it is trying to get to: its goal.

Goal information describes states that are desirable and that the agent should try to achieve.

The agent can combine goal information with information about what it's actions achieve to plan sequences of actions that achieve those goals.

Search and Planning are the sub fields of AI devoted to finding action sequences that achieve goals.

Decision making of this kind is fundamentally different from condition-action rules, in that it involves consideration of the future. In the reflex agent design this information is not used because the designer has precomputed the correct action for the various cases. A goal-based agent could reason that if the car in front has its brake lights on, it will slow down. From the way in which the world evolves, the only action that will achieve the goal of not hitting the braking car is to slow down. To do so requires hitting the brakes. The goal-based agent is more flexible but takes longer to decide what to do.

Utility-based agents

Goals alone are not enough to generate high-quality behaviour. For example, there are many action sequences that will get the taxi to its destination, but some are quicker, safer, more reliable, cheaper, etc.

Goals just provide a crude distinction between "happy" and "unhappy" states whereas a more general performance measure should allow a comparison of different world states. "Happiness" of an agent is called utility.

Utility can be represented as a function that maps states into real numbers. The larger the number the higher the utility of the state.

A complete specification of the utility function allows rational decisions in two kinds of cases where goals have trouble. First, when there are conflicting goals, only some of which can be achieved (e.g., speed vs. safety), the utility function specifies the appropriate trade-off. Second, when there are several goals that the agent can aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed up against the importance of the goals.

## Environments

Now we look at how agents couple with the environment.

Properties of environments

• Accessible vs. inaccessible

• If an agent's sensory apparatus gives it access to the complete state of the environment, we say that the environment is accessible. When the environment is accessible, less internal state is necessary.
• Deterministic vs. non-deterministic

• If the next state is completely determined by the current state and the actions selected by the agent, then the environment is deterministic. An agent need not worry about uncertainty in an accessible deterministic environment. If the environment is inaccessible, it may appear to be nondeterministic.
It is often best to think of an environment as deterministic or nondeterministic from the point of view of the agent.
• Episodic vs. nonepisodic

• In an episodic environment, the agent's experience is divided into "episodes." The quality of an agent's actions during an episode depends only on that episode because subsequent episodes do not depend on previous episodes.
• Static vs. Dynamic

• If the environment can change while an agent is deliberating, we say that the environment is dynamic for that agent. If the environment does not change with the passage of time but the agent's performance score does, we say the environment is semidynamic.
• Discrete vs. continuous

• If there are a limited number of distinct, clearly defined percepts and actions, we say that the environment is discrete. Otherwise its continuous.

Environment Programs

The following pseudo-code illustrates the basic relationship between agents and environments. The environment simulator takes one or more agents as input and arranges to repeatedly give each agent the right percepts and receive back an action. The simulator then updates the environment based on the actions and possibly other dynamic processes in the environment that are not considered to be agents.

The environment is therefore defined by an initial state and an update function. Obviously, an agent that works in a simulated environment ought also to work in the real thing.

Run-Eval-Environment returns the performance measure for a single environment defined by a single initial state and a particular update function. Usually, an agent is designed to work in an environment class, a whole set of different environments. So, in order to measure the performance of an agent, we need an environment generator that selects particular environments in which to run the agent. We are then interested in average performance.

A possible confusion arises between the state variable in the environment simulator and the state variable in the agent itself. These must be kept strictly separate!

Section 03

# Solving Problems by Searching

Simple reflex agents are limited because they cannot plan ahead. They also have no knowledge of what their actions do nor of what they are trying to achieve.
Now we investigate one type of goal-based agent called a problem-solving agent. This type of agent decides what to do by finding sequences of actions that lead to desirable states.
How can an agent formulate an appropriate view of the problem it faces?

Problem-Solving Agents
Intelligent agents are supposed to act in such a way that the environment goes through a sequence of states that maximizes the performance measure.
Unfortunately, this specification is difficult to translate into a successful agent design. The task is simplified if the agent can adopt a goal and aim to satisfy it.

Example Suppose the agent is in Auckland and wishes to get to Wellington. There are a number of factors to consider e.g.  cost, speed and comfort of journey.

Goals such as this help to organize behaviour by limiting the objectives that the agent is trying to achieve. Goal formulation, based on the current situation is the first step in problem solving. In addition to formulating a goal, the agent may wish to decide on some other factors that affect the desirability of different ways of achieving the goal.

We will consider a goal to be a set of states - just those states in which the goal is satisfied.
Actions can be viewed as causing transitions between states.
How can the agent decide on what types of actions to consider?

Problem formulation is the process of deciding what actions and states to consider. For now let us assume that the agent will consider actions at the level of driving from one city to another.

The states will then correspond to being in particular towns along the way.

The agent has now adopted the goal of getting to Wellington, so unless it is already there, it must transform the current state into the desired one. Suppose that there are three roads leaving Auckland but that none of them lead directly to Wellington. What should the agent do? If it does not know the geography it can do no better than to pick one of the roads at random.

However, suppose the agent has a map of the area. The purpose of a map is to provide the agent with information about the states it might get itself into and the actions it can take. The agent can use the map to consider subsequent steps of a hypothetical journey that will eventually reach its goal.

In general, an agent with several intermediate options of unknown value can decide what to do by first examining different possible sequences of actions that lead to states of known value and then choosing the best one.

This process is called search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a solution is found, the actions it recommends can be carried out. This is called the execution phase. Hence, we have a simple "formulate, search, execute" design for the agent.

Formulating Problems
Formulating problems is an art. First, we look at the different amounts of knowledge that an agent can have concerning its actions and the state that it is in. This depends on how the agent is connected to its environment.

There are four essentially different types of problems.

• single state
• multiple state
• contingency
• exploration
Knowledge and problem types
Let us consider the vacuum world - we need to clean the world using a vacuum cleaner. For the moment we will simplify it even further and suppose that the world has just two locations. In this case there are eight possible world states. There are three possible actions: left, right, and suck. The goal is to clean up all the dirt, i.e., the goal is equivalent to the set of states {7,8}.

First, suppose that the agent's sensors give it enough information to tell exactly which state it is in (i.e., the world is completely accessible) and suppose it knows exactly what each of its actions does.

Then it can calculate exactly what state it will be in after any sequence of actions. For example, if the initial state is 5, then it can calculate the result of the actions sequence {right, suck}.

This simplest case is called a single-state problem.

Now suppose the agent knows all of the effects of its actions, but world access is limited. For example, in the extreme case, the agent has no sensors so it knows only that its initial state is one of the set {1,2,3,4,5,6,7,8}. In this simple world, the agent can succeed even though it has no sensors. It knows that the action {right} will cause it to be in one of the states {2,4,6,8}. In fact, it can discover that the sequence {right, suck, left, suck} is guaranteed to reach a goal state no matter what the initial state is.

In this case, when the world is not fully accessible, the agent must reason about sets of states that it might get to, rather than single states. This is called the multiple-state problem.

The case of ignorance about the effects of actions can be treated similarly.

Example Suppose the suck action sometimes deposits dirt when there is none. Then if the agent is in state 4, sucking will place it in one of {2,4}. There is still a sequence of actions that is guaranteed to reach the goal state.

However, sometimes ignorance prevents the agent from finding a guaranteed solution sequence. Suppose that the agent has the nondeterministic suck action as above and that it has a position sensor and a local dirt sensor. Suppose the agent is in one of the states {1,3}. The agent might formulate the action sequence {suck, right, suck}. The first action would change the state to one of {5,7}; moving right would then change the state to {6,8}. If the agent is, in fact, in state 6, the plan will succeed, otherwise it will fail. It turns out there is no fixed action sequence that guarantees a solution to this problem.

The agent can solve the problem if it can perform sensing actions during execution. For example, starting from one of {1,3}: first suck dirt, then move right, then suck only if there is dirt there. In this case the agent must calculate a whole tree of actions rather than a single sequence, i.e., plans now have conditionals in them that are based on the results of sensing actions.

For this reason, we call this a contingency problem. Many problems in the real world are contingency problems. This is why most people keep their eyes open when walking and driving around.

Single-state and multiple-state problems can be handled by similar search techniques. Contingency problems require more complex algorithms. They also lend them selves to an agent design in which planning and execution are interleaved.

We will only consider cases where guaranteed solutions consist of a single action sequence.

The last type of problem is the exploration problem. In this type of problem, the agent has no information about the effects of its actions. The agent must experiment, gradually discovering what its actions do and what sorts of states exist. This is search in the real world rather than in a model.

Well-defined problems and solutions

We have seen that the basic elements of a problem definition are the states and actions. To capture these ideas more precisely, we need the following:

• The initial state that the agent knows itself to be in.
• The set of possible actions available to the agent. The term operator is used to denote the description of an action in terms of which state will be reached by carrying out the action in a particular state.
Together these define the state space of the problem: the set of all states reachable from the initial state by any sequence of actions.

A path in the state space is simply any sequence of actions leading from one state to another.

• The goal test, which the agent can apply to a single state to determine if it is a goal state. Sometimes there is an explicit set of goal states and the test is to see if a state is in this set. Sometimes a goal is specified by an abstract property, e.g., "checkmate".
It may also be the case that one solution is preferable to another, even though they both reach the goal. To capture this idea, we use the notion of path cost.
• A path cost function is a function that assigns a cost to a path. We will consider the cost of a path is the sum of the cost of each step. The path function is denoted g.
Together the initial state, operator set, goal test, and path cost function define a problem.

To deal with multiple-state problems, a problem consists of an initial state set; a set of operators specifying for each action the set of states reached from any given state; and a goal test and path cost function. The state space is replaced by the state set space.

Measuring problem-solving performance
The effectiveness of a search can be measured in at least three ways:

• Does it find a solution?
• Is it a good solution (low cost)?
• What is the time and memory required to find a solution (search cost)?
The total cost of a search is the sum of the path cost and search cost.
What makes problem solving an art is deciding what goes into the description of states and operators and what should be left out. The process of removing detail is called abstraction. Without it, intelligent agents would be swamped.

## Example Problems

We will talk both about toy problems and real-world problems. Toy problems are used to illustrate and exercise various techniques. Real-world problems are usually much harder and we usually care about the solutions.

Toy problems
The 8-puzzle problem

One important trick is to notice that rather than use operators such as "move the 3 tile into the blank space," it is more sensible to have operators such as "the blank space changes places with the tile to its left." This is because there are fewer of the latter kind of operator. This leads to the following formulation:

• states: a state description specifies the location of each of the eight tiles in one of the nine squares. For efficiency it is also useful to include a location for the blank.
• operators: blank moves left, right, up or down.
• goal test: as in figure
• path cost: length of path
The 8-queens problem
The goal of this problem is to place 8 queens on the board so that none can attack the others. The following is not a solution!

There are two main kinds of formulations. The incremental formulation involves placing queens one-by-one on the board. The complete-state formulation starts with all 8 queens on the board and moves them around until a solution is found.
•  goal test: 8 queens on board, none attacked
•  path cost: irrelevant
•  states: any arrangement of 0-8 queens on the board
•  operators: add or remove a queen to/from any square
In this formulation, we have 648 possible sequences to investigate. A more sensible choice would use the fact that placing a queen where it is already under attack cannot work because subsequent placings will not undo the attack. So, try the following instead:
• - states: arrangements of 0-8 queens with none attacked.
• - operators: place a queen in the leftmost upper column such that it is not attacked by any other queen.
It is easy to see that the actions given can generate only states with no attacks; but sometimes no actions will be possible. For example, after making the first seven choices (left to right) in the above figure, there is no action available. However, in this formulation there are only 2057 possible sequences to investigate.
The right formulation makes a big difference to the size of the search space.

### Vacuum World

State Space with sensors

State space with no sensors (self loops omitted) - initial state is centre-top

## Real-world problems

Route finding, e.g., find a route from Auckland to Wellington.

The Travelling Salesperson Problem is a famous touring problem in which each city in a network of connected cities must be visited exactly once. The goal is to find the shortest tour. This problem is NP-hard.

VLSI layout - a VLSI chips can have millions of gates and the positioning and connections to every gate are crucial to the successful operation of the chip. CAD tools are used in every phase of the process. Two difficult subtasks are cell layout and channel routing. The goal in cell layout is to place every cell on the chip with no overlap and enough room in between to place connecting wires. There are additional constrains such as placing particular cells near each other. Channel routing in the process of placing the connecting wires.

Assembly sequencing - Robot assembly of complex objects.

## Searching for Solutions

The idea behind most search techniques is to maintain and extend a set of partial solution paths.

Generating action sequences
To solve the route planning example problem, the first step is to test if the current state is the goal state. If not, we must consider some other states. This is done by applying the operators to the current state, thereby generating a new set of states. This process is called expanding the state. Whenever there are multiple possibilities, a choice must be made about which one to consider further.

This is the essence of search - choosing one option and putting the others aside for later, in case the first choice does not lead to a solution. We continue choosing, testing, and expanding until a solution is found or until there are no more states that can be expanded. The choice of which state to expand first is determined by the search strategy.

It is helpful to think of the process as building a search tree that is superimposed over the state space. The root of the tree is a search node corresponding to the initial state. The leaves of the tree are nodes that do not have successors either because they are goal states or because no operators can be applied to them.

The general search algorithm is described as follows:

Data structures for search trees
We will assume a search tree node is a five component structure with the following components:

1. The state in the state space to which the node corresponds
2. The node in the search tree from which this node was generated (the parent node)
3. The operator that was applied to generate the node
4. The number of nodes in the path from the root to this node (depth)
5. The path cost of the path to this node
Remember that a node is a bookkeeping data structure used to represent a search tree for a particular problem instance. This is what makes a node different from a state.
Nodes have depths and parents, whereas states do not.

The EXPAND function is responsible for calculating each of the five components above for the nodes that it generates.
We also need to represent the collection of nodes that are waiting to be expanded - this collection is called the fringe or frontier.
For efficiency, the collection of node on the fringe will be implemented as a queue using the following operations:

•  MAKE-QUEUE(elements) creates a queue with the given elements
•  EMPTY?(queue) returns true if there are no elements in the queue
•  REMOVE-FRONT(queue) removes the element at the front of the queue and returns it
•  QUEUING-FN(elements,queue) inserts a set of elements into the queue.
Different varieties of queuing functions produce different search algorithms.

Now we can write a more formal version of the general search algorithm.

## Search Strategies

We will evaluate search strategies in terms of four criteria:
• Completeness: is the strategy guaranteed to find a solution when there is one?
• Time complexity: how long does it take to find a solution?
• Space Complexity: how much memory is required to perform the search?
• Optimality: does the search strategy find the highest quality solution when there are multiple solutions?
Uninformed (or blind) search strategies have no information about the number of steps or the path cost from the current state to the goal.
In a route finding problem, given several choices of cities to go to next, uniformed search strategies have no way to prefer any particular choices.
Informed (or heuristic) search strategies use considerations to prefer choices. For example, in the route finding problem with a map, if a choice is in the direction of the goal city, prefer it.

Even though uninformed search is less effective than heuristic search, uninformed search is still important because there are many problems for which information used to make informed choices is not available.

We now present six uninformed search strategies:

In this strategy, the root node is expanded first, then all of the nodes generated by the root are expanded before any of their successors. Then these successors are all expanded before any of their successors.

In general, all the nodes at depth d in the search tree are expanded before any nodes at d+1. This kind of search can be implemented by using the queuing function ENQUEUE-AT-END.

Breadth-first search is complete and optimal provided the path cost is a nondecreasing function of the depth of the node. Lets look at the amount of time and memory it requires. Consider a hypothetical search tree with b successors for every node. Such a tree is said to have a branching factor of b. If the first solution is a depth d, then the maximum number of nodes expanded before reaching a solution is 1+b+b2+b3+...+bd.

This is not good. For example consider the following table for b=10, 1 node/ms, 100 bytes/node:

Note that memory requirements are a bigger problem here than execution time.
In general, exponential complexity search problems cannot be solved for any but the smallest instances.

Uniform cost search (Dijkstra's Algorithm)
This strategy modifies breadth-first search to account for general path cost. Now the lowest cost node on the fringe is always the first one expanded (recall that the cost of a node is the cost of the path to that node).

When certain conditions are met, the first solution found is guaranteed to be the cheapest.

Consider an example

Uniform cost search finds the cheapest solution provided the cost of a path never decreases as it gets longer, i.e., g(SUCCESSOR(n)>=g(n) for every node n.

Depth-first search

This search strategy always expands one node to the deepest level of the tree. Only when a dead-end is encountered does the search backup and expand nodes at shallower levels.

This strategy can be implemented by calling GENERAL-SEARCH with a queuing function that always puts the newest node on the front ( ENQUEUE-AT-FRONT)
Depth-first search has modest memory requirements, storing only a single path from root to the current node along with the remaining unexpanded sibling nodes for each node on the path. For a state space with branching factor b and maximum depth m, depth first search requires storage of only bm nodes, in contrast to bd for breadth-first search. Using the same assumptions as our analysis of breadth-first search, depth-first search would require 12 kilobytes instead of 111 terabytes at depth 12, a factor of 10 billion times less space.

For problems that have a lot of solutions, depth-first search is often faster than breadth-first search because it has a good chance of finding a solution after exploring only a small portion of the search space.

However, depth-first search can get stuck on a wrong path in infinite search spaces.

It is common to implement depth-first search with a recursive function that calls itself on each of its children in turn. In this case, the queue is stored implicitly in the local state of each invocation on the calling stack.

Depth-limited search
This strategy avoids the pitfalls of depth-first search by imposing a cut-off on the maximum depth of a path. Depth-first search is used to search to the given depth.
Depth-limited search is complete but non optimal and if we choose a depth-limit that is too shallow its not even complete.

Iterative deepening search
The hard part about depth-limited search is picking the right depth limit. Most of the time we will not know what to pick for a depth limit. Iterative deepening search is a strategy that side-steps the issue of choosing the depth limit by trying all possible limits in increasing order.

In effect, this search strategy keeps the best features of both breadth-first and depth-first search. It has the modest memory requirements of depth-first search but is optimal and complete like breadth-first search. Also it does not get stuck in dead ends.

Iterative deepening may seem wasteful because many states are expanded multiple times. For most problems, however, the overhead of this multiple expansion is actually rather small because a majority of the nodes are at the bottom of the tree.

1+b+b2+...+bd-2+bd-1+bd
we have:
(d+1)1+(d)b+(d-1)b2+...+(3)bd-2+(2)bd-1+bd

For b=10 and d=5, the numbers are 111,111 and 123,456. So iterative deepening is 11% more costly than depth-limited in this case.
In terms of complexity numbers, iterative deepening has the same asymptotic complexity as breadth-first, but has space complexity O(bd).
In general, iterative deepening is the preferred search method when there is a large search space and the depth of the solution is unknown.

Bidirectional Search
The idea here is to search forward from the initial state and backward from the goal state simultaneously. When it works well, the search meets in the middle. For problems where the branching factor is b in both directions, bidirectional search can make a big difference.

The solution will be found in O(bd/2) steps because both searches have to go only half way. For example, if b=10 and d=6, breadth-first generates 1,111,111 nodes, whereas bidirectional search generates only 2,222 nodes. Even though this sounds great in theory, there are several issues that must be addressed before this algorithm can be implemented.

• What does it mean to search backwards from the goal? Let the predecessor of a node n be all those nodes that have n as a successor. Searching backwards means successively generating predecessors starting with the goal node.
• Calculating predecessors can be difficult.
• What can be done if there are many possible goal states? If there is an explicit list of goal states, then we can apply the predecessor function for these just as the successor function is applied in the multiple-state case. If we only have an implicit description of the set of goal states, then what?
• There must also be an efficient way to check each new node to see if it appears in the search tree of the other direction of search.
• One needs to decide what kind of search is going to be used in each half.
The complexity figure assumes that the process of testing for intersection of the two frontiers can be done in constant time. Also in order to have the two searches meet, the frontier of one of them must all be kept in memory, so the space complexity is O(bd/2).

## Avoiding Repeated States

We have so far ignored one of the most important complications to the search process: the possibility of wasting time by expanding states that have already been encountered. For many problems, this is unavoidable. Any problem where operators are reversible has this problem. For example in route planning problems, the search spaces are infinite, but if we prune some of the repeated states, we can cut the tree down to a finite size.

Even when the tree is finite, avoiding repeated states can yield exponential speedups. The classic example is a space that contains only m+1 states. The tree contains every possible path which is 2m branches.

There are three ways to deal with repeated states, in increasing order of effectiveness and complexity:

• Do not create paths with cycles in them.
• Do not generate any state that was ever generated before.
To implement this last option, search algorithms often make use of a hash table that stores all the nodes that are generated. This makes checking for repeated nodes fairly efficient. Whether or not to do this depends on how "loopy" the state space is.

## Constraint satisfaction search

A constraint satisfaction problem (CSP) is a special kind of search problem in which states are defined by the values of a set of variables and the goal test specifies a set of constraints that the values must obey.

Example The 8-queens problem can be posed as a CSP in which the variables are the locations of the eight queens; the possible values are squares on the board; and the constraints state that no two queens can be in the same row, column, or diagonal.

• variables: Qi=<xi,yi>, 1<=xi<=8, 1<=yi<=8
• Goal test: For all ij, xixj, yiyj,...
A solution to a CSP specifies values for all of the variables such that the constraints are met.

Cryptoarithmetic and VLSI layout can be posed as CSPs.

Constraints can be unary, binary, or higher-order:

• Unary: concern the value of a single variable
• Binary: relate pairs of variables (in 8-Queens all constraints are binary)
• Higher-order: triples, 4-tuples,... (cryptoarithmetic, constrains between all variables in a row)
Constraints can also be either absolute or preference.

Each variable vi in a CSP has a domain Di which is the set of possible values that the variable can take on. The domain can be either discrete or continuous.

A unary constraint specifies the allowable subset of a variable's domain and a binary constraint specifies the allowable subset of the cross-product of the two domains. In discrete CSPs with finite domains, constraints can be represented simply by enumerating the allowable combinations of values.

Example in 8-queens, the no attack constraint can be represented by a set of pairs of allowable values: {<1,3>,<1,4>,<1,5>,...,<2,4>,<2,5>,...}.

Constraints involving continuous variables cannot be enumerated in this way. Here we consider only CSPs with discrete, absolute, binary or unary constraints.

The initial state will be the state in which all variables are unassigned. Operators will assign a value to a variable from the set of possible values. The goal test will check if all variables are assigned and all constraints satisfied.

We can use depth-limited search because all solutions are of depth n, where n is the number of variables.

In the most naive implementation, any unassigned variable in a state can be assigned a value by and operator. Hence, the branching factor is the sum of the cardinalities of the different domains, i.e., , e.g., 64 in the 8-queens problem. Then the search space would be 648.
But note that the order of assignments makes no difference to the final solution. So we can choose a value for a single variables at each node. This results in a search space that is , or 88 in 8-queens.

A straightforward depth-limited search will examine all of these possible states. CSP contain some well known NP-complete problems and so in the worst case any algorithm to solve CSPs will be exponential.

In most real problems, however, we can take advantage of problem structure to eliminate large parts of the search space.

In CSPs, the goal test is a set of constraints which can be considered individually, as opposed to all together. Depth-limited search on a CSP wastes time searching when constraints have already been violated. For example, suppose we put the first two queens in the top row. DLS will examine all 86 possibilities for the remaining queens before discovering that no solution exists in that subtree.

A substantial improvement is to insert a test before the successor generation step to check whether any constraint has been violated by the variable assignments up to this point. The resulting algorithm backtracks immediately to try something else. This is called backtracking search.

Backtracking still has some failings. Suppose the squares chosen for the first six queens make it impossible to place the seventh queen, because they attack all 8 squares in the last column. Backtracking will still try all possible placings of the seventh queen. Forward checking avoids this problem by looking ahead to detect unsolvability. Each time a variable is instantiated, forward checking deletes from the domains of the as-yet-uninstantiated variables all of those values that conflict with the variables assigned so far. If any of the domains becomes empty, then the search backtracks immediately.

Forward checking is a special case of arc consistency checking. A state is arc-consistent if every variable has a value in its domain that is consistent with each of the constraints on that variable. This can be achieved by successive deletion of values that are inconsistent with some constraint. As values are deleted, other values may become inconsistent because they relied on the deleted values. This behaviour is called constraint propagation: choices are gradually narrowed down. Sometimes arc consistency is all that is needed to completely solve a problem. Section 04

# Informed Search Methods

Unfortunately, uninformed search methods are very inefficient in most cases. We now show how heuristic search strategies - ones that use problem specific knowledge - can find solutions more efficiently.

## Best-first search

If we use the GENERAL-SEARCH algorithm of the previous chapter, the only place where knowledge can be applied is in the queuing function, which determines the node to expand next. The knowledge used to make this determination is provided by an evaluation function that returns a number that describes the desirability of expanding the node.
When the nodes are ordered so that the one with the best evaluation is expanded first, the resulting strategy is called best-first search. It can be implemented as:

Note that the name best-first search is actually a misnomer because if it really always expanded the best node first, there would be no search at all.
What it, in fact, does is choose next the node that appears to be the best.
Just as there is a whole family of GENERAL-SEARCH algorithms with different ordering functions, there is a whole family of BEST-FIRST-SEARCH algorithms with different evaluation functions. These algorithms typically use some estimated measure of the cost of the solution and try to minimize it. One such measure is cost of the path so far. We have already used that one. In order to focus the search, the measure must incorporate some estimate of the cost of the path from a state to the goal.

Minimize estimated cost to reach a goal: Greedy search
The simplest best-first strategy is to minimize the estimated cost to reach the goal, i.e., always expand the node that appears to be closest to the goal. A function that calculates cost estimates is called an heuristic function.

• h(n) = estimated cost of the cheapest path from the state at n to a goal state
A best-first search that uses h to select the next node to expand is called a greedy search.

To get an idea of what an heuristic function looks like, lets look at a particular problem. Here we can use as h the straight-line distance to the goal. To do this, we need the map co-ordinates of each city. This heuristic works because roads tend to head in more or less of a straight line.

This figure shows the progress of a greedy search to find a path from Arad to Bucharest.

For this problem, greedy search leads to a minimal cost search because no node off the solution path is expanded. However, it does not find the optimal path: the path it found via Sibiu and Fagaras to Bucharest is 32 miles longer than the path through Pimnicu Vilcea and Pitesti.
Hence, the algorithm always chooses what looks locally best, rather than worrying about whether or not it will be best in the long run. (This is why its called greedy search.)
Greedy search is susceptible to false starts. Consider the problem of getting from Iasi to Fagaras. h suggests that Neamt be expanded first, but it is a deadend. The solution is to go first to Vaslui and then continue to Urziceni, Bucharest and Fagaras.
Note that if we are not careful to detect repeated states, the solution will never be found - the search will oscillate between Neamt and Iasi.
Greedy search resembles dfs in the way that it prefers to follow a single path to the goal and backup only when a deadend is encountered. It suffers from the same defects as dfs - it is not optimal and it is incomplete because it can start down an infinite path and never try other possibilities.
The worst-case complexity for greedy search is O(bm), where m is the maximum depth of the search. Its space complexity is the same as its time complexity, but the worst case can be substantially reduced with a good heuristic function.

Minimizing the total path cost: A* search
Recall that uniform search minimizes the cost of the path so far, g(n): it is optimal and complete, but can be very inefficient. We can combine the use of g(n) and h(n) simply by summing them

• f(n)=g(n)+h(n)
f(n) = the estimated cost of the cheapest solution through n

Hence, if we are trying to find the cheapest solution, a reasonable thing to try is the node with the lowest value for f. It is possible to show that this strategy is complete and optimal, given a simple restriction on the h function: h must never overestimate the cost to reach the goal. Such an h is called an admissible heuristic.

If h is admissible, f(n) never overestimates the actual cost of the best solution through n.

Best-first search using f and an admissible h is known as A* search.

Notice that hsld (straight-line distance) is an addmissible heuristic because the shortest path between two points is a straight line. The following diagram shows the first few steps of the A* search for Bucharest using hsld.

Notice that A* prefers to expand from Rimnicu Vilcea rather than Fagaras. Even though Fagaras is closer to Bucharest, the path taken to get to Fagaras is not as efficient in getting close to Bucharest as the path taken to get to Rimnicu.

The behaviour of A* search
An important thing to notice about the example A* search above: along any path from the root, the f-cost never decreases. This fact holds true for almost all admissible heuristics. An heuristic with this property is said to exhibit monotonicity.

Monotonicity is not quite the same as admissible. In a few cases, heuristics are admissible, but not monotonic.

When our heuristics are not monotonic, we will make them so, so we can assume that if an heuristics is admissible it is monotonic.

Now, we can take a look at what is going on in A*. If f never decreases along any path out of the root, we can conceptually draw contours in the state space.

Because A* expands the leaf node of lowest f, an A* search fans out from the start node, adding nodes in concentric bands of increasing f-cost.
With uniform-cost search (A* using h=0), the bands will be "circular" around the start node. WIth more accurate heuristics, the bands will stretch towards the goal state and become narrowly focused around the optimal path. If we define f* to the the cost of the optimal solution path, we can say:

• A* expands all nodes with f(n)<f*
• A* may then expand some of the nodes right on the "goal contour", for which f(n)=f* before selecting the goal node.
Hence, the first solution found must be the optimal one because nodes in all subsequent contours will have higher f-cost and hence higher g-cost.
A* search is also complete because as we add bands of increasing f, we must eventually reach a band where f is equal to the cost of the path to a goal state.

A* is optimally efficient for any given admissible heuristic function. This is because any algorithm that does not expand all nodes in the contours between the root and the goal runs the risk of missing the optimal solution.

Complexity of A*
The catch with A* is that even though its complete, optimal and optimally efficient, it still can't always be used, because for most problems, the number of nodes within the goal contour search space is still exponential in the length of the solution.

Similarly to breadth-first search, however, the major difficulty with A* is the amount of space that it uses.

## Heuristic Functions

First, we will look at some different heuristics for the 8-puzzle problem. A typical solution to the puzzle has around 20 steps. The branching factor is about 3. Hence, an exhaustive search to depth 20 will look at about 320 = 3.5 x 109 states.

By keeping track of repeated states, this number can be cut down to 9!=362,880 different arrangements of 9 squares. We need a heuristic to further reduce this number.

If we want to find shortest solutions, we need a function that never overestimates the number of steps to the goal. Here are two possibilities:

• h1= the number of tiles that are in the wrong position. This is admissible because any tile that is out of place must be moved at least once.
• h2= the sum of the distances of the tiles from their goal positions. Since tiles cannot be moved diagonally, we use city block distance. h2 is also admissible. Why?
The effect of heuristic accuracy on performance
One way to characterize the quality of an heuristic is the effective branching factor b*. If the total number of nodes expanded by A* for a particular problem is N and the solution depth is d, then b* is the branching factor that a uniform tree of depth d would have to have in order to contain N nodes.
• N=1+b*+(b*)2+...+(b*)d
Example if A* finds a solution at depth 5 using 52 nodes, the effective branching factor is 1.91

Usually, the effective branching factor of an heuristic is fairly constant over problem instances and, hence, experimental measurements of b* on a small set of problems provides a good approximation of the heuristic's usefulness.

A well designed heuristic should have a value of b* that is close to 1. The following is a table of the performance of A* with h1 and h2 above on 100 randomly generated problems. It shows that h2 is better than h1 and that both are much better than iterative deepening search.

Note that h2 is always a better heuristic function than h1.

Inventing heuristic functions
Is is possible for heuristic functions to be invented mechanically?

In many cases, the answer is yes. In the 8-puzzle case, h1 and h2 can be considered perfectly accurate path lengths for simplified versions of the 8-puzzle problem. That is, if the rules of the puzzle were changed so that a tile could be moved anywhere instead of just into an opening, h1 would accurately give the number of steps in the shortest solution.
Is there a simplified problem in which h2 gives correct path lengths?
Yes. If a tile could move a square in any direction, even onto an occupied square.
A problem with less restrictions on its operators is called a relaxed problem. It is often the case that the cost of an exact solution to a relaxed problem is a good heuristic for the original problem.
If a problem definition is written down in a formal language, it is possible to automatically generate relaxations. Again in 8-puzzle, the operators can be described as:

• A tile can move from square A to square B is A is adjacent to B and B is blank.
one can generate three relaxed problems by removing conditions:
• A tile can move from square A to square B if A is adjacent to B.
• A tile can move from square A to square B if B is blank.
• A tile can move from square A to square B.
One problem with generating new heuristic functions is there is often no one best heuristic.

It turns out that if there are several choices for heuristic functions with none clearly dominating the others, and they are all admissible, then they can be combined as

• h(n)=max(h1(n),h2(n),...,hm(n))
This composite heuristic uses whichever function is most accurate on a node.

Often it is possible to pick out features of a state that contribute to its heuristic evaluation, even if it is hard to say what the contribution should be. For example, the goal in chess is to checkmate and the relevant features include number of pieces of each kind belonging to each side, the number of pieces that are attacked by opponent's pieces, etc.

Usually, the evaluation function is assumed to be a linear combination of the features and even if we don't know the weights, it is possible to use a learning algorithm to acquire them.

We have not considered the cost of computing the heuristic functions in our discussion. So long as its about the cost of expanding a node, this is not an issue, but when the cost is as high as expanding several hundred nodes, it can no longer be ignored.

Heuristics for constraint satisfaction
We have so far examined variants of depth first search for solving CSPs. Now we extend the analysis by considering heuristics for selecting a variable to instantiate and for choosing a value for the variable.

Suppose we are trying to colour the following map with three colours so that no two adjacent regions have the same colour.

Clearly the best next move is to colour region E first because the only possible colour for it is blue. All the other regions have a choice and we might make the wrong one requiring us to backtrack.

This idea is called the most-constrained-variable heuristic. It is used with forward checking. At each point in the search, the variable with the fewest possible values is chosen to have a value assigned next. In this way the branching factor in the search tends to be minimized. When this heuristic is used in solving the n-queens problems, the maximum value of n goes from 30 for forward checking to around 100.

The most-constraining-variable heuristic attempts to reduce the branching factor on future choices by assigning a value to the variable that is involved in the largest number of constraints on other unassigned variables.

Once a variable has been selected, a value must still be chosen for it. For example, suppose we decide to assign a value to region C after A and B. Red is a better choice than blue because it leaves more freedom for future choices. This is the least-constraining-value heuristic - choose a value that rules out the smallest number of values in variables connected to the current variable by constraints. When applied to the n-queens problem, it allows problems up to n=1000 to be solved.

## Memory Bounded Search

IDA* is a logical extension of iterative deepening search and SMA* is similar to A* but restricts the queue size to fit into the available memory.
Iterative deepening A* search (IDA*)

Uses the same trick we used to make breadth-first search memory efficient.
In this algorithm each iteration is dfs, just as in regular iterative deepening. However, the search is modified to use an f-cost limit rather than a depth limit.
Hence, each iteration expands all nodes inside the contour for the current f-cost, peeping over the contour to find out where the next contour lies.
Once the search inside a contour is completed, a new iteration is started using a new f-cost for the next contour.

IDA* is complete and optimal, but because it is depth-first it only requires space proportional to the longest path that it explores. b*d is a good estimate of storage requirements.

The algorithm's time complexity depends on the number of different values that the h function can take on. The city block distance function used in the 8-puzzle takes on one of a small number of integer values. So typically, f only increases two or three times along any solution path. Hence, IDA* only goes through 2 or 3 iterations. Because IDA* does not need to maintain a queue of nodes, its treatment of nodes is faster.

Unfortunately, IDA* has difficulty in more complex domains. In the travelling salesperson problem, for example, the heuristic value is different for every state. As a result, the search is much more like iterative deepening: if A* expands N nodes, IDA* will have to go through N iterations and will therefore expand 1+2+...+N=O(N2) nodes.

One way around this problem is to increase the f-cost limit by a fixed amount [e] on each iteration, so that the total number of iterations is proportional to 1/[e]. This can reduce the search cost at the expense of returning solutions that can be worse than optimal by at most [e]. Such an algorithm is called [e]-admissible.

SMA* search

IDA*s difficulties in some problem spaces can be traced to using too little memory. Between iterations, it retains only a single number, the current f-cost limit.

SMA*(Simplified memory-bounded A*) can make use of all available memory to carry out the search. SMA* has the following properties:

• It will utilize whatever memory is made available.
• It avoids repeated states as far as its memory allows.
• It is complete if the available memory is sufficient to store the shallowest solution path.
• It is optimal if enough memory is available to store the shallowest optimal solution path. Otherwise it returns the best solution that can be reached with the available memory.
• When enough memory is available for the entire search tree, the search is optimally efficient.

When SMA* needs to make space on its queue, it drops a node from the queue. It prefers to drop nodes that are unpromising - that is nodes with high f-cost. To avoid reexploring subtrees that it has dropped from memory, it retains in the ancestor nodes information about the quality of the best path in the forgotten subtree. In this way, it only regenerates the subtree when all other paths have been shown to look worse than the path it has forgotten.

Example  Here is a hand simulation of SMA* on illustrates how it works.

• 1. At each stage, one successor is added to the deepest lowest-f-cost node that has some successors not currently in the tree. The left child B is added to the root A.
• 2. Now f(A) is still 12, so we add the right child G (f=13). Now that we have seen all the children of A, we can update its f-cost to the minimum of its children, that is, 13. The memory is now full.
• 3. G is now designated for expansion, but we must first drop a node to make room. We drop the shallowest highest-f-cost leaf, that is B. When we have done this, we note that A's best forgotten descendent has f=15, as shown in parentheses. We then add H, with f(H)=18. H is not a goal node, but the path to H uses up all the available memory. Hence, there is no way to find a solution through H, so we set f(H)=*.
• 4. G is expanded again. We drop H and add I with f(I)=24. Now we have seen both successors of G, with values * and 24, so f(G) becomes 24. f(A) becomes 15, the minimum of 15 (forgotten successor value) and 24. Notice that I is a goal node, but it might not be the best solution because A's f-cost is only 15.
• 5. A is once again the most promising node, so B is generated for a second time. We have found the path through G was not so great after all.
• 6. C, the successor of B, is a nongoal node at the max depth, so f(C)=*.
• 7. To look at the second successor, D, we first drop C. Then f(D)=20, and this calue is inherited by B and A.
• 8. Now the deepest, lowest-f-cost node is D. D is therefore selected and because it is a goal node, the search terminates.
In this case, there is enough memory for the shallowest optimal solution path. If J had had a cost of 19 instead of 24, however, SMA* still would not have been able to find it because the solution path contains four nodes.

In this case, it would return D which is a reachable, non-optimal solution.

Given a reasonable amount of memory, SMA* can solve significantly more difficult problems that A* without incurring significant overhead in terms of extra nodes generated. It performs well on problems with highly connected state spaces with real-valued heuristics, on which IDA* has difficulty.

## Iterative Improvement Algorithms

Start with a complete configuration and make modifications to improve quality. Example: n-queens

For this kind of algorithm, consider a grid containing all of the problem states. The height of any point corresponds to the value of the evaluation function on the state at that point.

The idea of iterative improvement algorithms is to move around the grid, trying to find the highest peaks, which are optimal solutions.

These algorithms usually keep track only of the current state and do not look ahead beyond immediate neighbors.

There are two major classes of iterative improvement algorithms: Hill-climbing algorithms try to make changes that improve the current state. Simulated annealing algorithms sometimes make changes that make things worse.

Hill-Climbing Search

Always move in a direction of increasing quality. This simple policy has three well-known drawbacks:

• Local Maxima - a local maximum is a peak that is lower than the highest peak in the state space. Once on a local maximum, hill-climbing will halt, even though there is a better solution.
• Plateaux - a plateaux is an area of the state space where the evaluation function is nearly flat. Hill-climbing will do a random walk in such an area.
• Ridges - a ridge can have steeply sloping sides, so that the search reaches the top with ease, but on the top may slope gently toward a peak. Unless there happen to be operators that move directly along the top of the ridge, the search may oscillate from side to side making little or no progress.

Simulated Annealing

Allow hill-climbing to take some downhill steps to escape local maxima.

The innermost loop of simulated annealing is quite similar to hill-climbing. Instead of picking the best move, however, it picks a random move. If the move actually improves the situation, it is executed. Otherwise, the algorithm make the move with some probability less than 1. The probability decrease exponentially with the badness of the move - the amount [[Delta]]E by which the evaluation is worsened.

A second parameter T is also used to determine the probability. At higher values of T, bad moves are more likely to be allowed. As T tends towards 0, they become more and more unlikely, until the algorithms behaves like hill-climbing. The schedule input determines the value of T as a function of how many cycles already have been completed.

Simulated annealing was developed from an explicit analogy with annealing - the process of gradually cooling a liquid until it freezes. The value function corresponds to the total energy of the atoms in the material and T corresponds to the temperature. The schedule determines the rate at which the temperature is lowered. Individual moves in the state space correspond to random fluctuations due to thermal noise. One can prove that if the temperature is lowered sufficiently slowly, the material will attain a lowest-energy configuration. This corresponds to the statement that if schedule lowers T slowly enough, the algorithm will find a global optimum.

Applications to constraint satisfaction problems

CSPs can be solved by iterative improvement methods by first assigning values to all the variables and then applying modification operators to move the configuration towards a solution. These operators simply assign a different value to a variable, e.g., 8-queens.

Algorithms that solve CSPs in the fashion are called heuristic repair methods. In choosing a new value for a variable, the most obvious heuristic is to select the value that results in the minimum number of conflicts with other variables. This is called the min-conflicts heuristic.

Section 05

# Games

Games require different search methods since we are often dealing with an adversary.

## Game Tree

A tree in which the nodes denote board configurations, and the branches indicate how one board configuration can be transformed into another by a single move.  A pair of moves is often called a ply.

Need a way to rank game outcomes. The ranking function is called a utility function.  A simple utility function may give -1,0 or +1 for lose, draw and win.
Consider noughts and crosses with two players called MAX and MIN. If MAX is X and goes first then the tree is as follows.

Max will always try and maximize the utility function for the terminal state and min will always try to minimise it.
An exhaustive search of the game tree for a solution is usually not possible.
Consider the following simple game tree.
The utility function is only applied to terminal nodes. If max makes move A1 then Min should make move A11. Thus the result of making move A1is a value of 3 for the utility function. Similarly A2 leads to a value of 2 and A3 a value of 2. Max wants to maximise the utility function and so A1 is the best move to make.

Minimax consists of 5 steps.

1. Generate the complete game tree.
2. Apply the utility function to all the terminal states.
3. Use the utility of the terminal states to calculate a utility value for their parents (either max or min) depending on depth.
4. Continue up to root node.
5. Choose move with highest value from root node.
The chosen move is called the minimax decision because it maximises the utility under the assumption that the opponent will play perfectly to try and minimise it.

If the maximum depth of the tree is m and there are b moves at each node then the time complexity of minimax is O(bm) .  The algorithm is depth first so memory requirement is O(bm).

## Evaluation functions

Minimax assumes that we can search all the way to the terminal states, this is not often practical.

Instead of doing this we could use a heuristic evaluation function to estimate which moves leads to better terminal states. Now a cutoff test must be used to limit the search.

Choosing a good evaluation function is very important to the efficiency of this method. The evaluation function must agree with the utility function for terminal states and it must be a good predictor of the terminal values. If the evaluation function is infallible then no search is necessary, just choose the move that gives the best position. The better the evaluation function the less search need to be done.

e.g. the following shows some evaluations of chess positions.

The evaluation function is often a linear combination of features.

### Cutting off search

The easiest way to stop the search is to have a fixed depth or to use iterative deepening.

### Heuristic Continuation

You search for N ply in a tree, and find a very good move. But, perhaps, if you had searched just one ply further you would have discovered
that this move is actually very bad.

In general:

The analysis may stop just before your opponent captures one of your pieces or the analysis stops just before you capture one your opponent's pieces.

This is called the horizon effect: a good (or bad) move may be just over the horizon.

### The Singular Extension Heuristic:

Search should continue as long as one moves static value stands out from the rest. If we don't use this heuristic, we risk harm from the Horizon Effect.

e.g. consider the following chess position. Black is ahead in material but if white can reach the eighth row with it's pawn then it can win. Black can stall this for some time by checking the white and so will never see that this is a bad position.

# Alpha Beta Pruning

Using the Minimax algorithm allows a chess playing program to look ahead about 3 or 4 ply. This is roughly as good an intermediate human player, but still fairly easy to beat. Minimax, however is wasteful because it evaluates some branches of the game tree that it need not. The principle behind Alpha-Beta pruning is:

If you have an idea that is surely bad, do not take time to see how terrible it is.

Consider the following example:

Search proceeds exactly as for minimax evaluating A11 A12 A13 and A1 until, after looking at A21, we find that A2 must have a utility less than 2. We now know that A2 is a bad choice because A1 has a higher utility and we need not evaluate any more branches below A2 .

The following  diagram shows this for a general case.

If we have a better choice m, either at the parent node of n or at any choice point further up, then n will never be reached in actual play.

Minimax is depth first, so all that need be remembered is the best choice found so far for the player (MAX) and the best choice found so far for the opponent (MIN). These are called alpha and beta respectively.

The following algorithm illustrates this.

### Effectiveness of alpha-beta pruning.

What are the maximum savings possible?

Suppose the tree is ordered as follows:

^                ____________________________o____________________________
/                            |                            \
v      ________o________              _______o________             ________o________
/        |        \           /        |        \           /        |        \
^    o         o         o         o         o         o         o         o         o
/ | \     / | \     / | \     / | \     / | \     / | \     / | \     / | \     / | \
14 15 16  17 18 19  20 21 22  13 14 15  26 27 28  29 30 31  12 13 14  35 36 37  38 39 40
*  *  *  *         *         *  *  *                       *  *  *
Only those nodes marked (*) need be evaluated.
How many static evaluations are needed?.

If b is the branching factor (3 above) and d is the depth (3 above)

s = 2bd/2 - 1 IF d even

s = b(d+1)/2 + b(d-1)/2 - 1 IF d odd

For our tree d=3, b=3 and so s=11.

This is only for the perfectly arranged tree. It gives a lower bound on the number of evaluations of approximately 2bd/2 .The worst case is bd (minimax)
In practice, for reasonable games, the complexity is O(b3d/4). Using minimax with alpha beta pruning allows us to look ahead about half as far again as without.

## Games of Chance

Many games, such as backgammon, involve chance.

To represent this game we need to add chance nodes to the game tree. In the tree below chance nodes are shown as circles. Below each circle are the possible outcomes of the dice throw with the probability of the throw occurring. (1/36 for a double, 1/18 otherwise).

Now each position has no known outcome only an expected outcome. The expected value for a chance node above a max node is:

expectimax(C)=sumi(p(di)maxs(utility(s)))
Where
C is a chance node
di is a dice roll
p(di) is the probability a dice roll occurring
maxs(utility(s))) is the maximum of the possible utility values for all the moves that can be made after dice roll di.
The expectimin value is similarly defined for a chance node above a min node.
expectimin(C)=sumi(p(di)mins(utility(s)))

### Complexity of expectiminimax

Minimax has a complexity of O(bm) where b is the branching factor and m is the depth. If there are now also a number (n) of chance nodes at each level, then the complexity becomes: O(bmnm)
This extra cost can make games of chance very difficult to solve.

## State of the art

The following diagram suggests that in 1997 a chess program is as good as the best human. This is in fact what happened.
Deep Thought has 32 PowerPC CPUs and 256 dedicated hardware devices for evaluating board positions. It also has 100 years worth of grandmaster chess games and an opening and closing move database.

see: http://www.chess.ibm.com/meet/html/d.3.2.html for more details of Deep Blue

Section 06

# Agents that reason logically

We will now look at knowledge-based agents; these implement a view of agents in which they can be seen as knowing about their world and reasoning about their possible courses of action.

A knowledge-based agent needs to know: the current state of the world, how to infer unseen properties from percepts, how the world evolves over time, what it wants to achieve, and what its own actions do in various circumstances.

The basic elements of a reasoning agent's design are: a formal language in which knowledge can be expressed and a means of carrying out reasoning in such a language. These two elements constitute a logic.

## A knowledge-based agent

The central component of a knowledge-based agent is its knowledge base (KB). A KB is a set of representations of facts about the world. Often the individual units of a KB are called sentences.

There must be a way to add new sentences to the knowledge base and a way to query what is known. We will call these standard functions TELL and ASK, respectively.

Determining what follows from a KB is the job of the inference mechanism.

Like all other agents, this agent takes a percept as input and returns an action.

The agent maintains a KB which may initially contains some background knowledge.

In the process of answering this query, logical reasoning is used to prove which action is better than all others, given what the agent knows and what its goals are.

Make-percept-sentence takes a percept and a time and returns a sentence representing the fact that the agent perceived the percept at time t.

Make-action-query takes a time and returns a sentence that is suitable for asking what action should be performed at that time.

At any point, we can describe a knowledge-based agent at three levels:

• The knowledge level is the most abstract; we can describe the agent by saying what it knows.
• The logical level is the level at which the knowledge is encoded in sentences.
• The implementation level is the level that runs on the agent architecture.
The agents initial program is built by adding sentences one at a time to the knowledge base. Provided that the representation language makes it easy to express this knowledge in the form of sentences, this simplifies the construction problem significantly. This is called the declarative approach to system building.

## The Wumpus World

We now provide a simple world.  We must explore a cave consisting of rooms connected by passageways. Lurking somewhere in the cave is a beast (the wumpus)  that eats anyone who enters his room. Some rooms contain bottomless pits, but another contains gold.

The percepts, actions and goals in the wumpus world are as follows:
• In the square containing the wumpus and in the directly adjacent squares, the agent will perceive a stench.
• In the squares adjacent to a pit the agent will perceive a breeze.
• In the square where the gold is, the agent will perceive a glitter.
• When an agent walks into a wall, it will perceive a bump.
• When the wumpus is killed, it emits a woeful scream that can be perceived anywhere in the cave.
• The percepts will be given to the agent in the form of a list of five symbols, e.g., [Stench, Breeze, Glitter, None, None], i.e., no bump, no scream. The agent cannot perceive its own location.
• The actions are: go forward, turn right 90, turn left 90, grab, which picks up an object that is in the same square as the agent, shoot, which fires an arrow in a straight line in the direction the agent is facing, climb, which is used to end the game when the agent is in square [1,1]. The agent has only one arrow.
• The agent dies a miserable death if it enters a square containing a pit or a live wumpus. It is safe to enter a square with a dead wumpus.
• The agent's goal is to find the gold and bring it back to the start as quickly as possible without getting killed. To be precise, 100 points for climbing out of the cave with the gold, but there is a 1 point penalty for each action taken and a 10,000 point penalty for getting killed.
The agent must do well over a class of wumpus problems. We will assume a 4x4 grid surrounded by walls. The location of the gold and the wumpus are chosen randomly, with a uniform distribution from the squares other than [1,1]. In addition, each square other than the start can be a pit, with probability .2.

In most environments in this class, it is possible for the agent to return successfully with the gold. However, in some the agent must choose between going home empty handed or taking a chance that could lead to death or the gold. And in about 21% of the environments, there is no way the agent can get a positive score.

Acting and reasoning in the wumpus world

Why does an agent that is successful in this world need to be able to reason? The following figure shows the state of the agent's knowledge after it has received its initial percept in the world above.

From the fact that there was no stench or breeze in [1,1], the agent can infer that [1,2] and [2,1] are free of dangers. From the fact that it is still alive, it can infer than [1,1] is also OK. A cautious agent will only move into a square that it knows is OK. Let us suppose the agent moves to [2,1], giving us (b) above.

The agent detects a breeze in [2,1], so there must be a pit in a neighbouring square, either [2,2] or [3,1]. (The notation P? indicates a possible pit.) The pit cannot be in [1,1] because the agent has already been there and did not fall in. At this point, there is only one known square that is OK and has not yet been visited. So the prudent agent turns back and goes to [1,1] and then [1,2], giving us (a) below:

The agent detects a stench in [1,2], which means there must be a Wumpus near by. But the wumpus cannot be in [1,1] and it cannot be in [2,2] (or the agent would have detected a stench when it was in [1,3]). More interesting is the lack of a breeze in [1,2] means that there must be a pit in [2,2]. ...

## Representation, Reasoning and Logic

Together representation and reasoning support the operation of a knowledge-based agent. The object of a knowledge representation is to express knowledge in computer-manipulatable form such that it can be used to help agents perform well. A knowledge representation language has two aspects:
• The syntax of a language defines what are legal sentences in the language. Each sentence is implemented by a physical configuration of some part of the agent. Example: if x and y are numeric expressions then x>=y is a legal sentence.
• The semantics determines the facts in the world to which the sentences refer. Without a semantics, a syntax is just a meaningless arrangement of symbols. With semantics, each sentence makes a claim about the world. Example: the semantics of a language tells us when sentences like x>=y are true and when they are false.
Languages with precisely defined syntax and semantics are called logics. In a logic, we can develop inference mechanisms for an agent that uses the language. A sentence inside of a computer is not a fact, it is a representation of a fact. An inference mechanism must ensure that it allows new sentences to be derived from existing ones only when the new fact is true just when the existing ones are.

We want to generate sentences that are necessarily true given that the old sentences are true. This relationship between sentences is called entailment.

How do we build sound inference procedures? By examining the semantics of logical languages, we can extract what is called a proof theory of the language.

Example Consider the sentence E=mc2. The syntax of the "equation language" allows two expressions to be connected by =. An expression can be a symbol or a number, a concatenation of two expressions, two expressions joined by +, and so on.

The semantics of the language says that the two expressions on each side of the = refer to the same quantity; that the concatenation of two expressions refers to the quantity that is the product of the two quantities referred to by each of the expressions; and so on.

From the semantics, we can show that a new sentence can be generated by, for example, concatenating the same expression to both sides of the equation. ET=mc2T.

Representation

Representations need to be good at the job for which they are intended, but what does this mean? We start by looking at some more familiar examples than representation languages.

Programming languages are good for describing algorithms and concrete data structures. It is easy to imagine using a 4x4 array to represent the wumpus world and world[2,2]<-pit is a natural way to place a pit in square [2,2] of a world. However, when we want to say less specific things, we run into trouble. For example, "there is a pit in [2,2] or [3,1]" or "there is a wumpus somewhere."

The reason we run into this problem is that programming languages are designed to completely describe the state of the computer and how it changes as a program executes. But we would like to be able to describe situations where complete information is not available. A language that does not allow us to do this is not sufficiently expressive.

A good knowledge representation language should be expressive and concise so that what we say today will still be understandable tomorrow. It should be unambiguous and independent of context. Also, it should be possible to design an inference procedure for it.

Semantics

In logic, the meaning of a sentence is what it states about the world. So, how do we establish the correspondence between facts and sentences? In order to say what a sentence means, an interpretation must be provided to say what fact is corresponds to. Without an interpretation, a sentence is meaningless. This idea is easiest to see in fabricated languages.

The languages that we will deal with are all compositional - the meaning of a sentence is a function of the meaning of its parts. Just as the meaning of x2 + y2 is determined from the meanings of x2 and y2, we want the meaning of "s1 and s2" to be determined from the meanings of "s1" and "s2".

A sentence is true under a particular interpretation if the state of affairs it represents is the case. Hence, truth depends both on the interpretation and the actual state of affairs in the world.

Example

"S1,2" would be true under the interpretation in which it means that there is a stench in [1,2], in the world we explored earlier, but it would be false in worlds that do not have a stench in [1,2]. Also it would be false in our example world if its interpretation were "there is a breeze in [1,2]."

Inference

We are concerned with sound inference which is called logical inference or deduction. Logical inference is a process that implements the entailment relation between sentences. We begin with the idea of necessary truth.

Validity and satisfiability

A sentence is valid or necessarily true if and only if it is true under all possible interpretations in all possible worlds.

Example "There is a stench at [1,1] implies there is a stench at [1,1]." By contrast, "There is an open area in the square in front of me or there is a wall in the square in front of me" is not valid.

A sentence is satisfiable if and only if there is some interpretation in some world that makes it true, e.g., "there is a wumpus at [1,2]". If there is no interpretation and no world in which a sentence is true, it is called unsatisfiable.

Example There is a wall in front of me if and only if there is no wall in front of me.

Inference in computers

It turns out that validity and unsatisfiability are crucial to the ability of a computer to reason logically.

Computers do not know the interpretation you are using for the sentences in the knowledge base, nor does it know anything more than what appears on the knowledge base (it has no direct experience with the world).

Example Suppose we ask if it is OK to move to square [2,2]. The computer does not know what OK means, nor does it know what a wumpus a a pit is. So it cannot reason informally as we do. All it can do is see if KB"[2,2] is OK". The inference procedure must show that the sentence, "If KB is true then [2,2] is OK" is a valid sentence. If it can show this then [2,2] is OK under any interpretation and model.

Hence, the powerful thing about formal inference is that it can be used to derive valid conclusions even if the reasoner does not know the intended interpretation.

Logics

As we have said, logics consist of the following:

• 1. A formal system for describing states of affairs, consisting of
• (a) the syntax of the language
• (b) the semantics of the language
• 2. A proof theory -a set of rules for deducing the entailments of a set of sentences.
We will look first at propositional logic (0-order logic) and then at first-order logic with equality.

In propositional logic symbols represent whole propositions (what we have been calling facts), e.g., D might have the interpretation, "the wumpus is dead". These symbols can be combined into more complicated sentences using boolean connectives.

First-order logic commits to the representation of the world in terms of objects and predicates on those objects as well as connectives (as in propositional logic) and quantifiers.

It is illuminating to consider logics in the light of their ontological and epistemological commitments. Ontological commitments have to do with the nature of reality. For example, propositional logic assumes that there are facts that either hold or do not hold in the world. First-order logic further assumes that the world consists of objects with certain relationships between them that either hold or do not hold. Special purpose logics make further commitments.

Epistemological commitments have to do with the possible states of knowledge an agent can have using various types of logics. In propositional and first-order logic, a sentence represents a fact and the agent either believes the sentence to be true, believes it to be false or cannot make a determination. Hence, these logics have three possible states of belief about any sentence. Systems using probability theory can have any degree of belief in a sentence, ranging from 0 (false) to 1 (true).

## Propositional Logic

Syntax

The symbols are True and False, propositional symbols such as P or Q, the logical connectives, ,,<=>,=>, and parentheses. All sentences are made by putting these symbols together using the following rules:

• The logical constants True and False are sentences.
• A propositional symbol by itself is a sentence.
• Wrapping a sentence in parentheses yields as sentence.
• A sentence can be formed by combining simpler sentences using the logical connectives.
BNF for propositional logic:

<sentence> := <atomicsentence>|<complexsentence>
<atomicsentence> := True | False | P | Q | R | ...
<complexsentence> := (<sentence>) | <sentence><connective><sentence>| <sentence>
<connective> :=  | <=> | =>
Precedence of operators is: ,,,=>,<=>.

Semantics

A propositional symbol can be interpreted as any fact. The propositional constants have the obvious interpretations.

For a complex sentence the meaning (as a function of the truth values of the constituent parts, is given by truth tables.

The truth table for => may be confusing at first. This is because, in English, "if P then Q" has a causal interpretation, i.e., P causes Q. This is not the case in propositional logic, where, "if P then Q" means that when P is the case I claim that Q is also. When P is not the case, I am making no claim.

Validity and inference

Truth tables can also be used to test the validity of sentences.

Example

Since we can check validity of a sentence using truth tables, we can check whether a conclusion follows from some premises by checking whether or not Premises => Conclusion is valid.

Models

Any world in which a sentence is true under a particular interpretation is called a model of that sentence under that interpretation.

The more we claim, the fewer models there will be of those claims, e.g., there are fewer models of "There is a pit in [1,2] and there is a Wumpus in [4,2]" than of "There is a pit in [1,2]."

Models are very important in logic because of the relationship between models and entailment. If KB, then every model of KB is also a model of . Hence, whenever KB is true so also is .

We could define the meaning of a sentence by means of set operations on sets of models. For example, the set of models of PQ is the intersection of the models of P and the models of Q.

Rules of inference for propositional logic

The process by which the soundness of an inference rule is established through truth tables can be extended to entire classes of inferences. There are certain patterns of inferences that occur over and over, and their soundness can be shown once and for all. Then the pattern can be captured in what is called an inference rule. Once this has been done, the rule can be used to make inferences without going through the tedious process of building truth tables.

We have already seen the notation  to say that  can be derived from  by inference. An alternate notation emphasizes that this is not a sentence, but rather an inference rule:

```
---------
```
Here  and  match sentences.

Whenever something in the knowledge base matches the pattern above the line, the inference rule concludes the sentence below the line.

An inference rule is sound if the conclusion is true in all cases in which the premises are true.

Some example inference rules:

Modus Ponens

```         =>,
------------
```
And-Elimination
```      12...n
----------
i```
Or-Introduction
```           1
-----------
12...n```
Double-Negation Elimination
```
-------------
```
Unit Resolution
```         ,
------------
```
Resolution
```        ,
--------------
```
Complexity of Propositional Inference

Since there are 2n rows in a truth table, one might suspect that the complexity of propositional inference is exponential. One might wonder if there is a polynomial-time procedure for propositional inference. This very problem was the motivation for the class of NP-complete problems.

There is a useful class of sentences for which polynomial-time inference procedure exists. This is a class called Horn sentences. These sentences have the form:

• P1P2...Pn=>Q
Where each of the Pi and Q are nonnegative literals.

## An Agent for the Wumpus World

Let S1,1 mean there is a stench in [1,1].

Let B1,1 mean there is a breeze in [1,1].

Let W1,1 mean there is a Wumpus in [1,1].

Let P1,1 mean there is a pit in [1,1].

what we know about the wumpus world is:
S1,1
S2,1
S1,2
B1,1
B2,1
B1,2
How can we express the rules about Wumpuses and Pits?

r1: S1,1=>W1,1W1,2W2,1

r2: S2,1=>W1,1W2,1W2,2W3,1

r3:    S1,2=>W1,3W1,2W2,2W1,1

Now to find the wumpus using the inference rules:

modus ponens with S1,1 and r1 gives

W1,1W1,2W2,1

and elimination gives:

W1,1
W1,2
W2,1

modus ponens followed by and elimination with S2,1 and r2 gives:

W1,1
W2,1
W2,2
W3,1

modus ponens with S1,2 and r3 gives:

W1,3W1,2W2,2W1,1

apply unit resolution where  is W1,3W1,2W2,2 and  is W1,1 gives:

W1,3W1,2W2,2

apply unit resolution where  is W1,3W1,2 and  is W2,2 gives:

W1,3W1,2

apply unit resolution where  is W1,3W1,2 and is W1,2 gives:

W1,3

We have found the wumpus!!

s7

# First-Order Logic

Propositional logic can only represent facts about the world.
First-order logic describes a world which consists of objects and properties (or predicates) of those objects.
Among objects, various relations hold, e.g., Parent(Martin, Zac).
A function is a relation in which there is only one value for a given input.

Examples
Objects: people, houses, numbers, planets,...
Relations: parent, brother-of, greater-than,...
Properties: red, round, prime,...
Functions: father-of, one-more-than

Examples:
"One plus one equals two."
"Squares neighbouring the Wumpus are smelly."
First-order logic is universal in the sense that it can express anything that can be programmed.

## Syntax and semantics

First-order logic has sentences just like propositional logic and, in addition, it has terms, which represent objects. Constant symbols, variables and function symbols are used to build terms and quantifiers and predicate symbols are used to build sentences.

<Sentence> := <AtomicSentence>
| <Sentence> <Connective> <Sentence>
| <Quantifier> <Variable>,... <Sentence>
| <Sentence>
| (<Sentence>)

<Atomic Sentence> := <Predicate>(<Term>,...)
| <Term> = <Term>

<Term> := <Function>(<Term>,...)
| <Constant>
| <Variable>

<Connective> := | <=> | =>
<Quantifier> :=
<Constant> := Martin | 59302 | Cat | X | ...
<Variable> := a | x | s | ...
<Predicate> := Before | Likes | Raining | Fails | ...
<Function> := Father | Hairof | 304gradefor | ...

BNF for first-order logic

Constant symbols: A, B, C, 1, John,...
Each constant symbol names exactly one object in the world, not all objects need to have names and some can have more than one name.

Predicate symbols: Round, Brother,...
A predicate symbol refers to a particular relation in the model. For example,  Brother ; since  Brother is a binary relation symbol, the relation it refers to must also be binary, i.e., it must hold or fail to hold between pairs of objects.

Function symbols: Cosine, Fatherof, Leftlegof
A functional relation relates an object to exactly one other object. The last element in the tuple is the value of the function for the other elements. e.g. Officeof(Martin,sc1.40)

Terms
A term is a logical expression that refers to an object. Constant symbols are terms. Terms can also be constructed from function symbols and constant symbols, e.g., Fatherof(John).

The formal semantics of terms is as follows: An interpretation specifies a functional relation referred to by the function symbol and objects referred to by the constant symbols. Hence, a function term refers to the n+1th object in a tuple whose first n elements are those objects refereed to by the arguments of the function.

Atomic sentences
An atomic sentence is formed from a predicate symbol follows by a parenthesized list of terms, for example, Brother(Richard,John) states that the object referred to by Richard is the brother of the object referred to by John.
Atomic sentences can have arguments that are complex terms:

Married(Fatherof(Richard),Motherof(John))
Complex sentences
We can use logical connectives to construct more complex sentences. The semantics of these is the same as in propositional logic.

Examples
Brother(Richard,John)  Brother(John,Richard) is true just in case John is the brother of Richard and Richard is the brother of John.
Older(John,30)  Younger(John,30) is true when John is older than 30 or he is younger than 30.

Quantifiers
We now have a logic that allows objects, quantifiers let us express properties of entire collections of objects.
There are two quantifiers in first-order logic: universal and existential.

Universal quantification()
Using this we can say things like, "All cats are mammals."

x    Cat(x)=>Mammal(x)

Hence, a sentence x(x) is true in a model only if  is true of every object in the model.

Notice the difference between
x    Cat(x)=>Mammal(x)   and x    Cat(x) Mammal(x)
Universal statements are true if they are true for every individual in the world. They can be thought of as an infinite conjunction.

Existential Quantification
Where universal quantification makes statements about every object, existential quantification makes statements about some objects. To say, for example that Spot has a sister that is a cat, we write

x    Sister(x,spot)Cat(x)
x P is true if P is true for some object in the world. It can be thought of as an infinite disjunction.

Nested quantifiers
Very complex statements can be made if one nests quantifiers. Starting simply, without mixing types of quantifiers, we can say things like

x,y    Parent(x,y) => Child(y,x)
We can also mix quantifiers, as in
• xy   Goodat(x,y)
"Everybody is good at something"

Connections between  and
There is an intimate connection between the two quantifiers. To see this, consider the sentence

• xLikes(x,Deceitful Politicians)
"For all x, x does not like Deceitful Politicians."
Another way to say this is, "There does not exists an x that likes Deceitful Politicians"
• x Likes(x,Deceitful Politicians.)
This is true in general because  is a conjunction over all objects and  is a disjunction over all objects. In fact, all of the following are also true
```x P <=> x P        PQ<=>(PQ)
x P<=>x P          (PQ)<=>PQ
x P<=>x P           PQ<=>(PQ)
x P<=>x P           PQ<=>(PQ)```
Equality
Often the equality symbol is included as a special symbol. This is because the notion of equality is so important in our thinking. With this symbol, we can write things like Father(John)=Bill, whose intended meaning is that the object that is the father of John is the same object as Bill.
Equality can also be thought of as an ordinary binary relation symbol, so the interpretation of = is a set of pairs.
Equality can be used to say that  there are two or more individuals with a particular property
• x,y Sister(Spot,x)Sister(Spot,y)(x=y)
"There is an x and y that are sisters of Spot and they are not the same individual."
The equality symbol can also be used to restrict the number of objects that have a certain property, for example,
• x P(x)  P(y) => x=y
"Every pair of objects with property P are equal." This statement restricts there to being one object with property P.
A short hand that is often used is !x King(x) which means
• x King(x)  y King(y) => x=y

## Using First-Order Logic

In knowledge representation, a domain is a section of the world about which we wish to express some knowledge.
Here we give some simple examples of using FOL to encode domains.

The Kinship domain

Examples
m,c Mother(c)=m <=> Female(m)  Parent(m,c)
w,h Husband(h,w) <=> Male(h)  Spouse(h,w)
x Male(x) <=> Female(x)
g,c Grandparent(g,c) <=> p Parent(g,p) Parent(p,c)
x,y Sibling(x,y) <=> xp Parent(p,x)  Parent(p,y)

Axioms, Definitions and Theorems
Axioms capture the basic facts about a domain, e.g., the examples above are axioms. The axioms are then used to prove theorems.
We can have too few or too many axioms. If we have too few, then we cannot prove theorems that we expect should actually follow in a domain. If we have too many axioms, then some axioms follow from (combinations of) others. An axiom is said to be independent of a set of axioms if it does not follow from that set.
None of the primitive predicates (functions) are defined, instead we give partial specifications of the properties of these.

To add sentences to the knowledge base, we call TELL, e.g.,
TELL(KB,m,c Mother(c)=m <=> Female(m)  Parent(m,c))
This goes both for axioms (as above) and specific facts about a particular situation such as
TELL(KB,(Female(Maxi)Parent(Maxi,Spot)Parent(Spot,Boots)))
With these facts added, we can
Here we do not want simply a yes/no answer. In addition we would like to know a term for x denoting an object in the domain. In general, for a query involving existentially quantified variables, we want to know bindings for those variables. Hence, ASK returns a binding list, e.g., {x/boots}.

## Logical Agents for the Wumpus World

We will consider three agent architectures for the Wumpus World: reflex agents that merely classify their percepts and act on that classification, model-based agents that construct an internal representation of the world and use it to act, and goal-based agents that form goals and try to achieve them.

The first step is to define the interface between the agent and the world. The percept sequence must contain both the percepts and the time at which they occurred, otherwise the agent will not be able to tell when things were observed. We will use integers for time steps so a typical percept sentence would be

• Percept([Stench,Breeze,Glitter,None, None],5)
The agent's action must be one of:
Turn(Right), Turn(Left), Forward, Shoot, Grab, Release, Climb
To determine which one is best, we create a query such as a Action(a,5). If we set things up right, such a query will return a binding list such as {a/Grab}.

## A Simple Reflex Agent

The simplest type of rule connects actions directly with percepts. These rules resemble reflexes. For example, if the agent sees a glitter, it should perform a Grab:
• s,b,u,c,t Percept([s,b,Glitter,u,c],t)=>Action(Grab,t)
The connection between percepts and actions can be mediated by rules for perception, which turn the immediate perceptual input into more useful forms:
• b,g,u,c,t Percept([Stench,b,g,u,c],t)=>Stench(t)
• s,b,u,c,t Percept([s,b,Glitter,u,c],t)=>AtGold(t)
Then a connection can be made from these predicates to action choices with separate rules:
• t Atgold(t) => Action(Grab,t)
Limitations of simple reflex agents
Simple reflex agents react to only the current percept sequence. Therefore, they cannot know what things were discovered in previous steps. Hence, they are doomed to a life of wondering aimlessly.

## Representing Change in the World

We would like to have rules that refer to previous times. The best way to do this is to maintain an internal model of the world.
The most straightforward way to deal with change is simply to change the knowledge base, erasing and adding sentences as the agent explores its environment. This approach solves some of the problems we saw with the propositional agent. However, it does not allow the agent to keep track of the past.

Situation Calculus
Situation Calculus is the name for a particular way of describing change in FOL. It conceives of the world as a sequence of situations, each of which is a snapshot of the state of the world. Situations are generated from previous situations by actions.
Every relation whose truth can change over time is handled by giving an extra situation argument to the corresponding predicate symbol. By convention, we make the situation argument always the last. Hence, instead of At(Agent,Location), we would have At(Agent,[1,1],S0)  At(Agent,[1,2],S1).
Relations or properties that do not change over time need not have the extra argument, e.g., WallAt([0,1]).
The next step is to represent how the world changes from one situation to the next. One way to do this is to have a function result that maps actions and states to new states. Then we would have:

• Result(Forward,S0)=S1
• Result(Turn(Right),S1)=S2
• Result(Forward,S2)=S1

Actions are described by stating their effects. We specify the properties of the situation that results from doing the action.
• Portable(Gold)
• s AtGold(s) => Present(Gold,s)
• x,s Present(x,s)  Portable(x) => Holding(x,Result(Grab,s))
Or
• x,s Holding(x,Result(Release,s))
These axioms are called effect axioms. Alone they are not sufficient to keep track of whether the agent is holding the gold. We also need to say that if the agent is holding something in a state then it is still holding it in the state that results from an action other than Release:
• a,x,s Holding(x,s)aRelease => Holding(x,Result(a,x))
• a,x,s Holding(x,s)(aGrab(Present(x,s)Portable(x)) => Holding(x,Result(a,s))
Axioms like these are called frame axioms. The combination of effect and frame axioms provide a complete description of how the world changes in response to an agent's actions.

A more elegant representation combines the effects and frame axioms into a single axiom that describes how to compute the next time step. These axioms have the structure

• true afterwards <=> [an action made it true  true already and no action made it false]
The use of <=> ensures that a predicate will be true if it is made true or stays true and that it will be false in all other cases.

a,x,s Holding(x,Result(a,s))<=>[(a=GrabPresent(x,s)Portable(x))  (Holding(x,s)aRelease)]

This type if axiom is called a successor-state axiom. One such axiom is needed for each predicate that may change its value over time. The axiom must list all the ways in which a predicate can become true and all the ways in which it can become false.

Keeping track of location

In the Wumpus world, the agent cannot directly perceive its location, so it needs to keep track of where it has been. Its also a good idea for it to keep track of what it has seen in previous locations. The agent needs to know its location and orientation:

• At(Agent,[1,1],S0)
• Orientation(Agent,S0)=0 (an angle in degrees, with 0 along the x axis)
The agent also needs to know how the locations are arranged. Its "map" consists of the values of the function LocationToward, which takes a location and a direction and gives the location that is one step forward in the given direction:
• x,y LocationToward([x,y],0)=[x+1,y] and so on.
From the map it is possible to tell which square is directly ahead of the agent
The agent also needs to know whatever is known about the contents of the locations. Assume that the perimeter locations contain walls and the rest do not.
• x,y Wall([x,y])<=>(x=0x=5y=0y=5)
The agent needs to know what the actions do to locations:
• a,d,p,s At(p,l,Result(a,s))<=> [ (a=Forwardl=LocationAhead(p,s)Wall(l))  (At(p,l,s)aForward)]
Finally the agent needs to know what the actions do to orientations.

• a,d,p,s Orientation(p,Result(a,s))=d <=> [a=Turn(Right)d=Mod(Orientation(p,s)-90,360))  (a=Turn(Left)d=Mod(Orientation(p,s)+90,360))  (Orientation(p,s)=d(a=Turn(Right)a=Turn(Left)))]
In addition to keeping track of location and the gold, the agent should also keep track of whether the Wumpus is alive or dead. We also need to describe shoot.

## Deducing Hidden Properties of the World

From the perceptual information we obtain in situations, we can infer properties of locations.
• l,s At(Agent,l,s)Breeze(s)=>Breezy(l)
• l,s At(Agent,l,s)Stench(s)=>Smelly(l)
Neither Breezy nor Smelly need situation arguments because pits and Wumpuses do not move around.

We need to write some rules that relate various aspects of single world states (as opposed to across states). There are two main kinds of such rules:

• Causal rules reflect the assumed direction of causality in the world:
• Systems that reason with causal rules are called model-based reasoning systems.
• Diagnostic rules infer the presence of hidden properties directly from the percept-derived information. We have already seen two diagnostic rules
• l,s At(Agent,l,s)Breeze(s)=>Breezy(l)
• l,s At(Agent,l,s)Stench(s)=>Smelly(l)
Why do we need both causal and diagnostic rules? It would seem that diagnostic rules are enough. However, it is very tricky to ensure that they derive the strongest possible conclusions from the available information. For example, the absence of stench or breeze implies that adjacent squares are OK:
but sometimes a square can be OK even when smells and breezes abound. The model-based rule
• x,t (At(Wumpus,x,t)Pit(x))<=>OK(x)
is probably the best was to represent safety.

The important thing to remember about all this is that if the axioms correctly and completely describe the way the world works and the way percepts are produced, the inference procedure will correctly infer the strongest possible description of the world state given the available percepts.

## Preference Among Actions

A problem with the Wumpus world knowledge base that we have built so far is that it is difficult to decide which action is best among a number of possibilities. For example, to decide between a forward and a grab, axioms describing when it is OK to move to a square would have to mention glitter. This is not modular. We can solve this problem by separating facts about actions from facts about goals. This way our agent can be reprogrammed just by asking it to achieve different goals.

The first step is to describe the desirability of actions independent of each other. In doing this we will use a simple scale: actions can be Great, Good, Medium, Risky, or Deadly. Obviously, the agent should always do the best action it can find:

• a,s Great(a,s)=>Action(a,s)
• a,s Good(a,s)  (b Great(b,s)) => Action(a,s)
• a,s Medium(a,s)  (b Great(b,s)  Good(b,s))=>Action(a,s)
• ...
We use this action quality scale in the following way. Up to the point where it finds the gold, the basic strategy for our agent will be:
• Great actions include picking up the gold when found and climbing out of the cave with the gold.
• Good actions include moving to a square that's OK and hasn't been visited yet.
• Medium actions include moving to a square that is OK and has already been visited.
• Risky actions include moving to a square that is not known to be deadly or OK.
• Deadly actions are moving into a square that is known to have a pit or a Wumpus.
Once the gold is found the policy must change. At this point the agent should have the goal of safely reaching square [1,1] as quickly as possible.
• s Holding(Gold,s)=>GoalLocation([1,1],s)
The presence of an explicit goal allows the agent to work out a sequence of actions that will achieve the goal. There are at least three ways to find such a sequence:
• Inference: this is what we are currently doing.
• Search: We can use best-first search to find a path to the goal
• Planning: This involves the use of special-purpose reasoning engines that reason about action.

# Pattern Recognition

Classification

Feature Extraction:

• Convert a raw pattern into a feature vector.
• Reduce redundancy in the pattern.
• e.g. convert image to line drawing.
Classification
• Assign the feature vector a class
• Pattern space (or feature space) must be partitioned through training.
Training and Testing
• The system is trained using a finite set of patterns- the training set.
• If the correct classification for these patterns is known then this is Supervised Learning, otherwise it is Unsupervised Learning.
• The system is evaluated using a different set of patterns - the test set.
Over-training and Under-training.

It is easy to learn about the training set, just remember them all, unfortunately this will be of little use for classification of the test set - this is over-training. Similarly If the system knows too little about the training set then it will classify badly.

The system must not overgeneralise or undergeneralise about patterns.

Bayesian Classification

• The probability of event A:
• Pr(A)
• The joint probability of A and B:
• Pr(A, B)
• Conditional probability of B given A:
• Pr(B | A)=Pr(A,B)/Pr(A)
For independent events:
Pr(B | A) = Pr(B)
i.e. Pr(A, B)=Pr(A)Pr(B)

### Bayes Rule

Pr(A, B) = Pr(B, A)
Pr(B | A) Pr(A) = Pr(A | B) Pr(B)
so
Pr(B | A) = Pr(A | B) Pr(B)
Pr(A)
For pattern classification:

if x is the feature vector and wi is class i.

Pr(wi) is the prior probability of the pattern being in class i.

p(x | wi) is the probability density function for the feature vector.
p(x) = sum(p(x | wi) Pr(wi))

From Bayes Rule:

Pr(wi | x) = p(x | wi) Pr(wi)
p(x)

Pr(wi | x) = p(x | wi) Pr(wi)
sum(p(x | wi) Pr(wi))

where:
Pr(wi | x) = posterior probability for class i.
The optimal decision is the i for which Pr(wi | x) is largest.

Bayesian theory gives us the optimum decision boundary but in practice we don't have enough information to calculate this boundary.

### Nearest Neighbour Classification

A pattern is classified as belonging to the class of the training pattern that is closest to it. To measure closeness use a distance metric.

For a feature vector x = {x1, x2, x3,.....,xn}

and a training pattern t = {t1, t2, t3,.....,tn}

1. Euclidean distance:

2. D2=Sumi( (xi - ti)2)
3. Dot Product Distance:

4. D = Sumi(xi * ti)
5. Angle between vectors:

6. D = Sumi(xi * ti) /(|xi|*|ti|)

### Efficiency

• Training is trivial, just store all patterns, this may require a lot of storage.
• Classification may be time consuming since all stored patterns must be compared.

### Optimality

• Nearest Neighbour is not optimal. Making simple assumptions it can be shown that the error probability is bounded by twice the optimal Bayesian error.
Nearest neighbour classification is prone to errors due to rogue patterns. A rogue pattern is a mislabelled pattern.

## K Nearest Neighbour

To eliminate the problem caused by rogue patterns use not just the nearest neighbour but a group of them.

Using K neighbours, take a majority vote of their classes to give a classification.

### Optimality

• As K gets larger this method approaches the optimal decision.

### Speeding up nearest neighbour methods.

The biggest problem with this method is the time it takes to calculate the distances to the training examples.
Possible solutions are:
• Only store examples near the decision boundary.
• Use a pre computed search tree and branch and bound to search for the nearest neighbour.

# Lecture 6

## Neural Networks

Consider the plane shown above; this ane be used to make decisions about patterns in the x-y plane. For every point in the plane x-y there is a corresponding value for z where (x,y,z) lies on the angled plane. If z is negative then the point lies above the dotted line, otherwise it is below the dotted line.

The equation for z is

z=wx+vy+c

We can generalise this for more dimensions

o=w1*i1+w2*i2+w3*i3....

o=sum[j=1..n](wj * ij)

o is the output, w are the weights and i is the input.

This can be drawn as shown below.

Generally we don't care about the size of the output, it will be a yes/no answer for a class; so use the following:

here:

y=f(sum(wixi+theta))

The function in the box is called the sigmoid function and is:

f(x) = 1/(1+exp(-ax))

If t is the expected output then there will be an error between t and o. To train the system we need to minimise this error. For a number of these functions there will be one error - the sum of the individual errors.

E = (sumk(tk-yk)2)/2

E = ((tk-f(sum(wikxik+theta)))2)/2

To train the classifier, just change each w so that e becomes smaller.

Now

dE/dwi=-sumk(tk-yk)*dyk/dwi

dE/dwi=-sumk(tk-yk)*f '(sum(wikxik)+theta)xik

dE/dwi=-sumk(dkxik)

Where

dk = (tk-yk)*f '(sum(wikxik)+theta)

f(x) = 1/(1+exp(-ax))

therefore

f '(x) = f(x)(1-f(x))

so

dk = (tk-yk)*yk(1-yk)

Weight updating rule:

dwi = n sumk(dkxik)

where n is a parameter used to change the speed of gradient descent. Note that we want to reduce the error so the minus sign disappears.

For theta:

dE/dtheta=-sumk(dk)

therefore

dtheta = n sumk(dk)

This device is sometimes called a perceptron. The training rule is called the delta rule.

### Perceptron Learning Algorithm

```repeat {
set dw1=dw2=dw3=dtheta=0;
for every pattern xk do {
calculate yk
calculate dk
Add n dk xik to dwi for i=1..n
}
Add dwi to wi for i=1..n
} until E becomes small or changes little.
```

## The Multilayer Perceptron

This perceptron can only learn simple problems.

# Lecture 7

## The Multilayer Perceptron

The perceptron can only learn simple problems. It can place a hyperplane in pattern space and move the plane until the error is reduced. Unfortunately this is only useful if the problem is linearly separable.

A linearly separable problem is one in which the classes can be separated by a single hyperplane.

It is often the case that a problem is not linearly separable. To solve these we use a Multi-Layer Perceptron (MLP) where one layer feeds into the next. Consider the following problem.

## The XOR problem

This is the simplest problem that can not be solved by a perceptron.

For two inputs x1 and x2, the output is the exclusive OR of the inputs.

The pattern space for this problem looks like this:

This cannot be solved using a single line, the solution uses two lines:

A two layer Multi-Layer Perceptron to solve this problem looks like this:

The shape of regions in pattern space that can be separated by a Multi-Layer Perceptron is shown in the following table.

We can see that a three layer MLP can learn arbitrary areas while a two layer MLP can learn convex regions. (if you can draw a line from any point in the region to any other in the region and the line passes out of the region then that region is not convex).

## Training the MLP

last time we saw that the delta rule can be used to train a perceptron. When training the MLP, the error (delta) must be propagated back through the layers. This is called error back-propagation. Or just backpropagation.

The following procedure can be used to train a backpropagation network.

```t is the target
units[l] is the number of units in layer l
n[l][i] is unit i in layer l
n[l][i].output is the output
n[l][i].delta is the delta
n[l][i].weight[j] is weight j
ek is the learning constant```
```adapt() {
int i,j,k,l;

for(l=layers-1;l>=0;l--)
for(i=0;i<units[l];i++)
if(l==layers-1)
n[l][i].delta=
ek*n[l][i].output*
(1.0-n[l][i].output)*
(t[i]-n[l][i].output);
else {
n[l][i].delta=0.0;
for(k=0;k<units[l];k++)
n[l][i].delta+=
n[l+1][k].delta*
n[l+1][k].weight[i];
n[l][i].delta=n[l][i].delta*
ek*n[l][i].output*
(1.0-n[l][i].output);
}

for(l=layers-1;l>=1;l--)
for(i=0;i<units[l];i++)
for(j=0;j<weights;j++)
n[l][i].weight[j]+=
n[l-1][j].output*
n[l][i].delta;

for(i=0;i<units[0];i++)
for(j=0;j<weights;j++)
n[0][i].weight[j]+=
input[j]*n[0][i].delta;
}```
When this algorithm is applied to the XOR we get the following output.
```iteration no 0, inputs 0 1, target 1, output 0.477995
iteration no 20, inputs 0 0, target 1, output 0.447816
iteration no 40, inputs 1 0, target 0, output 0.450292
iteration no 60, inputs 0 0, target 1, output 0.549096
iteration no 80, inputs 1 0, target 0, output 0.460706
iteration no 100, inputs 0 0, target 1, output 0.507636
iteration no 120, inputs 0 1, target 1, output 0.571619
iteration no 140, inputs 1 0, target 0, output 0.451493
iteration no 160, inputs 0 1, target 1, output 0.570574
iteration no 180, inputs 0 0, target 1, output 0.575979
iteration no 200, inputs 0 1, target 1, output 0.744079
iteration no 220, inputs 1 0, target 0, output 0.233541
iteration no 240, inputs 0 1, target 1, output 0.755600
iteration no 260, inputs 1 1, target 0, output 0.185273
iteration no 280, inputs 0 1, target 1, output 0.788309
iteration no 300, inputs 1 1, target 0, output 0.167068
iteration no 320, inputs 1 0, target 0, output 0.123461
iteration no 340, inputs 1 1, target 0, output 0.132892
iteration no 360, inputs 1 1, target 0, output 0.133583
iteration no 380, inputs 1 1, target 0, output 0.116641
iteration no 400, inputs 1 0, target 0, output 0.088269
iteration no 420, inputs 0 0, target 1, output 0.861810
iteration no 440, inputs 1 1, target 0, output 0.102406
iteration no 460, inputs 1 0, target 0, output 0.080179
iteration no 480, inputs 1 0, target 0, output 0.075584
iteration no 500, inputs 0 0, target 1, output 0.884442
iteration no 520, inputs 0 0, target 1, output 0.892789
iteration no 540, inputs 0 1, target 1, output 0.923969
iteration no 560, inputs 1 0, target 0, output 0.064146
iteration no 580, inputs 1 1, target 0, output 0.071938
iteration no 600, inputs 1 1, target 0, output 0.075764
iteration no 620, inputs 1 1, target 0, output 0.074536
iteration no 640, inputs 1 1, target 0, output 0.069014
iteration no 660, inputs 1 1, target 0, output 0.066534
iteration no 680, inputs 0 0, target 1, output 0.918422
iteration no 700, inputs 0 0, target 1, output 0.924860
iteration no 720, inputs 1 1, target 0, output 0.065864
iteration no 740, inputs 1 0, target 0, output 0.052634
iteration no 760, inputs 0 0, target 1, output 0.927081
iteration no 780, inputs 1 0, target 0, output 0.050964
iteration no 800, inputs 0 1, target 1, output 0.948869
iteration no 820, inputs 1 0, target 0, output 0.049082
iteration no 840, inputs 1 0, target 0, output 0.048074
iteration no 860, inputs 1 1, target 0, output 0.057916
iteration no 880, inputs 1 1, target 0, output 0.056088
iteration no 900, inputs 0 1, target 1, output 0.954659
iteration no 920, inputs 1 1, target 0, output 0.057337
iteration no 940, inputs 0 0, target 1, output 0.944243
iteration no 960, inputs 1 0, target 0, output 0.045653
iteration no 980, inputs 0 0, target 1, output 0.946199```

## Example: NETtalk

The problem

Convert English text into the vowel and consonant sounds (phonemes) that make up speech. For example consider the following words: albany, Albania, misled and bedraggled.

1. Create if-then rules to encode all the regularities in the language.
2. Maintain a database of exceptions to these rules.
3. Build a production system to resolve conflicts.
For example, a ``c'' can either be pronounced like an ``s'' as in center, icy, and city or a ``k'' as in cat and crumb. Here is a rule that might encode this difference:

```if a ``c'' appears before an ``i'', ``e'', or ``y''
then pronounce it like an ``s''
else pronounce it like a ``k''

Exception: celtic```

## Neural Network Approach

Allow a system to learn how to pronounce words by giving it lots of examples and correcting its mistakes.

1. Choose an architecture.
2. Choose a representation for the input and output.
3. Determine a training set. 16,000 words were chosen randomly from the 20,000 words in Webster's Dictionary.
4. Train the system. Give the network each word in the training set several times, and indicate the errors for learning.
5. Test the system. The network was able to respond correctly to 90%of the 4,000 remaining words from the dictionary.

• There may exist non-intuitive relationships between the problem's parameters that are easy for the network learning procedure to take advantage of but hard for a programmer to recognize
• A network can consider a much larger number of inputs and outputs (ranging in the thousands) than is computationally feasible for most traditional methods
• A network solution may often require less hardware and less software leading to lower costs and faster design cycles
Steps in the training of NETtalk

• Babbling
• Consonant and vowel sounds
• Pauses
• Words
• Phrases
• Full sentences

# Lecture 8

## Unsupervised Learning

So far we have always known the correct classification for every member of the training set, this is not always the case.

Patterns are often created by processes we are not able to describe well. In these cases we can try to build a classifier based on inherent properties of the patterns.

This is Unsupervised Learning

example:

If we take speech signals and transform them to the frequency domain we get a speech spectrogram.

The spectrogram is a two dimensional signal showing how the frequencies change over time.

The spectrogram above shows the phrase, "In nineteen ninety two"
For human speech we find that speech consists of two of three main bands called formants. We could try to classify speech by using the position of the formants. (this is not easy because there is a wide variation in way the formants change for a particular sound).

Since there is some inherent structure in the signal we can use unsupervised learning to reduce the quantity of information in the formants.

We need to construct a classifier that minimises not the error between output and target, but the error in the fit of a particular model to the data.

If we use a very complex model it will be able to fit the data quite well.

An example model is a set of points in the pattern space, each point represents the centre of a cluster. An algorithm must be devised to move the points so that an error is reduced.

### The K means algorithm:

Assume that there are K clusters in the data, i.e. K classes.
1. choose small random values for each cluster mean
• m1(0),m2(0),m3(0)....mK(0) where mi(0)=(x1,x2,....xn)
2. Use the training set to assign a cluster to each training pattern.

3. Given a training pattern y, y is in cluster i at time t  - Ci(t) if
• Forall j,d(y,mi(t))=min(d(y,mj(t))
where d is some distance metric.
i.e. y is in cluster i, if i has the closest mean.
4. Calculate new values for m1(t)...mK(t)
• mi(t+1)=min(Sum(d(y,mi(t)))) where y is in Ci(t)
For the Euclidean distance this is just the mean of all the patterns assigned to class i.

Repeat from 2 until

• Forall i, mi(t+1)-mi(t) < e
i.e. the clusters are not moving much.
Clustering is applicable to problems where
• We have no labels for classes or we want to find inherent properties of data.
• We know the number of classes.
• We know very little about the process which has created the patterns.

### ISODATA

ISODATA is an algorithm based on K means that does not need to know the number of clusters. It starts with one cluster and splits/merges clusters according to certain parameters.
1. Minimum distance between two cluster centres below which they are merged.
2. Maximum standard deviation for a cluster above which it is split into two clusters.
3. Minimum proportion of training patterns in a cluster.
Amongst many others.

To modify the K mans algorithm we need to add an extra step 4 which can split or merge clusters.

Clustering algorithms perform dimensionality reduction. They do this because they take a high dimensional pattern space and produce a lower dimensional space. Hopefully the lower dimensional space still contains most of the information from the original space. Image Processing

Image Processing

# Introduction:

## What is an Image?

• A Two Dimensional Data Structure made up from picture elements (pixels).

## What is Image Processing?

• Performing operations on images.

## Image Characteristics

• Large amount of data (~1 million numbers for a high quality picture).
• Pixels are related to their neighbours.
• Most of the data is redundant (i.e. it contains little information).

## Image Processing Characteristics

• Many Simple Operations.
• It is only recently that sufficient processing power has become available; users may now perform image processing on relatively small computers.
• A working model is available:
The Human Visual System.

# Image Coding

Image coding reduces the storage requirement for an image by removing redundant information. This may be done in two ways.

## Exact Coding

The image may be reconstructed exactly. Standard coding techniques may be used.
Compression of 2:1 is typically achieved.

## Fidelity Coding

• The image may be reconstructed so that a human observer notices little or no degradation.
Compression of 50:1 is typically achieved.

## Applications of Image Coding

• Digital Cameras.
• CDROM MPEG Movies.
• Video on Demand.
• Video-phones.
• Video-conferencing.
• Hypermedia Documents.

# Image Enhancement

An image may need to be processed so that it may be viewed. For example, an X-ray image may be displayed with greater contrast in some regions so that very faint features can be seen.

Original

Enhanced

## Applications of Image Enhancement

• Medical Imaging.
• Scientific Visualisation.
• Publishing.
• Astronomy.
• Microscopy.
• Particle Physics.

# Image Restoration

If an image has known faults then these may often be corrected. A blurred image may be 'deblurred' or a noisy image may have noise removed. Restoration often relies on knowledge of the process that caused the degradation.

Original Image

Noisy Image

Noise removal applied once

Noise Removal applied twice

Original Image

Sharpened Image

# Image Segmentation

Sometimes similar areas in an image must be identified, for example from an aerial photograph the land and the sea may need to be segmented.

Original Image

Segmented Image

# Feature detection

Often, only part of an image is of interest.

Feature detection creates an image that contains information about the presence of certain types of features. Edges are common features.

Original Image

Edges in Image

## Applications of Image Restoration, Segmentation and Feature Detection.

• Remote Sensing.
• Medical Imaging.
• Character Recognition.
• Data Analysis.
• Computational Vision.
• Feature Location.
• Gesture recognition.
L10

#### Convolution operations

The convolution operation is a weighted sum of all the pixels in the neighbourhood. For example, shown below is an area in an image and a weight array:

 I00 I10 I20 M00 M10 M20 I01 I11 I21 M01 M11 M21 I02 I12 I22 M02 M12 M22

If I is a 3x3 neighbourhood from the image, and M is an array the same size as I, then a pixel I'11 to replace I11 is found using.

M is called the convolution mask, and by choosing values for the elements of M, many different operations may be performed. The convolution mask may be of any size, but is commonly 3x3, 5x5 or 7x7. The sum of the elements of M is commonly either 0 or 1. In order to see why, consider an area in an image with pixels of a uniform grey level.

If the sum of the elements in M is 0 then the operator produces an image in which with all pixels are zero. Thus such operators are used to detect properties of non uniform regions, for example edges.

If the sum of the elements in M is 1 then the operator produces an image with all pixels unchanged. Thus such operators are used to change the pixels in an image that are in non uniform areas. They may be used, for example, to blur or sharpen an image.

The symbol used to denote a convolution is a star -

*

### Convolution Masks for common operations

#### Blurring

If all the mask coefficients are positive, then the result will be a blurred image.

 1/9 1/9 1/9 1/12 1/12 1/12 1/9 1/9 1/9 or 1/12 1/3 1/12 1/9 1/9 1/9 1/12 1/12 1/12

The mask to the left uses the average in the neighbourhood. The mask to the right uses a weighted average. Commonly, the weighted average is calculated using a Gaussian distribution so that there is no sudden discontinuity in the mask.

The gradient in an image is important because it gives information about edges. The first and second derivative are often used in edge detection.

##### First derivative

An image is a two dimensional function, thus the gradient at any point is a vector

The gradient at any point in an image is a vector given by:

The gradient in the x direction in an image I at a point x,y may be approximated as:

gx(x,y)=I(x+0.5,y)-I(x-0.5,y) and, since x and y are integers

gx(x-0.5,y)=I(x,y)-I(x-1,y) or 2gx(x,y)=I(x+1,y)-I(x-1,y)

Similarly the gradient in the y direction is:

gy(x,y-0.5)=I(x,y)-I(x,y-1) or 2gy(x,y)=I(x,y+1)-I(x,y-1)

The convolution masks for this operation are:

In the x direction In the y direction

 -1 0 1 or -1 1 -1 -1 0 or 1 1

Prewitt                                     Sobel

 -1 0 1 -1 -1 -1 -1 0 1 -1 -2 -1 -1 0 1 0 0 0 -2 0 2 0 0 0 -1 0 1 1 1 1 -1 0 1 1 2 1

Isotropic                                  Roberts

 -1 0 1 -1 -1 1 0 0 1 0 0 0 0 0 -1 -1 0 -1 0 1 1 1

The magnitude and direction of the gradient may be found using

Often, only the magnitude of the gradient is required. In this case an approximation is usually used.

This needs only a single mask

 0 -1 -1 2
##### Second derivative

The magnitude of the gradient may be used to detect edges by thresholding. However, the width of the line in the thresholded image will be proportional to the strength of the edge. If the position of the edge is required then the second derivative can be used. Any place where the second derivative crosses zero will be on an edge.

The centre of the edge occurs at the zero crossing of the second derivative

###### Laplacian

In one dimension, an approximation to the second derivative can be obtained by using the approximation to the first derivative and applying it again.

f'(x)= f(x+0.5)-f(x-0.5)

f''(x)=f'(x+0.5)-f'(x-0.5)

f''(x)= f(x+1)-f(x)-(f(x)-f(x-1))

f''(x)=f(x+1)-2f(x)+f(x-1)

In two dimensions, the second derivative is the Laplacian -.

Convolution Masks that will approximate the Laplacian or the negative Laplacian are:

 0 1 0 0 -1 0 1 -4 1 or -1 4 -1 0 1 0 0 -1 0

The following mask will give the average for the pixels in a 3x3 neighbourhood.

 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9

If this is applied to an image and then the Laplacian is applied, the result will be equivalent to using the following mask.

 0 -1/9 -1/9 -1/9 0 -1/9 2/9 1/9 2/9 -1/9 -1/9 1/9 0 1/9 -1/9 -1/9 2/9 1/9 2/9 -1/9 0 -1/9 -1/9 -1/9 0

Thus, two convolutions may be replaced by one, and vice versa. This is useful when two small convolution masks can be used to replace one large mask. This is only possible for certain masks. If a mask can be split into two one dimensional masks it is said to be separable. The Laplacian operator is not separable, but a similar mask is:

 -1 1 -2 1 2 followed by -1 2 -1 is equivalent to -2 4 -2 -1 1 -2 1

In order to be separable each row in the mask must be a multiple of all other rows and each column must be a multiple of all other columns.

Using separable masks it is possible to use large neighbourhood sizes because the number of computations is proportional to the width of the neighbourhood and not its area.

#### Edge Detection Using the Laplacian

##### The zero crossings of Laplacian

As described earlier, the edges in an image correspond to the zero crossings of the Laplacian. These may be found by examining the pixels in a 2x2 neighbourhood, i.e.

 a b d c if ac

The vale of t must be set so that noise pixels do not produce edges.

##### Laplacian of a Gaussian

The Laplacian locates edges, but it does not differentiate between edges of different widths. In practice some edges are more important than others. For example, given a face image a single hair will generate an edge as will the border between face and background. In order to detect edges with a large width, then a larger convolution mask must be necessary.

If the image is blurred by a Gaussian then the Laplacian will only detect wide edges. These two operations are usually combined by finding the Laplacian of a Gaussian operator.

thus and

This may be extended to two dimensions and used to generate a convolution mask.

#### Non Linear Neighbourhood Operations.

The convolution operation is a linear process. If there is a conditional step in the process of obtaining a pixel value from a neighbourhood, then it is said to be non linear.

##### Maximum

This maximum value in the neighbourhood is chosen. The maximum operation causes light areas to expand, it is sometimes called the dilation or expand operation. It is useful for displaying data which contains isolated points and in segmentation.

##### Minimum

This operation involves choosing the minimum in the neighbourhood. Sometimes called an erode or shrink operation, it is often used in conjunction with the maximum operator to perform fuzzy morphological operations on grey scale images.

##### Median Filter

The most commonly used non linear neighbourhood operator is the median filter.

The pixel is replaced by the median value from within the neighbourhood. The pixel values must be sorted and the middle value chosen. For example, given the following neighbourhood:

 10 20 30 15 10 40 These values are sorted to give 10,10,15,18,20,20,30,40,100 100 18 20

In this case the median value is 20.

The median filter is important because it removes pixels with values that are not similar to those in the neighbourhood while preserving edges. To illustrate this, consider the one dimensional case shown below.

The example to the left shows a feature that has a size less than half the width of the neighbourhood. In this case there is no point where the median value will be taken from inside the feature. Thus it is removed.

The example to the right shows a feature that has a size greater than half the width of the neighbourhood. In this case the median will always be the same as the value in the centre of the neighbourhood. This feature remains intact.

This property of the median filter is important because other noise removal methods, for example blurring, will change edge features.

The disadvantage of the median filter is than it is computationally intensive because it involves sorting. It is possible to use a moving histogram to reduce the number of calculations for large neighbourhood sizes.

##### Example of the Median Filter

Original Image 1/4 of all pixels replaced by random values

3x3 Median Filter applied once 3x3 Median Filter applied twice

##### The Mode Filter

The mode filter replaces a pixel with the most common pixel value in the neighbourhood. This is only useful if there are a small number of grey levels. If grey levels are used to represent segmented regions then noise can be removed by using the mode filter.

##### K Nearest Neighbour

Instead of using all the pixels in the neighbourhood use the K pixels closest in value to the central pixel. Replace it with the average of these. For example if K=6:

 10 20 30 The 6 pixels nearest 10 are: 15 10 40 10,15,18,20,20 and 30 100 18 20 Average = 113/6 19

This operator performs an operation without including extreme values (the 100 in this case).

##### Maximum Homogenity

Calculate the average for a number of neighbourhoods around a pixel and use the closest to the pixel value.

 n1 n2 n5 n3 n4

Calculate the average in each neighbourhood n1..n5 and use the closest of these to the original pixel value.

#### Image Enhancement using the Laplacian

The Laplacian operation may be used to enhance an image.

f(x) f''(x) f(x)-f''(x)

If the second derivative of a function is subtracted from the function then sharp transitions in the function are emphasised as shown above. Similarly, the Laplacian may be used to enhance the edges in an image.

 0 0 0 0 1 0 0 -1 0 0 1 0 - 1 -4 1 = -1 5 -1 0 0 0 0 1 0 0 -1 0

Image Laplacian Edge Enhancement

##### Example

Original                    Sharpened using Laplacian

#### Binary neighbourhood operators

Often it is necessary to process binary images. Since each pixel may take only two values, the pixel is primarily used to denote the presence or absence of a feature. This feature may be the result of previous processing, for example edge detection. Often the binary image gives information about the shape of an object. Neighbourhood operators for binary images are called morphological operators because they deal with shape information.

##### Connectivity

In order to use morphological operators it is necessary to decide how one pixel is connected to another pixel.

###### Eight Connected

The simplest method for deciding if a pixel is joined to one of its neighbours, is to look at all eight of the neighbours. If a pixel in this 8 neighbourhood is set, then it is joined to the central pixel. Eight connectedness correctly connects diagonal lines but it also connects the background across a diagonal line. Thus when eight connectedness is used the background must be treated differently to the foreground.

###### Four Connected

Pixels are four connected, if they are joined to the left, right, above or below, but not diagonally.

Eight Connected Figure    Four Connected Figure

##### Shrink and Expand

Two important morphological operators are shrink and expand. Shrink will clear the pixels that have any neighbours clear. Expand will set the pixels that have any neighbours set.

##### Close and Open

An expand followed by a shrink is called a Close because it will remove small gaps between objects.

A shrink followed by and expand is called an open because it will keep small gaps between objects open.

Shrink, Expand, Open and Close are illustrated below.

original expand shrink close open

Shrink and expand are analogous to the neighbourhood operators min and max.

The edges in an image may be found by taking the original away from the expanded image, or taking the shrunk image away from the original. These correspond to the outer 4 connected edge and the inner 4 connected edge respectively.

##### Skeletonisation.

The shrink operator may be used to reduce the size of a region of set pixels but if it is repeatedly applied then the region will eventually disappear. If the shape of the region is important then rather than shrink to nothing an operator should shrink the region to its skeleton.

The skeleton (or medial axis) is a set of points which are equidistant from the two or more closest edge points in the image.

Skeletonisation may be performed by repeatedly applying an algorithm which thins the image until the skeleton remains.

##### Implementation

Binary morphological operators may be implemented by using a lookup table. The pixels in the neighbourhood are combined to form a binary word. This word will be between 0 and 511, for a 3x3 neighbourhood. This word is used to address a table which contains a new value for the central pixel.

Image Transforms - Hough Transform

# The Hough Transform

The Hough transform is a technique which can be used to isolate features of a particular shape within an image.

Because it requires that the desired features be specified in some parametric form, the classical Hough transform is most commonly used for the detection of regular curves such as lines, circles, ellipses, etc.

A generalized Hough transform can be employed in applications where a simple analytic description of a feature(s) is not possible.

The Hough transform has many applications, as most manufactured parts (and many anatomical parts investigated in medical imagery) contain feature boundaries which can be described by regular curves or straight lines.

The main advantage of the Hough transform is that it is tolerant of gaps in feature boundary descriptions and is relatively unaffected by image noise.

## How It Works

The Hough technique is useful for computing a global description of a feature(s) (where the number of solution classes need not be known a priori), given (possibly noisy) local measurements.

The motivating idea behind the Hough technique for line detection is that each input measurement (e.g. coordinate point) indicates its contribution to a globally consistent solution (e.g. the physical line which gave rise to that image point).

As a simple example, consider the common problem of fitting a set of line segments to a set of discrete image points (e.g. pixel locations output from an edge detector). The diagram below shows some possible solutions to this problem.

a) Coordinate points. b) and c) Possible straight line fittings.

We can analytically describe a line segment in a number of forms. However, a convenient equation for describing a set of lines uses parametric or normal notion:

where is the length of a normal from the origin to this line and is the orientation of with respect to the X-axis. (See Figure 2.) For any point on this line, and are constant.

Figure 2 Parametric description of a straight line.

In an image analysis context, the coordinates of the point(s) of edge segments (i.e. ) in the image are known and therefore serve as constants in the parametric line equation, while and are the unknown variables we seek. If we plot the possible values defined by each , points in cartesian image space map to curves (i.e. sinusoids) in the polar Hough parameter space. This point-to-curve transformation is the Hough transformation for straight lines. When viewed in Hough parameter space, points which are collinear in the cartesian image space become readily apparent as they yield curves which intersect at a common point.

The transform is implemented by quantizing the Hough parameter space into finite intervals or accumulator cells. (i.e. a multidimensional array). As the algorithm runs, each is transformed into a discretized curve and the accumulator cells which lie along this curve are incremented.

Peaks in the accumulator array represent strong evidence that a corresponding straight line exists in the image.

We can use this same procedure to detect other features with analytical descriptions. For instance, in the case of circles, the parametric equation is

where and are the coordinates of the center of the circle and is the radius. In this case, the computational complexity of the algorithm begins to increase as we now have three coordinates in the parameter space and a 3-D accumulator. (In general, the computation and the size of the accumulator array increase polynomially with the number of parameters. Thus, the basic Hough technique described here is only practical for simple curves.)

The Hough transform can be used to identify the parameter(s) of a curve which best fits a set of given edge points.

This edge description is commonly obtained by using an edge detector such as the zero crossings of the laplacian. The edge image may be noisy, i.e. it may contain multiple edge fragments corresponding to a single whole feature.

Since the output of an edge detector defines only where features are in an image, the work of the Hough transform is to determine both what the features are (i.e. to detect the feature(s) for which it has a parametric description) and how many of them exist in the image.

In order to illustrate the Hough transform in detail, we begin with the simple image of two occluding rectangles,

An edge detector can produce a set of boundary descriptions for this part, as shown in

Here we see the overall boundaries in the image, but this result tells us nothing about the identity (and quantity) of feature(s) within this boundary description. In this case, we can use the Hough (line detecting) transform to detect the eight separate straight lines segments of this image and thereby identify the true geometric structure of the subject.

If we use these edge/boundary points as input to the Hough transform, a curve is generated in polar space for each edge point in cartesian space. The accumulator array, when viewed as an intensity image, looks like

Enhancing the contrast of this image allows us to see the patterns of information contained in the low intensity pixel values.

Although and are notionally polar coordinates, the accumulator space is plotted rectangularly.

Note that the accumulator space wraps around at the vertical edge of the image such that, in fact, there are only 8 real peaks.

Curves generated by collinear points in the gradient image intersect in peaks in the Hough transform space. These intersection points characterize the straight line segments of the original image.

There are a number of methods which one might employ to extract these bright points, or local maxima, from the accumulator array.

For example, a simple method involves thresholding and then applying some thinning to the isolated clusters of bright spots in the accumulator array image.

Here we use a relative threshold to extract the unique points corresponding to each of the straight line edges in the original image.

(In other words, we take only those local maxima in the accumulator array whose values are equal to or greater than some fixed percentage of the global maximum value.)

Mapping back from Hough transform space (i.e. de-Houghing) into cartesian space yields a set of line descriptions of the image subject. By overlaying this image on an inverted version of the original, we can confirm the result that the Hough transform found the 8 true sides of the two rectangles.

The accuracy of alignment of detected and original image lines is not perfect due to the quantization of the accumulator array.

Note also that the lines generated by the Hough transform are infinite in length. If we wish to identify the actual line segments which generated the transform parameters, further image analysis is required in order to see which portions of these infinitely long lines actually have points on them.

To illustrate the Hough technique's robustness to noise, the edge description is been corrupted by noise.

The result, plotted in Hough space, is

De-Houghing this result (and overlaying it on the original) yields

(As in the above case, the relative threshold is 40%.)

The sensitivity of the Hough transform to gaps in the feature boundary can be investigated by transforming the following image

The Hough representation is

The de-Houghed image is

In this case, because the accumulator space did not receive as many entries as in previous examples, only 7 peaks were found, but these are all structurally relevant lines.

We will now show some examples with natural images. In the first case, we have a city scene where the buildings are obstructed in fog,

If we want to find the true edges of the buildings, an edge detector cannot recover this information very well, as shown in:

However, the Hough transform can detect some of the straight lines representing building edges within the obstructed region. The accumulator space representation of the original image is shown here:

If we set the relative threshold to 70%, we get the following de-Houghed image

Only a few of the long edges are detected here, and there is a lot of duplication where many lines or edge fragments are nearly colinear. Applying a more generous relative threshold, i.e. 50%, yields

This yields more of the expected lines, but at the expense of many spurious lines arising from the many colinear edge fragments.

Another example comes from a remote sensing application. Here we would like to detect the streets in the image

of a reasonably rectangular city sector. We can edge detect the image as follows:

However, street information is not available as output of the edge detector alone. The image

shows that the Hough line detector is able to recover some of this information. Because the contrast in the original image is poor, a limited set of features (i.e. streets) is identified.

## Common Variants

Generalized Hough Transform

The generalized Hough transform is used when the shape of the feature that we wish to isolate does not have a simple analytic equation describing its boundary. In this case, instead of using a parametric equation of the curve, we use a look-up table to define the relationship between the boundary positions and orientations and the Hough parameters.

For example, suppose that we know the shape and orientation of the desired feature. (See Figure 3.) We can specify an arbitrary reference point within the feature, with respect to which the shape (i.e. the distance and angle of normal lines drawn from the boundary to this reference point ) of the feature is defined. Our look-up table (i.e. R-table) will consist of these distance and direction pairs, indexed by the orientation of the boundary.

Description of R-table components.

The Hough transform space is now defined in terms of the possible positions of the shape in the image, i.e. the possible ranges of . In other words, the transformation is defined by:

(The and values are derived from the R-table for particular known orientations .) If the orientation of the desired feature is unknown, this procedure is complicated by the fact that we must extend the accumulator by incorporating an extra parameter to account for changes in orientation.