Reinforcement learning in game play

Material Information

Reinforcement learning in game play
Powell, Brian
Publication Date:
Physical Description:
139 leaves : illustrations ; 28 cm


Subjects / Keywords:
Reinforcement learning ( lcsh )
Machine learning ( lcsh )
Games -- Data processing ( lcsh )
Games -- Data processing ( fast )
Machine learning ( fast )
Reinforcement learning ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 138-139).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Brian Powell.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
45217414 ( OCLC )
LD1190.E52 2000m .P69 ( lcc )

Full Text
Brian Powell
B.S., University of Colorado at Boulder, 1993
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science

This thesis for the Master of Science
degree by
Brian Powell
has been approved
Ross McConnell
Christopher Smith

Powell, Brian (M.S., Computer Science)
"Reinforcement Learning in Game Play"
Thesis directed by Professor William J. Wolfe
Machine Learning has always paralleled biologic learning research, and, more recently, that
of reinforcement learning. Reinforcement learning breaks from standard supervised learning
by allowing the agent to explore its environment to better understand the surrounding. This
exploration of the environment and exploitation of previous knowledge is particularly well
suited for stochastic game environments in which a supervisor could not possibly teach all
achievable states.
This abstract accurately represents the content of the candidate's thesis. I recommend its

To Lisa

My gracious thanks to the dedicated people at the University of Colorado system for
inspiring my interest in learning methods for machine learning. I also would like to thank
the dogs and cat not only for all of their sleeping during the writing of this document, but
also for showing me that basic intelligence is much more complex than we give credit for.

Tables ..................................................................................x
Code Listings...........................................................................xi
1. Introduction: Concept of Learning...................................................1
2. Biologic and Artificial Intelligence................................................2
2.1. Biological Learning..............................................................2
2.2. Artificial Intelligence..........................................................4
2.3. Machine Learning.................................................................5
2.4. Role of Game Playing in Machine Learning.........................................7
3. Samuel's Checkers Program...........................................................9
3.1. Samuel and Checkers.............................................................10
3.2. Searching the Space.............................................................11
3.3. Methods of Learning.............................................................15
4. Reinforcement Learning.............................................................22
4.1. History.........................................................................25
4.2. Classes of Reinforcement Learning Methods.......................................26
4.3. Temporal Difference.............................................................28
4.4. Temporal Difference with Eligibility: TD(A.)....................................32
4.5. Samuel's Checkers and Temporal Difference.......................................37
4.6. Gradient Descent with TD(A.)....................................................38
5. Tesauro's TD-Gammon................................................................41
5.1. Tesauro's Neurogammon...........................................................42

5.2. Tesauro's TD-Gammon.........................................................43
5.3. Tesauro's Conclusions.......................................................47
6. Tic-Tac-Toe and Reinforcement Learning........................................50
6.1. TD(^.) Implementation.......................................................51
6.2. Results of TD(^.)...........................................................53
6.3. Back-Propagation and Neural Networks........................................64
6.4. Temporal Difference and Neural Networks.....................................68
6.5. TD(Neural-Net) Implementation...............................................70
6.6. Results of TD(Neural-Network)...............................................75
6.7. Tic-Tac-Toe Conclusions.....................................................85
7. Conclusions...................................................................88
A. MDP Code......................................................................91
B. Tic-Tac-Toe Code..............................................................94
B.2. HumanPolicy.h..............................................................101
B.3. matrix.h...................................................................103
B.4. Player.h...................................................................106
B.5. Policy, h..................................................................108
B.6. RandomPolicy.h.............................................................109
B.7. TDlambdaPolicy.h...........................................................110
B.8. TDNeuralPolicy.h...........................................................114
B. 9. TDSuttonNet.h..............................................................118
C. MDP Data Result Tables.......................................................122
D. Tic-Tac-Toe TD(^.) Data Result Tables......................................125

3.1 Checkers Board...................................................................9
3.2 Player 1 Possible Moves.........................................................12
3.3 Player 2 Moves and Scores.......................................................12
3.4 Potential scores of each move...................................................13
4.1 Standard Reinforcement Model....................................................23
4.2 Robot Exploration as a MDP......................................................30
4.3 Comparison of TD and MC MDP results.............................................31
4.4 Comparison of Different X values for TD(A) and MDP Robot Walk...................36
5.1 Backgammon Board................................................................41
6.1 Tic-Tac-Toe Board...............................................................50
6.2 Percentage of Victories (A=0.5, a=0.05).........................................54
6.3 Percentage of Victories (A=0.5, a=0.1)..........................................55
6.4 Percentage of Victories (A=0.5, a=0. IS)........................................56
6.5 Percentage of Games Ending in a Draw For Each Reward............................57
6.6 Percentage of Victories for Differing X.........................................58
6.7 Percentage of Wins with TD(A.) as X.............................................59
6.8 Board Position with O's turn to move............................................62
6.9 Typical Neural Network..........................................................64
6.10 Percentage of Victories for TDNN vl.O..........................................76
6.11 Percentage of Victories for TDNN vl.l..........................................77
6.12 Percentage of Victories for TDNN vl.l vs. TD(A)................................78

6.13 Percentage of Victories for TDNN v2.0...........................................80
6.14 Percentage of Victories for TDNN v2.1 vs. TD(A.)................................81
6.15 Percentage of Victories for TDNN v3.0...........................................83
6.16 Percentage of Victories for TDNN v3.0 vs TD(A.).................................84

2.1 Disciplines Contributing to Machine Learning................................6
2.2 Learning Type Classifications...............................................6
5.2 Results of TD-Gammon.......................................................45
6.1 TD(^.) vs. Human Opponent..................................................60
6.2 TDNN vs. Human Opponent....................................................82
6.3 TDNN v3.0 vs. Human Opponent...............................................85

4.1 Pseudo-Code for TD(0) MDP....................................................31
4.2 Pseudo-Code for MC MDP.......................................................32
4.3 Pseudo-Code for TD(A.) implementation........................................37
6.1 The Policy for a State-Estimation Player.....................................53
6.2 Subroutine for Computing the Output of the Network...........................71
6.3 Subroutine for Computing the Change in Weights of the Network................72
6.4 Subroutine for Backpropagation of Eligibility Values.........................72

1. Introduction: Concept of Learning
The paradox of understanding learning begins with the seemingly indefinable
concept itself. The dictionary reveals that learning is to "1. gain knowledge or skill 2. acquire
as a habit or attitude" [Websters, 1982]. These definitions and concepts are rather vague as is
the current understanding of learning. This has led many to argue that the search for the
laws of learning is a never-ending task because they simply do not exist. Exploring the
philosophical implications within the definition of learning-though a fascinating Platonic
discourse-is not the author's intent.
For the sake of this discussion, learning shall be defined as merely an adaptive
change in behavior. Using this simple definition, intelligence can easily be assigned.
Consider that "behavior is usually considered intelligent when it can be seen as adaptive"
[Balkenius, 1994]. Superior intelligence can be assigned when the adaptive behavior
improves our domain knowledge. Inferior intelligence can be assigned when the adaptive
behavior worsens our domain knowledge.
Balkenius has an excellent example of perceived intelligence [Balkenius, 1994], A
squirrel gathering, hiding, and storing food for the winter would appear to be an intelligent
activity. Upon further inspection we find that the squirrel forgets where 80% of his food is
hidden, stores far more food than would ever be needed, and most oftentimes takes food
hidden by another squirrel. No longer does this activity seem intelligent, but, simply
random. Such counter-examples exemplify the notion that "intelligence is in the eye of the
beholder" [Brooks, 1991],

2. Biologic and Artificial Intelligence
Artificial Intelligence, throughout its history, has followed the philosophical and
theoretical trends in the biological learning fields rather closely. The fundamental difference
between the studies of animal intelligence and that of artificial intelligence is that the field of
artificial intelligence works to build new intelligent entities utilizing the knowledge learned.
These entities are built for research as well as applications, such as assisting in the diagnosis
of medical patients, controlling robotics within factories, guiding robotic rovers on other
planets, and even playing a challenging game against a human opponent.
Alan Turing defined intelligent behavior as the machine having the ability to
perform all human cognitive functions. This, rather homo-centric, view of intelligence is not
the goal of much of artificial intelligence because most believe that intelligence does not have
to achieve the level of human thought to be considered truly intelligent.
2.1. Biological Learning
The roots in the understanding of biological learning followed the belief that all
behaviors could be simplified into stimulus and response pairs. However, over one hundred
years of research has yet to form the general rule of learning based upon these stimulus-
response pairs. Other researchers, such as MacFarlane in 1930, set out to prove that learning
was more abstract [MacFarlane, 1930]. He taught rats to swim their way through a maze in
order to find the goal of a food treat. After the rats were adept at swimming the maze and
discovering their food every time, the maze was drained of water and the rats were allowed
to again run the maze in search of food. Every rat was able to recall the proper path through

the maze even though it was no longer swimming but running. He had proven that the
knowledge learned by the rats was not merely a response to the stimulus of swimming.
Through this research, MacFarlane and others showed that learning is not simply a complex
memorization of stimulus-response pairs.
Other biological responses or actions are not learned at all, but are instictual: handed
down from generation to generation by genetic code. For this discussion, instinct is defined
as any fixed reaction to a given stimulus (e.g. a "tick will bite everything that has a
temperature above +37C and smells of butyric acid" [Lorenz, 1977]). Using this definition of
instinct gives rise to the belief that learning may be nothing more than slight parameter
adjustment of pre-programmed motor reactions. An example of this theory can be displayed
with a newborn child. The child will smile at anything resembling a human face (from true
faces to masks and cardboard cutouts). As the child develops, however, these visual cues
must begin to become more and more resembling that of a true human face to elicit the
child's response.
Researchers from each of these two theories worked for years to prove their beliefs
and attempt to disprove the other side; however, as research continued, it led to the
realization that there is more than one type of learning and that neither is mutually exclusive
of the other. This has led to the now more widely held belief that there is only one general
learning system comprised of several (possibly many) learning methods.
As attention turned from attempting to fit all learning into stimulus-response or
adaptive instinct, researchers began to focus on the role of reinforcement learning in the
1950's. During this time, it was found that reinforcement learning is based upon causing
adaptive behavior with stimulus reinforcement: a rat that learns the maze to be rewarded
with food is simply being given a stimulus to learn. Some argue that this stimulus

reinforcement (or punishment) forces the actual learning (as opposed to adaptation).
Premack showed, however, that oftentimes, this stimulus is not so much as forced learning,
but that the stimulus serves only to "reinforce a less probable activity." For example, the rats
that swam the maze, though, not necessarily hungry learned to swim the maze not out of
need or of the capacity to learn, they swam the maze simply to get their treat.
The reinforcement learning theory is quite different from the mechanical view of
stimulus-response and is more generalized. Some argue that reinforcement learning is, in
fact, the oft-sought single general learning system. Most researchers, however, would argue
that reinforcement learning "plays a role in some but not all learning" [Balkenius, 1994].
To understand the forms of learning is an important step in the eventual
understanding of how learning as a whole occurs in biological organisms. Though many
theories are (or have been) presented separately by researchers, taken as a whole they can
help us to understand the rather abstract learning process that occurs.
2.2. Artificial Intelligence
It is interesting how development of general artificial intelligence theories tend to
closely follow the their respective theories and development of researchers in biological
learning. All primary threads of AI research can be mapped into one of the three learning
methods understood by biologic researchers.
Stimulus-Response actions are very much common place in rule based systems.
Simple look-up tables or expert system rules dictate a response or chain of responses for a
given stimulus or stimuli-chain respectively. Neural Networks, on the other hand,
theoretically attempt to model biological brains (utilizing a collection of synaptic nodes) such

as a Hopfield neural network (once it has been suitably trained) tend to follow the instinctual
model of parameter adjustment to a pre-programmed set of motor-controls and responses.
Genetic Algorithmsin generalalso fit into this parameter adjustment of instinctual
Thorndike defines law of effect to mean, "learning of a response is governed by the
effects of that response." The first to follow this belief in an Artificial Intelligence
representation and implementation was Arthur Samuel with his Checkers Program. Sutton
and Barto [Sutton, 1990] have explicitly made mention of the parallel between the behavioral
research in reinforcement learning and the temporal difference algorithm they have
developed and deployed in adaptive control.
2.3. Machine Learning
Machine Learning is the concept of adapting the theories of Artificial Intelligence to
that of a physical (typically discrete) machine. A broad definition for machine learning can
be given as, "a machine learns whenever it changes it structure, program, or data (based on
its inputs or in response to external information) in such a manner that its expected future
performance increases" [Nilsson, 1996]. In addition to the research of animal learning, other
fields have contributed philosophy and research to machine learning (as shown in table 2.1).

Subject Contribution
Statistics Common statistical problems attempt to resolve the value of an unknown function at some point given some sampling of previous points. This is considered because "decisions" are made only with data from the problem domain.
Brain Models One important discipline of machine learning is that of Neural Networks which attempt to model or mimic the basic structure and workings of biological brains.
Control Theory Adaptive controls attempt to control some process in an environment with unknown parameters that must be estimated (typically in real time) during the span of the process.
Psychological Models Much of the work performed by Sutton and Barto in the realm of Reinforcement Learning is based upon trying to model how rewards encourage learning. "Reinforcement Learning is an important theme in machine learning research" [Nilsson, 1996].
Evolutionary Models Genetic Algorithms attempt to model the evolutionary change as a species tries to "adapt" to the problem domain(s).
Table 2.1, Disciplines Contributing to Machine Learning
Within the application of Machine Learning, there are three classifications of the type
of learning taking place (as shown in table 2.2).
Supervised Learning The expected values, /, are known over n samples for the training set, S. The learning model can then be "taught" to adjust itself until it produces a result consistent with /. This is typically applied to Hopfield Neural Networks.
Unsupervised Learning For this case, only the training set, S, is known for which, the expected values, / are not known. In this case, the learning model must learn to typically maximize / for the set S.
Hybrid (or intermediate) Learning The learning model is shown subsets of the training set, S, in both supervised and unsupervised learning.
Table 2.2, Learning Type Classifications
Classification Description

Utilizing the philosophies of learning, incorporating the mathematics of statistics, the process
of control-theory and utilizing a choice of learning method, what can the machine actually
learn? Though there are many applications in use of such ideas and technologies, there are a
typical variety of structures that machines can learn (as listed, [Nilsson, 1996]):
Logic Problems and Rule Sets
Finite State Machines
Problem Solving Systems
It is the problem of teaching a machine to best fit (err approximate) a non-linear
function that most of our discussion will center.
2.4. Role of Game Playing in Machine Learning
Game playing has been a popular research topic in Artificial Intelligence and
Machine Learning because the problem typically has well-defined goals, is easily
represented, and has strict-rules. This is not to say that game play is easy. The computer
must deal with the uncertainty of its opponent while also managing a search space, which is
oftentimes infinite (in the realm of computer time and memory). Finally, once a computer
has excelled at a challenging game against strong human components, others typically
perceive it as exuding intelligence.
The first groundbreaking machine playable game was Samuel's Checkers Program.
His program "learned" a value-function (as represented by his linear function approximator)
and applied a training function that is very similar to what became temporal difference
within reinforcement learning. Arguably, the most successful game-playing machine is

Tesauro's TD-Gammon, which fully employs the methods of temporal difference to achieve a
master's level skill at the game of backgammon.
Samuel chose the game of checkers because he argued that checkers allows an
emphasis on learning techniques without complex rules to steer the concentration of the
learning method. The characteristics with which he believed checkers displayed intelligence
by a machine were:
No known algorithm guarantees a win or a draw because the state space is too large
to exhaustively search.
Definite Goal
Winning the game, with intermediate goals along the way.
Activity rules are definite and known
All responses must be allowed by the rules, thus keeping the possible space much
A background of knowledge exists
This is important for supervised learning. If you can teach the program what is good
and bad (with known inputs and outcomes), your learning will be more accurate.
Activity is understood by many people
By playing the game of checkers, even a child would be able to understand exactly
what the computer has learned and achieved.
To this day in research, games continue to attract tremendous attention because the
ability to plan, reason, and choose an option is perceived as intelligent. Additionally, these
exact skills are extremely well suited for translation to develop "real-world" applications.

3. Samuel's Checkers Program
The game of Checkers (also referred to as Draughts) was originally developed in
ancient Egypt. It is a basic game played between two players and does not involve an
element of chance. The game is played using only the black tiles of a standard chessboard as
shown in figure 3.1. Each opponent begins with twelve pieces. Each piece may move only
forward to its diagonally bordering, unoccupied tile. If an opponent's piece borders a
player's piece and the opposite tile is unoccupied, a player may "jump" the opponent's piece
and therefore remove it from the game. If a player's piece reaches the opposite side of the
board, it is crowned a "king" and is allowed to traverse in either forward or backward
direction. The player that takes the entirety of the opponent's pieces wins.
w W
tmjm IT L

spi r, ; wM Li.-., 1 L ...
I§IS 1 1 t
o u D o
a D ^ D
D u D D
Figure 3.1, Checkers Board
All popular computer checkers programs employ the use of a search tree. In each of
these trees, nodes represent possible positions within the game and branches represent
possible moves from each position. Each level of the tree represents a position with possible
moves based upon the opponent's move. For the game of checkers, to construct the entire

game tree would require 1040 non-terminal nodes. Assuming one had a rather fast computer,
capable of solving 3.rl0^ noc^es/s<2c' ^ wou^ require 1021 centuries to find all possible
3.1. Samuel and Checkers
Arthur L. Samuel, a researcher at IBM began work in 1952 on developing a program
which could play checkers at the level of a human opponent. By 1955, Samuel had developed
a program employing the first heuristic search method to play the game of checkers.
Samuel's Checkers Program was developed to run upon an IBM 700. He would oftentimes
sneak into the production facility after-hours to run his program, as IBM would not give him
computer time.
Samuel laid out the checkers board representing each possible location with a single
bit within the 36-bit word of the IBM 700. Starting at the lower, right position numbered as 1,
the next position left 2, and so forth for all possible 32 positions on a checkers board. Each
player was then assigned two words containing that player's current position. The first word
contained the location of all of the player's standard pieces. The second word contained the
location of all of the player's "kings." To make a move, a simple bit-shift could be performed
upon the bit representing the piece to move. To move forward and to the right, 4 right bit
shifts would move the proper checker. To move forward and left, 5 right bit shifts. Samuel
further created one additional word for all occupied spaces on the board. These five words
sufficiently describe the state-space of the checker's game.
An evaluation function to judge where the computer (and opponent) stand within
the game had to be created. The linear polynomial model that Samuel employed is based
upon the summation of weighted parameters. Such parameters include: piece advantage

(number of pieces vs. opponent's), simple boolean tests (do I have a "king"?), and others.
The selections of these parameters were initially set a priori. Later, the computer was allowed
to use those parameters with the most influence on the outcome of the game.
3.2. Searching the Space
With a representation and an evaluation function, a search tree is setup for each
move and its possible scores. Employing a simple Depth-First Search (DFS) when searching
for the best move, the machine cannot simply choose the move with the maximum score (our
assumed goal), instead, from that position, it must traverse back up the tree to ensure that the
goal chosen is attainable. The uncertainty in the attainability of this goal is due to the
opponent's moves (which are presumed to maximize his own position while minimizing our
position). This procedure, developed by Shannon came to be known as the MINIMAX search
[Shannon, 1950],
The MINIMAX search examines each of our possible moves. For each of those
moves, it again applies the evaluation function to each of the possible moves made by our
opponent after our potential move. Choosing the node with the minimum score, we traverse
back up to the previous node and assign it the minimum value. Thus, traversing back to our
root node, we choose the branch with the maximum score. This is the maximum score we
choose that would minimize our opponent's score.
For example, in figure 3.2, Player 1 may choose to move piece 1 forward left, forward
right, or piece 2 forward right.

Piece 1, Piece 1, Piece 2,
Forward Left Forward Right Forward Right
Figure 3.2, Player 1 Possible Moves
For each move Player 1 makes, there are a resultant number of moves that are
available to Player 2. For every possible move available to Player 2, we employ our
evaluation function to determine the "score" of Player 2 at each of these potential points in
the game as shown in figure 3.3.
Figure 3.3, Player 2 Moves and Scores
Thusly, we choose the minimum score on each branch for Player 2 and assign the
parent's node that minimum value as shown in figure 3.4.

Player 1
Forward Left Forward Right Forward Right
Figure 3.4, Potential scores of each move
We can then evaluate our potential moves based upon the minimum score that is
achievable by our opponent. We maximize this score and make our move. In this case,
Player 1 chooses to move piece 1 forward right because it will maximize his position while
minimizing his opponent's maximum position. This example proceeds only two levels (or
ply's) deep; however, the process is repeated as the search continues descending the tree.
Employing MINIMAX search (which utilizes DFS) would require a tremendous
amount of processing capability, in fact, an upper bound of (where B represents
branches and d the depth of the tree) nodes must be searched. Samuel applied what came to
be known as Alpha-Beta pruning of the MINIMAX search tree.
In the earlier example of MINIMAX, Player 1 first examined the Piece 1, forward left
move. The best score (which minimized his opponent's move) achieved by his move was six
Examining the Piece 1, forward right move returned a result of eight, which was better. In
the final case of Piece 2, forward right, the best score is two, worse than eight; therefore, this
branch may be pruned" and eliminated from searching further. This is the concept of a-p

As we descend the search tree, we set a to be the best value for our score and P to be
the best (minimum value) that we can force upon our opponent. While searching the tree, if
any branch is found to be worse than the current a or P value, we can back up to our parent
and move onto the next branch thus eliminating an entire branch from our search tree.
This pruning of our branches eliminates much of the unneeded time involved in
( d\
nodes that will be searched. This allows us to double the
searching and gives us O
amount of look-ahead in the same time allotment as the standard MINIMAX procedure.
Another modification was made to the search method to assist in the time that is
required by DFS. The problem with DFS is that the order in which the offspring are
examined can prove to be crucial. Unfortunately, a~P pruning will not assist the search in the
case in which the best result lies upon the last branch searched. By searching a child with a
better score first, the results will be found quicker, allowing a~P pruning to trim out the
results. Samuel ordered each offspring in the search employing what he titled the
"plausibility analysis." The search scored each of the children nodes with the evaluation
function and then ordered each of these nodes. The search would then return to the parent
node and use the a~P pruning with respect to these ordered, calculated child nodes.
Because of the search space and time restrictions placed upon Samuel with such an
early computer, the search tree is further limited in the number of ply to search forward. The
ply limitation algorithm he developed followed six basic rules:
1) Always search a minimum of 3 ply.
2) The search stops at ply 3 unless any of the following occur:
a) The next move is a jump.
b) The last move was a jump.

c) An exchange of pieces is possible.
3) The search stops at ply 4 unless 2a or 2c are then met.
4) Continue searching unless at any ply no jump is offered.
5) The search stops at ply 11 if one side is ahead by more than two "kings."
6) The search stops at ply 20 regardless of conditions
By limiting the ply levels available for searching and employing the a-fi pruning, the
program is required to search a path that follows a more winning (direct) route rather than
allowing the machine to explore paths which are known to lead away from the goal of
By utilizing this search method, the computer would be capable of calculating a
game of checkers; however, it would not "learn" about strategy, moves, etc. It would simply
try to maximize its score while minimizing its opponent's. Any average human opponent
could develop a basic strategy around this simple, fixed game play.
3.3. Methods of Learning
Samuel began work on developing a system to allow the machine to adapt to
mistakes and strategies presented to it during game play. The first component of this
method, rote-learning, was a very basic system to store results of previous games. After
failing to get substantial results, Samuel developed a second algorithm he called, learning by
In rote learning, the objective is to allow the computer the ability to search much
further into its search tree. This is accomplished by storing the best move found every time
the DFS a-fi pruning search is employed. During play, if the same move is encountered

again, the result is already computed, allowing the computer to begin searching deeper into
the tree. If a particular move was found during a 3 ply search, the second time that node is
encountered, the computer can begin its search at the third ply, thus enabling it to search
much deeper into the tree. Every time the same move was encountered, it was "rewarded"
by increasing its value slightly to encourage further exploration.
The problem Samuel found with this method is that no sense of direction was given
to the search routine. The machine did not recognize that other paths (though equal or lower
in ply-value) may lead to a quicker victory. In order to try to compensate for this problem,
each calculated score was weighted based upon the ply-value required to achieve the score.
Thus, if the machine is losing, it will take the path that maximizes the number of moves, and
if winning, it will take the path with fewest moves.
Samuel played this algorithm against himself, human players of varying abilities,
itself, and book games. During all of these games, the computer stored 53,000 positions.
Samuel found that the machine had very strong opening and closing games (where the
number of possible positions and moves is much lower) and that the middle game was
extremely weak. He estimates that over one million positions would have to be stored in
order for the machine to have a strong middle game.
Samuel began work on a more adaptive system by focusing not on the mechanics of
searching but on the evaluation function itself. The evaluation function is the most important
component of the game play and it is in combination with the rote-learning that Samuel
achieved his legendary results. The evaluation function must quickly and accurately
measure the current state of the game while also giving a reasonable interpretation toward
the possibility for the position to "win the game." Samuel developed a system using a first-
order linear polynomial.

Samuel's learning by generalization learns" by adjusting the weights of various
chosen parameters, which are used to calculate the current position's score. This evaluation
function was a basic linear polynomial of the form shown in equation 3.1.
v = a)]p]+(o2P2+-+a)p (3.1)
These parameters are selected for the game of checkers and are set up a priori to the
execution of the game. Samuel deemed thirty-eight different parameters to be important to
the outcome of the game. At any single calculation, sixteen were used by the linear
polynomial while twenty-two were in reserve. The machine arbitrarily chose these sixteen
parameters along with their weights at the initial stage. After each cycle, if a particular
parameter's weighting coefficient fell below the threshold, its counter would be increased.
Once a parameter's counter reached a set maximum (Samuel chose eight), that parameter
would be taken from the polynomial and placed at the end of the reserve queue. The top
parameter in the reserve queue was then removed and added into the polynomial.
A sampling of the parameters that Samuel applied to p, include:
Advancement of the piece
Exchange of pieces
Mobility of the piece
Currently a threat to an opponent's piece or pieces. (If p, is a THRET for some i,

assume Q leads to position R; therefore, /?, is equal to the value of THRET for R minus
the value of THRET for Q.)
Although the use of an a priori list of parameters is somewhat simplistic, at the time,
no other method for computer-generated terms had been found. Additionally, due to the
underlying complexity of the strategy of checkers, no one knows which parameters create the
minimum group required to play a winning game. Finally, a fraction of the parameters
created for representations by Samuel were combined into a single non-linear parameter
used within the linear evaluation function.
To enforce the learning, Samuel played the computer against itself. The first player,
called a, would adjust the weights during each turn. The second player, /}, would leave the
weights untouched; however, it used the same evaluation function that was being modified
by a. During play, a would perform its standard look-ahead calculations as performed by
the a-/3 pruning MINIMAX. This value would be stored and the move made. On the next
turn, a would use the evaluation function on its current position and compare this with the
stored value. If the difference were positive, the weights of the polynomial would be
Samuel found that these calculations and adjustments had to be made after each
move; otherwise, instability would become prevalent. This instability was due to the fact that
the determination of the opponent's move is not made with the scoring polynomial; however,
the anticipation of the opponent's move is calculated with the scoring polynomial during
MINIMAX. Samuel including another means to help stabilize the polynomial by only
updating the parameter weights if the delta between the stored value and the new value was
above a minimum threshold. This minimum was recomputed at each cycle and was set to
the average value of the coefficients for the terms of the utilized parameters.

The final stabilizing feature (and one which becomes key in reinforcement learning)
was to scale the weight based upon the number of cycles that the parameter has been used.
This aided in keeping stability when a parameter term is first used (and thus, possibly a large
delta) and after it has been in used for many iterations. The correction value is computed as
shown in equation 3.2.
C = C(I_,(3.2)
This correction value forces those weights that have been in use longer to be worth
more than those that have recently been added.
Samuel's game as described became a good starting point in solving checkers. When
the machine would compete against itself (using the a [i pairing as described), a changed
the top of its parameter list 14 times and lost 4 of 7 matches. Throughout the trials, a
continued this pattern and repeatedly played quite poorly. After approximately twenty-five
games, the machine would begin to play at a "better-than-average" level; however, the
learning at that point became "erratic and none too stable" [Samuel, 1959],
The program was also fooled by terrible play of the opponent. It seemed to be
learning to play like its opponent as opposed to learning to beat it. Samuel found that
changing the weighting coefficients less drastically when the delta was positive as compared
to when it was negative helped to stabilize the program. In a positive delta case, only
parameters that contributed negative terms in the positive scoring polynomial were updated.
In the negative scoring polynomial case, only parameters that contributed positively were

The program was found to be adding and removing parameters too frequently and it
caused very unstable evaluations. Samuel increased the removal count threshold from eight
to thirty-two to try and keep parameters more stable.
The program did not assign proper credit to the moves it had made. If a spectacular,
high scoring move was performed, credit was given to the previous move as opposed to the
series of moves that laid the groundwork for the spectacular move. Samuel never found a
precise method for correcting the incorrectly assigned credit; however, the fix applied to the
program allowed the remembered moves to increase until the delta exceeded its arbitrary,
minimum value and then apply the update to that historical position. This, however, may
assign credit falsely to a move along the chain.
The final instability found lied within the way the a-fi pairing was played. If a
played poorly and worsened its parameters, but, simply by chance, scored a victory over [i,
then in the next game, P would utilize these poor parameters and a would never improve its
game play. Simply by only updating the parameters used by ft if a minimum number of
victories were achieved by a solved this problem.
It was with this system that Samuel performed further testing, playing book games
against the computer, many games against itself, and games against humans. It eventually
achieved a master level beating all but the expert level players. Samuel showed that the
generalization scheme could be an effective learning method. The memory requirements for
such as system were modest (even by 1955 standards) and operating times remained fixed
and constant. Finally, even with an incomplete set of parameters, the computer was able to
learn to play a very high level game of checkers.

Samuel's method of learning by generalization was one of the first applications of what
is now referred to as Reinforcement Learning. Specifically, the method he developed is quite
similar to the temporal difference learning employed within the field of Reinforcement
Learning. Samuel assigned "rewards" for various parameters deemed as important to the
game (by increasing their weight). Samuel's algorithm always attempted to increase its piece
advantage; however, because the function was never given a reward to the generalization
function, it could (and, in fact, did) get worse with time.

4. Reinforcement Learning
The entire concept of reinforcement learning is built upon the principal of learning
through our interaction with a given environment. This idea utilizes the concepts observed
in the study of biological learning. A rat that runs through a maze, learning its way around
each turn until it finally reaches the destination, is rewarded with a treat, and freed from the
maze is a primary example of the basic tenets of reinforcement learning. The two most
important features of reinforcement learning are:
Trial-and-Error Search:
Discovery of actions within the environment that yield results.
Delayed Reward:
An action may affect the subsequent action or any subsequent rewards.
It is the trial-and-error search, which provides us with another key to reinforcement
learning. The learner must "trade-off between exploration and exploitation" [Sutton, 1998].
The learner needs to explore new territories that it has never seen, but, also exploit the
knowledge that it has already gained. To help with exploration, the learner should (at
randomly selected intervals) choose a non-greedy (or non-exploitative) path to seek new
answers. We call this an exploratory move "because they cause us to experience states we
might otherwise never see" [Sutton, 1998],
The standard reinforcement model (shown in figure 4.1) is based upon an agent (A)
with an environment (E). The agent at some point (t) takes an action (L). Upon the agent's
completion of action L, the state of the environment (S) may change. This change in the state

is given to the agent via a reward (R). Poor decisions are given smaller rewards (or possibly
a correction) than wise decisions.
Sutton and Barto define four primary components of the reinforcement learning
system [Sutton, 1998]:
Defines how the agent is to react to a given situation.
Reward Function
Ranks possible actions from the policy on their reward to the agent. The reward
function essentially maps our state with a given policy and scores it upon its ability
to reach the goal.
Value Function
Defines the total number of rewards the agent can accumulate between the current
state and the goal. Without rewards, we would not have a value; however, it is the
values with which the agent is most concerned simply to achieve more reward.
Environment Model
An estimator that can mimic the environment the agent is acting upon. Given a state
and policy, the environment model should give us a reasonably accurate
representation of the new environment.
Figure 4.1, Standard Reinforcement Model

Therefore, the job of the agent within its environment is to create a policy which
maps its actions to the environmental states that will achieve (typically maximize) the desired
goal (or long-term) state. What the rat running through a maze example shows us is an agent
interacting with its environment (the maze) making decisions to reach its goal in the face of
uncertainty on what lies around the next corner. This uncertainty is much more akin to a
true environment as opposed to a supervised learning model.
In supervised learning techniques, the agent is presented with known examples and
outcomes until the agent sets its policy to match the cases it was given; unfortunately, the
agent tends to learn only what is has been shown or "taught" by the supervisor. The
reinforcement learning agent learns through its interaction with the environment. It is this
interaction which distinctly distinguishes reinforcement learning from supervised learning in
two ways:
There is no feeding of known input and expected output.
The evaluation of each policy action is "concurrent" with learning in the supervised
Many research and industrial applications have begun to be implemented utilizing
reinforcement learning techniques. In fact, reinforcement learning is becoming quite popular
with researchers because it "serves as a theoretical tool for studying the principles of agents
learning to act" [Kaelbling, 1996]. Some of the applications which have proven
reinforcement learning to be a practical computational tool:
Industrial Manufacturing
Job Scheduling
Combinatorial Search Problems
Computer Game Research

4.1. History
Although the concept of reinforcement learning has been researched in both biologic
and artificial intelligence for years, it has been primarily in the last ten years that
reinforcement learning has gained more acceptance in machine learning. Samuel was the
first to employ what would now be called temporal difference learning; however, no one
carried on with his thoughts and methods until the resurgence of reinforcement learning in
the late remaining years of the 1980's.
The subject of reinforcement learning as it is now researched is a combination of
three differing research studies:
Trial and Error Learning
The focus of trial and error learning is quite common in psychology where it was
presented by Thorndike that actions followed by a reward (or punishment) cause the
action to be remembered by the agent for proper selection in the future. This is more
precisely Thorndike's law of effect discussed earlier. The first most important
characteristic of the lazu of effect is that it is "selectional, meaning that it involves
trying alternatives and selecting among them by comparing their consequences"
[Sutton, 1998]. The second important characteristic is that it is "associative, meaning
that the alternatives found by selection are associated with particular situations"
[Sutton, 1998].
Optimal Control & Dynamic Programming
Optimal Control systems began in the late 1950's as a means to solve the problem of
controlling a dynamic system to minimize the dynamics over time. Most of this
work continued upon the foundations laid by Hamilton and Jacobi in the previous
century. Richard Bellman furthered this approach to use the system's state and a
function valuator to create an equation for the system. The methods that Bellman

invented came to be known as Dynamic Programming. Bellman furthered his
research in optimal control and developed a discrete stochastic version that came to
be known as the Markovian Decision Processes. All of these methods and theories
have provided a pivotal theme to the theory of reinforcement learning.
Temporal-Difference Methods
In temporal-difference, learning is "driven" by repeated estimates of the same
valuation. Temporal-Difference parallels the concept in biological learning of
secondary reinforcers, which are paired with primary reinforcers and exhibit the same
qualities. Arthur Samuel was one of the first to include ideas of temporal-difference
by incorporating these concepts into his checkers program. During the 1960's, most
work began shifting toward supervised learning until 1972, when Klopf continued
with some of the earlier works. In the last ten years, interest and work in temporal-
difference has increased. Temporal-Difference plays a smaller role in reinforcement
learning than the other two threads; however, its inclusion is important and unique
to the concept of reinforcement learning.
4.2. Classes of Reinforcement Learning Methods
Sutton and Barto define three "fundamental classes of methods" which can be
employed for the solution of Reinforcement Learning problems. The first, dynamic
programming (DP), is well understood and developed from a mathematical viewpoint;
however, it requires an accurate and complete model of the environment. Because of the
stochastic nature of game play, there is not a complete model of the environment. The
second is based upon Monte-Carlo (MC) methods. These algorithms are conceptually simple
and, as opposed to dynamic programming, do not require a complete model of the
environment; unfortunately, Monte-Carlo methods are not well suited to solving incremental
(step-by-step) computations. This leads us to the third, and the focal point of our discussion:

the temporal difference method. Temporal Difference (TD) does not suffer from the
requirement of a complete, accurate environment and it is rather well suited for incremental
computation; however, temporal difference methods are far more difficult to analyze and
Markov property is simply a state that retains all relevant information. A problem that is
bound by the Markov property requires only the previous state and an action to achieve a
subsequent state. At any state, s, and any action, a, the probability of each next state, s', in a
finite Markov Decision Process is given by equation 4.1. The expected value of the next
reward is shown in equation 4.2.
These two quantities detail the important facts of the Markov Decision Process for
reinforcement learning.
Every form of reinforcement learning discussed satisfies the Markov property. A

4.3. Temporal Difference
Temporal Difference learning is the concept that is unique and key to the current
concept of reinforcement learning. Temporal difference methods take ideas from both
Dynamic Programming and Monte Carlo methods to form the framework of its method.
From DP, TD methods can utilize previously learned estimates for updating their current
estimate without the requirement of a final solution. From MC, TD methods do not require
the complete model of the environment for making their decisions-environmental stimulus is
The most interesting characteristic with TD is the fact that TD uses past experience to
predict its next estimation. The states are represented with Vn, a state at some time interval,
f, to be St, and the estimate at a given time of our states to be F(s,). We begin with the
simple, constant-a MC method for changing environments as in equation 4.3.
V(s,)=V(sl) + aSl (4.3)
S, = R,-V(s,) (4.4)
We define Rf to be the actual value found after time, t, and a as a constant step-size
parameter. The problem with this Monte-Carlo method is that we must wait until the end of
the entire episode to find our Rt value that would allow us to update our estimation: P(.v;).
Temporal Difference uses a separate goal from MC, replacing the Rt value with a more

immediate goal. The basic TD algorithm, called TD(0), uses the basic form of 4.3 with <5, as
defined in equation 4.5.
s, = 'i+i +7*/(-s,+i)- P(s,)
So, the target for temporal difference is for the next time step, as opposed to the
completion of the goal. We represent our next time step target with r,+| as our immediate
reward, and yas the discount parameter (discount for estimation). This concept of using the
previous estimation and results to generate the next estimation is similar to DP and is
referred to as the bootstrapping method.
This combination of the MC sampling technique with the DP bootstrapping is the
fundamental theme throughout the temporal difference method. The expected value from
our experience of n is utilized with the current estimation at time t. To illustrate, let us look
at the estimated value from the MC method as shown in equation 4.6. Next, we inspect the
estimated value from the DP method as shown in equation 4.7.
What the TD method does is to employ the estimated value from the DP method;
however, we do not know the current estimation in our experience because it has not been
performed yet. Thus, we substitute VK with Vr This is the key to utilizing our expected
value with our current estimation and what makes temporal difference a powerful tool to
r(5) = £,{/?(|s,=5}
^(5)= £^1+^(5, + 1^=5}

To demonstrate the advantage of temporal difference over the Monte Carlo method,
we shall examine a sample Markov Decision Process. This is an example modified from
Sutton [Sutton, 1998]. Suppose we have a robot exploring the floor of a building looking for a
particular room. It comes to a section of the floor in which it can make a series of decisions.
See figure 4.2 for a visual layout of the environment. At any point, the robot may randomly
(with equal probability) select to go to the next or previous point until it reaches one of the
two rooms.
Room #1
Room #2
Figure 4.2, Robot Exploration as a MDP
If it reaches room 2, it will be rewarded. If it reaches room 1, it will receive no
reward. In this simple example, we wish to properly find the estimated value of return for
each step along the way. There are 6 potential positions (5 stepping points and a final point).
Therefore, the proper estimate of reward for each point A through E is > ^3, , % > und %
We initialize every step to have an estimated return value of Yi.

initialize V[s] to 0.5
current := C
Repeat until current = a room
next := randomfnext, previous)
if next = room 2 then r := 1
else if next = room 1 then r := 0
V[current] := V[current]+ a* (r+y*V [next]-V[current])
current := next
Code 4.1, Pseudo-Code for TD(0) MDP
We then program the robot to attempt the problem by updating the estimates with
both temporal difference (See code 4.1) and constant-a MC (see code 4.2).
TD(0) vs. MC
Figure 4.3, Comparison of TD and MC MDP results
As shown in figure 4.3, over several different a values, TD(0) is consistently better
than MC. It is also apparent that the constant-a values for the MC method must be very
small as they become very unstable as it grows larger. Though both refine their estimations to
reasonably close values of the proper estimate, TD(0) has a smaller error. Furthermore, TD(0)

converges much quicker due to the fact that it is allowed to update its values during each
episode as opposed to waiting for the series of episodes to terminate before computing the
changes as is required with MC.
initialize V[s] to 0.5
current := C
repeat until current = a room
next := random(next, previous)
if next = room 2 then r := 1
else if next = room 1 then r := 0
push current onto S
current := next
for all n in S
V[n] := V[n]+a*(r-V[n])
Code 4.2, Pseudo-Code for MC MDP
TD gives us several distinct advantages:
Does not require a model or a priori knowledge of the surrounding environment: key
to game play.
Need only to wait until the end of each time step rather than until the end of the
entire trial as is dictated by Monte-Carlo methods.
Has been proven to converge for any fixed policy n. In fact, in constant-a, stochastic
tasks, TD has been shown to converge even faster than MC methods [Sutton, 1988].
4.4. Temporal Difference with Eligibility: TD(A.)
Eligibility traces allow us to utilize the entire spectrum of learning methods from the
Monte-Carlo (which stores all actions, then updates after completion) to TD(0) which stores
none of the actions, but, simply updates after each action. Eligibility traces assign credit (or
blame) for each action taken within a particular state allowing us to bridge between
information and actions taken. We can apply eligibility traces to any of the reinforcement

methods; however, it is when TD(0) is combined with eligibility traces to form TD(A.) that we
are concerned.
Eligibility traces are employed to provide the benefits of both TD(0) and MC within
one algorithm. With TD(0), we can only update based upon the immediate reward of the
action we have just taken. What if there is a delayed reward? It will not be recognized until
that reward is achieved. On the other hand, with MC, we can only update once we have
achieved the goal state and received the reward (if any). What if the reward was achieved
before reaching the goal state (or minor rewards were given along the path)? We will update
every step we took along the path giving it the same reward regardless of whether it
contributed or not. Eligibility traces will allow us combine the best of each algorithm without
falling into the traps of each respective method.
To combine the two, we would simply credit rewards back n steps: greater than one,
but less than every step taken until termination. The process of updating only one step
backward is equivalent to TD(0) and similarly, updating only once we have every step is
equivalent to MC. In MC, the only return we use is the reward plus the estimated value of
the next state as shown by equation 4.8. Equation 4.8 also includes the discount value for
discounting the estimates of next states (the discount is set to 1 for our examples, i.e. no
R,=rt+\+Vv,(st+1) (4-8)
Now, because MC is based upon a single step (the final reward step) to update its
previous steps, we must expand this return function to include n steps. Equation 4.9 shows
the return for an -step algorithm.

What if the episode was to hit a goal before the nth step? In that case, the standard
MC return, equation 4.8, is employed. However, this situation helps us to understand why
such an n-step method is not practical. The n-step method suffers the same restriction that is
placed upon the MC methods: it must wait a certain (n) number of episodes before
calculating the return. For real-time applications (such as control), this latency is can become
problematic. Therefore, we must apply the benefits of the n-step method without the
requirement of waiting for the steps to occur.
We define the eligibility trace to properly decay a state's contribution to the current
state as it moves further into the past (i.e. a state's eligibility is the "degree to which it has
been visited in the recent past" [Kaelbling, 1996]). Within each iteration, every state decays
by yk, and if we are updating the current state, its eligibility is incremented by one. Updating
the standard TD(0), equation 4.3, to include the eligibility trace, results in equation 4.10.
Where 5,is defined in equation 4.5 and e,(sr) is the eligibility value for state 5,.
Equation 4.11 defines the eligibility trace:
The new parameter, A, is defined as the trace-decay parameter. This parameter defines
how quickly the weighting of each state as it contributes to the current state will fall off in
time. For larger A, states are heavily weighted even as they have drifted further back into the
V(s,)= F(j,) + a5,e,(5,)
if s is our current state (st)

history. In fact, though it is computationally more expensive to utilize the generic TD(A), it
has been shown to converge considerably faster for systems using a larger X [Dayan, 1992].
Now we have included a way to historically weight states without waiting for the reward to
be given (as is the limitation of the MC and n-step algorithms). We can continually update as
with the standard, TD(0); however, we may properly credit previous states with any rewards
In fact, through simple inspection, we can see that the TD(A) algorithm is simply a
generic form that combines both TD(0) and MC at its extremes. First, allow us to examine the
case in which A. is 0 and pis 1. In this case, our eligibility trace is reduced to equation 4.12.
Substituting the eligibility back into equation 4.10, we arrive at equation 4.13 which is simply
the TD(0) case as shown in equation 4.3.
e,(s) =
if s is our current state (.?,)
r(l,)=F(S() + a5,l
if s is our current state (s()
Conversely, if we examine the case in which A. is 1 and yis 1, we see that the
eligibility never decays and every state will eventually be rewarded by the final reward.
Thus, we may consider the TD(1) case to be the Monte-Carlo method as described in equation
4.3 and 4.4 with one major caveat: TD(1) is performed on-line, that is, it occurs during the
episodes without waiting until a goal has been reached. Thus, with the TD(1) method, you
can utilize MC methods without waiting until the final goal to update the states.
Revisiting the robot-walk example from the last section, we can now implement the
TD(A) algorithm (as shown in code 4.3), and test it in various cases of X. At A=0, we should

expect to achieve the same results as the TD(0) case, and at A.=l, we should achieve very
similar results to the MC method; although, because it is still performed on-line it will
converge quicker than the standard MC method. It is at various values of X that we should
achieve faster convergence and better results. It will also introduce us to the problem of
choosing the proper trace-decay parameter value for the problem at hand. Some general
domain knowledge is required for proper selection of the decay value. For the simple robot
walk, we would suspect that a higher X (though less than 1) would lead to a faster
convergence because every step is important to our walk. In a game of chess, however, a
typical game that may last through 40 moves per player, a faster decay may be more valuable
as the initial pawn moves may not have contributed much to the current state.
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Figure 4.4, Comparison of Different X values for TDft) and MDP Robot
As shown in figure 4.4, the results of TD(A.) with a constant a=0.1, are consistently
better than either TD(0) or MC as shown in figure 4.3. In fact, one can see that the A.=0 case is

identical to the TD(0) case for a=0.1. Not only does the TD(A.) case converge quicker, but, it is
less susceptible to worsening its estimates over time as we saw in the previous example with
the MC method.
initialize V[s] to 0.5
initialize e[s] to 0.0
current := C
repeat until current = a room
next := random(next, previous)
if next = room 2 then r := 1
else If next = room 1 then r := 0
8 := r + y*V[next] V[current]
e[current] := e[current] + 1
push current onto S
for all i in S
V [ i ] : = V [i ] + a*8*e [ i]
e[i] := y*\*e[i]
end for
current := next
Code 4.3, Pseudo-Code for TD(^.) implementation
4.5. Samuel's Checkers and Temporal Difference
Samuel's Checkers Programas described in Chapter 3was one of the first
computer programs developed with the key aspect of reinforcement learning: the value of a
state is dependent upon the value of its subsequent states. In Samuel's simple rote-learning
method, he initially stored the value of the Depth First Search at each node and attached with
it a ply-value. Therefore, if the score were achieved with more ply levels (or moves) it was
discounted more than a move with the same score that required fewer moves. So, the rote-
learning score was based somewhat upon the fact that the state was dependent upon the
subsequent states of the game. It was with his second method, learning by generalization that
Samuel came closest to the modern notion of reinforcement learning.

In his learning by generalization method, Samuel adjusted the weights of each game
characteristic that contributed to the current state. So, after a move by both players, the
program would search for the best possible state to achieve. It would then backup the score
of this move to the last state and adjust the weights to guide it. This would happen after each
move so that at any state, the linear function would more accurately predict the following
state. Samuel's backup method is the same in concept with Temporal Difference Learning;
however, it lacked an important characteristic: rewards.
Samuel's learning by generalization method pushed the linear value function to be
consistent and accurate; however, he had no way to bind the function to the actual value of
the states. This is enforced in the standard Temporal Difference methods by tying rewards
and discounting from these terminal reward states. Samuel tried to introduce consistency to
his method by giving states with strong piece advantages a rather large, fixed weight.
However, by simply introducing a large weight within the evaluation function itself, the
function tried to fit the other weights in line with the single large term. This led to the
program actually becoming worse with age. Samuel would "kick-start" the program by
resetting some of the larger adjusted weights back to zero.
4.6. Gradient Descent with TDOi)
Oftentimes, the values being adjusted are not state values, but weights (or
parameters) of a non-linear function which attempts to estimate or predict our state. These
functions allow us to generate a fairly good approximation of the larger state as opposed to a
table type state (as was employed by the MDP robot-walking example previously).
However, the problem becomes, how does the TD(A.) method allow us to alter these weights
as opposed to table entries?

Firstly, we must now think of Vt as representing a parameterized function as
opposed to a table. We declare that the vector 0, is our parameter list. Therefore the non-
linear function, Vt, is completely dependent upon 0,, and changes from state to state as 0,
and the inputs change. This allows us to approximate many different functions, which are
suitable for many applications. The most popular of these functions in approximation are
those that fall into the gradient-descent methodology. Fortunately, TD(A.) lends itself very
well to solving the gradient descent problem.
Quite simply, gradient-descent methods attempt to reduce the error between the
estimated value and the actual returned value. So, on each iteration, we choose an action, a,
and estimate our resultant state as computed by our differentiable function Vr Upon
arriving at our new state, we find that the estimate, Vt, is different than the actual value, Vn.
We must, therefore, adjust our parameters, 0,, in the direction that would most reduce the
error. The generalized gradient descent method for adjusting 0, is shown in equation 4.14.
The gradient-descent method therefore adjusts 0, by a value that is "proportional to
the negative gradient of the squared error" [Sutton, 1998]. So, by applying equation 4.14 to
the general TD(A.) form as seen in equation 4.10, we come up with equation 4.15. No longer
are we adjusting the actual values of Vt, but, now we are adjusting the parameters, 0,, so that
the function will return a more accurate estimate.
0,+l = 0, + a8,e.

Where 5, is defined by equation 4.5 and e, is the eligibility trace for each component
of 0,, defined by equation 4.16.
e,=yXe,_\ + V-F,(s,)
where e0 = 0
The most immediate application of equations 4.15 and 4.16 is, of course, the multi-
layer neural networks employing backpropagation. We can directly relate 6, to the weights
in Back-Propagation [Hertz, 1991] and Vt to our neural activation. In fact, as we shall see in
section 6.4, these two equations map directly to the updating of the weights neural nets using
the back-propagation to compute the gradient descent of the eligibility traces.

5. Tesauro's TD-Gammon
Backgammon is an ancient game believed to have been developed more than one
thousand years before Chess. It is played with blots ("checkers") on a board as shown in
figure 5.1. Each player alternates rolling two dice and moving their blots in opposing
directions. The first player to move his blots around and off of the board is the winner. A
"gammon" is awarded if an opponent can move his blots off of the board without having lost
any blots to his opponent. A "backgammon" is awarded if a player moves his blots off of the
board without having lost any blots to his opponent and is able to keep all of his opponent's
blots in the far quadrant of the board.
Figure 5.1, Backgammon Board
If a player lands on (or "hits") an opponent's single blot, the blot is sent to the bar in
the center of the board. The person who has their blot "sent" must then re-enter the board
before making any other moves. Additionally, by placing more than one blot on a point, the
opponent is blocked from placing any of his blots onto that point. Finally, an opponent may
offer to double the current stakes of the game. If the opponent accepts this "doubling cube,"

he is given the right to make the next double offer; whereas, if he resigns the double, he
forfeits the current stakes. The points at the end of the game are calculated using the current
value of the doubling cube multiplied by 1 for a standard win, 2 for gammon win, and 3 for a
backgammon win.
The strategy complexity added to the game in hitting, blocking, and doubling make
Backgammon a very challenging game for humans to play, let alone for a computer to
master. In fact, due to the stochastic nature that comes from the dice roll, there are over 1020
possible states in the game of backgammon. This proves much too complex for a table
lookup. Deep tree searching algorithms fail for Backgammon as well because of the random
dice rolls. There are 21 possible dice rolls and an average of 20 moves per dice roll. This
would mean for every single ply there is an average branching ratio of around 400. This
would not be feasible on even the largest of computers.
5.1. Tesauro's Neurogammon
Neurogammon was Tesauro's first major work on the game of Backgammon.
Tesauro employed a standard Hopfield Neural Network with back-propagation (as
described in section 6.3) to attempt to give a generalized sense of direction for playing
Backgammon. The initial setup contained no hidden layers, 459 input nodes, and 1 single
output node. Each board location (of which there are 26) is represented by a number of input
nodes. Other input nodes represent varying board situations. This initial setup tended to
favor piling up as many pieces onto one point as the computer could. Tesauro and Sejnowski
experimented with varying sizes of the neural network including adding hidden layers of
varying node sizes.

Training the network was performed with supervised learning of approximately
3,200 positions and moves. These moves were generated by backgammon experts and
repeatedly shown to the network. Learning was accomplished with approximately 50 cycles
through the training set. Of course, the largest disadvantage to this was that the computer
could not learn the game as it played. It learned the game based upon what it was shown (or
taught). Tesauro (who is regarded as an expert player) continued to train the computer with
his own moves and later served as the evaluation on how well the computer could play
against a quality human opponent.
After training, the network was shown to have developed a generalized sense based
upon the moves with which it was taught as opposed to simply memorizing those moves.
The game was tested against other computer backgammon programs as well as against
human opponents. It was able to best the other computer programs around 60% of the time,
but, a human opponent only around 30% of the time. It made obviously stunning errors as
well as some rather surprisingly "thoughtful" strategic moves; however, overall, it was an
average to novice backgammon player. Even though (against human opponents) the
Neurogammon was not very successful, it did win the "backgammon championship at the
1989 International Computer Olympiad" [Tesauro, 1989],
5.2. Tesauro's TD-Gammon
Tesauro later began work to build his backgammon game based upon the new
learning paradigms built around reinforcement learning. He focused his attention upon the
temporal method of learning: the agent observes the input parameters and decides upon the
output. It is then rewarded or scolded by the consequence of its actions as opposed to
Neurogammon that required a teacher to correct it. Although Tesauro had been interested in
temporal methods, the primary problem had been the delay in reinforcement proved to be

difficult for proper credit assignment and that evaluations in previous models had tended to
be linear. Two advancements allowed Tesauro to test the capabilities of TD(A.) and to
compare to the previous Neurogammon work:
Many non-linear approximation functions had been developed
Advancement of temporal methods (particularly TD(A.) by Sutton [Sutton, 1998])
Tesauro constructed a standard Hopfield neural network with 198 input units, 40
(later 80) hidden units, and a single output unit. The output unit represented merely an
estimation of the probability of winning based upon the input. For the input layer, four units
represent the number of blots white possesses at a single point and four units represent black
at the same point. For the 24 points, there are 192 input units used. Two more input units
represent the number of blots white and black has on the bar. Two additional units represent
the number of blots that white and black has removed from the board. Finally, two units
specify whether it is white or black's turn.
The input was fed into the network which then used a standard feed forward to
compute the output: a value from 0 to 1 representing the probability that the specified player
would win. TD-Gammon then utilized the gradient descent method of TD(A.)as shown in
Section 4.6 and Section 6.4-to update the network after each move and observed state. The
initial weights were set to randomly small values, so the network had no a priori knowledge
of backgammon or the strategy. TD-Gammon was then played against itself allowing the
neural network to observe and predict for each player of the game.
This first version of TD-Gammon did not perform a multi-ply look-ahead search to
determine the best possible move to make. Rather, it evaluated each available position due to
its die roll and chose the position that would lead to the greatest probability of victory. In
later versions, Tesauro implemented a 2 ply look ahead to improve the results to which the

computer could use to choose the better move. Each version of TD-Gammon was tested
against world class backgammon players and the immediate results were somewhat
The initial version was trained for 300,000 games and played against Neurogammon.
This version consistently bested Neurogammon. The most amazing result of this was that
after only tens of thousands of games, Tesauro found a great deal of learning had taken place.
This initial version 1.0, played at a high intermediate level without any assistance. Later,
modifications were made to subsequent version to give the network more input as to the
progress of the game. These versions achieved a grand-champion level playing very
competitive games to the top human competitors in the world. Table 5.2 shows the results
for each version of TD-Gammon.
TD 1.0 300,000 Robertie, Davis, Magriel Lost by 13 points in 51 games (-25 PPg)
TD 2.0 800,000 Goulding, Woolsey, Lost by 7 points in 38 games
Snellings, Russell, Sylvester (-.18 ppg)
TD 2.1 1,500,000 Robertie Lost by 1 point in 40 games
(-.02 ppg)
Table 5.2, Results of TD-Gammon
Version 2.0 of TD-Gammon was shown at the 1992 World Cup of Backgammon
Tournament playing against some of the best players in the world. In fact, Wilcox Snellings
and Joe Russell were both former World Champions. Bill Robertie played the most number
of games against TD-Gammon, and, in fact, played a tournament against version 2.1 of TD-
Gammon. Robertie trailed TD-Gammon for the entire tournament until the very last game,
in which he was able to beat the computer and narrowly win the tournament by a single

point. Robertie later stated that TD-Gammon 2.1 plays at a "strong master level that is
extremely equaling the world's best human players." [Tesauro, 1995] He went on
to state that due to the tireless computation of the computer, it would be favored against the
best players during a long, grueling match play.
Kit Woolsey, who (as of 1995) was rated the number 3 player in the world, wrote a
favorable write-up of TD-Gammon. Woolsey analyzed the game play of TD-Gammon and
offered the following personal comments to Tesauro [Tesuaro, 1995]:
"There is no question in my mind that its positional judgment is better than mine. Only on
small technical areas can I claim a definite advantage... [TD-Gammon's] strength is in the
vague positional battles where judgment, not calculation, is the key. There, it has a definite
edge over humans... [Itsl judgment on bold vs. safe play decisions, which is what
backgammon really is all about, is nothing short of phenomenal.
One extremely powerful example of the game play of TD-Gammon is in opening
positional play. For 30 years, the standard human opening move (called "slotting") was to
setup your piece for possible hitting, but, allowing for an attack position. TD-Gammon
introduced a different strategy (now called "splitting") which avoids the "slotting" maneuver
and is now considered a superior strategy. In fact, most world-class players now employ the
"splitting" strategy for the opening move.

5.3. Tesauro's Conclusions
The results of utilizing the standard neural network to compute the gradient for non-
linear function approximation as applied to the game of backgammon was an astonishing
success that even surprised Tesauro. It was the first major display of what is possible using
reinforcement learning combined with the non-linear computation available via the neural
network. Due to his work, Tesauro realized three important insights in TD(A.) based game
The absolute error of each calculation made by the function estimator (neural
network) was often around a tenth of one point. This is a rather large and substantial error in
the estimation of the game; however, because every calculation has a similar error, the
relative accuracy of the neural network was very high. In a task in which estimations are
weighed and compared to each other, each estimation having very little (if any) relative error
meant that results were very accurate within each other. This would suggest that the
absolute error observed is simply a systematic error which therefore "cancels out in move-
making decisions" [Tesauro, 1992].
The game of backgammon lent itself very well to the temporal difference method.
Due to the stochastic nature of the game, during training against itself, the evaluation
function would see a very wide array of possible states and outcomes. This forces the learner
to more dramatically fit itself to a non-linear function describing the game. In more
deterministic games (such as chess), the learner may focus on one particular strategy and
never explore the environment; therefore, for deterministic applications of temporal
difference, the learner must be forced to explore. However, simply forcing random
exploration can not guarantee success by the learner. In the game of backgammon, the game

always progresses forward as determined by the dice roll until a player reaches the end. In
chess, a random player could, conceivably, consistently choose moves to avoid the other
player forcing the game to last almost infinitely. Finally, non-deterministic games tend to
have a much smoother continuous evaluation function; whereas, deterministic games
typically have a more discontinuous evaluation function.
The final interesting lesson learned is that the neural network's backpropagation
tends to learn more highly linear concepts during the initial training. Once these concepts
tend to be properly weighted, the network begins adjustment of values that focus more on
the non-linear aspects of that which it is learning. In the case of backgammon, the network
tended to learn within the first learning iterations topics such as hitting, playing safe, and
building up new points. It was later during training that the computer began to develop
strategy and more non-linear concepts. In fact, it was found that having only learned the
basic linear concepts of the evaluation, TD-Gammon was better than a typical beginning
human player.
Tesauro found that theoretically, there are a number of limitations that could inhibit
temporal difference learning. The algorithm may not necessarily converge for a predication
and control task and even when it does converge, it could get stuck in a rather poor local
minimum. Theoretically, as the problem size increases, the learning time required may
become too great that the solution may be too resource intensive. However, with these initial
theoretical concerns, Tesauro found that the method always at least converges to a local
minimum with the output of the solution typically very good. He found that as you increase
the number of hidden units, the results tended to improve. The network, he discovered did
not suffer greatly in training time as the size and complexity of the problem was increased.
In fact, after the work of TD-Gammon, he does not believe that this could be a problem.
Finally, the most exciting outcome of the work was the indisputable fact that the

networktrained against itself with zero knowledge about the game of backgammonwas
able to learn quicker and achieve a vastly superior game knowledge and playing ability than
Neurogammon (which was trained with a huge number of expert moves). This fact alone
would suggest that a number of applications and research could be conducted on the
utilization of learning non-linear functions with a neural network and temporal difference

6. Tic-Tac-Toe and Reinforcement Learning
Tic-Tac-Toe is a basic, children's game thought to date back to ancient Egypt. In fact,
the Romans played a similar game called Temi Lapilli that was extremely popular; although,
there is some discussion as to whether it was identical to modern Tic-Tac-Toe. It is believed
that Temi Lapilli used three pieces (as opposed to the two in modern Tic-Tac-Toe) and was
more complicated than Tic-Tac-Toe. For Temi Lapilli to be as popular as it was, it would
almost certainly be more complicated that Tic-Tac-Toe.
Tic-Tac-Toe is played on a board with 9 positions. There are three possible states for
each position: blank (no player has laid their piece), X (player X has laid their piece), or O
(player O has laid their piece). A typical game would appear as in figure 6.1. The player
with the X pieces goes first, putting a piece in any unoccupied space. The player with the O
pieces makes their move in the same fashion. The first player to lay their pieces in a line
(across, down, or diagonally) wins the game. If all spaces are occupied and no player has
completed a line, then the game ends in a draw.
The strategy of Tic-Tac-Toe is extremely simple. If X begins the game by placing his
piece on any of the corners, O must place his piece in the center to prevent X from winning

the game (as long as O continues to block X). In fact, the player in the O position is at a
disadvantage because he is almost always playing defense, reacting to the move that the X
player has made.
Because there are three pieces (blank, X, O) and nine possible positions, the number
of states is bounded by o|39 j = 0(19,683). In actuality, there are far fewer states because
many of the states are winning (or terminal) states. Additionally, many of the states within
the upper bound are unachievable due to the basic rules (e.g. O could never have more pieces
on the board than X). Needless to say, a table-based implementation will require storage
space to hold several thousand states.
6.1. TDOi) Implementation
The first attempt to solve the Tic-Tac-Toe problem used the standard TD(A.) method
without gradient descent. In fact, it is the same algorithm as used to test the Markov
Decision Problem in section 4.4. The game was set up so that one of three types of player
could be assigned to each position: human, random, TD(3.). All states were initially set to
zero, thus the initial TD(A.) player had no a priori knowledge about any state. If a player won,
it was granted a reward of 1.0. If the player lost, it was corrected with a reward of -1.0.
Various experiments were performed with varying rewards for a draw game before
eventually settling upon a reward of 0 to both players in a draw game.
TD(X.) is a table-based implementation storing the value estimations for every state
encountered. As discussed previously (and we shall see), this is not the best solution for a
game domain type problem; however, it provides us with a good comparison (and even

possible teacher) to the neural network based temporal difference estimator in section 6.5.
The TD(^.) player implementation updates its estimates as shown in code 4.3.
The program controlled game play by giving the current player an opportunity to
move based upon its policy. If the move resulted in a game winning move, that player was
rewarded and the other player given a correction. The board was reset and a new game
would occur repeating until the number of games specified had been completed. The
random player would randomly select from the available board positions and place its piece.
TD(^.) employed a more cohesive selection process. TD(A.) would select a random number
between 1 and 20. If it chose the number 7, it would randomly choose from available board
positions (just as the random player). This helped to force the TD(^.) player into exploration
mode so that it would occasionally try a non-optimal move. This randomness was tested
with an increased and decreased interval and was found to have little effect either way over
the 1 out of 20 chance. The random element was required though to increase the number of
states for which it had an estimate. For the majority of the choices, TD(^.) would examine
each available board position and compare the state estimate given by the TD(A.) estimation
with each of the others. The best estimate resulted in that move being made (see code 6.1 for
the pseudo-code representation of the policy employed). If no move was better than the
other, then a random move would be made. After the move was made and the opponent
made their move, TD(^.) would make another move and compare the results of its move with
the expected results of the previous move and apply the process as shown in code 4.3 to
update the last state.

sub Estimate-Policy
if random() = 7 then
move := random open board position
foreach possible in open board positions
if estimate(possible) > max then
max := estimate(possible)
move := possible
endi f
if move is empty then
move := random open board position
Code 6.1, The Policy for a State-Estimation Player
Learning took place by running one player as random and the other player as the
TD(^) player. The TD(A.) player was typically used as the O player because O is a more
difficult position. Some tests were also performed allowing TD(^) to play as X against a
random player. Finally, TD(^.) was allowed to play against itself as both X and O to develop
strategies against another learning player.
6.2. Results of TD(J.)
With TD(^.) playing as the O position, the random player was selected to play the X
position. The program was run for one and one-half million iterations to (hopefully) explore
every state and calculate an accurate estimate for that state. The first comparison was to find
the optimal a value with which to perform our learning (using ^.=0.5 and rewarding a draw
with 0). Three values were tested for a: 0.05, 0.10, and 0.15. The program was set up to run
for 500,000 iterations, save the states to disk, and run again for three iterations. This would
allow the pseudo-random number generator to reset and the game to avoid continual

repetition of the same moves. In fact, it usually gave a large boost" in the win percentage
because it was no longer playing catchup to the initial default estimation states.
Figure 6.2 shows the case for =0.5 through the first 750,000 iterations. As shown,
after approximately 85,000 iterations, the TD(A.) player had finally surpassed the random
player in the percentage of wins per 1,000 games. The TD(A.) player continued to improve
until approximately 550,000 iterations at which point, it never bettered its percentage. By the
500,000 mark, it was losing fewer games than draw games. This meant that the TD(A.) player
did a good job of trying to win the game, and failing that, would try to prevent a win by the
random player. In fact, after one and one-half million iterations, TD(X) had stored 2,217
states and was winning the game 51.8% of the time, would tie the game 29.5%, and lost 18.7%
of the games played. The a=0.5 case was a very continuous learning function with little
chatter as shown in figure 6.2.
Win Percentage (a=0.05)
Iterations (in Thousands)
Figure 6.2, Percentage of Victories (A=0.5, a=0.05)

Figure 6.3 shows the case for a=0.10. In this case, TD(A) learned quicker and was
able to surpass the random player in about 72,000 iterations. Again, as it was reset at the
500,000 mark, it began to perform better; however, as shown in figure 6.3, it was chattering
some after it had leveled off at approximately 550,000 iterations. After one and one-half
million iterations, TD(A) had stored 2,226 states and was winning the game 50.5%, drawing a
tie 30.3%, and losing 19.2% of the games played. Although a=0.10 was able to correctly
estimate the majority of states required to beat the random player quickly, it was apt to
worsening its state estimation position as time continued.
Win Percentage (a =0.1)
Iterations (in Thousands)
Figure 6.3, Percentage of Victories (A=0.5, a=0.1)
Finally, figure 6.4 shows the case for a=0.15. This case showed that the larger the
learning rate parameter, the estimations became more unstable. The TD(A) player took 92,000
iterations to finally best the random player. As the graph shows, it was very prone to
overshooting its estimates during updates. After one and one-half million iterations, TD(A)
had stored 2,291 states (the most of any of the other tests); however, it was winning the game

only 49.8%, drawing a tie 29.5%, and losing 20.7% of the time. This learning rate parameter
gave the worst results of the three, with a=0.05 giving the best results, just as it had with the
Markov Decision Process in section 4.4.
Win Percentage (a =0.15)
Iterations (in Thousands)
Figure 6.4, Percentage of Victories (X=0.5, a=0.15)
With the learning rate parameter chosen as 0.05, experiments were performed to find
the optimal reward for a draw game. Initially, a draw game received a reward of 0. The
reward for winning is, of course, 1.0 and conversely for losing, -1.0. This gives a function to
which all states lie between -1.0 and 1.0. However, should a tie be worth a reward of some
value? Trying four separate values for ties gave some interesting results.

Percentage of Draw Games
Iterations (in Thousands)
Figure 6.5, Percentage of Games Ending in a Draw For Each Reward
As shown in figure 6.5, any non-zero draw value heavily weighted the game to play
for a tie as opposed to playing for a win. For this reason, the draw reward was chosen to be
0. With a learning rate and a draw reward set, the trace-decay parameter (X) must now be
decided. Varying trace decays from 0.0 to 0.6 were tested before deciding that the results
were not improving with each increase in the trace-decay parameter. As shown in figure 6.6,
the number of games won by the player increased slightly faster for those cases with a higher
X; however, after several thousand iterations, all X values settled upon the same curve.

TD(W player for varying Vs
X=0.5 |
Iterations (in Thousands)
Figure 6.6, Percentage of Victories for Differing X
Because the O position is playing a game of defense-first, O typically cannot mount
an offensive strategy to win the game. With TD(A.) playing O, the random player was always
at an advantage until TD(A.) adapted its states enough to more accurately predict the
outcome. As a final experiment, TD(A.) was taught to estimate the X player to compare with
the previous O player results.

TD(A.) as X player
Iterations (in Thousands)
Figure 6.7, Percentage of Wins with TDW as X
As shown in figure 6.7, with ?t=0.5 and a=0.05, the results were significantly
improved over the case of TD(^.) playing as O with the same parameters. After being trained
for one and one half million iterations, it had generated 1,489 states. This was significantly
less than the states generated as the O player, in part, because the X player has such a piece
advantage that it can quickly win a game with a wrong move by the O player. In fact, after
the training set was complete, X was winning 71.2% of the matches while O won 3.8% and
games ended in a draw 25% of the time.
After all of the training iterations for varying values of a, X, and draw rewards, it was
time to perform a subjective test to see how well TD(A.) could actually play the game of Tic-
Tac-Toe against a worthy human opponent, as opposed to the random player. Because the
best results against the random player were achieved with a=0.05 and a draw rewarded with
0, those training sets were utilized against a human opponent: the author. For varying X, the
game was played against the human opponent for twenty iterations to test its prowess.

Number of Training Iterations! Spgpl Games won by Human 1 Games won by TIM) ** Draw Games *
MO.O 10,000,000 2,305 6 4 10
X=0.2 10,000,000 2,318 9 1 10
X=0.3 10,000,000 2,316 10 2 8
X=0.5 10,000,000 2,324 5 5 10
X=0.6 10,000,000 2,327 6 2 12
X=0.5 (computer X) 1,500,000 1,489 0 0 20
X=0.5 (tie-0.1) 10,000,000 2,418 6 2 12
Table 6.1, TD(X) vs. Human Opponent
As shown in table 6.1, the resultsthough not remarkablewere respectable for a
table based look-ahead method. The best results for TD(X) as player O were achieved with
/V=0.5 where the computer and the human opponent each won 5 games while ending in a
draw 10 games; therefore, the match ended in a draw. The best overall results were achieved
with TD(/l) playing as the X player. With only 15% of the training iterations and 64% of the
number of state estimations as compared to the O player, X=0.5 case, it was able to prevent
the human opponent from winning and forcing a tie every time.
After playing hundreds of games against the varying trained sets, some basic
patterns were seen that the TD(X) player had developed. Firstly, TD(X) did learn the basic
strategy of Tic-Tac-Toe. It was able to consistently avoid giving up the power position to the
opponent which made the game more difficult for the opponent to win. Secondly, it

ultimately learned only a defensive game rather than a winning game. TD(A.) was able to play
a fairly decent game; however, it was not a truly formidable opponent.
TD(A.) did learn the very basic strategy of Tic-Tac-Toe (arguably, the only strategic
move in Tic-Tac-Toe): when playing the O position, if X takes any of the corners, O must take
the center position or O will lose. In every single game in which the human opponent was X,
if X chose a corner position, TD(^.) would take the center position to prevent X from winning
the game outright. In the cases in which TD(A.) played X, it always chose a corner position to
begin with, forcing O into the defensive strategy of taking the center position. Interestingly,
TD(A.) would never begin the game with a weak position (one of the non-corner border
positions). Because these positions offer no piece advantage (unless used for the win or a
block), TD(^.) developed a strategy to select the more important positions first.
Unfortunately, TD(A.) became primarily focused upon preventing the opponent from
winning (even with the cases of a draw being rewarded with 0 or -0.1). Once this realization
was made, it became much easier to beat the computer because winning positions for the
computer could be left open.

Figure 6.8, Board Position with O's turn to move. (O would choose a. over
the winning position b.)
As shown in figure 6.8, if TD(X) (playing as O) had the opportunity to choose a
winning move or a blocking move, it always elected the blocking move. This allows a human
opponent to quickly develop a strategy of always placing pieces offensively to win and not
concentrate upon preventing a computer win. In the cases in which the computer was able to
both block and setup a winning position, the computer would always choose the winning
position. So, if the computer placed a piece to block and in doing so had two potential
winning moves remaining (always a winning position against any opponent), the computer
would win upon its next move.
In reviewing this problem, I found the major shortcoming of the basic TD(?l) table-
based implementation: the program must have an estimate for every single state in order to
be effective. In every single case investigated in which the computer chose a defensive move
over the winning move, the winning move's board estimation was unknown. Unfortunately,
even after 10 million iterations, there were still states missing from its estimation. Therefore,
for a more comprehensive state space estimation, more exploration is needed. By increasing
the random probability of an exploratory move did not assist with this problem.

Additionally, increasing the number of training iterations from one and one-half million to
ten million made little difference. In fact, the problem that it may be more inherent with the
random player. Essentially, the TD(?i) table-based algorithm can only estimate those states
which it sees. In the case of a random player, with so many combinations, it is unlikely that
the random player would show the TD(A.) player the "wiser" choices to be made.
To attempt exploration of all states, the results of the TD(A.) as the X player were used
to train a new TD(X) O player. The results for the O player were not promising. After one
and one-half million iterations, X won 50%, O won 7%, and the game ended in a draw 43% of
the time. In playing a game against this trained player, the human opponent won all 20 of
the games. As a final experiment, TD(^.) as the X player was used to further tune the states
learned from the O player, a=0.05, A.=0.5, tie=0.0 case. This case was trained for one and one-
half million iterations. The results were worse than the previous experiment. The already
trained O player could best the trained X player only 7.1% of the time, while losing 49.9% of
the time. In fact, in pitting this previously decent opponent against the human opponent, the
computer lost all 20 games.
Upon studying the resulting state estimations of those O players dominated by the
trained TD(A.) X player it was found the O player wasquite literallybeaten into
submission. It was so difficult for it to win a game initially that the only states it was able to
update and accurately estimate were losing states. Eventually every move it was making
was leading toward a "reward" of -1.0. This constant negative reinforcement pushed all
states toward a negative goal, thus, figuratively crushed its spirit.

6.3. Back-Propagation and Neural Networks
Neural Networks are a method of fitting a non-linear function to closely approximate
the problem domain. Neural Networks are composed of neurons that are connected in a
layered network. For a network, a minimum of two layers is required: an input layer which
is connected to the output layer. Every layer can have as many neurons as are required.
These neurons are then connected to neurons in the previous (if any) and next (if any) layers.
Each connection to other neurons is weighted. So, one neuron may have a stronger signal to
a target neuron than a second neuron may have to the same target neuron.
Output Layer
Hidden Layer
Input Layer
Figure 6.9, Typical Neural Network
As shown in figure 6.9, there is an input layer which "feeds" its activations to the
hidden layer wrhich, in turn, "feeds" activations to the output layer. On both the input and
hidden layers, a bias is added which is a fixed neuron acting as a constant for the equation
being approximated. Through back-propagation, a network can be adjusted so that it better
approximates the result that is to be achieved. Back-propagation is a gradient-descent
method that gradually lessens the output error. For instance, the error could be immediately
calculated (if the expected results are known) and the weights adjusted accordingly;

however, to keep the network from becoming unstable, the error is gradually lessened. Once
the error is known, the weights can be adjusted in the direction of the gradient from the
output layer back down to the input layer, hence the name, back-propagation.
Typically, back-propagation is employed in supervised learning mode. During
supervised learning, fixed input is fed to the network where the expected output is known.
The expected output is then compared to the actual output of the network. The error is
computed between the two values, and this error is propagated back through the network.
This process continues until the network has correctly fit its function to generate the correct
results. The hope, oftentimes, is that by training the network on many different inputs, it will
eventually develop a generic function for most input casesan example of this is
To utilize a neural network, one typically performs three tasks:
Feed the network the input data and retrieve the resulting output value(s). The
output is the result of the non-linear function to which our neural network is
If training the neural network, compute the error between the expected and actual
Propagate the error back through the network, allowing the neurons to adjust their
This process is repeated until the network no longer needs to be trained. Typically, a
training session lasts for tens or hundreds of thousands of iterations until the network is
sufficiently trained to generate the correct output for the given set of training input.

The first step is feed-forward. The input layer neurons are excited to some activation
by the initial data. From these new values contained in the input layer, we feed the data
forward to the hidden layer. The hidden layer's excited neurons can then feed their data
forward to the output layer. The activations of neurons on one layer are dependent upon the
weights and the activations of the previous layer. As we can see in equation 6.1, an activation
is a function of the summation of all weighted neuron activations connected to it from the
previous layer.
"(/) = /

/M =
1 + e~x
f{x)=. £ = /(*)['-/(*)] (6-3)
The term, a" (t), is defined to be the activation of neuron i on layer n at time, t.
Conversely, a)-" is defined to be the weight between neuron i on layer m to neuron j on layer
n. Equation 6.2 is the standard sigmoidal function for generating a neural activation between
0 and 1. Equation 6.3 is the derivative of equation 6.2 that will be used during back-

The feed forward step is very fast to compute, and, in fact, once a network has been
trained, it can calculate advanced non-linear functions very rapidly. During training, after
the activations of the output layer are computed, they are compared to the expected output to
compute the error as shown in equation 6.4.
Where desired is the vector of expected outputs at time t, and O is the vector of
neurons on the output layer. To update the weights, the output layer errors are propagated
to the previous layer. Once these weights are updated, the error is propagated back to the
weights from the previous layer, and, so forth, until the input layer is reached. Equation 6.5
is the weight update method.
The gradient of the weights with respect to the error must be found to solve equation

6.5. The learning rate is defined as a. Substituting the chain rule into equation 6.5 results in:
<"(' + 0 = <"(')-
dSk(t) da>r('K(t)
a<"(tK"(/) 3<"(0
Which reduces to:

The back-propagation process computes a"(t) as shown in 6.9.
where n = 0
The general back-propagation of neural networks has become quite popular for
many tasks that do not fit a standard, linear function or tasks that require constant
modification of the parameters such as visual recognition, speech recognition, and others
(including the aforementioned Neurogammon).
6.4. Temporal Difference and Neural Networks
The temporal difference method for updating states can also be used to help solve a
gradient descent method as shown in section 4.6. The TD(A.) method for neural networks;
however, is slightly different from the standard back-propagation method. In the standard
back-propagation, an error is propagated back to every neuron, which, in turn, adjusts its
weights for every other neuron signal that contributed to that error. In the case of utilizing
TD()i), the back-propagation is used to generate the eligibility of each neuron. After each
time step, the temporal difference is computed and adjusts each neuron based upon their
eligibility to the error. These eligibilities determine which neuron weights contribute to the
current temporal difference, allowing the properly assigned update.

The TD(X) gradient descent method follows the same form as with the standard
neural network; however, to correspond with the method, V/ (t) is used to represent the
activation for neuron i on layer m at time t, and 0''"' it) represents the weight between neuron
/ on layer m to neuron / on layer n at time t. The feed-forward step as shown in equation 6.10
is identical to equation 6.1 with the new variable definitions.

where, / and f are defined by equations 6.2 and 6.3.
Substituting into equation 4.5, the temporal difference error is computed by equation
8?() = n(t) + yv?(t + \)-vf>(t) (6.11)
Where /; (?) is the reward, and yis the discount parameter. The weights can be
updated based upon equation 6.12, which is adapted from the temporal difference gradient
descent method, equation 4.15.
0""'(/+l) = 0'(r) + a(5f)^'(O where k eO (6.12)
Where e"^(t)is the eligibility of the connection between neuron i on layer m to
neuron / on layer n to the output layer, unit k. Therefore, a change to an output unit affects
all other neurons through their "eligibility" to that output node. The eligibility is determined
by equation 4.16. Substituting for the gradient results in equation 6.13.

Where o'lj{t) is computed with the recursive back-propagation method as defined by
equation 6.14:
Utilizing the gradient descent method as defined in section 4.6, combined with the
standard back-propagation neural-network, a neural-network can employ real-time,
its own during the tasks it is set up to solve.
6.5. TD(Neural-Net) Implementation
To implement solving Tic-Tac-Toe with the temporal difference based
backpropagation approach, a neural network player was added to the same code base as
described in section 6.1. This gives the software four potential players to be tested and
compared: human, random, TD(A.), and TD(Neural-Net). The TD(Neural-Net) player utilizes
the neural network process described in section 6.4. The code is based upon Sutton's pseudo-
code for a temporal based neural network. The network is a three-layer network with a
single hidden layer. The hidden layer size was set to eight neurons for each of the results.
Both the input and output layers varied based upon each experiment. Additionally, the
rewards were varied for each experiment.
if m = n
if m eO & m n
reinforcement temporal difference learning. This alleviates the need for supervised learning
(one of the major drawbacks to useful neural networks) allowing the network to "learn" on

The network was implemented using the standard function provided by Sutton
[Sutton, 1992], The first step, feed forward, is used to generate an estimate of the current
state using the neural network. Using equation 6.10, Sutton created a subroutine (as shown
in code 6.2) which will feed the current input layer through the network to calculate the
current output layer.
sub feed forward
value := = 0
for j in hidden layer
for i in input layer
value := value + 0[i][j] *V[i]
end for
V [ j ] := f(value)
end for
value := = 0
for k in ouput layer
for j in hidden layer
value := value + 0[j] [k] * 1 1.
end for
V [ k] := f(value)
end for
Code 6.2, Subroutine for Computing the Output of the Network
As shown in code 6.2, each neuron's weighted connection to the layer below is
summed and computed with the sigmoidal function as defined by equation 6.2. After
feeding the input layer through, the output layer can be used as the estimation for
determining the next move. After a move is made the current state must be compared with
the previous state estimated by the previous move. In addition to applying any reward
received for the last move, the temporal error from the previous estimation to the current
estimation must be computed using equation 6.11. This error is utilized to adjust the weights
such that the estimation of the last move and the estimation of the current move both follow a
continuous function.

sub update_weights
for k in output_layer
error[k] := r[k] + y*V[k](t+1) V[k](t)
for j in hidden_layer
6>[j][k] :=0[j][k] + /)*error[k]*e[j] [k]
for i in input_layer
0[i][j] := 0[i][j] + terror [k] *e [i] [ j ] [k]
end for
end for
end for
Code 6.3, Subroutine for Computing the Change in Weights of the
Code 6.3 implements equation 6.12 to update the weights that connect every neuron
in the network. In code 6.3, a and P are learning rate parameters which scale the error to follow
a slower slope of the gradient-descent. The weight is changed slightly from its previous
value by how much it contributed to the current error. How the weight contributed to the
current error is determined by the eligibility. The eligibilities define how each neuron is
connected to the output layer and how heavily that neuron's activation contributed to the
error. The eligibilities are updated after each move as shown in code 6.4.
sub update eligibil ity
for j in hidden layer
for k in output layer
e[j][k] := y*A*e[j 1 1 k] + f' (V [k] ) * v [ j ]
for i in input layer
e [i] [jllk] := Y*X*e [ i 1 [j][k] +
f' (V[k])*0[j][k]* f' (V [ j ] )*V[i]
end for
end for
end for
Code 6.4, Subroutine for Backpropagation of Eligibility Values
As shown in code 6.4, the eligibility decays with the trace decay parameter, X. This
allows move contributions to decay over time (e.g., a move made at the initial point of the
game may be less crucial to a win or loss than a move made at the end). The eligibility is

changed slightly based upon is previous eligibility (scaled by the trace decay) and the change
computed by the back-propagation procedure for solving the gradient descent. Equation 6.3
is employed to compute the derivative of our sigmoidal function. Because the activation (the
result of equation 6.2) is known, the second form of the derivative using the values known
for /(x) is used.
There were three different networks developed to test how well TD(Neural-Net)
could learn the game of Tic-Tac-Toe. The first was based directly upon the initial work of
Tesauro with TD-Gammon. The second, still based upon the concepts of Tesauro's work was
modified to better estimate the outcome of a Tic-Tac-Toe game. The third and final, was
based upon TD(Neural-Net), but deviated with a different sigmoidal function and neural
network computation result. Each network was tested with a constant learning-rate parameter
of ^.=0.5. Additionally, each network used a hidden layer size of eight neurons. Each
network was trained against itself using varying rewards and lastly trained against a
"learned" player from the TD(A.)-based table implementation in section 6.2.
The first experiment built is very similar to Tesauro's TD-Gammon. Two neurons
represent each position on the Tic-Tac-Toe board. The first neuron is set to one if an X
occupies that position or the second neuron is set to one if it is occupied by an O piece;
otherwise, both are set to 0. This gives eighteen neurons with odd numbered neurons
representing X and even representing O. The final two neurons tell the network which
player has the current turn. The first neuron is set to one for X or the second neuron is set to
one for O. The output layer consists of a single neuron which represented the probability of
victory for the given state and player. It is hoped that the neural network would learn to be
an accurate state estimator for any given Tic-Tac-Toe state. By continually feeding the game
state, the network should return an estimation of the probability of victory. This allows the

player to use the estimator to determine the better move. The rewards for this experiment
were first tried as 0 for a loss, 0.5 for a draw, and 1 for a win. In the second trial, the rewards
were changed to -1 for a loss, 0 for a draw, and 1 for a win.
The second neural experiment is based upon the first experiment with one change to
the output layer. The output layer was increased to two neurons to estimate the likelihood of
a win and the likelihood of a draw game. The input vector was created using the same
process as described in the first neural network. The estimator is then used differently by the
player. The player would feed forward a possible state and compare the likelihood of a
victory vs. a draw game. The player would choose the outcome that had the greatest
possibility of occurring. The rewards were computed differently in that there were two
output neurons. If the outcome were a draw game, the neuron estimating a draw would be
rewarded with a 1, and the neuron estimating a win would be rewarded with a 0. For a
winning outcome, the rewards were reversed. In the case of a loss, both neurons were given
a reward of a 0 or a -1 depending upon the test trial for obtaining results.
The final neural experiment was still based upon the TDfNeural Net); however, its
function was slightly changed. Instead of being an estimator for the probability of a win, it
was altered to give an estimation of the expected reward (similar to the method as studied in
Section 6.2). There was a single output neuron that returned a result from -1 to 1 as the
estimation of the final reward to be given. In order to achieve a result from -1 to 1, our
sigmoidal function (equation 6.2) must be changed from its standard 0 to 1 form. Using
equation 6.15, our neuron's activation will range from -1 to 1.
= (6.15)
1 + e

With the derivative being:
f(-v) = -^-T = 2/(41-/(.,)] (6.16)
(l + ,')
Using equations 6.15 and 6.16 in the feed-forward and back propagation
computations allows the network to vary from -1 to 1. The rewards were used exactly as in
Section 6.2. In the case of a win, the player was given a reward of 1 and in the case of a loss, a
-1 was rewarded. For a draw game, no reward was given.
6.6. Results of TD(Neural-Network)
The procedure to train the neural network is somewhat different than that of the
TD(A) procedure in section 6.1. The neural network requires a repetition of the states in order
to begin to develop an estimation for that state. For this reason, the neural network cannot be
trained against a random player. The neural network needs a constant stream of games that
are played similarly for it to recognize and adapt its weights. The best source of endless
games in such a manner is against itself. For all three of the neural network experiments
described in section 6.5, the neural network was primarily trained against itself to achieve the
results. In all cases, a test was also run to see how well the network would train against a
known, average player. The source for average player is the TD(X) player using results from
section 6.2.
The first neural network is a direct implementation of the system Tesauro created for
TD-Gammon adapted to the game of Tic-Tac-Toe. With separate input neurons representing
each player and an output neuron intended to return the probability of victory for the given
player, it was trained against itself for 500,000 iterations to test how well it had progressed.

Win Percentage (TDNN vl.O)
90 L
70 -
j 60 -
50 -
40 -
i 30

10 ./...... .......................----------------------------------------------------
0 l ..............
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Iterations (in Thousands)
Figure 6.10, Percentage of Victories for TDNN vl.O
As shown in figure 6.10, the neural network did not seem to improve either of its
players for the first 100,000 iterations (which are consistent for all iterations). This result was
somewhat surprising as it would be assumed that as the neural network becomes a more
accurate estimator and each player is utilizing that estimator, the number of tie games would
rise dramatically approaching the vast majority of game outcomes. This did not occur in this
case. In fact, closer inspection of the neural network at this time shows that the output of the
network has asymptotically approached one. Every input fed through the network results in
an output activation very close to one.
Inspection of the process used for TDNN vl.O reveals that the sigmoidal function
returns a result of 0.5 when the input is 0. Therefore, the initial activation is 0.5. Upon each
iteration, more neurons are excited, which increases the activation, pushing it further toward
one on the sigmoidal. The temporal error in that case is positive which forces the network to
try and increase the positive value of each activation. More excited neurons are fed forward,

exacerbating the issue. Because there are no negative rewards given to the network, there is
never a negative temporal error and it therefore is always approaching one.
To try to alleviate this situation, TDNN vl.l was altered to give different rewards. A
reward of 1 was given to the network for a winning game, while a "correction" reward of -1
was given to the network for a losing game. In all other cases, a reward of 0 was given.
TDNN vl.l was then trained against itself for 500,000 games to see if it could return better
Win Percentage (TDNN vl.l)
Iterations (in Thousands)
Figure 6.11, Percentage of Victories for TDNN vl.l
As shown in figure 6.11, TDNN vl.l did not fare any better than TDNN vl.0. The
results of TDNN vl.l continued the trend in which the estimator cannot properly compute
the state for a given board of Tic-Tac-Toe. The probability of winning function was not
anywhere near approximating a valid probability for the given state. There appeared to be
no learning of the Tic-Tac-Toe function (if one exists) at all.

The possibility existed that the network had fallen into an extreme local minimum
and could not work its way out; thus, it was simply playing the same game (or few) over and
over. I decided to attempt training the network against a player who had already learned the
game. 1 used the a=0.05, X=0.5 case from section 6.2. This player had faired well in the O
position, so I trained the neural network playing the X position against the TD(A.) player.
Win Percentage (TDNN vl.l vs TD(k))
Iterations (in Thousands)
Figure 6.12, Percentage of Victories for TDNN vl.l vs. TD(^.)
As shown in figure 6.12, the neural network did not perform well against a proper
trainer. In fact, the network only won about 10% of the games, with another 10% ending as a
draw, and the remaining 80% won by the inferior positioned O player. This proved to me
that the network was not learning the game as presented to it.
Because of the failed attempts with TDNN vl.O and vl.l, a change was made to the
algorithm to attempt to better represent the game of Tic-Tac-Toe with the neural network. In
the case of TD-Gammon, Tesauro's initial version contained a single ouput neuron to

estimate the probability of a win. On later iterations, he added more neurons to represent the
various outcomes of backgammon (victory, gammon, or backgammon). For TDNN v2.0,1
added a second output neuron to represent the probability of a draw game. So, for TDNN
v2.0, there were two output neurons, each representing the probability of occurrence. The
first represented the probability of victory, the second, probability of a draw game.
After testing TDNN v2.0, its results were identical to TDNN vl.O. In fact, upon
closer inspection of the output layer after complete training, the neuron representing the
probability of a victory always returned a value near zero while the neuron representing a
draw game returned a value near one. No matter the input, both neurons would return
activations very close to these values. For similar reasons as the change from TDNN vl.O to
TDNN vl.l, I altered the rewards for the output neurons. If a victory was achieved, the
neuron for victory was rewarded with a 1 while the draw game was given a -1. If a draw
game occurred, the values were reversed. In the case of a loss, both neurons were given a
reward of-1.

Win Percentage (TDNN v2.1)
Iterations (in Thousands)
Figure 6.13, Percentage of Victories for TDNN v2.0
As shown in figure 6.13, TDNN v2.1 seems no different than TDNN vl.l. In fact, the
results from TDNN v2.1 were no better than TDNN vl.l. The output layer seemed to have
little effect on how the network responded to learning games presented to it. Again, a test
was run to see if the network would perform better if trained against an average quality
opponent. In tests versus the TD(A.) opponent, it was able to better play the opponent as it
could play to a draw game as often as allowing the opponent to win the game as shown in
figure 6.14.

Win Percentage (TDNN v2.1 vs TD(A.))
Iteration (in Thousands)
Figure 6.14, Percentage of Victories for TDNN v2.1 vs. TD(A.)
Because the results for TDNN vl and v2 were not promising, a test was conducted to
see how well each would perform against a human opponent. I played against each trained
network for twenty games. The results of the tests against humans, based upon earlier
results, were not surprising. As shown in table 6.2, only the player who fared well against
the TD(>.) player was able to play a game that was not dominated.

Number of 4s
Training Iterations
4s Games
ins # won by
* Games won
by TD(A.)S
Draw Games
Ml -W- " W %
vl.O 500,000 20 0 0
vl.l 500,000 20 0 0
vl.l trained against TD(^.) 500,000 20 0 0
v2.0 500,000 20 0 0
v2.1 * 500,000 20 0 0
V2.1 trained against TD(A.) 500,000 13 2 5
Table 6.2, TDNN vs. Human Opponent
Players marked with an asterisk (*) did not even follow the basic rule of Tic-Tac-Toe:
as the O player, if X does not choose the centerpiece, O must take the center. Every single
player seemed to follow the same pattern. It would always choose the same positions (if
available) in the same order no matter what the opponent had chosen. These seemed to
suggest either the computer was not learning anything at all or that the training games rarely
were consistent (or possibly too consistent) to train the network to recognize a pattern.
I decided to break somewhat from the basic method outlined as the TD(X.) gradient
descent method with neural networks. I changed the sigmoidal function to return a value
from -1 to 1 so that the reward estimation could be accurately predicted as was performed
with the TD(A.) player in section 6.2. At first, the input layer was altered for TDNN v3.0. The
input vector was comprised of 10 neurons. A single neuron represented each board position
and the final neuron represented which player had the current turn. A -1 represented the O
player while the X player was represented with a 1. If no player occupied the space, it was
left as 0. This input vector, however, proved to be flawed because when training against

itself, the same board position would be presented (with either a positive or negative
activation) for each player with a different reward. The network would try to correct its error
from a 1 reward, then, attempt to correct the error from a -1 reward resulting in very large
offsetting weights which quickly became too large for a 32-bit processor to handle.
Therefore, TDNN v3.0 was returned to using the standard input vector as used for every
other version.
As with previous versions, TDNN v3.0 was trained against itself for 500,000
iterations. The results were more consistent with expectation in that the number of ties was
more recurring than in previous attempts. The X player still held the major advantage and
won most of the games; however, the number of draw games steadily increased as shown in
figure 6.15.
Win Percentage (TDNN v2.0)
7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Iterations (in Thousands)
Figure 6.15, Percentage of Victories for TDNN v3.0

Based on how well it trained against itself, the hope was that it would fair well
against a trained opponent. Playing TDNN v3.0 against the TD(^) player would test how
well the player had learned the game. As figure 6.16 shows, however, the results were much
poorer than expected. In fact, the TDNN v3.0 player did not play as well against a trained
opponent as the earlier TDNN v2.1 had.
Win Percentage (TDNN v2.0 vs TD(1.))
Iterations (in Thousands)
Figure 6.16, Percentage of Victories for TDNN v3.0 vs TD(A.)
Because TDNN v3.0 was based on a concept more similar to that of TD(A.) with a
reward estimator for a given state, it was hoped that it would learn to compute (or estimate)
a given state as opposed to storing a value such as the TD(1) player. Unfortunately, it was not
able to learn the function. Even after 1.5 million iterations, it played a rather poor game. As
table 6.3 shows, the various TDNN v3.0 networks were not able to perform at a level at which
it could compete with a human.

Trainer Number of Training Iterations Games won by Human Games won by; / Draw Games
Itself 500,000 20 0 0
TD(A.) 500,000 20 0 0
Random 1,500,000 20 0 0
Table 6.3, TDNN v3.0 vs. Human Opponent
Inspection of the reward estimations for TDNN v3.0 revealed a similar problem to
what had occurred with TDNN vl.O. All states fed forward through the input resulted in an
estimation that approached 1. One primary difficulty with a neural implementation of Tic-
Tac-Toe is that for a given input and a given output, one minor change in the input (changing
the user who is currently at play) must force the network to "invert" the estimation. In other
words, for a given board state for Player X, the estimation for a reward may be 0.78; however,
feed the same state through the estimation for Player O, and the reward estimation would be
roughly -0.78 because the likelihood for Player O is a loss.
6.7. Tic-Tac-Toe Conclusions
In the Tic-Tac-Toe problem domain, there were two distinctly different results in the
use of the concepts of temporal difference. In the use of TD(X) as a linear, table-based lookup
method, the results were extremely positive, particularly, as player X in which the human
opponent could not beat it. The second use of TD(A.) in the non-linear, neural network case
did not achieve results which would be construed as "learning." Although, this was the case,
it does not cast doubt upon the algorithm, more it requires thought as to the application.
Temporal difference methods as described by Sutton [Sutton, 1998] were primarily
focused upon linearly independent input patterns. In the case of utilizing TD(X) as a linear

predictor, it worked quite well. In the case of utilizing TD(A.) with a non-linear neural
network, the results were very poor. Possibly this speaks more toward the simplistic game of
Tic-Tac-Toe. Tic-Tac-Toe is a very basic Markovian process (a state depends only upon the
current state to reach the following) without non-linear aspects such as strategy, middle-
games, etc. The application of neural networks to real-world problems requires selection of a
proper problem, and the game of Tic-Tac-Toe may very well be a case in which the network
did not fit the problem. In many cases, a problem that combines prediction and control
would not converge with the TD(^.) algorithm. It can "get stuck in a self-consistent but non-
optimal predictor/controller" [Tesauro, 1992],
One of the most important considerations in the implementation of a multi-layered
neural network is that of data representation. A method with which to express the current
state and extract meaningful results from the network as related to that state is different for
every problem. If expressed incorrectly, the network may not have enough (or too many
minor) data points to accurately understand the state or its outcome. Tesauro outlines two
classifications [Tesauro, 1992]: a) lookup table in which there enough parameters such that the
network may store the correct output for every state, and b) compact representation where the
parameters are far fewer than the states and the network must derive the underlying task. In
the second case, which is what was attempted with TD(Neural-Network) for Tic-Tac-Toe, the
representation is all important. In fact, TD(A.) was described for the first case in which all
states could be stored separately, and may not be suitable in standard form for the case in
which it must compactly represent the model. Even with the compact model, the gradient-
descent method is only capable of finding local optima for each state, not a global-state
independent optimum.

Although the TD(Neural-Network) was not successful at implementing the game of
Tic-Tac-Toe, the TD(^.) player did adapt quite well and was able to play a very reasonable
game of Tic-Tac-Toe without an a priori knowledge of the underlying game. In the linear
case, it was shown to be a very successful algorithm. In fact, a comparison between the TD(A.)
implementation of Tic-Tac-Toe and of TD-Gammon reveals that an extremely complicated,
non-linear function such as BackGammon performed extremely well utilizing TD(Neural-
Network). It could, potentially, be implemented utilizing the TD(A.) linear based algorithm;
however, with the number of possible states exceeding even that of Chess, it would be
impractical and require years to train.
The reverse is true for TD(Neural-Network) applied to Tic-Tac-Toe. Tic-Tac-Toe has
approximately 3,000 viable game states without a complicated game structure. It shows itself
a very solvable problem with the linear TD(^.) player thus practically proving that it is too
simple a problem to be solved with the multi-layered neural network. The problem domain
being linear makes it difficult for the non-linear network to fit its hyper-plane to. The
addition complexity of properly assigning all parameters within the neural network (trace
decay, discount, learning rate, and others) make it very difficult to properly fit a function to the
game of Tic-Tac-Toe.
The temporal difference methods have had many successful applications in recent
years including robotic control, automated task scheduling, playing games such as
BackGammon and Go at world-class levels, and channel allocation for cellular radio-based
traffic. The concepts introduced with reinforcement learning and applied via the temporal
difference methods have applications in most problem spaces; however, as was shown with
the two examples exhibited, one must properly choose the temporal difference method for
their particular application.

7. Conclusions
Mankind has tried to understand the learning process since the earliest days of
history. To understand how we learn is to understand, possibly, what most gives us our
most human of attributes: the ability of abstraction. People have dreamt of building
machines that could think independently since the days of Pascal and later Babbage. To
achieve this goal, researchers have closely followed the advancements of biologic learning
research. De Morgan eventually applied this research in developing the "laws of thought and
thus took the first step towards AI software" [Hofstadter, 1979].
As machine learning techniques progress, those which require the least supervision
and learn on their own are the methods that are typically seen as holding the most promise.
Samuel's method of searching the tree space for all possible moves in conjunction with an
adaptive state estimator is viewed as an historical revolution in artificial intelligence. Now,
with faster, cheaper, and larger capacity hardware, programs such as Deep Blue can beat the
top chess player in the world utilizing look-ahead algorithms. When humans play games,
they employ basic look-ahead algorithms to formulate strategy; however, this is where we
differ from the machine. A human may abstract, formulate, and strategize whereas the
machine is merely performing a look ahead to find the optimal board position for itself. How
similar is a look-ahead of board estimations to a human playing the game? Quite similar as a
matter of fact. The primary difference, however, is in the state estimation being performed.
Although we do evaluate state estimations, it tends to be highly abstracted (oftentimes,
humans will sacrifice valuable pieces as a way to "trick" our opponent). This is where TD-
Gammon excelled.

TD-Gammon introduced the first major successful application of the concept of
reinforcement learning. By watching the game and comparing where you are with where
you thought you would be in the past allows us to develop a method to learn independently
of a teacher or supervisor. TD-Gammon was able to learn the game of backgammon in such
a way that it could accurately predict superior board positions (even if it may have appeared
inferior). In fact, although the estimated probability value by TD-Gammon may have been
incorrect, every estimation was consistent with the other. This parallels human state
estimation. Every person estimates their current states differently than another; however,
people are consistent with themselves. Despite this error in probabilistic calculation, TD-
Gammon, according to top players in the world, was unusually cunning in its move
selections, often seemingly attempting to lull its opponent into a trap.
Although the success of TD-Gammon cannot directly translate to other problem
domains, such as Tic-Tac-Toe, the basic principles of reinforcement learning, developed by
Sutton and employed by Tesauro, have proven to be an important new direction for the
research of machine learning. As with all other aspects of artificial intelligence, ideas are
"borrowed" from many different disciplines and adapted to fit machine learning model.
Temporal difference with its roots in dynamic control systems has given us a way to learn
on-line; to learn by doingby exploring. It is this explorationinquisitivenesswhich
allows humans to broaden their knowledge of the world around them. It is what makes the
temporal difference method rather unique in the field. Samuel first realized the importance
of exploration in his work before the concept began to again attract attention in the late
1980's. Independent exploration is an extremely important aspect in biological learning.
Curiosity may have killed the cat, but it certainly learned about its environment along the