The implementation for section two was pretty straight forward. We had to implement the value itteration aspect, and the value computation aspect
This was the most complicated aspect of project two. The core of it is defined by
Q(s,a) = R(s,a) +gV(d(s,a))
This says, that the QValue for action on every square on the map is defined by
the sum of the reward value for performing that action on that square, plus the
value of going to that square in the first place.
So to start with, we use a standard itterative for loop to access all the squares
on the map.
Then we go ahead and for each square, we populate the appropiate value for that
action on that square. Of course, we also set for the fringe cases where the
action would cause us to move off of the map. For these cases, we set the Q value
Then, we just follow this standard:
Q[x][y][i] = R[x][y][i] * .25 + V[x1][y+(i-1)] * .5
of course, this is modified to be generic. A slight variation in wich the 3
possible moves are hardcoded are used in the code.
I chose the values of .25 as the weight for the reward and .5 for the weight for
the value because they work, and they meet the rule where the sum of the weights
needs to be less than 1.
This portion is trivial. We compare all the possible actions in Q for a square,
and set the value to be the largest one. So just a max() function.
How Value Itteration Works:
Value itteration works itteritively… Every time the function is called to learn, it
goes ahead and has each square recalculate its q_value.
Each square, in turn, looks ahead only ~one~ move, and sets its q_value based upon
the reward of going into the target square, plus the biggest residual q_value from
prior learning calls.
This allows the q_values to grow over time, and eventually point the way to the
greatest reward. The q_values also make sense because the q_values are the value
function more or less. The value function just represents the biggest q_value.
So basically, Q_values provide for a giant, creeping map of values that give a weight
to a square based upon the optimal path from all the squares it can get to to a
My implementation of Q_learning closely resembles the prior sections q_learning,
except that it only performs Q_learning on the last two nodes visited, rather than
updating the whole grid with every learning section.
And it can be summed up in about one line of code:
Q[s.x][s.y][a] = Q[t.x][t.y][a]*(1-alpha-.5) + alpha *(reward + V[t.x][t.y] * .5)
So we set the Q value of the square we came from to be equal to sum of the Q value of
the square we just entered with respect to direction plus the greatest Q_value in the
square we just came from, and we add the reward value to that for good mesure.
This way, if we enter trash, the Q_Value of the square we came from goes down.
If we enter a good square, the Q_Value goes up.
Then we recycle the calculate_values from the last section (Mostly because I didnt
feel like writing it again, and it has no functional effect on how the code operates)
and things start to work
The downside is, this type of Q_Learning is slow, and takes many, many, many
It behaves very similarly to Value itteration, bieng a subset of the problem where
we only focus on the relationship of two squares, rather then the relationship of
every square on the map with its neighbors.
Exploration is useful as a way to speed up Q_Learning, and, if it somehow gets
itself into a working path, to get it out of the path it is stuck in.
How I set my values, however, tends to have the agent doing exploration on its
own, without needing me to specify the need for exploration. This doesnt mean
that I couldnt turn on exploration to speed things up, or to hit points it isnt
focusing on. It just means that for most cases, I dont have to.