Implementation:

The implementation for section two was pretty straight forward. We had to implement the value itteration aspect, and the value computation aspect

* Value Iteration:
This was the most complicated aspect of project two. The core of it is defined by*

Q(s,a) = R(s,a) +gV(d(s,a))

This says, that the QValue for action on every square on the map is defined by

*the sum of the reward value for performing that action on that square, plus the*

value of going to that square in the first place.

So to start with, we use a standard itterative for loop to access all the squares

on the map.

Then we go ahead and for each square, we populate the appropiate value for that

action on that square. Of course, we also set for the fringe cases where the

action would cause us to move off of the map. For these cases, we set the Q value

to zero.

Then, we just follow this standard:

Q[x][y][i] = R[x][y][i] * .25 + V[x1][y+(i-1)] * .5

of course, this is modified to be generic. A slight variation in wich the 3

possible moves are hardcoded are used in the code.

I chose the values of .25 as the weight for the reward and .5 for the weight for

the value because they work, and they meet the rule where the sum of the weights

needs to be less than 1.

Value Computation:

This portion is trivial. We compare all the possible actions in Q for a square,

and set the value to be the largest one. So just a max() function.

How Value Itteration Works:

Value itteration works itteritively… Every time the function is called to learn, it

goes ahead and has each square recalculate its q_value.

Each square, in turn, looks ahead only ~one~ move, and sets its q_value based upon

the reward of going into the target square, plus the biggest residual q_value from

prior learning calls.

This allows the q_values to grow over time, and eventually point the way to the

greatest reward. The q_values also make sense because the q_values are the value

function more or less. The value function just represents the biggest q_value.

So basically, Q_values provide for a giant, creeping map of values that give a weight

to a square based upon the optimal path from all the squares it can get to to a

reward square.

My implementation of Q_learning closely resembles the prior sections q_learning,

except that it only performs Q_learning on the last two nodes visited, rather than

updating the whole grid with every learning section.

And it can be summed up in about one line of code:

Q[s.x][s.y][a] = Q[t.x][t.y][a]*(1-alpha-.5) + alpha *(reward + V[t.x][t.y] * .5)

So we set the Q value of the square we came from to be equal to sum of the Q value of

the square we just entered with respect to direction plus the greatest Q_value in the

square we just came from, and we add the reward value to that for good mesure.

This way, if we enter trash, the Q_Value of the square we came from goes down.

If we enter a good square, the Q_Value goes up.

Then we recycle the calculate_values from the last section (Mostly because I didnt

feel like writing it again, and it has no functional effect on how the code operates)

and things start to work

The downside is, this type of Q_Learning is slow, and takes many, many, many

repititions.

It behaves very similarly to Value itteration, bieng a subset of the problem where

we only focus on the relationship of two squares, rather then the relationship of

every square on the map with its neighbors.

Exploration:

Exploration is useful as a way to speed up Q_Learning, and, if it somehow gets

itself into a working path, to get it out of the path it is stuck in.

How I set my values, however, tends to have the agent doing exploration on its

own, without needing me to specify the need for exploration. This doesnt mean

that I couldnt turn on exploration to speed things up, or to hit points it isnt

focusing on. It just means that for most cases, I dont have to.