## Thursday, August 17, 2017

### Q learning - and its core flaws

A few days ago i gave a talk on Reinforcement learning with a focus on Q-learning at Cookpads Tokyo office. https://www.meetup.com/tokyo-machine-learning-kitchen/events/242060161/

The main slides for the talk are here https://github.com/ashleysmart/mlgym/blob/master/qlearning/main_slides.html

I have been neglecting my blog lately so i figured i would convert the slides into a post so lets get started.

The Q function is a estimate of the systems potential value. It is accessed based on:
• The environment or state that the system is in
• The actions that can be taken from that state
• The rewards that can be acquired by performing the action
The Q function is the target function that we are trying to learn and it can be implemented as a neural network, table or some other standard linear regression machine learning system or function. The textbook formula:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
Where:
• Q

: The function that guess the total 'value' of rewards

• Q

: The new iteration of the 'value'

• s_t

: The “State” of the environment at time 't'

• a_t

: The “action” perform at time 't'

• r_t

: The “reward” received for the action at 't'

• s_{t+1}

: The “State” of the environment after action at time 't'

• a

: A possible action performed from state 't+1'

• \alpha

: The learning rate, how quickly to adjust when wrong. This limited between 0 and 1

• \gamma

: The discount rate, how important/trusted future rewards are. This limited between 0 and 1. and has a effect that can be considered as a EMA(exponential moving average)

Of course everyone understood that... If we manipulate the formula a bit how it works becomes much more clear Starting with the textbook form:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
We notice that
Q'(s_t, a_t)
is in several places so we can group it together..
Q'(s_t, a_t)=(1-\alpha)Q(s_t, a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1}, a)\Big)
Then we can group the non-Q terms into a common item
Q_{target}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
And Finally we have something very clear
Q_{new}=(1-\alpha)Q_{target}+\alpha Q_{target}
As you can see alpha is acting as a ratio to merge the current value of Q with target value of Q and new info for the next iteration Also given that Q-learning does percential mixes 2 numbers to produce a third then when learning is complete and stable all 3 parts will match ie:
Q_{target} \approx Q_{current} \approx Q_{new}
So the core of what it learns is:
Q_{final} \approx Q_{target} = r_t+\gamma\max_{a} Q(s_{t+1},a)
This is just an recursive formula and com sci guys can often will instantly associate dynamic programming and tree diagrams with it. Ok so now lets have a look at how this works solutions to problems Each circle represents a world state, the number on the left are the instant reward for that state and the the number on the right is the current Q value for that state.
But there is of course a problem: Pay very close attention to the formulas..
"Q_{new}=(1-\alpha)Q_{current}+\alpha Q_{update}
Note that:
• The forumla is iterative
• The is top down
Also:
Q_{update}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
Note carefully the effect and scope of the “max”
• This is the *local* best not the *global*
• It is a heuristic know in computer science as Greedy Optimization." },
These are the key flaws in this algorithm. So what really happens is: