The Coding Slim Jim: Q learning - and its core flaws

Thursday, August 17, 2017

Q learning - and its core flaws

A few days ago i gave a talk on Reinforcement learning with a focus on Q-learning at Cookpads Tokyo office. https://www.meetup.com/tokyo-machine-learning-kitchen/events/242060161/

The main slides for the talk are here https://github.com/ashleysmart/mlgym/blob/master/qlearning/main_slides.html

I have been neglecting my blog lately so i figured i would convert the slides into a post so lets get started.

The Q function is a estimate of the systems potential value. It is accessed based on:

The environment or state that the system is in
The actions that can be taken from that state
The rewards that can be acquired by performing the action

The Q function is the target function that we are trying to learn and it can be implemented as a neural network, table or some other standard linear regression machine learning system or function. The textbook formula:

Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)

ReferenceError: katex is not defined

Where:

```
Q
```
ReferenceError: katex is not defined
: The function that guess the total 'value' of rewards
```
Q
```
ReferenceError: katex is not defined
: The new iteration of the 'value'
```
s_t
```
ReferenceError: katex is not defined
: The “State” of the environment at time 't'
```
a_t
```
ReferenceError: katex is not defined
: The “action” perform at time 't'
```
r_t
```
ReferenceError: katex is not defined
: The “reward” received for the action at 't'
```
s_{t+1}
```
ReferenceError: katex is not defined
: The “State” of the environment after action at time 't'
```
a
```
ReferenceError: katex is not defined
: A possible action performed from state 't+1'
```
\alpha
```
ReferenceError: katex is not defined
: The learning rate, how quickly to adjust when wrong. This limited between 0 and 1
```
\gamma
```
ReferenceError: katex is not defined
: The discount rate, how important/trusted future rewards are. This limited between 0 and 1. and has a effect that can be considered as a EMA(exponential moving average)

Of course everyone understood that... If we manipulate the formula a bit how it works becomes much more clear Starting with the textbook form:

Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)

ReferenceError: katex is not defined

We notice that

Q'(s_t, a_t)

ReferenceError: katex is not defined

is in several places so we can group it together..

Q'(s_t, a_t)=(1-\alpha)Q(s_t, a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1}, a)\Big)

ReferenceError: katex is not defined

Then we can group the non-Q terms into a common item

Q_{target}=r_t+\gamma\max_{a}Q(s_{t+1}, a)

ReferenceError: katex is not defined

And Finally we have something very clear

Q_{new}=(1-\alpha)Q_{target}+\alpha Q_{target}

ReferenceError: katex is not defined

As you can see alpha is acting as a ratio to merge the current value of Q with target value of Q and new info for the next iteration Also given that Q-learning does percential mixes 2 numbers to produce a third then when learning is complete and stable all 3 parts will match ie:

Q_{target} \approx Q_{current} \approx Q_{new}

ReferenceError: katex is not defined

So the core of what it learns is:

Q_{final} \approx Q_{target} = r_t+\gamma\max_{a} Q(s_{t+1},a)

ReferenceError: katex is not defined

This is just an recursive formula and com sci guys can often will instantly associate dynamic programming and tree diagrams with it. Ok so now lets have a look at how this works solutions to problems Each circle represents a world state, the number on the left are the instant reward for that state and the the number on the right is the current Q value for that state.

But there is of course a problem: Pay very close attention to the formulas..

"Q_{new}=(1-\alpha)Q_{current}+\alpha Q_{update}

ReferenceError: katex is not defined

Note that:

The forumla is iterative
The is top down

Also:

Q_{update}=r_t+\gamma\max_{a}Q(s_{t+1}, a)

ReferenceError: katex is not defined

Note carefully the effect and scope of the “max”

This is the *local* best not the *global*
It is a heuristic know in computer science as Greedy Optimization." },

These are the key flaws in this algorithm. So what really happens is:

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)