Thursday, August 17, 2017

Q learning - and its core flaws






A few days ago i gave a talk on Reinforcement learning with a focus on Q-learning at Cookpads Tokyo office. https://www.meetup.com/tokyo-machine-learning-kitchen/events/242060161/

The main slides for the talk are here https://github.com/ashleysmart/mlgym/blob/master/qlearning/main_slides.html

I have been neglecting my blog lately so i figured i would convert the slides into a post so lets get started.


The Q function is a estimate of the systems potential value. It is accessed based on:
  • The environment or state that the system is in
  • The actions that can be taken from that state
  • The rewards that can be acquired by performing the action
The Q function is the target function that we are trying to learn and it can be implemented as a neural network, table or some other standard linear regression machine learning system or function. The textbook formula:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
ReferenceError: katex is not defined
Where:
  • Q
    ReferenceError: katex is not defined

    : The function that guess the total 'value' of rewards

  • Q
    ReferenceError: katex is not defined

    : The new iteration of the 'value'

  • s_t
    ReferenceError: katex is not defined

    : The “State” of the environment at time 't'

  • a_t
    ReferenceError: katex is not defined

    : The “action” perform at time 't'

  • r_t
    ReferenceError: katex is not defined

    : The “reward” received for the action at 't'

  • s_{t+1}
    ReferenceError: katex is not defined

    : The “State” of the environment after action at time 't'

  • a
    ReferenceError: katex is not defined

    : A possible action performed from state 't+1'

  • \alpha
    ReferenceError: katex is not defined

    : The learning rate, how quickly to adjust when wrong. This limited between 0 and 1

  • \gamma
    ReferenceError: katex is not defined

    : The discount rate, how important/trusted future rewards are. This limited between 0 and 1. and has a effect that can be considered as a EMA(exponential moving average)

Of course everyone understood that... If we manipulate the formula a bit how it works becomes much more clear Starting with the textbook form:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
ReferenceError: katex is not defined
We notice that
Q'(s_t, a_t)
ReferenceError: katex is not defined
is in several places so we can group it together..
Q'(s_t, a_t)=(1-\alpha)Q(s_t, a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1}, a)\Big)
ReferenceError: katex is not defined
Then we can group the non-Q terms into a common item
Q_{target}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
ReferenceError: katex is not defined
And Finally we have something very clear
Q_{new}=(1-\alpha)Q_{target}+\alpha Q_{target}
ReferenceError: katex is not defined
As you can see alpha is acting as a ratio to merge the current value of Q with target value of Q and new info for the next iteration Also given that Q-learning does percential mixes 2 numbers to produce a third then when learning is complete and stable all 3 parts will match ie:
Q_{target} \approx Q_{current} \approx Q_{new}
ReferenceError: katex is not defined
So the core of what it learns is:
Q_{final} \approx Q_{target} = r_t+\gamma\max_{a} Q(s_{t+1},a)
ReferenceError: katex is not defined
This is just an recursive formula and com sci guys can often will instantly associate dynamic programming and tree diagrams with it. Ok so now lets have a look at how this works solutions to problems Each circle represents a world state, the number on the left are the instant reward for that state and the the number on the right is the current Q value for that state.
+1 / 10 / 0+5 / 5-1 / ?0 / ?+1 / 1-1 / -10 / 00 / 0+1 / 10 / 0-1 / -1+1 / ?+1 / ?0 / ?
But there is of course a problem: Pay very close attention to the formulas..
"Q_{new}=(1-\alpha)Q_{current}+\alpha Q_{update}
ReferenceError: katex is not defined
Note that:
  • The forumla is iterative
  • The is top down
Also:
Q_{update}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
ReferenceError: katex is not defined
Note carefully the effect and scope of the “max”
  • This is the *local* best not the *global*
  • It is a heuristic know in computer science as Greedy Optimization." },
These are the key flaws in this algorithm. So what really happens is:
0 / 0

No comments:

Post a Comment