Thursday, August 17, 2017

Q learning - and its core flaws






A few days ago i gave a talk on Reinforcement learning with a focus on Q-learning at Cookpads Tokyo office. https://www.meetup.com/tokyo-machine-learning-kitchen/events/242060161/

The main slides for the talk are here https://github.com/ashleysmart/mlgym/blob/master/qlearning/main_slides.html

I have been neglecting my blog lately so i figured i would convert the slides into a post so lets get started.


The Q function is a estimate of the systems potential value. It is accessed based on:
  • The environment or state that the system is in
  • The actions that can be taken from that state
  • The rewards that can be acquired by performing the action
The Q function is the target function that we are trying to learn and it can be implemented as a neural network, table or some other standard linear regression machine learning system or function. The textbook formula:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
ReferenceError: katex is not defined
Where:
  • Q
    ReferenceError: katex is not defined

    : The function that guess the total 'value' of rewards

  • Q
    ReferenceError: katex is not defined

    : The new iteration of the 'value'

  • s_t
    ReferenceError: katex is not defined

    : The “State” of the environment at time 't'

  • a_t
    ReferenceError: katex is not defined

    : The “action” perform at time 't'

  • r_t
    ReferenceError: katex is not defined

    : The “reward” received for the action at 't'

  • s_{t+1}
    ReferenceError: katex is not defined

    : The “State” of the environment after action at time 't'

  • a
    ReferenceError: katex is not defined

    : A possible action performed from state 't+1'

  • \alpha
    ReferenceError: katex is not defined

    : The learning rate, how quickly to adjust when wrong. This limited between 0 and 1

  • \gamma
    ReferenceError: katex is not defined

    : The discount rate, how important/trusted future rewards are. This limited between 0 and 1. and has a effect that can be considered as a EMA(exponential moving average)

Of course everyone understood that... If we manipulate the formula a bit how it works becomes much more clear Starting with the textbook form:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
ReferenceError: katex is not defined
We notice that
Q'(s_t, a_t)
ReferenceError: katex is not defined
is in several places so we can group it together..
Q'(s_t, a_t)=(1-\alpha)Q(s_t, a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1}, a)\Big)
ReferenceError: katex is not defined
Then we can group the non-Q terms into a common item
Q_{target}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
ReferenceError: katex is not defined
And Finally we have something very clear
Q_{new}=(1-\alpha)Q_{target}+\alpha Q_{target}
ReferenceError: katex is not defined
As you can see alpha is acting as a ratio to merge the current value of Q with target value of Q and new info for the next iteration Also given that Q-learning does percential mixes 2 numbers to produce a third then when learning is complete and stable all 3 parts will match ie:
Q_{target} \approx Q_{current} \approx Q_{new}
ReferenceError: katex is not defined
So the core of what it learns is:
Q_{final} \approx Q_{target} = r_t+\gamma\max_{a} Q(s_{t+1},a)
ReferenceError: katex is not defined
This is just an recursive formula and com sci guys can often will instantly associate dynamic programming and tree diagrams with it. Ok so now lets have a look at how this works solutions to problems Each circle represents a world state, the number on the left are the instant reward for that state and the the number on the right is the current Q value for that state.
+1 / 10 / 0+5 / 5-1 / ?0 / ?+1 / 1-1 / -10 / 00 / 0+1 / 10 / 0-1 / -1+1 / ?+1 / ?0 / ?
But there is of course a problem: Pay very close attention to the formulas..
"Q_{new}=(1-\alpha)Q_{current}+\alpha Q_{update}
ReferenceError: katex is not defined
Note that:
  • The forumla is iterative
  • The is top down
Also:
Q_{update}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
ReferenceError: katex is not defined
Note carefully the effect and scope of the “max”
  • This is the *local* best not the *global*
  • It is a heuristic know in computer science as Greedy Optimization." },
These are the key flaws in this algorithm. So what really happens is:
0 / 0

Saturday, March 18, 2017

Setting up basic Machine Learning rig - Ubuntu

For ubuntu
https://www.tensorflow.org/install/install_linux

First install python and pip

sudo apt-get install python python-pip python-dev
sudo pip install --upgrade pip virtualenv 

--- WITH A GPU ---

Install the GPU if not already done as described here:
https://www.pugetsystems.com/labs/hpc/NVIDIA-CUDA-with-Ubuntu-16-04-beta-on-a-laptop-if-you-just-cannot-wait-775/

Verify that you even have a usable gpu with (it must be one compatiable with cuda
lspci | grep -i nvidia

Install Cuda drivers

Remove prior installs (if you have a problem with it)
sudo apt-get purge nvidia-cuda* 
sudo apt-get install cuda

download the recent cuda drivers from
https://developer.nvidia.com/cuda-downloads

install the drivers
chmod 755 cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run --override

Confirm setup
which nvcc 
nvcc --version
nvidia-smi

Output should be something like
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17


Sat Mar 18 14:16:58 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970M    Off  | 0000:01:00.0     Off |                  N/A |
| N/A   55C    P0    22W /  N/A |    586MiB /  3016MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1144    G   /usr/lib/xorg/Xorg                             366MiB |
|    0      1922    G   compiz                                         111MiB |
|    0      2302    G   ...bled/ExtensionDeveloperModeWarning/Defaul   107MiB |
+-----------------------------------------------------------------------------+


Install cudnn drivers
http://askubuntu.com/questions/767269/how-can-i-install-cudnn-on-ubuntu-16-04

Download the drivers
https://developer.nvidia.com/cudnn


Locate where your cuda installation is. it is /usr/lib/... and /usr/include or /urs/local/cuda/.

which nvcc 
ldconfig -p | grep cuda

Step 3: Copy the files:
cd extracted_driver/
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

Confirm setup
ldconfig -p | grep cudnn

should be something like:
libcudnn.so.5 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so.5
libcudnn.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so

--- INSTALL KEY TOOLS ---

Install Machine learning essentials

sudo pip install numpy
sudo pip install pandas
sudo pip install scikit-learn
sudo pip install jupyter
sudo pip install xgboost


Now you can install tensorflow with GPU as follows
sudo pip install tensorflow-gpu
sudo pip install keras    

Or without:
sudo pip install tensorflow
sudo pip install keras