Showing posts with label Machine learning. Show all posts
Showing posts with label Machine learning. Show all posts

Thursday, August 17, 2017

Q learning - and its core flaws






A few days ago i gave a talk on Reinforcement learning with a focus on Q-learning at Cookpads Tokyo office. https://www.meetup.com/tokyo-machine-learning-kitchen/events/242060161/

The main slides for the talk are here https://github.com/ashleysmart/mlgym/blob/master/qlearning/main_slides.html

I have been neglecting my blog lately so i figured i would convert the slides into a post so lets get started.


The Q function is a estimate of the systems potential value. It is accessed based on:
  • The environment or state that the system is in
  • The actions that can be taken from that state
  • The rewards that can be acquired by performing the action
The Q function is the target function that we are trying to learn and it can be implemented as a neural network, table or some other standard linear regression machine learning system or function. The textbook formula:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
Where:
  • Q

    : The function that guess the total 'value' of rewards

  • Q

    : The new iteration of the 'value'

  • s_t

    : The “State” of the environment at time 't'

  • a_t

    : The “action” perform at time 't'

  • r_t

    : The “reward” received for the action at 't'

  • s_{t+1}

    : The “State” of the environment after action at time 't'

  • a

    : A possible action performed from state 't+1'

  • \alpha

    : The learning rate, how quickly to adjust when wrong. This limited between 0 and 1

  • \gamma

    : The discount rate, how important/trusted future rewards are. This limited between 0 and 1. and has a effect that can be considered as a EMA(exponential moving average)

Of course everyone understood that... If we manipulate the formula a bit how it works becomes much more clear Starting with the textbook form:
Q'(s_t, a_t)=Q(s_t,a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1},a)-Q(s_t,a_t)\Big)
We notice that
Q'(s_t, a_t)
is in several places so we can group it together..
Q'(s_t, a_t)=(1-\alpha)Q(s_t, a_t)+\alpha\Big(r_t+\gamma\max_{a}Q(s_{t+1}, a)\Big)
Then we can group the non-Q terms into a common item
Q_{target}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
And Finally we have something very clear
Q_{new}=(1-\alpha)Q_{target}+\alpha Q_{target}
As you can see alpha is acting as a ratio to merge the current value of Q with target value of Q and new info for the next iteration Also given that Q-learning does percential mixes 2 numbers to produce a third then when learning is complete and stable all 3 parts will match ie:
Q_{target} \approx Q_{current} \approx Q_{new}
So the core of what it learns is:
Q_{final} \approx Q_{target} = r_t+\gamma\max_{a} Q(s_{t+1},a)
This is just an recursive formula and com sci guys can often will instantly associate dynamic programming and tree diagrams with it. Ok so now lets have a look at how this works solutions to problems Each circle represents a world state, the number on the left are the instant reward for that state and the the number on the right is the current Q value for that state.
But there is of course a problem: Pay very close attention to the formulas..
"Q_{new}=(1-\alpha)Q_{current}+\alpha Q_{update}
Note that:
  • The forumla is iterative
  • The is top down
Also:
Q_{update}=r_t+\gamma\max_{a}Q(s_{t+1}, a)
Note carefully the effect and scope of the “max”
  • This is the *local* best not the *global*
  • It is a heuristic know in computer science as Greedy Optimization." },
These are the key flaws in this algorithm. So what really happens is:

Saturday, March 18, 2017

Setting up basic Machine Learning rig - Ubuntu

For ubuntu
https://www.tensorflow.org/install/install_linux

First install python and pip

sudo apt-get install python python-pip python-dev
sudo pip install --upgrade pip virtualenv 

--- WITH A GPU ---

Install the GPU if not already done as described here:
https://www.pugetsystems.com/labs/hpc/NVIDIA-CUDA-with-Ubuntu-16-04-beta-on-a-laptop-if-you-just-cannot-wait-775/

Verify that you even have a usable gpu with (it must be one compatiable with cuda
lspci | grep -i nvidia

Install Cuda drivers

Remove prior installs (if you have a problem with it)
sudo apt-get purge nvidia-cuda* 
sudo apt-get install cuda

download the recent cuda drivers from
https://developer.nvidia.com/cuda-downloads

install the drivers
chmod 755 cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run --override

Confirm setup
which nvcc 
nvcc --version
nvidia-smi

Output should be something like
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17


Sat Mar 18 14:16:58 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970M    Off  | 0000:01:00.0     Off |                  N/A |
| N/A   55C    P0    22W /  N/A |    586MiB /  3016MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1144    G   /usr/lib/xorg/Xorg                             366MiB |
|    0      1922    G   compiz                                         111MiB |
|    0      2302    G   ...bled/ExtensionDeveloperModeWarning/Defaul   107MiB |
+-----------------------------------------------------------------------------+


Install cudnn drivers
http://askubuntu.com/questions/767269/how-can-i-install-cudnn-on-ubuntu-16-04

Download the drivers
https://developer.nvidia.com/cudnn


Locate where your cuda installation is. it is /usr/lib/... and /usr/include or /urs/local/cuda/.

which nvcc 
ldconfig -p | grep cuda

Step 3: Copy the files:
cd extracted_driver/
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

Confirm setup
ldconfig -p | grep cudnn

should be something like:
libcudnn.so.5 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so.5
libcudnn.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so

--- INSTALL KEY TOOLS ---

Install Machine learning essentials

sudo pip install numpy
sudo pip install pandas
sudo pip install scikit-learn
sudo pip install jupyter
sudo pip install xgboost


Now you can install tensorflow with GPU as follows
sudo pip install tensorflow-gpu
sudo pip install keras    

Or without:
sudo pip install tensorflow
sudo pip install keras    

Wednesday, October 26, 2016

analytic regression in python using a parameterized family of functions

Im have been studying ML(machine learning). So here is the deal.. ML is just applied statics. You take a bunch of data and you fit a model to it. The data affords you a certain upper limit of accuracy compared to the real world. Your chosen model then subtracts a certain percentage off the top. The cost of abstraction is an imperfect fit. Simple.

So one of the primary rules of thumb when working with this stuff is start simple. You need to digest the dataset and then make incremental steps forward... its easy however to be making steps backwards without knowing it.

Now regression is just curve fitting. The simplest way to do this is just a direct analytic solution. no gradient descent, nothing more complex, but this approach has limits that quickly become clear when dealing with larger datasets.

So lets solve analytic regression for a parameterized family of functions and then codify it... you'll note i have some quirks to the way i do things.. i try to avoid matrix maths since i think it hides too much and fails to readily when you get beyond 2d axis of data

Define the model:
\hat y_i := \sum_k w_k f(x_i;k)

Define the loss:
MSE = \sum_i(\hat y_i - y_i)^2

Set target (when mse gradient is 0 the error is at its min for w of 0 to n):
\frac{\partial MSE}{\partial w} = 0

Sub the model into the target equation for each "n" derviate term:
= \frac{\partial}{\partial w_n} \Big( \sum_i( \sum_k w_k f(x_i;k) - y_i)^2 \Big)

Expand square (note not expanding the sum of k treat it as a single term):
= \frac{\partial}{\partial w_n} \Big( \sum_i\big( (\sum_k w_k f(x_i;k))^2 -2 y_i \sum_k w_k f(x_i;k) + y_i^2 \big) \Big)

Transfer outer sum to terms:
= \frac{\partial}{\partial w_n} \Big( \sum_i(\sum_k w_k f(x_i;k))^2 - \sum_i 2 y_i \sum_k w_k f(x_i;k) + \sum_iy_i^2 \Big)

Transfer derivative to terms and expand/rebase squares:
= \frac{\partial}{\partial w_n} \sum_i(\sum_k w_k f(x_i;k))(\sum_j w_j f(x_i;j)) - \frac{\partial}{\partial w_n} \sum_i \sum_k 2 y_i w_k f(x_i;k) + \frac{\partial}{\partial w_n} \sum_iy_i^2

Do simple derivatives to the 2 right most terms:
* Note that the partial derivative of a sum drops all terms unless n=k
* Note that sum of y's looks like a constant and the derivate of a constant is 0
= \sum_i\frac{\partial}{\partial w_n} (\sum_k w_k f(x_i;k))(\sum_j w_j f(x_i;j)) - \sum_i 2 y_i f(x_i;n)

Start applying the derivative product rule:
= \sum_i\sum_k w_k f(x_i;k) \frac{\partial}{\partial w_n}(\sum_j w_j f(x_j;j)) + \sum_i\sum_j w_j f(x_j;j) \frac{\partial}{\partial w_n}(\sum_k w_k f(x_i;k)) - \sum_i 2 y_i f(x_i;n)

Continue the derivative product rules:
= \sum_i\sum_k w_k f(x_i;k) f(x_i;n) + \sum_i\sum_j w_j f(x_i;j) f(x_i;n)) - \sum_i 2 y_i f(x_i;n)

Rebase vars into common ones:
= \sum_i\sum_j w_j f(x_i;j) f(x_i;n) + \sum_i\sum_j w_j f(x_i;j) f(x_i;n)) - \sum_i 2 y_i f(x_i;n)

Simply:
= 2 \sum_i\sum_j w_j f(x_i;j) f(x_i;n) - 2 \sum_i y_i f(x_i;n)

Equate for a linear solver implementation (recall there are now "n" of these equations):
\sum_i\sum_j w_j f(x_i;j) f(x_i;n) = \sum_i y_i f(x_i;n)

Now the above maths when codified produces the following code:

#!/usr/bin/env python

# analytic regression for a general linear combination of functions

import random
import numpy as np
import matplotlib.pyplot as plt

# convert the maths into a function
# note this is likely weak in respect to:
#  - overflow (because we are summing numbers)
#  - large amounts of data (no batching etc)
#   etc
def analytic_regress(x,y,params,func):
    size = len(params)
    left  = np.zeros((size,size))
    right = np.zeros(size)

    for i in range(size):
        right[i] = np.sum(y*func(x,params[i]))

        for j in range(size):
            left[i,j] = np.sum(func(x,params[i])*func(x,params[j]))

    w = np.linalg.solve(left, right)

    model = lambda x: (np.sum(w * np.array([func(x,i) for i in params])))
    return w, model

### testing ###

def generate(func,
             x_min, x_max,
             yerr,
             count):
    x = np.random.uniform(x_min, x_max, count)
    y = np.array([func(xi) for xi in x]) + np.random.uniform(0,yerr,count)

    return x,y

# basis functions to trial
def ploy(x,n):
    return x**n

def cossin(x,n):
    if (n < 0):
        return np.sin(-n * x)
    return np.cos(n * x)

def sigmoid(x,n):
    return 1.0/(1.0 + np.exp(n-x))

def guassian(x,n):
    return np.exp(-0.5*(x-n)**2)

def square(x,n,w):
    return (1.0*np.logical_and(x >= n, x < n+w))

# a general function to trial a few regressions
def test_set(func,
             x_min, x_max, y_error, count,
             basis, params,
             x_test):

    x,y = generate(func, x_min, x_max, y_error, count)

    models = {}
    for p in params:
        w, models[p]  = analytic_regress(x,y,range(p),basis)
        print "model basis up to:", p, " has w:"
        print w

    y_test = {}
    for p in params:
        model = models[p]
        y_test[p]  = np.array([model(i) for i in x_test])

    plt.plot(x , y , "o")
    for p in params:
        plt.plot(x_test, y_test[p])

    plt.show()

# a general funtion to trial a single regression
def test_one(func,
             x_min, x_max, y_error, count,
             basis, params,
             x_test):
    x,y = generate(func, x_min, x_max, y_error, count)

    w, model = analytic_regress(x,y, params, basis)
    print w

    y_test = np.array([model(i) for i in x_test])

    plt.plot(x , y , "o")
    plt.plot(x_test, y_test)
    plt.show()

print " ploy regress for function y = 78 - 2x"
test_set(lambda x: (78.0 - 2.0*x),
         0.0, 6.0, 6, 100,
         ploy, [2,3,5,9,11],
         np.array(range(-1, 70))/10.0)

print " ploy regress for function y = 78 + 60x - 13x^2"
test_set(lambda x: (78.0 + 60.0*x - 13.0*x**2),
         0.0, 6.0, 6, 100,
         ploy, [2,3,5,9,11],
         np.array(range(-2, 72))/10.0)

print "square regress for function y = 1 iff 1 < x < 6 othewise 0"
test_one(lambda x: (0 if (x < 1 or x > 6) else 1),
         0.0, 7.0, 0.4, 100,
         lambda x,n: (square(x,n,0.5)), np.array(range(14))/2.0,
         np.array(range(-50, 150))/10.0)

print "square regress for function y = 10*cos(2x)"
test_one(lambda x: 10*np.cos(2.0*x),
         0.0, 7.0, 0.4, 100,
         lambda x,n: (square(x,n,0.5)), np.array(range(14))/2.0,
         np.array(range(-50, 150))/10.0)

print "sigmod regress for function y = 10*cos(2x)"
test_one(lambda x: 10*np.cos(2.0*x),
         0.0, 7.0, 0.4, 100,
         sigmoid, np.array(range(14))/2.0,
         np.array(range(-50, 150))/10.0)

print "guassian regress for function y = 10*cos(2x)"
test_one(lambda x: 10*np.cos(2.0*x),
         0.0, 7.0, 0.4, 100,
         guassian, np.array(range(14))/2.0,
         np.array(range(-50, 150))/10.0)

print "cos/sin regress for function y = 10*cos(2x)"
test_one(lambda x: 10*np.cos(2.0*x),
         0.0, 7.0, 0.4, 100,
         cossin, np.array(range(-6,8))/2.0,
         np.array(range(-50, 150))/10.0)

And then the various outputs...

The regression using using a family of polynomials on a linear equation:

The regression using using a family of polynomials on a quadratic:

The regression using square bars on a square bar:

The regression using square bars on a cos:


The regression using guassians on a cos:

The regression using sigmoids on a cos:

The regression using sin and cos on a cos:

Tuesday, June 7, 2016

c++ like template AutoGrad - How to compute the Derivative Of a function using templates

You may have heard of a system called AutoGrad. It is a system that when given an equation computes the derivate of it automatically. Here is an example of how to build the same system using c++ templates.

I was inspired to write this after i realized i messed up the differentiation of the for the neural networks back prop path in my last post.

Clearly you see where im headed with this.. a few more mods to make it work with tensors and ill have full autograd system for c++ that can automatically compute the backdrop of a neural network.. and will also compile in c4driod.. and then toss in some thrust code and it will be CUDA capable able to use PC with GPUs

from smartphone to GPU in a few short steps...

Anyway here is the proof of concept code:

output looks like
x:2 m:3 c:4 p:6 y:10 z:16 q:100
 dm/dx:0 dx/dx:1 dp/dx:1 dy/dx:3 dz/dx:12 dq/dx:60

// compile with "g++ std=c++11 ..."
// or in c4driod on an andriod phone..

#include <tuple>
#include <iostream>
#include <cmath>

template <int VALUE>
struct Const
{
    template< typename... Types >
    static double op(std::tuple<Types...>& params )
    {
        return VALUE;
    }
};

template <int ID>
struct Var
{
    template< typename... Types >
    static double op(std::tuple<Types...>& params )
    {
        return std::get<ID>(params);
    }
};

template <typename L, typename R>
struct OpAdd
{
    template< typename... Types >
    static double op(std::tuple<Types...>& params)
    {
        return L::op(params) + R::op(params);
    }
};

template <typename L, typename R>
struct OpMult
{
    template< typename... Types >
    static double op(std::tuple<Types...>& params)
    {
        return L::op(params) * R::op(params);
    }
};

template <typename L, int POW>
struct OpPower
{
    template< typename... Types >
    static double op(std::tuple<Types...>& params)
    {
        return std::pow(L::op(params), POW);
    }
};

template <typename Num, typename Denum>
struct DerivativeOf {};

template <int V, int ID_D>
struct DerivativeOf<Const<V>, Var<ID_D> >
{
    typedef Const<0> Type;
};

template <int ID_N, int ID_D>
struct DerivativeOf<Var<ID_N>, Var<ID_D> >
{
    typedef Const<0> Type;
};

template <int ID_N>
struct DerivativeOf<Var<ID_N>, Var<ID_N> >
{
    typedef Const<1> Type;
};

template <typename L, typename R, typename D>
struct DerivativeOf<OpAdd<L,R>, D >
{
    typedef OpAdd<typename DerivativeOf<L,D>::Type,
                  typename DerivativeOf<R,D>::Type> Type;
};

template <typename L, typename R, typename D>
struct DerivativeOf<OpMult<L,R>, D >
{
    typedef OpAdd<OpMult<typename DerivativeOf<L, D>::Type, R>,
                  OpMult<L, typename DerivativeOf<R, D>::Type> > Type;
};

template <typename L, int POW>
struct DerivativeOf<OpPower<L,POW>, L >
{
    typedef OpMult<Const<POW>, OpPower<L, POW-1> > Type;
};

template <typename L, int POW, typename D>
struct DerivativeOf<OpPower<L,POW>, D >
{
    typedef OpMult<OpMult<Const<POW>, OpPower<L, POW-1> >,
                   typename DerivativeOf<L, D>::Type>  Type;
};

int main()
{
    typedef Var<0> X;
    typedef Var<1> M;
    typedef Var<2> C;
    typedef OpAdd<X,C> P;                      // p = x + c
    typedef OpAdd<OpMult<M,X>,C> Y;            // y = m*x + c
    typedef OpAdd<OpMult<M,OpPower<X,2>>,C> Z; // z = m*x^2 + c
    typedef OpPower<OpAdd<OpMult<M,X>,C>,2> Q; // q = = y^2 = (m*x + c)^2

    typedef DerivativeOf<M, X>::Type  dM_dX;   // dm/dx = 0
    typedef DerivativeOf<X, X>::Type  dX_dX;   // dx/dx = 1
    typedef DerivativeOf<P, X>::Type  dP_dX;   // dP/dx = 1
    typedef DerivativeOf<Y, X>::Type  dY_dX;   // dY/Dx = m
    typedef DerivativeOf<Z, X>::Type  dZ_dX;   // dZ/dx = 2x
    typedef DerivativeOf<Q, X>::Type  dQ_dX;   // dQ/dx = dq/dy + dy/dx = 2*(m*x + c)*m

    X x;
    M m;
    C c;
    P p;
    Y y;
    Z z;
    Q q;

    dX_dX  dx_dx;
    dM_dX  dm_dx;
    dP_dX  dp_dx;
    dY_dX  dy_dx;
    dZ_dX  dz_dx;
    dQ_dX  dq_dx;

    std::tuple<double,double,double> params = std::make_tuple(2.0,3.0,4.0);

    std::cout << " x:" << x.op(params)
              << " m:" << m.op(params)
              << " c:" << c.op(params)
              << " p:" << p.op(params)
              << " y:" << y.op(params)
              << " z:" << z.op(params)
              << " q:" << q.op(params)
              << "\n";

    std::cout << " dm/dx:"  << dm_dx.op(params)
              << " dx/dx:"  << dx_dx.op(params)
              << " dp/dx:"  << dp_dx.op(params)
              << " dy/dx:"  << dy_dx.op(params)
              << " dz/dx:"  << dz_dx.op(params)
              << " dq/dx:"  << dq_dx.op(params)
              << "\n";
}

Thursday, May 5, 2016

libraryless c++ neural nets

UPDATE: DONT USE THIS

I have noticed that i messed up the backprop.. seems Matrics are not my strong point and do/dw is a killer.. even the wiki page on it gives up https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions ... check the "Result of differentiating various kinds of aggregates with other kinds of aggregates" table it has "??" in the vector y to matrix X column



I like every other com-sci guy on the planet is studying Machine learning... But to do this tools are needed..

Libraries, toolkits and frameworks with their sundry list of requirements have always annoyed the hell out of me. These things are never cost-less.. We constantly have burn time to setup a system that is compatible with the megabytes of dependencies that these things require to be installed, then deal with the portability issues that stop them from installing and then over the long term maintain and upgrade them and deal with package drift between the various components to keep it all running and then hope that the package maintainers dont quit updating the entire mess.

Library free raw language implementation have some big plus going for them:
* Firstly any of todays smartphones have enough processing power to do the job of training and working with neural networks.. but the tool chain makes it next to impossible actually do the job..
* Secondly companies are often paranoid about what you can install in workstations.

So in short I have tossed out the ML tool chain and give a shoot at doing it in plain simple c++.

ML at its core is just Matrix math so I have hacked together a Matrix class that can be compiled with c4driod, g++, and clang. Then using this Matrix class build a simple xor model and trained using hand worked out back prop. This is proof of concept that with less than 1000 lines of normal c++ code and a standard c++ compiler your good to go for ML projects.

The whole mess is 728 lines of c++11 code (it needs a fair bit clean up but hey whatever)

Heres the end result of the build and run on my smartphone (a nexus 5x using c4driod)



And the code:
#include <iostream>
#include <vector>
#include <functional>
#include <memory>
#include <sstream>
#include <stdexcept>
#include <cmath>
#include <cstdlib>

// ****************************************************************
// *************************** MATRIX *****************************
// ****************************************************************

template <typename Type>
struct Matrix
{
    //private:
    typedef std::vector<Type> Data;

    std::shared_ptr<Data>    data_;
    std::vector<std::size_t> shape_;

    // friend class MatrixUtils<Type>;
    //public:

    Matrix()
    {}

    Matrix(std::size_t cols,
           std::size_t rows)
    {
        data_.reset(new Data);
        data_->resize(cols*rows);
        shape_.push_back(cols);
        shape_.push_back(rows);
    }

    template <std::size_t N>
    Matrix(Type (&raw)[N],
           std::size_t cols,
           std::size_t rows)
    {
        data_.reset(new Data);
        data_->resize(N);
        data_->assign(raw,raw+N);
        shape_.push_back(cols);
        shape_.push_back(rows);
    }

    Matrix(Type* begin, Type* end,
           std::size_t cols,
           std::size_t rows)
    {
        data_.reset(new Data);
        data_->resize(cols*rows);
        data_->assign(begin,end);
        shape_.push_back(cols);
        shape_.push_back(rows);
    }

    Matrix(std::shared_ptr<Data> data,
           std::size_t cols,
           std::size_t rows) :
        data_(data)
    {
        shape_.push_back(cols);
        shape_.push_back(rows);
    }

    Type& at(std::size_t x, std::size_t y)
    {
        return (*data_)[y*shape_[0] + x];
    }

    Type at(std::size_t x, std::size_t y) const
    {
        return (*data_)[y*shape_[0] + x];
    }
};

// ****************************************************************
// ************************ MATRIX UTILS **************************
// ****************************************************************

template <typename Type>
struct MatrixUtils
{
    static void print(std::ostream& os, const Matrix<Type>& a)
    {
       for (std::size_t y = 0; y < a.shape_[1]; ++y)
        {
            for (std::size_t x = 0; x < a.shape_[0]; ++x)
            {
                os << a.at(x,y) << " ";
            }
            os << "\n";
        }
    }

    static Matrix<Type> matmul(const Matrix<Type>& a,
                               const Matrix<Type>& b)
    {
        if (a.shape_[0] != b.shape_[1])
        {
            std::stringstream ss;
            ss << "Matrix shapes wrong for matmul"
               << " a: " << a.shape_[0] << "x" << a.shape_[1]
               << " b: " << b.shape_[0] << "x" << b.shape_[1];
            throw std::runtime_error(ss.str());
        }

        Matrix<Type> res(b.shape_[0], a.shape_[1]);

        for (std::size_t y = 0; y < res.shape_[1]; ++y)
        {
            for (std::size_t x = 0; x < res.shape_[0]; ++x)
            {
                Type sum = 0;
                for (std::size_t i = 0; i < a.shape_[0]; ++i)
                {
                    sum += a.at(i,y) * b.at(x,i);
                }
                res.at(x,y) = sum;
            }
        }

        return res;
    }

    static Matrix<Type> selrow(std::size_t row,
                               const Matrix<Type>& a)
    {
        // TODO reimpl as an iterator!
        Matrix<Type> r(a.shape_[0],1);

        for (std::size_t x = 0; x < a.shape_[0]; ++x)
        {
            r.at(x,0) = a.at(x,row);
        }

        return r;
    }

    static Matrix<Type> selcol(std::size_t col,
                               const Matrix<Type>& a)
    {
        Matrix<Type> r(1,a.shape_[1]);

        for (std::size_t y = 0; y < a.shape_[1]; ++y)
        {
            r.at(0,y) = a.at(col,y);
        }

        return r;
    }

    static Matrix<Type> transpose(const Matrix<Type>& a)
    {
        Matrix<Type> r(a.shape_[1],a.shape_[0]);

        for (std::size_t y = 0; y < a.shape_[1]; ++y)
        {
            for (std::size_t x = 0; x < a.shape_[0]; ++x)
            {
                r.at(y,x) = a.at(x,y);
            }
        }

        return r;
    }

    static void unifunctor_inplace(std::function<Type (Type)> func,
                                   Matrix<Type>& a)

    {
        typedef typename Matrix<Type>::Data::iterator iterator;
        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??

        for(iterator it = a.data_->begin();
            it != a.data_->end();
            ++it)
        {
            *it = func(*it);
        }
    }

    static Matrix<Type> unifunctor(std::function<Type (Type)> func,
                                   const Matrix<Type>& a)

    {
        typedef typename Matrix<Type>::Data::iterator iterator;
        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??

        Matrix<Type> r(a.shape_[0],a.shape_[1]);

        iterator ait = a.data_->begin();
        for(iterator rit = r.data_->begin();
            rit != r.data_->end();
            ++rit)
        {
            *rit = func(*ait);
            ++ait;
        }

        return r;
    }

    static Matrix<Type> bifunctor(std::function<Type (Type,Type)> func,
                                  const Matrix<Type>& a,
                                  const Matrix<Type>& b)

    {
        typedef typename Matrix<Type>::Data::iterator iterator;

        if (a.shape_[0] != b.shape_[0] or
            a.shape_[1] != b.shape_[1])
        {
            std::stringstream ss;
            ss << "Matrix shapes mismatch for bifunctor"
               << " a: " << a.shape_[0] << "x" << a.shape_[1]
               << " b: " << b.shape_[0] << "x" << b.shape_[1];
            throw std::runtime_error(ss.str());
        }

        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??
        Matrix<Type> r(a.shape_[0], a.shape_[1]);

        iterator ait = a.data_->begin();
        iterator bit = b.data_->begin();
        iterator rit = r.data_->begin();

        while (rit != r.data_->end())
        {
            *rit = func(*ait, *bit);
            ++ait;
            ++bit;
            ++rit;
        }

        return r;
    }

    static void bifunctor_inplace(std::function<Type (Type,Type)> func,
                                  Matrix<Type>& a,
                                  const Matrix<Type>& b)

    {
        typedef typename Matrix<Type>::Data::iterator iterator;

        if (a.shape_[0] != b.shape_[0] or
            a.shape_[1] != b.shape_[1])
        {
            std::stringstream ss;
            ss << "Matrix shapes mismatch for bifunctor"
               << " a: " << a.shape_[0] << "x" << a.shape_[1]
               << " b: " << b.shape_[0] << "x" << b.shape_[1];
            throw std::runtime_error(ss.str());
        }

        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??
        Matrix<Type> r(a.shape_[0], a.shape_[1]);

        iterator bit = b.data_->begin();
        for(iterator ait = a.data_->begin();
            ait != a.data_->end();
            ++ait)
        {
            *ait = func(*ait, *bit);
            ++bit;
        }
    }

    static Matrix<Type> bifunctor_row(std::function<Type (Type,Type)> func,
                                      const Matrix<Type>& a,
                                      const Matrix<Type>& b)
    {
        typedef typename Matrix<Type>::Data::iterator iterator;

        if (a.shape_[0] != b.shape_[0] or
            b.shape_[1] != 1)
        {
            std::stringstream ss;
            ss << "Matrix shapes mismatch for bifunctor_row"
               << " a: " << a.shape_[0] << "x" << a.shape_[1]
               << " b: " << b.shape_[0] << "x" << b.shape_[1];
            throw std::runtime_error(ss.str());
        }

        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??
        Matrix<Type> r(a.shape_[0], a.shape_[1]);

        iterator ait = a.data_->begin();
        iterator bit = b.data_->begin();
        iterator rit = r.data_->begin();

        while (rit != r.data_->end())
        {
            if (bit == b.data_->end()) bit = b.data_->begin();

            *rit = func(*ait, *bit);
            ++ait;
            ++bit;
            ++rit;
        }

        return r;
    }

    static Matrix<Type> bifunctor_scaler(std::function<Type (Type,Type)> func,
                                        const Type a,
                                        const Matrix<Type>& b)

    {
        typedef typename Matrix<Type>::Data::iterator iterator;

        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??
        Matrix<Type> r(b.shape_[0], b.shape_[1]);

        iterator bit = b.data_->begin();
        iterator rit = r.data_->begin();

        while (rit != r.data_->end())
        {
            *rit = func(a,*bit);
            ++bit;
            ++rit;
        }

        return r;
    }

    static Matrix<Type> bifunctor_scaler(std::function<Type (Type,Type)> func,
                                        const Matrix<Type>& a,
                                        const Type b)
    {
        typedef typename Matrix<Type>::Data::iterator iterator;

        // TODO this feels llike there should be an std:algo for it
        //  maybe generate ??
        Matrix<Type> r(a.shape_[0], a.shape_[1]);

        iterator ait = a.data_->begin();
        iterator rit = r.data_->begin();

        while (rit != r.data_->end())
        {
            *rit = func(*ait,b);
            ++ait;
            ++rit;
        }

        return r;
    }

    struct Helpers
    {
        static Type add(Type a, Type b) { return a+b; }
        static Type sub(Type a, Type b) { return a-b; }
        static Type mul(Type a, Type b) { return a*b; }
        static Type div(Type a, Type b) { return a/b; }

        static Type relu(Type a) { return (a < 0) ? 0 : a; }
        static Type tanh(Type a) { return std::tanh(a); }
        static Type rand(Type a) { return static_cast<Type>(std::rand() % 100)/100.0 - 0.5; }
        static Type ones(Type a) { return 1.0; }

        static Type xor_f(Type a, Type b) { return (a*b < 0) ? -1.0 : 1.0; }
    };
};

// ****************************************************************
// ************************ MATRIX OPERATORS **********************
// ****************************************************************

template <typename Type>
std::ostream& operator<<(std::ostream& os, const Matrix<Type>& a)
{
    MatrixUtils<Type>::print(os, a);
    return os;
}

template <typename Type>
Matrix<Type> operator+(const Matrix<Type>& a,
                       const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor(&MatrixUtils<Type>::Helpers::add,a,b);
}

template <typename Type>
Matrix<Type> operator-(const Matrix<Type>& a,
                       const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor(&MatrixUtils<Type>::Helpers::sub,a,b);
}

template <typename Type>
Matrix<Type> operator*(const Matrix<Type>& a,
                       const Matrix<Type>& b)
{
    return MatrixUtils<Type>::matmul(a,b);
}

template <typename Type>
Matrix<Type> operator*(Type                a,
                       const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor_scaler(&MatrixUtils<Type>::Helpers::mul,a,b);
}

template <typename Type>
Matrix<Type> operator*(const Matrix<Type>& a,
                       Type                b)
{
    return MatrixUtils<Type>::bifunctor_scaler(&MatrixUtils<Type>::Helpers::mul,a,b);
}

template <typename Type>
Matrix<Type> operator/(const Matrix<Type>& a,
                       const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor(&MatrixUtils<Type>::Helpers::div,a,b);
}

template <typename Type>
Matrix<Type> operator+=(Matrix<Type>&       a,
                        const Matrix<Type>& b)
{
    MatrixUtils<Type>::bifunctor_inplace(&MatrixUtils<Type>::Helpers::add,a,b);
    return a;
}

template <typename Type>
Matrix<Type> product(const Matrix<Type>& a,
                     const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor(&MatrixUtils<Type>::Helpers::mul,a,b);
}

template <typename Type>
Matrix<Type> rowadd(const Matrix<Type>& a,
                    const Matrix<Type>& b)
{
    return MatrixUtils<Type>::bifunctor_row(&MatrixUtils<Type>::Helpers::add,a,b);
}

template <typename Type>
Matrix<Type> transpose(const Matrix<Type>& a)
{
    return MatrixUtils<Type>::transpose(a);
}

template <typename Type>
Matrix<Type> tanh(const Matrix<Type>& a)
{
    return MatrixUtils<Type>::unifunctor(&MatrixUtils<Type>::Helpers::tanh,a);
}

template <typename Type>
Matrix<Type> tanh_derivate(const Matrix<Type>& a)
{
    // dtanh/dx = 1 - (tanh(x)) ^ 2
    Matrix<Type> th = tanh(a);

    return MatrixUtils<Type>::bifunctor_scaler(&MatrixUtils<Type>::Helpers::sub, 1, product(th,th));
}

template <typename Type>
void rand(Matrix<Type>& a)
{
    MatrixUtils<Type>::unifunctor_inplace(&MatrixUtils<Type>::Helpers::rand,a);
}

template <typename Type>
void ones(Matrix<Type>& a)
{
    MatrixUtils<Type>::unifunctor_inplace(&MatrixUtils<Type>::Helpers::ones,a);
}

// ****************************************************************
// ****************************************************************
// ****************************************************************

struct Model
{
    // input   2
    // hidden1 4
    // hidden2 2
    // output  1
    // function tanh
    // learn rate: 0.03
    // batch size 10
    // xor data [-1, 1]
    //  result in less than 140 iterators
    typedef float Type;

    struct Network
    {
        Matrix<Type> x_;
        Matrix<Type> o1_;
        Matrix<Type> h1_;
        Matrix<Type> o2_;
        Matrix<Type> h2_;
        Matrix<Type> o3_;
        Matrix<Type> y_ ;
    };

    struct Params
    {
        Matrix<float> Wh1_;
        Matrix<float> Bh1_;
        Matrix<float> Wh2_;
        Matrix<float> Bh2_;
        Matrix<float> Wo_;
        Matrix<float> Bo_;

        Params() :
            Wh1_(4,2),
            Bh1_(4,1),
            Wh2_(2,4),
            Bh2_(2,1),
            Wo_ (1,2),
            Bo_ (1,1)
        {
            // init wieghts
            rand(Wh1_);
            rand(Bh1_);
            rand(Wh2_);
            rand(Bh2_);
            rand(Wo_ );
            rand(Bo_ );
        }
    };

    Params p_;

    // init weights
    Model() :
        p_()
    {}

    void forward(Network& net, const Matrix<Type>& inputs)
    {
        // forward network params
        // h1 = tanh(Wh1*x  + Bh1)
        // h2 = tanh(Wh2*h1 + Bh2)
        // y  = tanh(Wo *h2 + Bo )
        net.x_ = inputs;

        net.o1_ = rowadd(net.x_  * p_.Wh1_, p_.Bh1_);  net.h1_ = tanh(net.o1_);
        net.o2_ = rowadd(net.h1_ * p_.Wh2_, p_.Bh2_);  net.h2_ = tanh(net.o2_);
        net.o3_ = rowadd(net.h2_ * p_.Wo_ , p_.Bo_ );  net.y_  = tanh(net.o3_);
    }

    Matrix<Type> loss(const Matrix<Type>& expected,
                      const Network& net)
    {
        // you dont have to run this unless your printing stuff.. the back prop needs the "dE/dy" not the "E"
        Matrix<Type> delta = net.y_ - expected;
        return product(delta, delta);
    }

    void backward(const Matrix<Type>& expected,
                  const Network& net,
                  const Type alpha)
    {
        // note tanh(x) = (exp(?x)-?exp(?-x))/?(exp(?x)+?exp(?-x))
        // dtanh/dx = 1 - (tanh(x)) ^ 2

        // forward network params
        // o1 = Wh1*x  + Bh1 ; h1 = tanh(o1)
        // o2 = Wh2*h1 + Bh2 ; h2 = tanh(o2)
        // o3 = Wo *h2 + Bo  ; y  = tanh(o3)
        // E = (e - y)^2

        // dE/dy =  2 * (y - e)
        //       =~     (y - e)    // toss the 2  out as alpha will take it

        // sigma3 = dE/do3
        //        = dE/dy  * dy/do3
        //        = (y - e) * (1 - (tanh(o3))^2)
        // sigma2 = dE/do2
        //        = dE/dy  * dy/do3 * do3/dh2 * dh2/do2
        //        = sigma3          * do3/dh2 * dh2/do2
        //        = sigma3          * Wo      * (1 - (tanh(o2))^2)
        // sigma1 = dE/do1
        //        = dE/dy  * dy/do3 * do3/dh2 * dh2/do2 * do2/dh1 * dh1/do1
        //        = sigma2                              * do2/dh1 * dh1/do1
        //        = sigma2                              * Wh2     * (1 - (tanh(o1))^2)

        // dWo    = -alpha * dE/dWo
        //        = -alpha * dE/dy * dy/do3 * do3/dWo
        //        = -alpha * sigma3         * do3/dWo
        //        = -alpha * sigma3         * h2
        // dWh2   = -alpha * dE/dWh2
        //        = -alpha * dE/dy * dy/do3 * do3/dh2 * dh2/do2 * do2/dWh2
        //        = -alpha * sigma2                             * do2/dWh2
        //        = -alpha * sigma2                             * h1
        // dWh1   = -alpha * dE/dWh1
        //        = -alpha * dE/dy * dy/do3 * do3/dh2 * dh2/do2 * do2/dh1 * dh1/do1 * do1/dWh1
        //        = -alpha * sigma1                                                 * do1/dWh1
        //        = -alpha * sigma1                                                 * x

        // dBo    = -alpha * dE/dBo
        //        = -alpha * dE/dy * dy/do3 * do3/dBo
        //        = -alpha * sigma3         * do3/dBo
        //        = -alpha * sigma3         * 1
        // dBh2   = -alpha * dE/dBh2
        //        = -alpha * dE/dy * dy/do3 * do3/dh2 * dh2/do2 * do2/dBh2
        //        = -alpha * sigma2                             * do2/dBh2
        //        = -alpha * sigma2                             * 1
        // dBh1   = -alpha * dE/dBh1
        //        = -alpha * dE/dy * dy/do3 * do3/dh2 * dh2/do2 * do2/dh1 * dh1/do1 * do1/dBh1
        //        = -alpha * sigma1                                                 * do1/dBh1
        //        = -alpha * sigma1                                                 * 1

        Matrix<Type> sigma3 = product((net.y_ - expected)        , tanh_derivate(net.o3_));
        Matrix<Type> sigma2 = product(sigma3 * transpose(p_.Wo_ ), tanh_derivate(net.o2_));
        Matrix<Type> sigma1 = product(sigma2 * transpose(p_.Wh2_), tanh_derivate(net.o1_));

        Matrix<Type> dWo  = -alpha * (transpose(net.h2_) * sigma3);
        Matrix<Type> dWh2 = -alpha * (transpose(net.h1_) * sigma2);
        Matrix<Type> dWh1 = -alpha * (transpose(net.x_ ) * sigma1);

        Matrix<Type> transposed_unit(sigma3.shape_[1], 1);
        ones(transposed_unit);

        Matrix<Type> dBo  = -alpha * (transposed_unit * sigma3);
        Matrix<Type> dBh2 = -alpha * (transposed_unit * sigma2);
        Matrix<Type> dBh1 = -alpha * (transposed_unit * sigma1);

        p_.Wo_  += dWo ;
        p_.Wh2_ += dWh2;
        p_.Wh1_ += dWh1;

        p_.Bo_  += dBo ;
        p_.Bh2_ += dBh2;
        p_.Bh1_ += dBh1;
    }

    void train(const Matrix<Type>& inputs,
               const Matrix<Type>& expected,
               const Type alpha,
               const int  cycles)
    {
        //gradient descent .. the hard way?
        for (int i = 0; i < cycles; ++i)
        {
            // batching ??
            Network net;

            forward(net, inputs);
            backward(expected, net, alpha);

            if ((i % 10) == 0)
            {
                Matrix<Type> err = loss(expected,net);
                Matrix<Type> transposed_unit(err.shape_[1], 1);
                ones(transposed_unit);

                Matrix<Type>  netErr = transposed_unit * err;
                std::cout << "**********************************************\n";
                std::cout << "step: " << i << " err:" << netErr;

                std::cout << "       Wo_ :\n" << p_.Wo_ ;
                std::cout << "       Wh2_:\n" << p_.Wh2_;
                std::cout << "       Wh1_:\n" << p_.Wh1_;
                std::cout << "       Bo_ :\n" << p_.Bo_ ;
                std::cout << "       Bh2_:\n" << p_.Bh2_;
                std::cout << "       Bh1_:\n" << p_.Bh1_;

            }
        }
    }
};

void basicTest()
{
    int rawA[] = {1,2,3,4};
    int rawB[] = {2,3,4,5};
    int rawC[] = {2,3,4,5,6,7};

    Matrix<int> a(rawA, 2, 2);
    Matrix<int> b(rawB, 2, 2);
    Matrix<int> c(rawC, 3, 2);

    std::cout << a << "\n";
    std::cout << b << "\n";
    std::cout << c << "\n";

    std::cout << (a + b) << "\n";
    try { std::cout << (a + c) << "\n"; }
    catch (std::exception& e) { std::cout << e.what() << "\n"; }

    std::cout << MatrixUtils<int>::matmul(a,c) << "\n";
    try { std::cout << MatrixUtils<int>::matmul(c,a) << "\n"; }
    catch (std::exception& e) { std::cout << e.what() << "\n"; }

    std::cout << "SUCCESS\n";
}

void xorNetTest()
{
    // generate some train data
    Matrix<float> inputs(2,100);
    rand(inputs);

    // compute xor for input data
    Matrix<float> expected
        = MatrixUtils<float>::bifunctor(&MatrixUtils<float>::Helpers::xor_f,
                                        MatrixUtils<float>::selcol(0, inputs),
                                        MatrixUtils<float>::selcol(1, inputs));

    Model model;
    model.train(inputs,expected,0.03,10001);
}

int main()
{
    basicTest();
    xorNetTest();

}