{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SCIKIT-LEARN: \n",
    "- Collection of machine learning algorithms and tools in Python.\n",
    "- BSD Licensed, used in academia and industry (Spotify, bit.ly, Evernote).\n",
    "- ~20 core developers.\n",
    "- [http://scikit-learn.org/stable/](SCIKIT-LEARN)\n",
    "\n",
    "** Other packages for Machine Learning in Python: **\n",
    "- Pylearn2\n",
    "- PyBrain\n",
    "- ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NUMPY ARRAYS (MATRICES)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scikit-learn does not use any of the standard data types of Python (let us remember those are: lists, tuples, dictionaries, and sets). Scikit-learn uses **arrays**, which represent numerical matrices. Let's see how they are used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "# Let's create a 5 by 3 matrix by using command np.array\n",
    "myMatrix = np.array([[1, 10, 100],\n",
    "                     [2, 20, 200],\n",
    "                     [3, 30, 300],\n",
    "                     [4, 40, 400],\n",
    "                     [5, 50, 500]])\n",
    "\n",
    "print(myMatrix)\n",
    "# \"array\" identifies numpy matrices\n",
    "myMatrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Elements of this matrix can be accessed similarly as in lists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"Element at second row and third column\")\n",
    "print(myMatrix[1,2])\n",
    "print(\"Submatrix with rows 1 and 2, and column 1 to the end\")\n",
    "print(myMatrix[1:3,1:])\n",
    "print(\"Complete row 1\")\n",
    "print(myMatrix[1,:])\n",
    "print(\"Complete column 1\")\n",
    "print(myMatrix[:,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DECISION TREES IN SK-LEARN\n",
    "[Decision trees](http://scikit-learn.org/stable/modules/tree.html)\n",
    "\n",
    "DecisionTreeClassifier take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.\n",
    "\n",
    "**Important:** All input and output variables must be numerical. Categorical attributes, if needed, would have to be converted to integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import tree\n",
    "\n",
    "# X = input attributes. As usual, rows are instances, columns are attributes\n",
    "X = np.array([[0, 0], \n",
    "              [0, 1],\n",
    "              [1, 0],\n",
    "              [1, 1]])\n",
    "# Y = vector of outputs: one value for every instance\n",
    "y = np.array([0, 1, 1, 1])\n",
    "\n",
    "# Create an empty decision tree\n",
    "clf = tree.DecisionTreeClassifier()\n",
    "# Now, learn the model (fit) and store it in variable clf\n",
    "clf = clf.fit(X, y)\n",
    "clf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After being fitted, the model can then be used to predict test instances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's try with the training instances\n",
    "print(\"Let's see the predictions if we use the training instances as test instances\")\n",
    "print(clf.predict([[0, 0],\n",
    "                   [0, 1],\n",
    "                   [1, 0],\n",
    "                   [1, 1]]))\n",
    "\n",
    "# And now, with some new test instances\n",
    "print(\"And now, let's try some actually new instances (test instances)\")\n",
    "\n",
    "print(clf.predict([[0.5, 0],\n",
    "                   [0, 0.2],\n",
    "                   [0.1, 0],\n",
    "                   [0.9, 0.9]]))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Probabilities of each class can also be predicted (the fraction of training samples of the same class in a leaf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"The output gives the probability of 'o' and the probability of '1'\")\n",
    "clf.predict_proba([[0.9, 0.8]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also possible to do multi-class classification. For instance, let's try with the Iris dataset. The iris dataset is already included within sklearn module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()\n",
    "print(\"Let's print the names of the input attributes\")\n",
    "print(iris.feature_names)\n",
    "print(\"And the actual input attributes\")\n",
    "print(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"Let's print the output variable\")\n",
    "print(\"We can see that there are three classes, encoded as 0, 1, and 2\")\n",
    "print(\"They actually are three different types of plants: 0=setosa, 1=versicolor, 2=virginica \")\n",
    "print(iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "X = iris.data[:, :2]  # we only take the first two features.\n",
    "y = iris.target\n",
    "\n",
    "plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)\n",
    "plt.xlabel('Sepal length')\n",
    "plt.ylabel('Sepal width')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's train the decision tree on the iris dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf = tree.DecisionTreeClassifier()\n",
    "clf = clf.fit(iris.data, iris.target)\n",
    "clf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to visualize the learned decision tree, let's define the print_tree function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def print_tree(t, root=0, depth=1):\n",
    "    if depth == 1:\n",
    "        print 'def predict(X_i):'\n",
    "    indent = '    '*depth\n",
    "    print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])\n",
    "    left_child = t.children_left[root]\n",
    "    right_child = t.children_right[root]\n",
    "     \n",
    "    if left_child == tree._tree.TREE_LEAF:\n",
    "        print indent + 'return %s # (node %d)' % (str(t.value[root]), root)\n",
    "    else:\n",
    "        print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)\n",
    "        print_tree(t, root=left_child, depth=depth+1)\n",
    "         \n",
    "        print indent + 'else:'\n",
    "        print_tree(t,root=right_child, depth=depth+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print_tree(clf.tree_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After being fitted, the model can then be used to predict the class (and the probability) of samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"Predicion for this instance: {0}\".format(iris.data[:1, :]))\n",
    "print(\"Class is: {0}\".format(clf.predict(iris.data[:1, :])))\n",
    "print(\"Probabilities are (for class1, class2, class3): {0}\".format(clf.predict_proba(iris.data[:1, :])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = np.array([[0, 0], [2, 2]])\n",
    "y = np.array([0.5, 2.5])\n",
    "clf = tree.DecisionTreeRegressor()\n",
    "clf = clf.fit(X, y)\n",
    "clf.predict([[1, 1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extra:\n",
    "- (Multi-output Decision Trees for Regression)[http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression_multioutput.html]\n",
    "- (Multi-output Decision Trees for Classification)[http://scikit-learn.org/stable/auto_examples/plot_multioutput_face_completion.html#example-plot-multioutput-face-completion-py]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}