{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DECISION TREE HYPER-PARAMETERS. TUNING DECISION TREES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- ** max_depth : int or None, optional (default=None)**\n",
    "    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_leaf_nodes is not None.\n",
    "    \n",
    "- **min_samples_split : int, optional (default=2)**\n",
    "    The minimum number of samples required to split an internal node.\n",
    "\n",
    "- There are more hyper-parameters: \n",
    "  - help(\"sklearn.tree.DecisionTreeRegressor\")\n",
    "  - help(\"sklearn.tree.DecisionTreeClassifier\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn import tree\n",
    "from scipy.stats import sem\n",
    "from sklearn.cross_validation import cross_val_score, KFold\n",
    "\n",
    "boston = load_boston()\n",
    "X = boston.data\n",
    "y = boston.target\n",
    "\n",
    "#np.random.seed(0)\n",
    "cv = KFold(X.shape[0], 10, shuffle=True, random_state=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's see what happens if we change max_depth parameter **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#for max_depth in [2,4,6,8,10,12,14,16]:\n",
    "np.random.seed(0)\n",
    "mds = range(2,16,2)\n",
    "results = []\n",
    "for max_depth in mds:\n",
    "  clf = tree.DecisionTreeRegressor(max_depth=max_depth)\n",
    "  scores = -cross_val_score(clf, \n",
    "                            X, y, \n",
    "                            scoring='mean_squared_error', \n",
    "                            cv = cv)\n",
    "    \n",
    "  results.append(scores.mean())\n",
    "  print (\"Max_depth={0:d} :Mean score: {1:.3f} (+/-{2:.3f})\").format(max_depth, scores.mean(), sem(scores))\n",
    "\n",
    "plt.plot(np.array(mds,dtype=float),  results)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** We can see that the minimum value is obtained at max_depth = 10, so we should set the hyper-parameter to this value. However, it is important to see that if the random seed is changed from 0 to other values (try it by changing x in np.seed(x)), slightly different plots (and minima) are obtained, because the algorithm that builds decision trees is stochastic. **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Let's see now what happens if we change the other hyperparameter: min_samples_split hyper-parameter **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "np.random.seed(0)\n",
    "mds = range(2,16,2)\n",
    "results = []\n",
    "for min_samples_split in mds:\n",
    "  clf = tree.DecisionTreeRegressor(min_samples_split=min_samples_split)\n",
    "  scores = -cross_val_score(clf, \n",
    "                            X, y, \n",
    "                            scoring='mean_squared_error', \n",
    "                            cv = cv)\n",
    "    \n",
    "  results.append(scores.mean())\n",
    "  print (\"min_samples_split={0:d} :Mean score: {1:.3f} (+/-{2:.3f})\").format(min_samples_split, scores.mean(), sem(scores))\n",
    "\n",
    "plt.plot(np.array(mds,dtype=float),  results)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The minimum for min_samples_split is obtained at 12, but this could change slightly if the random seed is altered, because decision tree construction is an stochastic process.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GRID SEARCH\n",
    "**What if we want to find the best combination of hyper-parameters? (and not individual parameters as we did above). The process that performs a crossvalidation for all possible combinations of two (or more) hyper-parameters is called *grid-search* **\n",
    "\n",
    "Note: in priciple, n_jobs can be used to run the process in parallel. In practive, in Windows it does not work well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "param_grid = {'max_depth': range(2,16,2),\n",
    "              'min_samples_split': range(2,16,2)}\n",
    "\n",
    "clf = GridSearchCV(tree.DecisionTreeRegressor(), \n",
    "                   param_grid,\n",
    "                   scoring='mean_squared_error',\n",
    "                   cv=5 , n_jobs=1, verbose=1)\n",
    "%time _ = clf.fit(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Let's see the best ten combinations of hyper-parameters**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.grid_scores_.sort()\n",
    "for line in clf.grid_scores_[0:11]:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** And now, the best hyper-parameters**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.best_params_, clf.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The best model fit with the best hyper-parameters and the whole training set can be used to make predictions:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictions = clf.predict(X)\n",
    "print predictions[0:11]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Using Randomized Search instead of a systematic search**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n",
    "from scipy.stats import randint as sp_randint\n",
    "\n",
    "param_dist = {'max_depth': sp_randint(2,16),\n",
    "              'min_samples_split': sp_randint(2,16)}\n",
    "\n",
    "n_iter_search = 20\n",
    "clfrs = RandomizedSearchCV(tree.DecisionTreeRegressor(), \n",
    "                                   param_distributions=param_dist,\n",
    "                                   scoring='mean_squared_error',\n",
    "                                   cv=5 , n_jobs=1, verbose=1,\n",
    "                                   n_iter=n_iter_search)\n",
    "clfrs.fit(X,y)\n",
    "clfrs.grid_scores_.sort()\n",
    "for line in clfrs.grid_scores_[0:11]:\n",
    "    print(line)\n",
    "    \n",
    "clfrs.best_params_, clfrs.best_score_\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}