{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DECISION TREE HYPER-PARAMETERS. TUNING DECISION TREES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ** max_depth : int or None, optional (default=None)**\n", " The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_leaf_nodes is not None.\n", " \n", "- **min_samples_split : int, optional (default=2)**\n", " The minimum number of samples required to split an internal node.\n", "\n", "- There are more hyper-parameters: \n", " - help(\"sklearn.tree.DecisionTreeRegressor\")\n", " - help(\"sklearn.tree.DecisionTreeClassifier\")\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn.datasets import load_boston\n", "from sklearn import tree\n", "from scipy.stats import sem\n", "from sklearn.cross_validation import cross_val_score, KFold\n", "\n", "boston = load_boston()\n", "X = boston.data\n", "y = boston.target\n", "\n", "#np.random.seed(0)\n", "cv = KFold(X.shape[0], 10, shuffle=True, random_state=0)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's see what happens if we change max_depth parameter **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "#for max_depth in [2,4,6,8,10,12,14,16]:\n", "np.random.seed(0)\n", "mds = range(2,16,2)\n", "results = []\n", "for max_depth in mds:\n", " clf = tree.DecisionTreeRegressor(max_depth=max_depth)\n", " scores = -cross_val_score(clf, \n", " X, y, \n", " scoring='mean_squared_error', \n", " cv = cv)\n", " \n", " results.append(scores.mean())\n", " print (\"Max_depth={0:d} :Mean score: {1:.3f} (+/-{2:.3f})\").format(max_depth, scores.mean(), sem(scores))\n", "\n", "plt.plot(np.array(mds,dtype=float), results)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** We can see that the minimum value is obtained at max_depth = 10, so we should set the hyper-parameter to this value. However, it is important to see that if the random seed is changed from 0 to other values (try it by changing x in np.seed(x)), slightly different plots (and minima) are obtained, because the algorithm that builds decision trees is stochastic. **" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Let's see now what happens if we change the other hyperparameter: min_samples_split hyper-parameter **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "np.random.seed(0)\n", "mds = range(2,16,2)\n", "results = []\n", "for min_samples_split in mds:\n", " clf = tree.DecisionTreeRegressor(min_samples_split=min_samples_split)\n", " scores = -cross_val_score(clf, \n", " X, y, \n", " scoring='mean_squared_error', \n", " cv = cv)\n", " \n", " results.append(scores.mean())\n", " print (\"min_samples_split={0:d} :Mean score: {1:.3f} (+/-{2:.3f})\").format(min_samples_split, scores.mean(), sem(scores))\n", "\n", "plt.plot(np.array(mds,dtype=float), results)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The minimum for min_samples_split is obtained at 12, but this could change slightly if the random seed is altered, because decision tree construction is an stochastic process.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# GRID SEARCH\n", "**What if we want to find the best combination of hyper-parameters? (and not individual parameters as we did above). The process that performs a crossvalidation for all possible combinations of two (or more) hyper-parameters is called *grid-search* **\n", "\n", "Note: in priciple, n_jobs can be used to run the process in parallel. In practive, in Windows it does not work well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV\n", "param_grid = {'max_depth': range(2,16,2),\n", " 'min_samples_split': range(2,16,2)}\n", "\n", "clf = GridSearchCV(tree.DecisionTreeRegressor(), \n", " param_grid,\n", " scoring='mean_squared_error',\n", " cv=5 , n_jobs=1, verbose=1)\n", "%time _ = clf.fit(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Let's see the best ten combinations of hyper-parameters**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf.grid_scores_.sort()\n", "for line in clf.grid_scores_[0:11]:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** And now, the best hyper-parameters**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf.best_params_, clf.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The best model fit with the best hyper-parameters and the whole training set can be used to make predictions:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "predictions = clf.predict(X)\n", "print predictions[0:11]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Using Randomized Search instead of a systematic search**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n", "from scipy.stats import randint as sp_randint\n", "\n", "param_dist = {'max_depth': sp_randint(2,16),\n", " 'min_samples_split': sp_randint(2,16)}\n", "\n", "n_iter_search = 20\n", "clfrs = RandomizedSearchCV(tree.DecisionTreeRegressor(), \n", " param_distributions=param_dist,\n", " scoring='mean_squared_error',\n", " cv=5 , n_jobs=1, verbose=1,\n", " n_iter=n_iter_search)\n", "clfrs.fit(X,y)\n", "clfrs.grid_scores_.sort()\n", "for line in clfrs.grid_scores_[0:11]:\n", " print(line)\n", " \n", "clfrs.best_params_, clfrs.best_score_\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }