{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DECISION TREES WITH A TRAINING AND A TESTING SET"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Let's load the Boston dataset and check its description. Its data about housing prices depending on the characteristics of the zone**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The Boston dataset is also included within sklearn\n",
    "from sklearn.datasets import load_boston\n",
    "boston = load_boston()\n",
    "print(boston.DESCR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**These are the names of the input attributes**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "boston.feature_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**X will contain the input attributes, and y the output attribute. We can visualize the shape of the dataset (number of instances x number of input attributes), and the shape of the target (output) attribute.** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = boston.data\n",
    "y = boston.target\n",
    "print X.shape, y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Here we can see the first three instances (input and output attributes) of the dataset**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print X[0:2]\n",
    "print y[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's split the dataset into a training and a testing set. 25% of data for testing. Selection is random**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import tree\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn import preprocessing\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "test_size=0.25, random_state=33)\n",
    "print X_train.shape, y_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Now we train a decision tree (for regression)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "clf = tree.DecisionTreeRegressor()\n",
    "clf = clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**And the model is used to make predictions for the training set, and more importantly, for the testing set. We can observe that the training error is very small (0 in this case) but the testing error is much larger.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn import metrics\n",
    "y_train_pred = clf.predict(X_train)\n",
    "y_test_pred = clf.predict(X_test)\n",
    "print metrics.mean_squared_error(y_train, y_train_pred)\n",
    "print metrics.mean_squared_error(y_test, y_test_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now, we will use crossvalidation to obtain a better estimate of prediction error. Typically, 10 folds are used.** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import cross_val_score, KFold\n",
    "# Let's make results reproducible\n",
    "np.random.seed(0)\n",
    "# create a k-fold cross validation iterator of k=5 folds\n",
    "cv = KFold(X.shape[0], 10, shuffle=True)\n",
    "# The \"minus\" is because scikit maximizes scores, hence\n",
    "# errors are negative\n",
    "scores = -cross_val_score(tree.DecisionTreeRegressor(), \n",
    "                         X, y, \n",
    "                         scoring='mean_squared_error', \n",
    "                         cv = cv)              \n",
    "# Printing the 10 scores\n",
    "print(scores)\n",
    "# Printing the average score\n",
    "from scipy.stats import sem # Standard deviation\n",
    "print (\"Mean score: {0:.3f} (+/-{1:.3f})\").format(scores.mean(), sem(scores))\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}