Usted está aquí: Inicio Ingeniería Informática Machine Learning I decisionTreesTrainTest.html

decisionTreesTrainTest.html

Acciones de Documento
  • Vista de contenidos
  • Marcadores (bookmarks)
  • Exportación de LTI
Autor: Ricardo Aler

DECISION TREES WITH A TRAINING AND A TESTING SET

Let's load the Boston dataset and check its description. Its data about housing prices depending on the characteristics of the zone

In [ ]:
# The Boston dataset is also included within sklearn
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

These are the names of the input attributes

In [ ]:
boston.feature_names

X will contain the input attributes, and y the output attribute. We can visualize the shape of the dataset (number of instances x number of input attributes), and the shape of the target (output) attribute.

In [ ]:
X = boston.data
y = boston.target
print X.shape, y.shape

Here we can see the first three instances (input and output attributes) of the dataset

In [ ]:
print X[0:2]
print y[0:2]

Let's split the dataset into a training and a testing set. 25% of data for testing. Selection is random

In [ ]:
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=33)
print X_train.shape, y_train.shape

Now we train a decision tree (for regression)

In [ ]:
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)

And the model is used to make predictions for the training set, and more importantly, for the testing set. We can observe that the training error is very small (0 in this case) but the testing error is much larger.

In [ ]:
import numpy as np
from sklearn import metrics
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print metrics.mean_squared_error(y_train, y_train_pred)
print metrics.mean_squared_error(y_test, y_test_pred)

Now, we will use crossvalidation to obtain a better estimate of prediction error. Typically, 10 folds are used.

In [ ]:
from sklearn.cross_validation import cross_val_score, KFold
# Let's make results reproducible
np.random.seed(0)
# create a k-fold cross validation iterator of k=5 folds
cv = KFold(X.shape[0], 10, shuffle=True)
# The "minus" is because scikit maximizes scores, hence
# errors are negative
scores = -cross_val_score(tree.DecisionTreeRegressor(), 
                         X, y, 
                         scoring='mean_squared_error', 
                         cv = cv)              
# Printing the 10 scores
print(scores)
# Printing the average score
from scipy.stats import sem # Standard deviation
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(scores.mean(), sem(scores))
Reutilizar Curso
Descargar este curso
OCW-UC3M user survey