Usted está aquí: Inicio Ingeniería Informática Machine Learning I decisionTreesHyperparameters.html

decisionTreesHyperparameters.html

Acciones de Documento
  • Vista de contenidos
  • Marcadores (bookmarks)
  • Exportación de LTI
Autor: Ricardo Aler

DECISION TREE HYPER-PARAMETERS. TUNING DECISION TREES

  • max_depth : int or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_leaf_nodes is not None.

  • min_samples_split : int, optional (default=2) The minimum number of samples required to split an internal node.

  • There are more hyper-parameters:

    • help("sklearn.tree.DecisionTreeRegressor")
    • help("sklearn.tree.DecisionTreeClassifier")
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_boston
from sklearn import tree
from scipy.stats import sem
from sklearn.cross_validation import cross_val_score, KFold

boston = load_boston()
X = boston.data
y = boston.target

#np.random.seed(0)
cv = KFold(X.shape[0], 10, shuffle=True, random_state=0)

Let's see what happens if we change max_depth parameter

In [ ]:
#for max_depth in [2,4,6,8,10,12,14,16]:
np.random.seed(0)
mds = range(2,16,2)
results = []
for max_depth in mds:
  clf = tree.DecisionTreeRegressor(max_depth=max_depth)
  scores = -cross_val_score(clf, 
                            X, y, 
                            scoring='mean_squared_error', 
                            cv = cv)
    
  results.append(scores.mean())
  print ("Max_depth={0:d} :Mean score: {1:.3f} (+/-{2:.3f})").format(max_depth, scores.mean(), sem(scores))

plt.plot(np.array(mds,dtype=float),  results)
plt.show()

We can see that the minimum value is obtained at max_depth = 10, so we should set the hyper-parameter to this value. However, it is important to see that if the random seed is changed from 0 to other values (try it by changing x in np.seed(x)), slightly different plots (and minima) are obtained, because the algorithm that builds decision trees is stochastic.

Let's see now what happens if we change the other hyperparameter: min_samples_split hyper-parameter

In [ ]:
np.random.seed(0)
mds = range(2,16,2)
results = []
for min_samples_split in mds:
  clf = tree.DecisionTreeRegressor(min_samples_split=min_samples_split)
  scores = -cross_val_score(clf, 
                            X, y, 
                            scoring='mean_squared_error', 
                            cv = cv)
    
  results.append(scores.mean())
  print ("min_samples_split={0:d} :Mean score: {1:.3f} (+/-{2:.3f})").format(min_samples_split, scores.mean(), sem(scores))

plt.plot(np.array(mds,dtype=float),  results)
plt.show()

The minimum for min_samples_split is obtained at 12, but this could change slightly if the random seed is altered, because decision tree construction is an stochastic process.

GRID SEARCH

What if we want to find the best combination of hyper-parameters? (and not individual parameters as we did above). The process that performs a crossvalidation for all possible combinations of two (or more) hyper-parameters is called grid-search

Note: in priciple, n_jobs can be used to run the process in parallel. In practive, in Windows it does not work well.

In [ ]:
from sklearn.grid_search import GridSearchCV
param_grid = {'max_depth': range(2,16,2),
              'min_samples_split': range(2,16,2)}

clf = GridSearchCV(tree.DecisionTreeRegressor(), 
                   param_grid,
                   scoring='mean_squared_error',
                   cv=5 , n_jobs=1, verbose=1)
%time _ = clf.fit(X,y)

Let's see the best ten combinations of hyper-parameters

In [ ]:
clf.grid_scores_.sort()
for line in clf.grid_scores_[0:11]:
    print(line)

And now, the best hyper-parameters

In [ ]:
clf.best_params_, clf.best_score_

The best model fit with the best hyper-parameters and the whole training set can be used to make predictions:

In [ ]:
predictions = clf.predict(X)
print predictions[0:11]

Using Randomized Search instead of a systematic search

In [ ]:
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {'max_depth': sp_randint(2,16),
              'min_samples_split': sp_randint(2,16)}

n_iter_search = 20
clfrs = RandomizedSearchCV(tree.DecisionTreeRegressor(), 
                                   param_distributions=param_dist,
                                   scoring='mean_squared_error',
                                   cv=5 , n_jobs=1, verbose=1,
                                   n_iter=n_iter_search)
clfrs.fit(X,y)
clfrs.grid_scores_.sort()
for line in clfrs.grid_scores_[0:11]:
    print(line)
    
clfrs.best_params_, clfrs.best_score_
Reutilizar Curso
Descargar este curso
OCW-UC3M user survey