# decisionTreesHyperparameters.html

Autor: Ricardo Aler

# DECISION TREE HYPER-PARAMETERS. TUNING DECISION TREES¶

• max_depth : int or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_leaf_nodes is not None.

• min_samples_split : int, optional (default=2) The minimum number of samples required to split an internal node.

• There are more hyper-parameters:

• help("sklearn.tree.DecisionTreeRegressor")
• help("sklearn.tree.DecisionTreeClassifier")
In [ ]:
```%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_boston
from sklearn import tree
from scipy.stats import sem
from sklearn.cross_validation import cross_val_score, KFold

X = boston.data
y = boston.target

#np.random.seed(0)
cv = KFold(X.shape[0], 10, shuffle=True, random_state=0)
```

Let's see what happens if we change max_depth parameter

In [ ]:
```#for max_depth in [2,4,6,8,10,12,14,16]:
np.random.seed(0)
mds = range(2,16,2)
results = []
for max_depth in mds:
clf = tree.DecisionTreeRegressor(max_depth=max_depth)
scores = -cross_val_score(clf,
X, y,
scoring='mean_squared_error',
cv = cv)

results.append(scores.mean())
print ("Max_depth={0:d} :Mean score: {1:.3f} (+/-{2:.3f})").format(max_depth, scores.mean(), sem(scores))

plt.plot(np.array(mds,dtype=float),  results)
plt.show()
```

We can see that the minimum value is obtained at max_depth = 10, so we should set the hyper-parameter to this value. However, it is important to see that if the random seed is changed from 0 to other values (try it by changing x in np.seed(x)), slightly different plots (and minima) are obtained, because the algorithm that builds decision trees is stochastic.

Let's see now what happens if we change the other hyperparameter: min_samples_split hyper-parameter

In [ ]:
```np.random.seed(0)
mds = range(2,16,2)
results = []
for min_samples_split in mds:
clf = tree.DecisionTreeRegressor(min_samples_split=min_samples_split)
scores = -cross_val_score(clf,
X, y,
scoring='mean_squared_error',
cv = cv)

results.append(scores.mean())
print ("min_samples_split={0:d} :Mean score: {1:.3f} (+/-{2:.3f})").format(min_samples_split, scores.mean(), sem(scores))

plt.plot(np.array(mds,dtype=float),  results)
plt.show()
```

The minimum for min_samples_split is obtained at 12, but this could change slightly if the random seed is altered, because decision tree construction is an stochastic process.

# GRID SEARCH¶

What if we want to find the best combination of hyper-parameters? (and not individual parameters as we did above). The process that performs a crossvalidation for all possible combinations of two (or more) hyper-parameters is called grid-search

Note: in priciple, n_jobs can be used to run the process in parallel. In practive, in Windows it does not work well.

In [ ]:
```from sklearn.grid_search import GridSearchCV
param_grid = {'max_depth': range(2,16,2),
'min_samples_split': range(2,16,2)}

clf = GridSearchCV(tree.DecisionTreeRegressor(),
param_grid,
scoring='mean_squared_error',
cv=5 , n_jobs=1, verbose=1)
%time _ = clf.fit(X,y)
```

Let's see the best ten combinations of hyper-parameters

In [ ]:
```clf.grid_scores_.sort()
for line in clf.grid_scores_[0:11]:
print(line)
```

And now, the best hyper-parameters

In [ ]:
```clf.best_params_, clf.best_score_
```

The best model fit with the best hyper-parameters and the whole training set can be used to make predictions:

In [ ]:
```predictions = clf.predict(X)
print predictions[0:11]
```

Using Randomized Search instead of a systematic search

In [ ]:
```from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {'max_depth': sp_randint(2,16),
'min_samples_split': sp_randint(2,16)}

n_iter_search = 20
clfrs = RandomizedSearchCV(tree.DecisionTreeRegressor(),
param_distributions=param_dist,
scoring='mean_squared_error',
cv=5 , n_jobs=1, verbose=1,
n_iter=n_iter_search)
clfrs.fit(X,y)
clfrs.grid_scores_.sort()
for line in clfrs.grid_scores_[0:11]:
print(line)

clfrs.best_params_, clfrs.best_score_
```
