##### Secciones
Usted está aquí: IntroDecisionTrees

# IntroDecisionTrees

##### Acciones de Documento
Autor: Ricardo Aler

# SCIKIT-LEARN:¶

Other packages for Machine Learning in Python:

• Pylearn2
• PyBrain
• ...

# NUMPY ARRAYS (MATRICES)¶

Scikit-learn does not use any of the standard data types of Python (let us remember those are: lists, tuples, dictionaries, and sets). Scikit-learn uses arrays, which represent numerical matrices. Let's see how they are used:

In [ ]:
```import numpy as np
# Let's create a 5 by 3 matrix by using command np.array
myMatrix = np.array([[1, 10, 100],
[2, 20, 200],
[3, 30, 300],
[4, 40, 400],
[5, 50, 500]])

print(myMatrix)
# "array" identifies numpy matrices
myMatrix
```

Elements of this matrix can be accessed similarly as in lists

In [ ]:
```print("Element at second row and third column")
print(myMatrix[1,2])
print("Submatrix with rows 1 and 2, and column 1 to the end")
print(myMatrix[1:3,1:])
print("Complete row 1")
print(myMatrix[1,:])
print("Complete column 1")
print(myMatrix[:,1])
```

# DECISION TREES IN SK-LEARN¶

Decision trees

DecisionTreeClassifier take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.

Important: All input and output variables must be numerical. Categorical attributes, if needed, would have to be converted to integers.

In [ ]:
```from sklearn import tree

# X = input attributes. As usual, rows are instances, columns are attributes
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
# Y = vector of outputs: one value for every instance
y = np.array([0, 1, 1, 1])

# Create an empty decision tree
clf = tree.DecisionTreeClassifier()
# Now, learn the model (fit) and store it in variable clf
clf = clf.fit(X, y)
clf
```

After being fitted, the model can then be used to predict test instances

In [ ]:
```# Let's try with the training instances
print("Let's see the predictions if we use the training instances as test instances")
print(clf.predict([[0, 0],
[0, 1],
[1, 0],
[1, 1]]))

# And now, with some new test instances
print("And now, let's try some actually new instances (test instances)")

print(clf.predict([[0.5, 0],
[0, 0.2],
[0.1, 0],
[0.9, 0.9]]))
```

Probabilities of each class can also be predicted (the fraction of training samples of the same class in a leaf)

In [ ]:
```print("The output gives the probability of 'o' and the probability of '1'")
clf.predict_proba([[0.9, 0.8]])
```

Also possible to do multi-class classification. For instance, let's try with the Iris dataset. The iris dataset is already included within sklearn module.

In [ ]:
```from sklearn.datasets import load_iris
print("Let's print the names of the input attributes")
print(iris.feature_names)
print("And the actual input attributes")
print(iris.data)
```
In [ ]:
```print("Let's print the output variable")
print("We can see that there are three classes, encoded as 0, 1, and 2")
print("They actually are three different types of plants: 0=setosa, 1=versicolor, 2=virginica ")
print(iris.target)
```
In [ ]:
```%matplotlib inline

import matplotlib.pyplot as plt

X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()
```

Now, let's train the decision tree on the iris dataset

In [ ]:
```clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
clf
```

In order to visualize the learned decision tree, let's define the print_tree function:

In [ ]:
```def print_tree(t, root=0, depth=1):
if depth == 1:
print 'def predict(X_i):'
indent = '    '*depth
print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])
left_child = t.children_left[root]
right_child = t.children_right[root]

if left_child == tree._tree.TREE_LEAF:
print indent + 'return %s # (node %d)' % (str(t.value[root]), root)
else:
print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)
print_tree(t, root=left_child, depth=depth+1)

print indent + 'else:'
print_tree(t,root=right_child, depth=depth+1)
```
In [ ]:
```print_tree(clf.tree_)
```

After being fitted, the model can then be used to predict the class (and the probability) of samples:

In [ ]:
```print("Predicion for this instance: {0}".format(iris.data[:1, :]))
print("Class is: {0}".format(clf.predict(iris.data[:1, :])))
print("Probabilities are (for class1, class2, class3): {0}".format(clf.predict_proba(iris.data[:1, :])))
```

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

In [ ]:
```X = np.array([[0, 0], [2, 2]])
y = np.array([0.5, 2.5])
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])
```
In [ ]:
```
```
Reutilizar Curso
Descargar este curso