Usted está aquí: Inicio Ingeniería Informática Machine Learning I IntroDecisionTrees

IntroDecisionTrees

Acciones de Documento
  • Vista de contenidos
  • Marcadores (bookmarks)
  • Exportación de LTI
Autor: Ricardo Aler

SCIKIT-LEARN:

Other packages for Machine Learning in Python:

  • Pylearn2
  • PyBrain
  • ...

NUMPY ARRAYS (MATRICES)

Scikit-learn does not use any of the standard data types of Python (let us remember those are: lists, tuples, dictionaries, and sets). Scikit-learn uses arrays, which represent numerical matrices. Let's see how they are used:

In [ ]:
import numpy as np
# Let's create a 5 by 3 matrix by using command np.array
myMatrix = np.array([[1, 10, 100],
                     [2, 20, 200],
                     [3, 30, 300],
                     [4, 40, 400],
                     [5, 50, 500]])

print(myMatrix)
# "array" identifies numpy matrices
myMatrix

Elements of this matrix can be accessed similarly as in lists

In [ ]:
print("Element at second row and third column")
print(myMatrix[1,2])
print("Submatrix with rows 1 and 2, and column 1 to the end")
print(myMatrix[1:3,1:])
print("Complete row 1")
print(myMatrix[1,:])
print("Complete column 1")
print(myMatrix[:,1])

DECISION TREES IN SK-LEARN

Decision trees

DecisionTreeClassifier take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.

Important: All input and output variables must be numerical. Categorical attributes, if needed, would have to be converted to integers.

In [ ]:
from sklearn import tree

# X = input attributes. As usual, rows are instances, columns are attributes
X = np.array([[0, 0], 
              [0, 1],
              [1, 0],
              [1, 1]])
# Y = vector of outputs: one value for every instance
y = np.array([0, 1, 1, 1])

# Create an empty decision tree
clf = tree.DecisionTreeClassifier()
# Now, learn the model (fit) and store it in variable clf
clf = clf.fit(X, y)
clf

After being fitted, the model can then be used to predict test instances

In [ ]:
# Let's try with the training instances
print("Let's see the predictions if we use the training instances as test instances")
print(clf.predict([[0, 0],
                   [0, 1],
                   [1, 0],
                   [1, 1]]))

# And now, with some new test instances
print("And now, let's try some actually new instances (test instances)")

print(clf.predict([[0.5, 0],
                   [0, 0.2],
                   [0.1, 0],
                   [0.9, 0.9]]))

Probabilities of each class can also be predicted (the fraction of training samples of the same class in a leaf)

In [ ]:
print("The output gives the probability of 'o' and the probability of '1'")
clf.predict_proba([[0.9, 0.8]])

Also possible to do multi-class classification. For instance, let's try with the Iris dataset. The iris dataset is already included within sklearn module.

In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
print("Let's print the names of the input attributes")
print(iris.feature_names)
print("And the actual input attributes")
print(iris.data)
In [ ]:
print("Let's print the output variable")
print("We can see that there are three classes, encoded as 0, 1, and 2")
print("They actually are three different types of plants: 0=setosa, 1=versicolor, 2=virginica ")
print(iris.target)
In [ ]:
%matplotlib inline

import matplotlib.pyplot as plt

X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

Now, let's train the decision tree on the iris dataset

In [ ]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
clf

In order to visualize the learned decision tree, let's define the print_tree function:

In [ ]:
def print_tree(t, root=0, depth=1):
    if depth == 1:
        print 'def predict(X_i):'
    indent = '    '*depth
    print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])
    left_child = t.children_left[root]
    right_child = t.children_right[root]
     
    if left_child == tree._tree.TREE_LEAF:
        print indent + 'return %s # (node %d)' % (str(t.value[root]), root)
    else:
        print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)
        print_tree(t, root=left_child, depth=depth+1)
         
        print indent + 'else:'
        print_tree(t,root=right_child, depth=depth+1)
In [ ]:
print_tree(clf.tree_)

After being fitted, the model can then be used to predict the class (and the probability) of samples:

In [ ]:
print("Predicion for this instance: {0}".format(iris.data[:1, :]))
print("Class is: {0}".format(clf.predict(iris.data[:1, :])))
print("Probabilities are (for class1, class2, class3): {0}".format(clf.predict_proba(iris.data[:1, :])))

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

In [ ]:
X = np.array([[0, 0], [2, 2]])
y = np.array([0.5, 2.5])
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])
In [ ]:
 
Reutilizar Curso
Descargar este curso
OCW-UC3M user survey