Usted está aquí: Inicio Ingeniería Informática Machine Learning I Third assignment: Python notebook

Third assignment: Python notebook

Acciones de Documento
  • Vista de contenidos
  • Marcadores (bookmarks)
  • Exportación de LTI
Autor: Ricardo Aler

Using decision trees in MLlib SPARK

You can find how to use other algorithms (Random Forest, Gradient Boosting, etc.) here

In [1]:
import sys
import os
import os.path
SPARK_HOME = """C:\spark-1.5.0-bin-hadoop2.6""" #CHANGE THIS PATH TO YOURS!

sys.path.append(os.path.join(SPARK_HOME, "python", "lib", "py4j-0.8.2.1-src.zip"))
sys.path.append(os.path.join(SPARK_HOME, "python", "lib", "pyspark.zip"))
os.environ["SPARK_HOME"] = SPARK_HOME

from pyspark import SparkContext
sc = SparkContext(master="local[*]", appName="PythonDecisionTreeClassificationExample")
In [2]:
%matplotlib inline
from pyspark.mllib.regression import LabeledPoint
import numpy as np
import matplotlib.pyplot as plt
In [3]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data            # Input attributes
y = iris.target          # Label
# zip is used so that each instance is a tuble of (label, input attributes). 
# This will make life easier later
# Note: zip([1,2,3], ["a","b","c"]) => [(1, 'a'), (2, 'b'), (3, 'c')]
data = zip(y,X) 
In [4]:
data_rdd = sc.parallelize(data,4)
print data_rdd.getNumPartitions()
4
In [5]:
data_rdd = data_rdd.map(lambda x: LabeledPoint(x[0], x[1]))
data_rdd.take(1)
Out[5]:
[LabeledPoint(0.0, [5.1,3.5,1.4,0.2])]
In [6]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
In [11]:
(trainingData_rdd, testData_rdd) = data_rdd.randomSplit([0.7, 0.3])
In [13]:
model = DecisionTree.trainClassifier(trainingData_rdd, numClasses=3, categoricalFeaturesInfo={},impurity='gini', maxDepth=5)
In [14]:
predictions = model.predict(testData_rdd.map(lambda x: x.features))
labelsAndPredictions = testData_rdd.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData_rdd.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
Test Error = 0.0208333333333
Learned classification tree model:
DecisionTreeModel classifier of depth 5 with 15 nodes
  If (feature 2 <= 1.7)
   Predict: 0.0
  Else (feature 2 > 1.7)
   If (feature 2 <= 4.8)
    If (feature 3 <= 1.6)
     Predict: 1.0
    Else (feature 3 > 1.6)
     If (feature 1 <= 2.8)
      Predict: 2.0
     Else (feature 1 > 2.8)
      Predict: 1.0
   Else (feature 2 > 4.8)
    If (feature 3 <= 1.7)
     If (feature 2 <= 5.0)
      If (feature 0 <= 6.0)
       Predict: 2.0
      Else (feature 0 > 6.0)
       Predict: 1.0
     Else (feature 2 > 5.0)
      Predict: 2.0
    Else (feature 3 > 1.7)
     Predict: 2.0

In [ ]:
 
Reutilizar Curso
Descargar este curso
OCW-UC3M user survey